How to Build a B2B Document Extractor with Both Rules and LLM: A Step-by-Step Comparison
Introduction
Extracting structured data from B2B PDF invoices, purchase orders, and receipts is a common challenge. Many developers turn to rule-based approaches using OCR (like Tesseract) or explore modern LLMs (like LLaMA 3) for more flexible extraction. This guide walks you through building the same extractor twice — once with pytesseract rules and once with Ollama + LLaMA 3 — so you can compare performance, accuracy, and maintenance on a realistic B2B order scenario.

What You Need
- Python 3.8+ installed on your system
- pytesseract and Tesseract OCR engine (follow installation for your OS)
- Ollama (install from ollama.ai) with LLaMA 3 model pulled (
ollama pull llama3) - A sample B2B PDF invoice or order document (use a real but anonymized one)
- Basic Python libraries:
pdf2image,Pillow,re,json - Text editor or IDE
Step-by-Step Guide
Step 1: Set Up the Environment and Sample Document
First, create a project folder and install dependencies:
pip install pytesseract pdf2image Pillow ollama
Place your sample B2B PDF in the folder. For this guide, we assume a purchase order containing fields like Order ID, Supplier Name, Line Items, Total Amount.
Step 2: Build the Rule-Based Extractor with pytesseract
Create a Python script rule_extractor.py. Use pdf2image to convert PDF pages to images, then apply Tesseract OCR:
from pdf2image import convert_from_path
import pytesseract
images = convert_from_path('order.pdf')
text = pytesseract.image_to_string(images[0])
Now define rules using regex and keyword matching. For example:
- Extract Order ID by looking for patterns like
Order #:\s*(\w+) - Find Supplier Name after the word Supplier or Vendor
- Parse line items using tabular assumption (fixed positions or delimiter)
- Grab the total via
Total:\s*[\$]?(\d+\.\d{2})
Test with your PDF and adjust regex patterns. This approach works well for consistent layouts but fails if the format changes.
Step 3: Build the LLM-Based Extractor with Ollama and LLaMA 3
Create llm_extractor.py. Read the PDF text as before (or use OCR output). Then pass it to Ollama:
import ollama
prompt = """You are a B2B document parser. Extract fields: Order ID, Supplier Name, Line Items (as list), Total. Output only JSON.
Document:
{ocr_text}
""".format(ocr_text=text)
response = ollama.chat(model='llama3', messages=[{'role': 'user', 'content': prompt}])
result = json.loads(response['message']['content'])
This method is layout-agnostic and handles variations naturally. However, it requires running a local LLM and may be slower. You can also tweak the prompt to enforce schema.

Step 4: Compare Outputs and Handle Failures
Run both scripts on the same document. Compare extracted JSON:
- Rule-based may miss fields if layout shifts or OCR introduces noise
- LLM-based may hallucinate or misinterpret ambiguous text
For failures, enhance rules with fallback patterns, or improve LLM prompt by providing examples. Consider using both in a hybrid pipeline where LLM acts as a backup.
Step 5: Optimize for Your Use Case
For production, measure accuracy, speed, and maintenance overhead. Rule-based is fast and cheap but brittle. LLM-based offers flexibility but requires GPU and careful prompt engineering.
You can also combine them: try rules first, then use LLM for confidence threshold below 90%.
Tips for Success
- Preprocess images before OCR: crop, deskew, convert to grayscale, increase contrast.
- Use structured output with LLMs: ask for JSON and validate with Pydantic.
- Test on multiple documents with varying layouts to see where each approach shines.
- Monitor costs: local LLM via Ollama has no API costs but uses compute; rules need no GPU.
- Version control both extraction scripts and sample documents to reproduce comparisons.
- Consider a hybrid system as the best of both worlds: rules for speed, LLM for edge cases.
By building the same extractor twice, you gain practical insight into trade-offs and can make an informed choice for your B2B document processing needs.
Related Articles
- Comparing Rule-Based and LLM-Based B2B Document Extraction: Which Approach Performs Better?
- Weekend Gaming Guide: Top Picks and Trends for Your Next Session
- Toyota Crown Signia Redefines Premium Value: Why Experts Say It Outshines Entry-Luxury SUVs
- 10 Game-Changing Features of Data Wrangler's New Notebook Results Table
- Inside the Musk v. Altman Trial: A Step-by-Step Guide to the Legal Battle Over OpenAI's Mission
- Building Sentiment-Aware Word Vectors: A Step-by-Step Guide Using IMDb Reviews and Python
- 10 Essential Insights About the DJI Osmo 360: The Ultimate Action Camera Alternative
- Apple's Week in Review: Chip Triumphs, Orange Trademark Tussles, and Tony Nominations