How to Build a B2B Document Extractor with Both Rules and LLM: A Step-by-Step Comparison

Introduction

Extracting structured data from B2B PDF invoices, purchase orders, and receipts is a common challenge. Many developers turn to rule-based approaches using OCR (like Tesseract) or explore modern LLMs (like LLaMA 3) for more flexible extraction. This guide walks you through building the same extractor twice — once with pytesseract rules and once with Ollama + LLaMA 3 — so you can compare performance, accuracy, and maintenance on a realistic B2B order scenario.

How to Build a B2B Document Extractor with Both Rules and LLM: A Step-by-Step Comparison — Source: towardsdatascience.com

What You Need

Python 3.8+ installed on your system
pytesseract and Tesseract OCR engine (follow installation for your OS)
Ollama (install from ollama.ai) with LLaMA 3 model pulled (ollama pull llama3)
A sample B2B PDF invoice or order document (use a real but anonymized one)
Basic Python libraries: pdf2image, Pillow, re, json
Text editor or IDE

Step-by-Step Guide

Step 1: Set Up the Environment and Sample Document

First, create a project folder and install dependencies:

pip install pytesseract pdf2image Pillow ollama

Place your sample B2B PDF in the folder. For this guide, we assume a purchase order containing fields like Order ID, Supplier Name, Line Items, Total Amount.

Step 2: Build the Rule-Based Extractor with pytesseract

Create a Python script rule_extractor.py. Use pdf2image to convert PDF pages to images, then apply Tesseract OCR:

from pdf2image import convert_from_path
import pytesseract

images = convert_from_path('order.pdf')
text = pytesseract.image_to_string(images[0])

Now define rules using regex and keyword matching. For example:

Extract Order ID by looking for patterns like Order #:\s*(\w+)
Find Supplier Name after the word Supplier or Vendor
Parse line items using tabular assumption (fixed positions or delimiter)
Grab the total via Total:\s*[\$]?(\d+\.\d{2})

Test with your PDF and adjust regex patterns. This approach works well for consistent layouts but fails if the format changes.

Step 3: Build the LLM-Based Extractor with Ollama and LLaMA 3

Create llm_extractor.py. Read the PDF text as before (or use OCR output). Then pass it to Ollama:

import ollama

prompt = """You are a B2B document parser. Extract fields: Order ID, Supplier Name, Line Items (as list), Total. Output only JSON.
Document:
{ocr_text}
""".format(ocr_text=text)

response = ollama.chat(model='llama3', messages=[{'role': 'user', 'content': prompt}])
result = json.loads(response['message']['content'])

This method is layout-agnostic and handles variations naturally. However, it requires running a local LLM and may be slower. You can also tweak the prompt to enforce schema.

Step 4: Compare Outputs and Handle Failures

Run both scripts on the same document. Compare extracted JSON:

Rule-based may miss fields if layout shifts or OCR introduces noise
LLM-based may hallucinate or misinterpret ambiguous text

For failures, enhance rules with fallback patterns, or improve LLM prompt by providing examples. Consider using both in a hybrid pipeline where LLM acts as a backup.

Step 5: Optimize for Your Use Case

For production, measure accuracy, speed, and maintenance overhead. Rule-based is fast and cheap but brittle. LLM-based offers flexibility but requires GPU and careful prompt engineering.

You can also combine them: try rules first, then use LLM for confidence threshold below 90%.

Tips for Success

Preprocess images before OCR: crop, deskew, convert to grayscale, increase contrast.
Use structured output with LLMs: ask for JSON and validate with Pydantic.
Test on multiple documents with varying layouts to see where each approach shines.
Monitor costs: local LLM via Ollama has no API costs but uses compute; rules need no GPU.
Version control both extraction scripts and sample documents to reproduce comparisons.
Consider a hybrid system as the best of both worlds: rules for speed, LLM for edge cases.

By building the same extractor twice, you gain practical insight into trade-offs and can make an informed choice for your B2B document processing needs.