How to Build a B2B Document Extractor: Rule-Based vs. LLM Approaches

By

Introduction

Extracting structured data from B2B documents—such as purchase orders, invoices, or delivery notes—is a common challenge. Two primary approaches exist: a traditional rule-based method using pytesseract for OCR and regex for parsing, and a modern LLM-based method using Ollama with LLaMA 3. This guide walks you through building both versions of the same document extractor, comparing their strengths and tradeoffs using a realistic B2B order scenario. By the end, you'll be able to choose the right approach for your own projects.

How to Build a B2B Document Extractor: Rule-Based vs. LLM Approaches
Source: towardsdatascience.com

What You Need

  • Python 3.8+ installed on your machine
  • pytesseract – Python wrapper for Tesseract OCR engine
  • Tesseract OCR engine installed separately (see Tesseract OCR documentation)
  • Ollama – local LLM server (download from ollama.com)
  • LLaMA 3 model (run ollama pull llama3 after installing Ollama)
  • Python libraries: pdf2image, Pillow, re, requests
  • A sample B2B PDF (e.g., a purchase order with fields: company name, date, line items, totals)

Step-by-Step Instructions

Step 1: Set Up the Environment

Create a new Python virtual environment and install all required packages:

pip install pytesseract pdf2image Pillow requests

Ensure Tesseract OCR is installed globally (sudo apt install tesseract-ocr on Linux, or download the Windows installer). Also install and start Ollama, then pull the LLaMA 3 model:

ollama pull llama3

Step 2: Convert PDF to Images

B2B documents are often scanned PDFs. Use pdf2image to turn each page into a PNG image. Write a function that:

  • Takes the PDF path as input
  • Converts pages to images using convert_from_path
  • Returns a list of PIL Image objects

Step 3: Perform OCR with pytesseract

For each image, call pytesseract.image_to_string() to extract raw text. This step is identical for both rule-based and LLM approaches, as they both need the text first. Store the extracted text per page.

Step 4: Build the Rule-Based Extractor

Use regular expressions and string logic to locate fields like Order Number, Date, Client Name, and Line Items. For example:

  • Search for patterns like r'Order\s*#:\s*(\S+)'
  • Use a list of known product names for line items
  • Parse multi-line blocks for tables

This method is fast and predictable, but fragile if the document format changes.

How to Build a B2B Document Extractor: Rule-Based vs. LLM Approaches
Source: towardsdatascience.com

Step 5: Build the LLM-Based Extractor

Instead of writing rules, send the extracted text to LLaMA 3 via Ollama’s API. Send a structured prompt that asks the model to extract specific fields in JSON format:

prompt = f"""
Extract the following information from this purchase order:
- order_number
- date
- client_name
- line_items (array of objects with 'item', 'quantity', 'price')
Return only valid JSON.

Text:
{text}
"""

Use the requests library to call Ollama:

response = requests.post('http://localhost:11434/api/generate', json={'model':'llama3', 'prompt':prompt, 'stream':False})

Parse the JSON from the response.

Step 6: Compare Outputs

Run both extractors on the same set of PDFs and compare:

  • Accuracy: Which fields are correct?
  • Robustness: How does each handle missing data or typos?
  • Speed: Rule-based usually finishes in seconds; LLM may take 10–30 seconds per page.

The original experiment showed that the rule-based approach failed on a slightly different document format, while the LLM gracefully adapted—but hallucinated one item.

Tips for Success

  • Preprocess images: For rule-based OCR, apply thresholding or deskewing to improve accuracy.
  • Optimize LLM prompts: Include example outputs and specify format clearly to reduce hallucinations.
  • Fallback strategy: Use rule-based extraction for well-known templates and LLM as a fallback for unknown documents.
  • Test with diverse samples: Don’t rely on a single document; vary fonts, layouts, and printing quality.
  • Monitor costs: Local LLMs are free but require GPU; cloud LLMs charge per token.

By following these steps, you can build your own B2B document extractor and decide which approach best fits your needs. For a deep dive into the original comparison, see the full article.

Related Articles

Recommended

Discover More

AWS Weekly Update: Anthropic and Meta Deepen AI Collaboration, Lambda Gains S3 Files SupportByteDance's Astra: A Dual-Model Breakthrough for Robot Navigation in Complex EnvironmentsHow to Obtain FDA Approval for an Alzheimer's Agitation Drug: A Step-by-Step GuideA Simple Guide to Activating Ubuntu Pro in Security CenterThe Ketogenic Diet as a Mental Health Intervention: A Practical Guide