Why Sending Raw HTML to an LLM for Web Scraping Is a Mistake (and What to Do Instead)

By

The Hidden Cost of Large DOM Inputs

When developers first attempt to build a web scraper using large language models (LLMs), the natural instinct is to feed the entire page's HTML into the model and ask it to extract relevant data. This seems straightforward, but it quickly reveals a major inefficiency: a typical product listing page contains 500–700 KB of raw DOM markup. Processing that much input means paying for approximately 150,000 tokens per request, enduring 15–30 seconds of latency, and frequently hitting context limits—especially for complex pages. Many projects stall at this first hurdle.

Why Sending Raw HTML to an LLM for Web Scraping Is a Mistake (and What to Do Instead)
Source: dev.to

The Reality Check: 15 Models, Consistent Performance

Over a four-month period, an exhaustive evaluation was conducted across 15 different models, including GPT-4, GPT-4o, Gemini 1.5 Pro, Gemini Ultra, Claude 3 Opus, Claude 3.5 Sonnet, Mistral Large, Llama 3 70B, Cohere Command R+, and several smaller fine-tuned variants. The results fell into a predictable pattern:

  • GPT-4 and Gemini Ultra delivered high accuracy but required 25–35 seconds per page.
  • Claude 3.5 Sonnet offered the best accuracy-to-latency trade-off but still needed 5–10 seconds.
  • Smaller models were faster but frequently hallucinated field names or produced inconsistent output.

No model solved the core latency problem because the fundamental approach—sending massive, unprocessed HTML—was flawed from the start.

The Breakthrough: Pre-Processing DOM

The real bottleneck was not the model's reasoning capability but the sheer volume of input data. To address this, a DOM pre-processor was developed with the following steps:

  1. Strip all <script>, <style>, and tracking pixel elements.
  2. Remove navigation, footer, and sidebar components.
  3. Collapse deeply nested wrappers that carry no semantic meaning.
  4. Apply SimHash to deduplicate structurally identical subtrees.

The result was a dramatic reduction from 580 KB to just 4.2 KB—a 99.3% decrease in input size. With a 4 KB input, every model became fast. More importantly, the reduced input made repeating structural patterns obvious: product cards, directory rows, and search results repeated 20, 50, or 100 times. This insight led to a fundamental shift in the architecture.

The Architecture Decision: Heuristics Before AI

Once the structural patterns were visible, it became clear that paying an LLM to detect those patterns was unnecessary. Instead, a heuristic detector was designed to:

Why Sending Raw HTML to an LLM for Web Scraping Is a Mistake (and What to Do Instead)
Source: dev.to
  • Identify elements with three or more structurally identical siblings.
  • Score candidate lists based on depth, child count uniformity, and text density.
  • Return ranked list candidates in under 0.2 milliseconds.

Then, AI enters only after detection—not to identify the list, but to label fields and structure the output. This reduces the LLM's job from 150,000 tokens to approximately 200 tokens. The resulting performance is dramatic:

StepApproachLatency
List detectionHeuristics0.2 ms
Field labelingLLM (small input)~2 s
Total~2 s

Compare this to the naive LLM approach, which takes 25–35 seconds per page.

What Was Actually Shipped

This architecture became the foundation for Clura, a heuristic-first AI web scraper Chrome extension. On any page, Clura automatically detects every list using the heuristic engine. Users simply pick the desired list and the fields to extract; all records are retrieved in seconds. There are no prompts to describe data, no training phase, and no long waits. The heuristic layer handles detection; AI handles labeling.

The Lesson: LLMs Excel at Meaning, Not Scanning HTML

Large language models are exceptional at understanding what something means. They are terrible at scanning 600 KB of HTML to find where something is. That is a structural pattern problem—and structural pattern problems are what algorithms are built for. By combining fast, cheap heuristics for pattern detection with small, targeted LLM calls for semantic labeling, you can achieve speeds and accuracy that neither method can reach alone.

Related Articles

Recommended

Discover More

Utah Breaks Ground: New Law Holds Sites Accountable for VPN-Bypassed Age ChecksVideo Game Figure Drought Drives Collectors to Create Custom AlternativesEurope's Blueprint for a Civilian DARPA: A Step-by-Step Guide to Countering Drone ThreatsThe Hidden Accessibility Crisis: How Session Timeouts Exclude Users with Disabilities7 Breakthroughs in the NVIDIA-Google Cloud AI Partnership You Need to Know