Skip to main content
New

800+ funded startups directory

Browse
LeadMagic logo
LeadMagic
Back to blog
AI10 min read

Web Scraping for LLM Pipelines: Clean Data for RAG

Web scraping for LLM apps — how to collect, clean, and format web data for RAG pipelines. Markdown as the optimal format for AI agents.

PS

Patrick Spielmann

March 1, 2026

Every large language model has the same fundamental limitation: it only knows what you put into its context window. Training data goes stale. Internal knowledge bases have gaps. The web, on the other hand, has nearly everything — pricing pages, documentation, competitor analysis, job postings, research papers, product specs. The problem is getting that data into a format your model can actually use.

Most teams hit this wall early. You build a RAG pipeline, point it at some URLs, and the output quality is terrible. The model hallucinates. The retrieved chunks are full of navigation menus and cookie banners. Your token costs explode because you are feeding raw HTML into an embedding model that does not care about <div class="footer-nav">.

This is a data quality problem disguised as a model problem. Fix the input, and the output fixes itself.

Why Web Scraping for LLM Applications Matters

LLMs are only as good as the data you ground them with. Without retrieval, a model will confidently fabricate facts, cite papers that do not exist, and give you pricing from 2023. RAG solves this by fetching relevant documents before the model generates a response. But the "relevant documents" part assumes you actually have them.

The web is the largest knowledge base on the planet. It is updated in real time. It covers every industry, every company, every topic. If you are building an LLM application — a customer support bot, a research assistant, a competitive intelligence tool, a content generator — the web is where your grounding data lives.

The challenge is that web pages are built for browsers, not for language models. A typical page is 70% boilerplate: headers, footers, sidebars, scripts, tracking pixels, ads. The actual content — the part your model needs — might be 30% of the raw HTML. And that HTML is wrapped in deeply nested <div> tags that burn through your token budget without adding any semantic value.

Web scraping for LLM applications is a different discipline than traditional scraping. You are not extracting structured fields into a database. You are extracting readable content into a format that preserves meaning while minimizing noise.

The RAG Pipeline Architecture

A RAG pipeline has seven stages. Each one matters, and most teams get stage 2 and 3 wrong.

1. Fetch — Retrieve the raw HTML from a URL. This needs to handle JavaScript-rendered pages (SPAs, React apps, dynamically loaded content). A simple HTTP GET will miss most of the modern web.

2. Extract — Separate the main content from boilerplate. Remove navigation, footers, ads, cookie banners, and script tags. Keep the article body, headings, tables, lists, and code blocks.

3. Convert — Transform the extracted content into a clean, structured format. This is where markdown comes in — it preserves the document hierarchy without the overhead of HTML tags.

4. Chunk — Split the content into semantically meaningful segments. Not fixed-size character windows, but sections that correspond to coherent ideas — usually split on headings or paragraph boundaries.

5. Embed — Run each chunk through an embedding model (OpenAI text-embedding-3-small, Cohere embed-v4, etc.) to produce vector representations.

6. Retrieve — When a user asks a question, embed the query and find the most similar chunks via vector search (Pinecone, Weaviate, Qdrant, pgvector).

7. Generate — Pass the retrieved chunks as context to the LLM along with the user's question. The model generates a response grounded in real data.

Steps 2 and 3 are where most pipelines fail. If your extraction is bad, your chunks are full of noise. If your format is wrong, your embeddings are diluted with irrelevant tokens. Everything downstream suffers.

Why Markdown Is the Optimal Format for LLMs

Raw HTML is the worst possible format for LLM input. Consider a simple heading:

<div class="section-header mt-8 mb-4">
  <h2 class="text-2xl font-bold text-gray-900 tracking-tight">
    Pricing Plans
  </h2>
</div>

In markdown, that same heading is:

## Pricing Plans

Two tokens instead of thirty. Multiply this across an entire page, and the difference is massive. A typical web page that costs 15,000 tokens in raw HTML costs roughly 5,000 tokens as clean markdown. That is a 67% reduction.

But token savings are only part of the story. Markdown also preserves the semantic structure that matters for retrieval:

  • Headings (##, ###) tell the chunker where one topic ends and another begins
  • Lists stay as lists, not as a soup of <li> tags nested in <ul> inside <div> wrappers
  • Tables remain readable without CSS classes and <thead>/<tbody> markup
  • Code blocks are fenced and language-tagged, not wrapped in <pre><code class="language-python">
  • Links keep their anchor text and URL without onclick handlers and tracking parameters

When you feed markdown into an embedding model, every token carries meaning. When you feed HTML, a significant percentage of your tokens are styling classes, data attributes, and structural tags that add zero semantic value.

For a deeper dive into the numbers, read Markdown vs HTML — we break down the token comparison across different page types.

Web Scraping for LLM: A Step-by-Step Approach

Here is the practical workflow. You need three things: a way to fetch JavaScript-rendered pages, a way to extract main content, and a way to convert to clean markdown.

You can build this yourself with Puppeteer + Readability + Turndown. That works until you hit the edge cases: Cloudflare challenges, rate limiting, pages that load content on scroll, iframes, shadow DOM. Every site is different, and maintaining a scraping pipeline is a full-time job.

Or you can use an API that handles all of this. LeadMagic's URL to Markdown API takes a URL and returns clean markdown. It handles JavaScript rendering, content extraction, and noise removal in a single request.

curl -X POST https://api.web2md.app/api/scrape \
  -H "X-API-Key: your_key" \
  -H "Content-Type: application/json" \
  -d '{"url": "https://example.com/pricing"}'

The response is clean markdown — headings, lists, tables, code blocks — ready to chunk and embed. No nav bars, no footers, no cookie consent dialogs.

You can also try it interactively with our URL to Markdown tool — paste a URL, get markdown back instantly.

If you already have HTML (from your own scraper, a cache, or a CMS export), the HTML to Markdown converter handles the conversion step directly. For the full technical walkthrough on conversion, see how to convert HTML to markdown.

Building a Web Scraping API for AI Pipelines

Here is a practical pipeline for feeding web data into a RAG system. This is pseudocode — adapt it to your stack.

import requests
from chunker import semantic_chunk
from embeddings import embed_batch
from vectordb import upsert

LEADMAGIC_API = "https://api.web2md.app/api/scrape"
API_KEY = "your_key"

def ingest_urls(urls: list[str]):
    for url in urls:
        # Step 1-3: Fetch, extract, and convert in one call
        response = requests.post(
            LEADMAGIC_API,
            headers={"X-API-Key": API_KEY, "Content-Type": "application/json"},
            json={"url": url}
        )
        markdown = response.json()["markdown"]

        # Step 4: Chunk on heading boundaries
        chunks = semantic_chunk(markdown, max_tokens=512)

        # Step 5: Embed all chunks
        vectors = embed_batch([c.text for c in chunks])

        # Step 6: Store in vector database with metadata
        documents = [
            {
                "id": f"{url}_{i}",
                "vector": vec,
                "text": chunk.text,
                "metadata": {
                    "source_url": url,
                    "heading": chunk.heading,
                    "ingested_at": datetime.now().isoformat()
                }
            }
            for i, (chunk, vec) in enumerate(zip(chunks, vectors))
        ]
        upsert(documents)

A few implementation details that matter:

Chunking strategy. Split on markdown headings (##, ###), not on fixed character counts. A 512-token chunk that starts at a heading and ends before the next one captures a complete thought. A 512-character window that starts mid-paragraph and ends mid-sentence is nearly useless for retrieval.

Metadata. Always store the source URL with each chunk. When your LLM cites a fact, you need to be able to trace it back to the original page. Headings make good metadata too — they help with filtering and re-ranking.

Refresh schedule. Web data goes stale. Pricing pages change quarterly, docs update weekly, blog posts get edited. Build a refresh mechanism — re-ingest URLs on a schedule, diff the markdown, and only update chunks that changed.

MCP Integration: Give AI Assistants Web Reading

If you are using AI coding tools like Cursor or Claude Desktop, there is an even simpler path than calling the API directly: MCP (Model Context Protocol).

MCP lets AI assistants use external tools natively — without writing integration code. Add web2md as an MCP server and your AI assistant can read any webpage as part of its workflow:

{
  "mcpServers": {
    "web2md": {
      "url": "https://mcp.web2md.app/sse"
    }
  }
}

Once connected, the AI assistant can fetch and process web pages in real time — reading documentation, researching competitors, pulling product specs, or ingesting content for analysis. The web2md MCP server handles JavaScript rendering, content extraction, and noise removal, returning clean markdown that fits directly into the assistant's context window.

This matters for a few reasons:

  • No glue code. You do not need to write a LangChain document loader or a custom tool. The MCP protocol handles tool discovery and invocation.
  • Token-efficient. The clean markdown output uses 67% fewer tokens than raw HTML, which means more useful content fits in the context window.
  • Real-time. The assistant fetches live web data during the conversation, not from a stale index.

For production RAG pipelines, the REST API gives you more control over batching, scheduling, and metadata. For interactive AI workflows — coding assistants, research agents, chat-based tools — MCP is the faster integration path.

Preparing LLM-Ready Data from Web Scraping

Whether you are building a RAG pipeline or preparing training data for fine-tuning, the same principle applies: quality beats quantity every time.

A thousand pages of clean, well-structured markdown will produce better results than ten thousand pages of noisy HTML. Here is what "clean" means in practice:

Deduplication. The web is full of duplicate content — syndicated articles, mirror sites, paginated pages that repeat headers. Deduplicate at the URL level first, then at the content level using simhash or minhash. Feeding duplicates into your embedding model wastes compute and skews retrieval toward over-represented content.

Content filtering. Not every page on a domain is useful. Remove 404 pages, login walls, terms of service (unless that is your use case), and auto-generated tag/category pages. Filter by content length too — a page with 50 words of actual content is not worth indexing.

Language consistency. If your LLM application serves English-speaking users, filter out non-English pages before embedding. Mixed-language corpora degrade retrieval quality.

Structural consistency. Markdown from different sources can vary in formatting — some use * for lists, others use -. Some use # for all headings, others skip levels. Normalize your markdown before chunking to ensure consistent embedding quality.

Freshness signals. When possible, extract and store publication dates, last-modified headers, or version numbers. This lets you prioritize recent content during retrieval — critical for domains where information changes frequently.

Common Mistakes in Web Data Collection

I have seen teams burn weeks on web data pipelines that produce terrible results. These are the patterns that keep showing up.

Scraping raw HTML and feeding it directly to the model. This is the most common mistake. You fetch a page with requests.get(), dump the HTML into your chunker, and wonder why retrieval quality is awful. The model is embedding CSS classes and <div> tags instead of actual content. Always convert to markdown first.

Ignoring JavaScript-rendered pages. More than half the web renders content client-side. If you are using a simple HTTP client, you are getting an empty <div id="root"></div> instead of the actual page content. You need a headless browser — or an API that handles rendering for you.

Not removing boilerplate. Even if you convert HTML to markdown, you need to extract the main content first. A naive HTML-to-markdown conversion will faithfully convert your navigation menu, cookie banner, and footer into markdown — which then pollutes your embeddings.

Fixed-size chunking. Splitting content every 500 characters with no regard for document structure produces chunks that start mid-sentence and end mid-paragraph. Use heading-based or semantic chunking instead.

No source tracking. When your LLM says something wrong and a user asks "where did this come from?", you need to point to the source URL and the specific chunk. Without metadata, you cannot debug retrieval quality or trace errors.

One-time ingestion. Web data is not static. If you ingest a set of URLs once and never refresh, your knowledge base drifts further from reality every day. Build refresh into your pipeline from the start.

Ship Better LLM Applications

The bottleneck for most LLM applications is not the model — it is the data. A mediocre model with clean, well-structured input data will outperform a frontier model with noisy, poorly formatted context every time.

Get the extraction right. Convert to markdown. Chunk on semantic boundaries. Store metadata. Refresh regularly.

If you are building a RAG pipeline that needs web data, start with the URL to Markdown API. One API call per URL, clean markdown back, no infrastructure to maintain. Or try the URL to Markdown tool to see the output quality before writing any code. For a step-by-step walkthrough of different extraction methods, see our guide on how to extract text from any website.

Get your API key in 30 seconds

100 free credits. No credit card. API, CLI, and MCP — all from one key.