Skip to main content
New

800+ funded startups directory

Browse
LeadMagic logo
LeadMagic
Back to blog
Developer10 min read

How to Extract Text from Any Website (2026 Guide)

How to extract text from any website — browser tools, Python scripts, and APIs. Covers JS-rendered pages and AI-ready output.

JO

Jesse Ouellette

February 21, 2026

You need the text from a webpage — not the navigation, not the ads, not the cookie banners. Just the actual content.

Maybe you're feeding it into an LLM. Maybe you're archiving documentation. Maybe you're building a dataset. Whatever the reason, the raw HTML is useless. You need clean, structured text.

This guide covers every practical method to extract text from a website, from zero-code online tools to Python scripts to production-ready APIs.

Extract Text from Website Online

The fastest way to convert a webpage to text is an online tool. No installation, no code.

LeadMagic's URL to Markdown converter does this in one step:

  1. Paste a URL
  2. Click convert
  3. Get clean markdown output

The tool renders the page (including JavaScript), strips navigation and boilerplate, and returns the main content as structured markdown. Headings, lists, tables, and code blocks are preserved. Ads and chrome are removed.

This works for one-off extractions. If you already have the HTML source, use the HTML to Markdown tool instead.

When to use this method:

  • You need text from a handful of pages
  • You don't want to write code
  • You need it right now

Extract Website Text with Python

For repeatable extraction, Python is the standard. The combination of requests and BeautifulSoup handles most static pages.

Basic extraction

import requests
from bs4 import BeautifulSoup

url = "https://example.com/blog/some-article"
response = requests.get(url, headers={"User-Agent": "Mozilla/5.0"})
soup = BeautifulSoup(response.text, "html.parser")

# Remove script and style elements
for tag in soup(["script", "style", "nav", "footer", "header"]):
    tag.decompose()

text = soup.get_text(separator="\n", strip=True)
print(text)

This gives you raw text. It works, but the output is flat — no headings, no structure, no distinction between a title and a paragraph.

Extracting structured content

If you need the document structure (and you probably do), target specific elements:

article = soup.find("article") or soup.find("main")

if article:
    for heading in article.find_all(["h1", "h2", "h3"]):
        level = int(heading.name[1])
        print(f"{'#' * level} {heading.get_text(strip=True)}")

    for paragraph in article.find_all("p"):
        print(paragraph.get_text(strip=True))
        print()

Batch extraction

Loop through multiple URLs:

urls = [
    "https://example.com/page-1",
    "https://example.com/page-2",
    "https://example.com/page-3",
]

results = {}
for url in urls:
    response = requests.get(url, headers={"User-Agent": "Mozilla/5.0"})
    soup = BeautifulSoup(response.text, "html.parser")
    for tag in soup(["script", "style", "nav", "footer"]):
        tag.decompose()
    results[url] = soup.get_text(separator="\n", strip=True)

Limitations: This approach fails on JavaScript-rendered pages. If the content loads after the initial HTML (SPAs, infinite scroll, lazy-loaded sections), requests won't see it.

Convert Webpage to Text with JavaScript

If you're in a Node.js environment, cheerio is the BeautifulSoup equivalent.

Static pages with Cheerio

import * as cheerio from "cheerio";

const response = await fetch("https://example.com/article");
const html = await response.text();
const $ = cheerio.load(html);

// Remove noise
$("script, style, nav, footer, header").remove();

const text = $("article").text().trim();
console.log(text);

Dynamic pages with Puppeteer

For JavaScript-rendered content, use Puppeteer to run a real browser:

import puppeteer from "puppeteer";

const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto("https://example.com/spa-page", {
  waitUntil: "networkidle0",
});

const text = await page.evaluate(() => {
  const article = document.querySelector("article") || document.querySelector("main");
  return article ? article.innerText : document.body.innerText;
});

console.log(text);
await browser.close();

Puppeteer gives you the fully rendered DOM, so dynamic content is captured. The trade-off is speed — launching a browser is orders of magnitude slower than an HTTP request.

Extract Text from JavaScript-Rendered Pages

This is where most extraction projects hit a wall. Modern websites rely heavily on client-side rendering:

  • React, Vue, Angular SPAs — Content doesn't exist in the initial HTML
  • Lazy-loaded sections — Content loads on scroll or interaction
  • API-driven pages — Data fetched after page load via XHR/fetch
  • Auth-gated content — Requires login before content appears

A basic requests.get() or fetch() returns the empty shell — a <div id="root"></div> with no content.

Solutions

ApproachProsCons
Playwright/PuppeteerFull browser, handles everythingSlow, resource-heavy, breaks at scale
SeleniumMature, well-documentedEven slower than Playwright
requests-html (Python)Lighter than a full browserLimited JS support, unmaintained
LeadMagic APIManaged rendering, fast, structured outputCosts credits per request

For one-off scripts, Playwright works fine. For production pipelines processing hundreds or thousands of URLs, running headless browsers becomes an infrastructure problem. That's where a managed API makes more sense.

Website to Text via API

When you need to extract content from URLs at scale — reliably, without managing browser infrastructure — use the URL to Markdown API.

How it works

Send a URL, get back clean markdown:

curl -X POST https://api.web2md.app/api/scrape \
  -H "X-API-Key: your_key" \
  -H "Content-Type: application/json" \
  -d '{"url": "https://example.com/article"}'

Response:

{
  "success": true,
  "data": {
    "markdown": "# Article Title\n\nFirst paragraph of content...\n\n## Section Heading\n\n...",
    "title": "Article Title",
    "url": "https://example.com/article"
  }
}

The API handles JavaScript rendering, content extraction, and cleanup. You get structured markdown — not a blob of text.

When to use the API

  • Batch processing — Extract text from hundreds of URLs programmatically
  • Production pipelines — Reliable extraction without managing Puppeteer/Playwright infrastructure
  • JS-rendered pages — No need to run your own headless browser
  • AI/LLM workflows — Get markdown that's immediately usable as LLM context

Python example

import requests

api_key = "your_key"
urls = ["https://example.com/page-1", "https://example.com/page-2"]

for url in urls:
    response = requests.post(
        "https://api.web2md.app/api/scrape",
        headers={
            "X-API-Key": api_key,
            "Content-Type": "application/json",
        },
        json={"url": url},
    )
    data = response.json()
    if data["success"]:
        print(f"--- {url} ---")
        print(data["data"]["markdown"][:500])

Choosing the Right Text Extraction Method

MethodBest forHandles JSStructured outputSetup time
Online toolQuick one-off extractionsYesMarkdownNone
Python + BeautifulSoupStatic pages, small batchesNoRaw text5 min
Node.js + CheerioStatic pages (JS ecosystem)NoRaw text5 min
Puppeteer / PlaywrightJS-rendered pages, scrapingYesRaw text15 min
LeadMagic APIProduction pipelines, batchYesMarkdown2 min

Decision shortcut:

  • Need it once? Use the online tool.
  • Need it in a script, static pages only? Use BeautifulSoup or Cheerio.
  • Need JS rendering in a script? Use Playwright.
  • Need it at scale, reliably? Use the API.

Extract Content from URL for AI and LLMs

If you're extracting website text to feed into an LLM, the output format matters as much as the extraction itself.

Raw HTML is wasteful. A typical webpage's HTML is 70-80% markup, navigation, scripts, and styling. The actual content is a fraction of the total. Feeding raw HTML into an LLM means:

  • Wasted tokens — You're paying for <div class="container mx-auto px-4"> instead of actual content
  • Worse results — The model has to parse through noise to find the signal
  • Context window limits — Large HTML files can exceed context limits entirely

Markdown solves this. It preserves structure (headings, lists, tables, code blocks) while discarding everything else. Our testing shows markdown uses roughly 67% fewer tokens than the equivalent HTML.

Practical example

Extracting a documentation page:

FormatSizeTokens (GPT-4)Usable content
Raw HTML47 KB~12,000~30%
Plain text8 KB~2,100~85%
Markdown11 KB~2,800~95%

Markdown is slightly larger than plain text, but it retains structure that plain text loses. An LLM reading markdown understands that ## Installation is a section heading — that context matters for accurate responses.

For more on converting existing HTML to markdown, see the HTML to Markdown guide.

RAG pipelines

For retrieval-augmented generation, markdown extraction is the first step:

  1. Extract — Convert URLs to markdown via API
  2. Chunk — Split markdown by headings (natural semantic boundaries)
  3. Embed — Generate embeddings for each chunk
  4. Store — Index in a vector database
  5. Retrieve — Query with user questions, retrieve relevant chunks

Markdown's heading structure gives you natural chunk boundaries. You don't need custom splitting logic — just split on ## headings and each chunk is a coherent section.

Get Started

Pick the method that fits your use case. For quick extractions, open the URL to Markdown converter and paste a URL. For programmatic access, grab an API key from app.leadmagic.io and start with the URL to Markdown API.

Every LeadMagic account includes API access. No separate plan, no add-ons.

Get your API key in 30 seconds

100 free credits. No credit card. API, CLI, and MCP — all from one key.