How to Extract Text from Any Website (2026 Guide)
How to extract text from any website — browser tools, Python scripts, and APIs. Covers JS-rendered pages and AI-ready output.
Jesse Ouellette
February 21, 2026
You need the text from a webpage — not the navigation, not the ads, not the cookie banners. Just the actual content.
Maybe you're feeding it into an LLM. Maybe you're archiving documentation. Maybe you're building a dataset. Whatever the reason, the raw HTML is useless. You need clean, structured text.
This guide covers every practical method to extract text from a website, from zero-code online tools to Python scripts to production-ready APIs.
Extract Text from Website Online
The fastest way to convert a webpage to text is an online tool. No installation, no code.
LeadMagic's URL to Markdown converter does this in one step:
- Paste a URL
- Click convert
- Get clean markdown output
The tool renders the page (including JavaScript), strips navigation and boilerplate, and returns the main content as structured markdown. Headings, lists, tables, and code blocks are preserved. Ads and chrome are removed.
This works for one-off extractions. If you already have the HTML source, use the HTML to Markdown tool instead.
When to use this method:
- You need text from a handful of pages
- You don't want to write code
- You need it right now
Extract Website Text with Python
For repeatable extraction, Python is the standard. The combination of requests and BeautifulSoup handles most static pages.
Basic extraction
import requests
from bs4 import BeautifulSoup
url = "https://example.com/blog/some-article"
response = requests.get(url, headers={"User-Agent": "Mozilla/5.0"})
soup = BeautifulSoup(response.text, "html.parser")
# Remove script and style elements
for tag in soup(["script", "style", "nav", "footer", "header"]):
tag.decompose()
text = soup.get_text(separator="\n", strip=True)
print(text)
This gives you raw text. It works, but the output is flat — no headings, no structure, no distinction between a title and a paragraph.
Extracting structured content
If you need the document structure (and you probably do), target specific elements:
article = soup.find("article") or soup.find("main")
if article:
for heading in article.find_all(["h1", "h2", "h3"]):
level = int(heading.name[1])
print(f"{'#' * level} {heading.get_text(strip=True)}")
for paragraph in article.find_all("p"):
print(paragraph.get_text(strip=True))
print()
Batch extraction
Loop through multiple URLs:
urls = [
"https://example.com/page-1",
"https://example.com/page-2",
"https://example.com/page-3",
]
results = {}
for url in urls:
response = requests.get(url, headers={"User-Agent": "Mozilla/5.0"})
soup = BeautifulSoup(response.text, "html.parser")
for tag in soup(["script", "style", "nav", "footer"]):
tag.decompose()
results[url] = soup.get_text(separator="\n", strip=True)
Limitations: This approach fails on JavaScript-rendered pages. If the content loads after the initial HTML (SPAs, infinite scroll, lazy-loaded sections), requests won't see it.
Convert Webpage to Text with JavaScript
If you're in a Node.js environment, cheerio is the BeautifulSoup equivalent.
Static pages with Cheerio
import * as cheerio from "cheerio";
const response = await fetch("https://example.com/article");
const html = await response.text();
const $ = cheerio.load(html);
// Remove noise
$("script, style, nav, footer, header").remove();
const text = $("article").text().trim();
console.log(text);
Dynamic pages with Puppeteer
For JavaScript-rendered content, use Puppeteer to run a real browser:
import puppeteer from "puppeteer";
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto("https://example.com/spa-page", {
waitUntil: "networkidle0",
});
const text = await page.evaluate(() => {
const article = document.querySelector("article") || document.querySelector("main");
return article ? article.innerText : document.body.innerText;
});
console.log(text);
await browser.close();
Puppeteer gives you the fully rendered DOM, so dynamic content is captured. The trade-off is speed — launching a browser is orders of magnitude slower than an HTTP request.
Extract Text from JavaScript-Rendered Pages
This is where most extraction projects hit a wall. Modern websites rely heavily on client-side rendering:
- React, Vue, Angular SPAs — Content doesn't exist in the initial HTML
- Lazy-loaded sections — Content loads on scroll or interaction
- API-driven pages — Data fetched after page load via XHR/fetch
- Auth-gated content — Requires login before content appears
A basic requests.get() or fetch() returns the empty shell — a <div id="root"></div> with no content.
Solutions
| Approach | Pros | Cons |
|---|---|---|
| Playwright/Puppeteer | Full browser, handles everything | Slow, resource-heavy, breaks at scale |
| Selenium | Mature, well-documented | Even slower than Playwright |
| requests-html (Python) | Lighter than a full browser | Limited JS support, unmaintained |
| LeadMagic API | Managed rendering, fast, structured output | Costs credits per request |
For one-off scripts, Playwright works fine. For production pipelines processing hundreds or thousands of URLs, running headless browsers becomes an infrastructure problem. That's where a managed API makes more sense.
Website to Text via API
When you need to extract content from URLs at scale — reliably, without managing browser infrastructure — use the URL to Markdown API.
How it works
Send a URL, get back clean markdown:
curl -X POST https://api.web2md.app/api/scrape \
-H "X-API-Key: your_key" \
-H "Content-Type: application/json" \
-d '{"url": "https://example.com/article"}'
Response:
{
"success": true,
"data": {
"markdown": "# Article Title\n\nFirst paragraph of content...\n\n## Section Heading\n\n...",
"title": "Article Title",
"url": "https://example.com/article"
}
}
The API handles JavaScript rendering, content extraction, and cleanup. You get structured markdown — not a blob of text.
When to use the API
- Batch processing — Extract text from hundreds of URLs programmatically
- Production pipelines — Reliable extraction without managing Puppeteer/Playwright infrastructure
- JS-rendered pages — No need to run your own headless browser
- AI/LLM workflows — Get markdown that's immediately usable as LLM context
Python example
import requests
api_key = "your_key"
urls = ["https://example.com/page-1", "https://example.com/page-2"]
for url in urls:
response = requests.post(
"https://api.web2md.app/api/scrape",
headers={
"X-API-Key": api_key,
"Content-Type": "application/json",
},
json={"url": url},
)
data = response.json()
if data["success"]:
print(f"--- {url} ---")
print(data["data"]["markdown"][:500])
Choosing the Right Text Extraction Method
| Method | Best for | Handles JS | Structured output | Setup time |
|---|---|---|---|---|
| Online tool | Quick one-off extractions | Yes | Markdown | None |
| Python + BeautifulSoup | Static pages, small batches | No | Raw text | 5 min |
| Node.js + Cheerio | Static pages (JS ecosystem) | No | Raw text | 5 min |
| Puppeteer / Playwright | JS-rendered pages, scraping | Yes | Raw text | 15 min |
| LeadMagic API | Production pipelines, batch | Yes | Markdown | 2 min |
Decision shortcut:
- Need it once? Use the online tool.
- Need it in a script, static pages only? Use BeautifulSoup or Cheerio.
- Need JS rendering in a script? Use Playwright.
- Need it at scale, reliably? Use the API.
Extract Content from URL for AI and LLMs
If you're extracting website text to feed into an LLM, the output format matters as much as the extraction itself.
Raw HTML is wasteful. A typical webpage's HTML is 70-80% markup, navigation, scripts, and styling. The actual content is a fraction of the total. Feeding raw HTML into an LLM means:
- Wasted tokens — You're paying for
<div class="container mx-auto px-4">instead of actual content - Worse results — The model has to parse through noise to find the signal
- Context window limits — Large HTML files can exceed context limits entirely
Markdown solves this. It preserves structure (headings, lists, tables, code blocks) while discarding everything else. Our testing shows markdown uses roughly 67% fewer tokens than the equivalent HTML.
Practical example
Extracting a documentation page:
| Format | Size | Tokens (GPT-4) | Usable content |
|---|---|---|---|
| Raw HTML | 47 KB | ~12,000 | ~30% |
| Plain text | 8 KB | ~2,100 | ~85% |
| Markdown | 11 KB | ~2,800 | ~95% |
Markdown is slightly larger than plain text, but it retains structure that plain text loses. An LLM reading markdown understands that ## Installation is a section heading — that context matters for accurate responses.
For more on converting existing HTML to markdown, see the HTML to Markdown guide.
RAG pipelines
For retrieval-augmented generation, markdown extraction is the first step:
- Extract — Convert URLs to markdown via API
- Chunk — Split markdown by headings (natural semantic boundaries)
- Embed — Generate embeddings for each chunk
- Store — Index in a vector database
- Retrieve — Query with user questions, retrieve relevant chunks
Markdown's heading structure gives you natural chunk boundaries. You don't need custom splitting logic — just split on ## headings and each chunk is a coherent section.
Get Started
Pick the method that fits your use case. For quick extractions, open the URL to Markdown converter and paste a URL. For programmatic access, grab an API key from app.leadmagic.io and start with the URL to Markdown API.
Every LeadMagic account includes API access. No separate plan, no add-ons.
Related Posts
Integrate an email finder API with curl, Python, and Node.js. Includes auth, rate limits, error handling, and batch patterns.
Learn how to convert HTML to Markdown with Python, JavaScript, Pandoc, and online tools. Code examples and comparison table included.
Markdown vs HTML — syntax differences, when to use each, and conversion methods. Why markdown wins for LLMs and AI pipelines.