How do I extract text from a website?

The simplest method is to use an online tool like LeadMagic's URL to Markdown converter. Enter the URL, and the tool extracts the main text content as clean markdown. For programmatic extraction, use Python libraries like BeautifulSoup or requests-html, or use LeadMagic's URL to Markdown API.

Can I extract text from a JavaScript-rendered website?

Yes, but you need a tool that renders JavaScript before extraction. Basic HTTP requests (like Python's requests library) won't capture dynamically loaded content. Use Playwright, Selenium, or LeadMagic's URL to Markdown API which handles JavaScript rendering automatically.

What's the best format for extracted website text?

Markdown is the best format for extracted text because it preserves document structure (headings, lists, tables) while removing noise. It uses 67% fewer tokens than raw HTML, making it ideal for LLM applications, documentation, and content archival.

How do I extract text from multiple web pages?

Use an API like LeadMagic's URL to Markdown API for batch extraction. Send URLs programmatically and receive clean text. For simple cases, Python scripts with BeautifulSoup can loop through multiple URLs.

Back to blog

Developer10 min read

How to Extract Text from Any Website (2026 Guide)

How to extract text from any website — browser tools, Python scripts, and APIs. Covers JS-rendered pages and AI-ready output.

Jesse Ouellette

February 21, 2026

You need the text from a webpage — not the navigation, not the ads, not the cookie banners. Just the actual content.

Maybe you're feeding it into an LLM. Maybe you're archiving documentation. Maybe you're building a dataset. Whatever the reason, the raw HTML is useless. You need clean, structured text.

This guide covers every practical method to extract text from a website, from zero-code online tools to Python scripts to production-ready APIs.

Extract Text from Website Online

The fastest way to convert a webpage to text is an online tool. No installation, no code.

LeadMagic's URL to Markdown converter does this in one step:

Paste a URL
Click convert
Get clean markdown output

The tool renders the page (including JavaScript), strips navigation and boilerplate, and returns the main content as structured markdown. Headings, lists, tables, and code blocks are preserved. Ads and chrome are removed.

This works for one-off extractions. If you already have the HTML source, use the HTML to Markdown tool instead.

When to use this method:

You need text from a handful of pages
You don't want to write code
You need it right now

Extract Website Text with Python

For repeatable extraction, Python is the standard. The combination of requests and BeautifulSoup handles most static pages.

Basic extraction

import requests
from bs4 import BeautifulSoup

url = "https://example.com/blog/some-article"
response = requests.get(url, headers={"User-Agent": "Mozilla/5.0"})
soup = BeautifulSoup(response.text, "html.parser")

# Remove script and style elements
for tag in soup(["script", "style", "nav", "footer", "header"]):
    tag.decompose()

text = soup.get_text(separator="\n", strip=True)
print(text)

This gives you raw text. It works, but the output is flat — no headings, no structure, no distinction between a title and a paragraph.

Extracting structured content

If you need the document structure (and you probably do), target specific elements:

article = soup.find("article") or soup.find("main")

if article:
    for heading in article.find_all(["h1", "h2", "h3"]):
        level = int(heading.name[1])
        print(f"{'#' * level} {heading.get_text(strip=True)}")

    for paragraph in article.find_all("p"):
        print(paragraph.get_text(strip=True))
        print()

Batch extraction

Loop through multiple URLs:

urls = [
    "https://example.com/page-1",
    "https://example.com/page-2",
    "https://example.com/page-3",
]

results = {}
for url in urls:
    response = requests.get(url, headers={"User-Agent": "Mozilla/5.0"})
    soup = BeautifulSoup(response.text, "html.parser")
    for tag in soup(["script", "style", "nav", "footer"]):
        tag.decompose()
    results[url] = soup.get_text(separator="\n", strip=True)

Limitations: This approach fails on JavaScript-rendered pages. If the content loads after the initial HTML (SPAs, infinite scroll, lazy-loaded sections), requests won't see it.

Convert Webpage to Text with JavaScript

If you're in a Node.js environment, cheerio is the BeautifulSoup equivalent.

Static pages with Cheerio

import * as cheerio from "cheerio";

const response = await fetch("https://example.com/article");
const html = await response.text();
const $ = cheerio.load(html);

// Remove noise
$("script, style, nav, footer, header").remove();

const text = $("article").text().trim();
console.log(text);

Dynamic pages with Puppeteer

For JavaScript-rendered content, use Puppeteer to run a real browser:

import puppeteer from "puppeteer";

const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto("https://example.com/spa-page", {
  waitUntil: "networkidle0",
});

const text = await page.evaluate(() => {
  const article = document.querySelector("article") || document.querySelector("main");
  return article ? article.innerText : document.body.innerText;
});

console.log(text);
await browser.close();

Puppeteer gives you the fully rendered DOM, so dynamic content is captured. The trade-off is speed — launching a browser is orders of magnitude slower than an HTTP request.

Extract Text from JavaScript-Rendered Pages

This is where most extraction projects hit a wall. Modern websites rely heavily on client-side rendering:

React, Vue, Angular SPAs — Content doesn't exist in the initial HTML
Lazy-loaded sections — Content loads on scroll or interaction
API-driven pages — Data fetched after page load via XHR/fetch
Auth-gated content — Requires login before content appears

A basic requests.get() or fetch() returns the empty shell — a <div id="root"></div> with no content.

Solutions

Approach	Pros	Cons
Playwright/Puppeteer	Full browser, handles everything	Slow, resource-heavy, breaks at scale
Selenium	Mature, well-documented	Even slower than Playwright
requests-html (Python)	Lighter than a full browser	Limited JS support, unmaintained
LeadMagic API	Managed rendering, fast, structured output	Costs credits per request

For one-off scripts, Playwright works fine. For production pipelines processing hundreds or thousands of URLs, running headless browsers becomes an infrastructure problem. That's where a managed API makes more sense.

Website to Text via API

When you need to extract content from URLs at scale — reliably, without managing browser infrastructure — use the URL to Markdown API.

How it works

Send a URL, get back clean markdown:

curl -X POST https://api.web2md.app/api/scrape \
  -H "X-API-Key: your_key" \
  -H "Content-Type: application/json" \
  -d '{"url": "https://example.com/article"}'

Response:

{
  "success": true,
  "data": {
    "markdown": "# Article Title\n\nFirst paragraph of content...\n\n## Section Heading\n\n...",
    "title": "Article Title",
    "url": "https://example.com/article"
  }
}

The API handles JavaScript rendering, content extraction, and cleanup. You get structured markdown — not a blob of text.

When to use the API

Batch processing — Extract text from hundreds of URLs programmatically
Production pipelines — Reliable extraction without managing Puppeteer/Playwright infrastructure
JS-rendered pages — No need to run your own headless browser
AI/LLM workflows — Get markdown that's immediately usable as LLM context

Python example

import requests

api_key = "your_key"
urls = ["https://example.com/page-1", "https://example.com/page-2"]

for url in urls:
    response = requests.post(
        "https://api.web2md.app/api/scrape",
        headers={
            "X-API-Key": api_key,
            "Content-Type": "application/json",
        },
        json={"url": url},
    )
    data = response.json()
    if data["success"]:
        print(f"--- {url} ---")
        print(data["data"]["markdown"][:500])

Choosing the Right Text Extraction Method

Method	Best for	Handles JS	Structured output	Setup time
Online tool	Quick one-off extractions	Yes	Markdown	None
Python + BeautifulSoup	Static pages, small batches	No	Raw text	5 min
Node.js + Cheerio	Static pages (JS ecosystem)	No	Raw text	5 min
Puppeteer / Playwright	JS-rendered pages, scraping	Yes	Raw text	15 min
LeadMagic API	Production pipelines, batch	Yes	Markdown	2 min

Decision shortcut:

Need it once? Use the online tool.
Need it in a script, static pages only? Use BeautifulSoup or Cheerio.
Need JS rendering in a script? Use Playwright.
Need it at scale, reliably? Use the API.

Extract Content from URL for AI and LLMs

If you're extracting website text to feed into an LLM, the output format matters as much as the extraction itself.

Raw HTML is wasteful. A typical webpage's HTML is 70-80% markup, navigation, scripts, and styling. The actual content is a fraction of the total. Feeding raw HTML into an LLM means:

Wasted tokens — You're paying for <div class="container mx-auto px-4"> instead of actual content
Worse results — The model has to parse through noise to find the signal
Context window limits — Large HTML files can exceed context limits entirely

Markdown solves this. It preserves structure (headings, lists, tables, code blocks) while discarding everything else. Our testing shows markdown uses roughly 67% fewer tokens than the equivalent HTML.

Practical example

Extracting a documentation page:

Format	Size	Tokens (GPT-4)	Usable content
Raw HTML	47 KB	~12,000	~30%
Plain text	8 KB	~2,100	~85%
Markdown	11 KB	~2,800	~95%

Markdown is slightly larger than plain text, but it retains structure that plain text loses. An LLM reading markdown understands that ## Installation is a section heading — that context matters for accurate responses.

For more on converting existing HTML to markdown, see the HTML to Markdown guide.

RAG pipelines

For retrieval-augmented generation, markdown extraction is the first step:

Extract — Convert URLs to markdown via API
Chunk — Split markdown by headings (natural semantic boundaries)
Embed — Generate embeddings for each chunk
Store — Index in a vector database
Retrieve — Query with user questions, retrieve relevant chunks

Markdown's heading structure gives you natural chunk boundaries. You don't need custom splitting logic — just split on ## headings and each chunk is a coherent section.

Get Started

Pick the method that fits your use case. For quick extractions, open the URL to Markdown converter and paste a URL. For programmatic access, grab an API key from app.leadmagic.io and start with the URL to Markdown API.

Every LeadMagic account includes API access. No separate plan, no add-ons.

Developer12 min read

Email Finder API Guide: Code + Integration Patterns

Integrate an email finder API with curl, Python, and Node.js. Includes auth, rate limits, error handling, and batch patterns.

Developer12 min read

How to Convert HTML to Markdown: Complete Guide (2026)

Learn how to convert HTML to Markdown with Python, JavaScript, Pandoc, and online tools. Code examples and comparison table included.

Developer8 min read

Markdown vs HTML: When to Use Each (2026 Guide)

Markdown vs HTML — syntax differences, when to use each, and conversion methods. Why markdown wins for LLMs and AI pipelines.