I’ll write this directly.
Scraping and chunking long-form articles for LLM context is one of those problems that looks trivial until your RAG pipeline starts hallucinating or returning irrelevant passages — because the raw HTML you fed it was 60% navigation, ads, and cookie banners. getting the extraction and chunking layers right makes the difference between a retrieval system that works and one that confidently answers from boilerplate footer text.
Why Raw HTML Makes Terrible LLM Context
Most web pages are architecturally hostile to text extraction. a typical news article contains 3-5x more non-content HTML than actual article text: nav bars, related article widgets, comment sections, cookie consent modals, and structured data scripts all end up in your scraped payload if you’re not aggressive about stripping them.
the specific failure modes matter:
- boilerplate contamination: chunks that embed nav text score well on embedding similarity but carry zero semantic value
- encoding artifacts: Windows-1252 characters mangled to UTF-8 produce gibberish that breaks tokenizers
- truncated sentences at chunk boundaries: splitting on fixed character counts cuts mid-sentence, destroying coherence for the LLM
- missing metadata: a chunk with no URL or heading path is nearly unrecoverable when you need to cite or re-fetch the source
if you’re scraping at scale, these issues compound. a corpus of 50,000 articles with 20% boilerplate contamination means 10,000 chunks that actively degrade your retrieval quality. for news pipelines specifically, How to Scrape Google News Articles in 2026 covers the upstream fetch layer in detail — but once you have the HTML, the extraction problem is yours to solve.
Extraction Layer: Choosing Your Parser
four tools dominate production extraction pipelines in 2026. they are not interchangeable.
trafilatura is the default choice for article extraction. it uses a combination of HTML structure heuristics and readability scoring to isolate the main content block, strips boilerplate aggressively, and returns clean text or XML with optional metadata (author, date, title). benchmark tests on CommonCrawl samples consistently show trafilatura outperforming alternatives on precision for news and blog content.
newspaper4k (the maintained fork of newspaper3k) adds NLP-based keyword and summary extraction on top of content isolation. useful if you need those fields, but slower and more opinionated than trafilatura.
Jina Reader (r.jina.ai/) is an API-based approach: pass a URL, get back clean markdown. the tradeoff is latency (300-800ms per request), cost at scale, and loss of control over the extraction logic. it works well for prototypes or low-volume pipelines where you don’t want to maintain a scraper fleet.
raw BeautifulSoup is the wrong choice for article extraction unless you’re working with a single known site structure. it has no boilerplate-detection logic — you’re writing and maintaining CSS selectors forever. for sites with dynamic selectors, pair it with a self-healing scraper using LLMs rather than trying to hand-maintain the extraction rules.
Chunking Strategies Compared
once you have clean text, chunking is where most pipelines make consequential choices. the right strategy depends on your retrieval use case and the LLM’s context window.
| strategy | typical chunk size | overlap | best for | main tradeoff |
|---|---|---|---|---|
| fixed-size (chars) | 500-1000 chars | 10-20% | fast indexing, homogeneous corpora | breaks sentences mid-thought |
| sentence-boundary | 3-8 sentences | 1-2 sentences | QA over dense prose | variable token counts complicate batching |
| heading-based | full H2/H3 section | none | structured docs, technical articles | sections vary from 50 to 2000+ tokens |
| semantic (embedding) | 256-512 tokens | none | high-precision retrieval | expensive, requires embedding at chunk time |
for long-form editorial content, heading-based chunking is the most semantically coherent approach. each chunk maps to an author-intended unit of meaning, preserves context, and makes the source citation obvious. the main risk is runaway sections, which you handle with a token ceiling and a recursive split fallback.
Heading-Based Chunking with tiktoken
the following approach converts trafilatura’s XML output to structured chunks keyed by heading path, then enforces a 1024-token ceiling using tiktoken:
import trafilatura
import tiktoken
from dataclasses import dataclass, field
from typing import Optional
enc = tiktoken.get_encoding("cl100k_base")
MAX_TOKENS = 1024
@dataclass
class Chunk:
url: str
heading_path: str
text: str
token_count: int
scraped_at: str
def extract_and_chunk(url: str, html: str, scraped_at: str) -> list[Chunk]:
extracted = trafilatura.extract(
html, include_tables=False, include_comments=False,
output_format="markdown", url=url
)
if not extracted:
return []
chunks = []
current_heading = "intro"
current_lines: list[str] = []
def flush(heading: str, lines: list[str]):
text = "\n".join(lines).strip()
if not text:
return
tokens = enc.encode(text)
# split oversized sections into 1024-token windows
for i in range(0, len(tokens), MAX_TOKENS):
window = enc.decode(tokens[i:i + MAX_TOKENS])
chunks.append(Chunk(
url=url,
heading_path=heading,
text=window,
token_count=min(MAX_TOKENS, len(tokens) - i),
scraped_at=scraped_at,
))
for line in extracted.splitlines():
if line.startswith("## ") or line.startswith("### "):
flush(current_heading, current_lines)
current_heading = line.lstrip("# ").strip()
current_lines = []
else:
current_lines.append(line)
flush(current_heading, current_lines)
return chunksthis produces chunks with a heading_path field that tells downstream retrieval exactly where in the article the text came from — critical for citation and re-ranking.
Metadata Design for Downstream Retrieval
the chunk object itself is only half the story. what you store alongside the embedding determines how well your retrieval pipeline can filter, rank, and cite results.
at minimum, store these fields per chunk:
url— the canonical source URL, not a redirect chainheading_path— the H2/H3 breadcrumb (e.g. “chunking strategies > heading-based”)token_count— lets you pack multiple chunks into a context window without guessingscraped_at— ISO 8601 timestamp for freshness filteringdomain— enables source-level filtering in retrievalword_count— quick quality signal; chunks under 30 words are usually boilerplate fragments
once you’re storing chunks with rich metadata, the vector DB choice matters. if you’re running on Postgres already, pgvector, Qdrant, and Weaviate each make different tradeoffs on query latency, filtering expressiveness, and operational overhead. for a full retrieval architecture built on a scraped corpus, the 2026 RAG architecture guide walks through the end-to-end design including re-ranking and context assembly.
one underused pattern: store pydantic-validated chunk objects before inserting into the vector DB. this catches encoding issues and schema drift early. auto-generating Pydantic models from LLM-inferred schemas is a practical approach when your source corpus is heterogeneous.
Bottom line
use trafilatura for extraction, heading-based chunking with a 1024-token ceiling for editorial content, and always attach URL plus heading path as chunk metadata. fixed-size splitting is faster to implement but degrades retrieval quality enough to matter in production. dataresearchtools.com covers the full pipeline from scraping to retrieval — start here, then follow the internal links above for each layer.
Related guides on dataresearchtools.com
- Vector Search for Scraped Data: pgvector vs Qdrant vs Weaviate (2026)
- Building a RAG App on Scraped Documentation: 2026 Architecture
- LLM-Driven Scraping Schemas: Auto-Generating Pydantic Models (2026)
- Self-Healing Scrapers with LLMs: When Selectors Break (2026)
- Pillar: How to Scrape Google News Articles in 2026