How to Scrape and Chunk Long-Form Articles for LLM Context (2026)

I’ll write this directly.

Scraping and chunking long-form articles for LLM context is one of those problems that looks trivial until your RAG pipeline starts hallucinating or returning irrelevant passages — because the raw HTML you fed it was 60% navigation, ads, and cookie banners. getting the extraction and chunking layers right makes the difference between a retrieval system that works and one that confidently answers from boilerplate footer text.

Why Raw HTML Makes Terrible LLM Context

Most web pages are architecturally hostile to text extraction. a typical news article contains 3-5x more non-content HTML than actual article text: nav bars, related article widgets, comment sections, cookie consent modals, and structured data scripts all end up in your scraped payload if you’re not aggressive about stripping them.

the specific failure modes matter:

  • boilerplate contamination: chunks that embed nav text score well on embedding similarity but carry zero semantic value
  • encoding artifacts: Windows-1252 characters mangled to UTF-8 produce gibberish that breaks tokenizers
  • truncated sentences at chunk boundaries: splitting on fixed character counts cuts mid-sentence, destroying coherence for the LLM
  • missing metadata: a chunk with no URL or heading path is nearly unrecoverable when you need to cite or re-fetch the source

if you’re scraping at scale, these issues compound. a corpus of 50,000 articles with 20% boilerplate contamination means 10,000 chunks that actively degrade your retrieval quality. for news pipelines specifically, How to Scrape Google News Articles in 2026 covers the upstream fetch layer in detail — but once you have the HTML, the extraction problem is yours to solve.

Extraction Layer: Choosing Your Parser

four tools dominate production extraction pipelines in 2026. they are not interchangeable.

trafilatura is the default choice for article extraction. it uses a combination of HTML structure heuristics and readability scoring to isolate the main content block, strips boilerplate aggressively, and returns clean text or XML with optional metadata (author, date, title). benchmark tests on CommonCrawl samples consistently show trafilatura outperforming alternatives on precision for news and blog content.

newspaper4k (the maintained fork of newspaper3k) adds NLP-based keyword and summary extraction on top of content isolation. useful if you need those fields, but slower and more opinionated than trafilatura.

Jina Reader (r.jina.ai/) is an API-based approach: pass a URL, get back clean markdown. the tradeoff is latency (300-800ms per request), cost at scale, and loss of control over the extraction logic. it works well for prototypes or low-volume pipelines where you don’t want to maintain a scraper fleet.

raw BeautifulSoup is the wrong choice for article extraction unless you’re working with a single known site structure. it has no boilerplate-detection logic — you’re writing and maintaining CSS selectors forever. for sites with dynamic selectors, pair it with a self-healing scraper using LLMs rather than trying to hand-maintain the extraction rules.

Chunking Strategies Compared

once you have clean text, chunking is where most pipelines make consequential choices. the right strategy depends on your retrieval use case and the LLM’s context window.

strategytypical chunk sizeoverlapbest formain tradeoff
fixed-size (chars)500-1000 chars10-20%fast indexing, homogeneous corporabreaks sentences mid-thought
sentence-boundary3-8 sentences1-2 sentencesQA over dense prosevariable token counts complicate batching
heading-basedfull H2/H3 sectionnonestructured docs, technical articlessections vary from 50 to 2000+ tokens
semantic (embedding)256-512 tokensnonehigh-precision retrievalexpensive, requires embedding at chunk time

for long-form editorial content, heading-based chunking is the most semantically coherent approach. each chunk maps to an author-intended unit of meaning, preserves context, and makes the source citation obvious. the main risk is runaway sections, which you handle with a token ceiling and a recursive split fallback.

Heading-Based Chunking with tiktoken

the following approach converts trafilatura’s XML output to structured chunks keyed by heading path, then enforces a 1024-token ceiling using tiktoken:

import trafilatura
import tiktoken
from dataclasses import dataclass, field
from typing import Optional

enc = tiktoken.get_encoding("cl100k_base")
MAX_TOKENS = 1024

@dataclass
class Chunk:
    url: str
    heading_path: str
    text: str
    token_count: int
    scraped_at: str

def extract_and_chunk(url: str, html: str, scraped_at: str) -> list[Chunk]:
    extracted = trafilatura.extract(
        html, include_tables=False, include_comments=False,
        output_format="markdown", url=url
    )
    if not extracted:
        return []

    chunks = []
    current_heading = "intro"
    current_lines: list[str] = []

    def flush(heading: str, lines: list[str]):
        text = "\n".join(lines).strip()
        if not text:
            return
        tokens = enc.encode(text)
        # split oversized sections into 1024-token windows
        for i in range(0, len(tokens), MAX_TOKENS):
            window = enc.decode(tokens[i:i + MAX_TOKENS])
            chunks.append(Chunk(
                url=url,
                heading_path=heading,
                text=window,
                token_count=min(MAX_TOKENS, len(tokens) - i),
                scraped_at=scraped_at,
            ))

    for line in extracted.splitlines():
        if line.startswith("## ") or line.startswith("### "):
            flush(current_heading, current_lines)
            current_heading = line.lstrip("# ").strip()
            current_lines = []
        else:
            current_lines.append(line)

    flush(current_heading, current_lines)
    return chunks

this produces chunks with a heading_path field that tells downstream retrieval exactly where in the article the text came from — critical for citation and re-ranking.

Metadata Design for Downstream Retrieval

the chunk object itself is only half the story. what you store alongside the embedding determines how well your retrieval pipeline can filter, rank, and cite results.

at minimum, store these fields per chunk:

  1. url — the canonical source URL, not a redirect chain
  2. heading_path — the H2/H3 breadcrumb (e.g. “chunking strategies > heading-based”)
  3. token_count — lets you pack multiple chunks into a context window without guessing
  4. scraped_at — ISO 8601 timestamp for freshness filtering
  5. domain — enables source-level filtering in retrieval
  6. word_count — quick quality signal; chunks under 30 words are usually boilerplate fragments

once you’re storing chunks with rich metadata, the vector DB choice matters. if you’re running on Postgres already, pgvector, Qdrant, and Weaviate each make different tradeoffs on query latency, filtering expressiveness, and operational overhead. for a full retrieval architecture built on a scraped corpus, the 2026 RAG architecture guide walks through the end-to-end design including re-ranking and context assembly.

one underused pattern: store pydantic-validated chunk objects before inserting into the vector DB. this catches encoding issues and schema drift early. auto-generating Pydantic models from LLM-inferred schemas is a practical approach when your source corpus is heterogeneous.

Bottom line

use trafilatura for extraction, heading-based chunking with a 1024-token ceiling for editorial content, and always attach URL plus heading path as chunk metadata. fixed-size splitting is faster to implement but degrades retrieval quality enough to matter in production. dataresearchtools.com covers the full pipeline from scraping to retrieval — start here, then follow the internal links above for each layer.

Related guides on dataresearchtools.com

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top

Resources

Proxy Signals Podcast
Operator-level insights on mobile proxies and access infrastructure.

Multi-Account Proxies: Setup, Types, Tools & Mistakes (2026)