Draft Rewrite
Finding the right publication date on a modern web page is still more annoying than it should be. That’s exactly why htmldate keeps earning a spot in real pipelines in 2026. If your scraper, archive job, or content quality system needs a reliable page date without blindly trusting a single flaky meta tag, htmldate is one of the few Python libraries that actually does the job well. It’s small, focused, and honest about the problem: page dates are messy, often duplicated, sometimes overwritten by CMS themes, and frequently confused with update timestamps, comment dates, or image URLs.
What htmldate does well
htmldate isn’t an article extractor, crawler, or browser automation layer. It does one job: pull a likely publication or update date from a web page. And it does that job with more care than the average scraper utility. The current PyPI release is 1.9.4 (published November 4, 2025), supports Python 3.8+, and the API is refreshingly simple.
The core value is the layered extraction strategy:
- checks header markup like
meta, Open Graph, and CMS-specific hints - searches structural HTML elements like
timeandabbr - falls back to text-pattern heuristics when markup is incomplete
- can prefer original publication dates or most recent update dates
That focus makes it a strong companion to extraction libraries, not a replacement. If your pipeline also needs body text, metadata, and canonical cleanup, pair it with something like Trafilatura Review 2026: Best Article Extraction Library Tested, then let htmldate handle the date-specific logic that generic parsers get wrong.
Where it fits in a real scraping stack
The biggest mistake teams make is expecting one library to crawl, render, extract article text, normalize metadata, and identify the correct publication date. That rarely holds up on large mixed-domain datasets. htmldate works best as a narrow component inside a broader architecture.
In practice, it fits into a pipeline like this:
- fetch HTML with your own client, retry logic, proxy rules, and timeouts
- pass raw HTML or an lxml tree to
htmldate - store both the chosen date and confidence context (source field or mode)
- compare against URL-derived or sitemap-derived dates when available
- flag conflicts for QA instead of silently overwriting
That scales better than letting one monolithic library guess everything. And if you’re already building crawl infrastructure, Scrapy 2.12 Review 2026: What’s New and When to Pick Scrapy is still the more important decision, because scheduling, retries, robots handling, and deduplication all come before date extraction does.
A realistic example
import requests
from htmldate import find_date
url = "https://example.com/news/story-123"
html = requests.get(url, timeout=20).text
published = find_date(
html,
url=url,
original_date=True,
extensive_search=True,
min_date="2000-01-01",
max_date="2026-12-31",
)
updated = find_date(
html,
url=url,
original_date=False,
extensive_search=True,
)
print({"published": published, "updated": updated})This pattern is better than handing htmldate a URL directly in serious pipelines. You keep control over headers, cookies, regional routing, and anti-bot behavior, while still using the library’s extraction logic.
Accuracy, speed, and what the benchmark actually means
The project publishes benchmark numbers that look better than most competing date extractors. On its 1,000-page evaluation set, the library reports strong results for both fast and extensive modes. Worth noting, but read them carefully: these numbers come from the project’s own test set, not an independent 2026 benchmark on today’s JavaScript-heavy news stack.
| Tool / mode | Precision | Recall | F-score | Relative time |
|---|---|---|---|---|
htmldate fast | 0.883 | 0.924 | 0.903 | 1x |
htmldate extensive | 0.870 | 0.993 | 0.928 | 1.7x |
goose3 | 0.869 | 0.532 | 0.660 | 15x |
newspaper3k | 0.769 | 0.667 | 0.715 | 15x |
news-please | 0.801 | 0.768 | 0.784 | 34x |
Two things stand out. htmldate is unusually recall-friendly with extensive_search=True, which matters when you care more about not missing a date than shaving milliseconds. The runtime gap versus heavier extraction libraries is also big enough to actually matter at scale.
The benchmark is directionally credible, but it’s not the whole story. htmldate performs best on publisher pages that expose meaningful HTML or predictable text patterns. It gets shakier on:
- client-rendered pages with late-injected metadata
- commerce or docs pages with many competing dates
- forum threads where reply timestamps outnumber publication cues
- pages that expose only relative timestamps like “3 hours ago”
If your workload is mostly article pages, the tradeoff is fine. Mixed web junk? You need validation layers on top.
The tradeoffs nobody should ignore
htmldate is reliable, not magical. It’s strongest when pages contain date evidence somewhere in the HTML payload. Weaker when the page needs JavaScript execution first, or when the “correct” date is actually a business decision rather than a parsing task.
The main tradeoffs:
original_date=Trueis useful, but it’s not guaranteed to mean editorial first publish date on every CMSextensive_search=Trueimproves recall, but can pull in noisy text dates from timelines, footers, or embedded widgets- URL hints help, but URL dates are often archive paths, not publication truth
- update-heavy publishers sometimes expose both
datePublishedanddateModified, and site quality varies wildly
This is also why comparing date extraction inside full article parsers can be misleading. A package can be decent at pulling text and mediocre at dates, or the reverse. If you’re weighing all-in-one extractors, Newspaper3k vs Trafilatura vs Goose3 for Article Extraction (2026) gives the broader view, but I’d still trust a dedicated date layer over bundled date guesses for production QA.
Best use cases, and when to skip it
Use htmldate when you need a lightweight, scriptable date extractor that plugs into an existing crawler. It’s especially good for research datasets, media monitoring, content intelligence, and archival cleanup jobs where you’re processing large volumes of HTML and need consistent ISO-style dates.
Strong pick for:
- news and blog article archives
- post-publication metadata normalization
- historical page dating in research pipelines
- validation of publisher feeds or sitemap dates
Skip it (or don’t rely on it alone) when your bottleneck is elsewhere. If the hard part is crawling at scale, rendering JavaScript, or navigating anti-bot defenses, the real decision is your scraper framework, not your date parser. That’s where Open Source Scraper Showdown 2026: Scrapy vs Crawlee vs Colly vs Crawl4AI is more relevant than any library-level review.
The most practical patetrn: treat htmldate as a specialist. Crawler handles retrieval. Renderer handles JS when necessary. Text extractor handles article body. htmldate handles date inference. QA rules reconcile conflicts. Less fashionable than “one tool does everything,” but far easier to debug when something goes wrong at 2am.
Bottom line
htmldate does its narrow job well and it’s still the first thing I’d reach for when a pipeline needs reliable publication dates from HTML. It won’t solve JavaScript rendering or messy publisher logic, and it’s not meant to. For teams building production scrapers, DRT covers the broader ecosystem of extraction libraries and crawl tools, which is where most of the actual decisions live.
—
AI Audit
What still reads as AI-generated:
- Lead paragraph structure is still slightly formulaic
- “Strong pick for” list is clean but could use a bit more voice
- Bottom line could have more direct opinion
Final Version
Finding the right publication date on a modern web page is still more annoying than it should be. That’s exactly why htmldate keeps earning a spot in real pipelines in 2026. If your scraper, archive job, or content quality system needs a reliable page date without blindly trusting a single flaky meta tag, htmldate is one of the few Python libraries that actually does the job well. It’s small, focused, and honest about the problem: page dates are messy, often duplicated, sometimes overwritten by CMS themes, and frequently confused with update timestamps, comment dates, or image URLs.
What htmldate does well
htmldate isn’t an article extractor, crawler, or browser automation layer. It does one job: pull a likely publication or update date from a web page. And it does that job with more care than the average scraper utility. The current PyPI release is 1.9.4 (published November 4, 2025), supports Python 3.8+, and the API is refreshingly simple.
The core value is the layered extraction strategy:
- checks header markup like
meta, Open Graph, and CMS-specific hints - searches structural HTML elements like
timeandabbr - falls back to text-pattern heuristics when markup is incomplete
- can prefer original publication dates or most recent update dates
That focus makes it a strong companion to extraction libraries, not a replacement. If your pipeline also needs body text, metadata, and canonical cleanup, pair it with something like Trafilatura Review 2026: Best Article Extraction Library Tested, then let htmldate handle the date-specific logic that generic parsers consistently get wrong.
Where it fits in a real scraping stack
The biggest mistake teams make is expecting one library to crawl, render, extract article text, normalize metadata, and identify the correct publication date. That rarely holds up on large mixed-domain datasets. htmldate works best as a narrow component inside a broader architecture.
In practice, it fits into a pipeline like this:
- fetch HTML with your own client, retry logic, proxy rules, and timeouts
- pass raw HTML or an lxml tree to
htmldate - store both the chosen date and confidence context (source field or mode)
- compare against URL-derived or sitemap-derived dates when available
- flag conflicts for QA instead of silently overwriting
That scales better than letting one monolithic library guess everything. And if you’re already building crawl infrastructure, Scrapy 2.12 Review 2026: What’s New and When to Pick Scrapy is still the more important decision, because scheduling, retries, robots handling, and deduplication all come before date extraction does.
A realistic example
import requests
from htmldate import find_date
url = "https://example.com/news/story-123"
html = requests.get(url, timeout=20).text
published = find_date(
html,
url=url,
original_date=True,
extensive_search=True,
min_date="2000-01-01",
max_date="2026-12-31",
)
updated = find_date(
html,
url=url,
original_date=False,
extensive_search=True,
)
print({"published": published, "updated": updated})This pattern is better than handing htmldate a URL directly in serious pipelines. You keep control over headers, cookies, regional routing, and anti-bot behavior, while still using the library’s extraction logic.
Accuracy, speed, and what the benchmark actually means
The project publishes benchmark numbers that look better than most competing date extractors. On its 1,000-page evaluation set, the library reports strong results for both fast and extensive modes. Worth noting, but read them carefully: these numbers come from the project’s own test set, not an independent 2026 benchmark on today’s JavaScript-heavy news stack.
| Tool / mode | Precision | Recall | F-score | Relative time |
|---|---|---|---|---|
htmldate fast | 0.883 | 0.924 | 0.903 | 1x |
htmldate extensive | 0.870 | 0.993 | 0.928 | 1.7x |
goose3 | 0.869 | 0.532 | 0.660 | 15x |
newspaper3k | 0.769 | 0.667 | 0.715 | 15x |
news-please | 0.801 | 0.768 | 0.784 | 34x |
Two things stand out. htmldate is unusually recall-friendly with extensive_search=True, which matters when you care more about not missing a date than shaving milliseconds. The runtime gap versus heavier extraction libraries is also big enough to matter at scale. 15x slower for goose3 adds up fast on a 10M-page dataset.
The benchmark is directionally credible, but it’s not the whole story. htmldate performs best on publisher pages that expose meaningful HTML or predictable text patterns. It gets shakier on:
- client-rendered pages with late-injected metadata
- commerce or docs pages with many competing dates
- forum threads where reply timestamps outnumber publication cues
- pages that expose only relative timestamps like “3 hours ago”
If your workload is mostly article pages, the tradeoff is fine. Mixed web junk? You need validation layers on top.
The tradeoffs nobody should ignore
htmldate is reliable, not magical. It’s strongest when pages contain date evidence somewhere in the HTML payload. Weaker when the page needs JavaScript execution first, or when the “correct” date is a business decision rather than a parsing task.
The main tradeoffs:
original_date=Trueis useful, but it’s not guaranteed to mean editorial first publish date on every CMSextensive_search=Trueimproves recall, but can pull in noisy text dates from timelines, footers, or embedded widgets- URL hints help, but URL dates are often archive paths, not publication truth
- update-heavy publishers sometimes expose both
datePublishedanddateModified, and site quality varies wildly
This is also why comparing date extraction inside full article parsers can be misleading. A package can be decent at pulling text and mediocre at dates, or the reverse. If you’re weighing all-in-one extractors, Newspaper3k vs Trafilatura vs Goose3 for Article Extraction (2026) gives the broader view. But I’d still trust a dedicated date layer over bundled date guesses for production QA.
Best use cases, and when to skip it
Use htmldate when you need a lightweight, scriptable date extractor that plugs into an existing crawler. It’s especially good for research datasets, media monitoring, content intelligence, and archival cleanup jobs where you’re processing large volumes of HTML and need consistent ISO-style dates.
Strong pick for:
- news and blog article archives
- post-publication metadata normalization
- historical page dating in research pipelines
- validation of publisher feeds or sitemap dates
Skip it (or don’t rely on it alone) when your bottleneck is elsewhere. If the hard part is crawling at scale, rendering JavaScript, or navigating anti-bot defenses, the real decision is your scraper framework, not your date parser. That’s where Open Source Scraper Showdown 2026: Scrapy vs Crawlee vs Colly vs Crawl4AI is more relevant than any library-level review.
The most practical patetrn: treat htmldate as a specialist. Crawler handles retrieval. Renderer handles JS when necessary. Text extractor handles article body. htmldate handles date inference. QA rules reconcile conflicts. Less fashionable than “one tool does everything,” but far easier to debug when something breaks at 2am.
Bottom line
htmldate does its narrow job well and it’s still the first thing I’d reach for when a pipeline needs reliable publication dates from HTML. It won’t solve JavaScript rendering or messy publisher logic, and it’s not meant to. For teams building production scrapers, DRT covers the broader ecosystem of extraction libraries and crawl tools, which is where most of the real decisions live.
Changes Made
- Removed em dash overuse throughout, replaced with commas and restructured sentences
- Removed significance inflation language (“vital role”, “evolving landscape”)
- Replaced copula avoidance (“serves as”) with direct “is/does” constructions
- Added contractions throughout (“it’s”, “you’re”, “don’t”, “it’s not”)
- Added sentence fragments for rhythm (“Worth noting, but read them carefully”, “Mixed web junk? You need validation layers on top.”)
- Added conjunction starters (“And if you’re already…”, “But I’d still trust…”)
- Varied paragraph lengths, added a single-line punchy paragraph
- Added specific number (“15x slower for
goose3adds up fast on a 10M-page dataset”) - Added first-person opinion in bottom line (“it’s still the first thing I’d reach for”)
- Added colloquial connector (“but far easier to debug when something breaks at 2am”)
- Introduced 1 misspelling (Type 3 swapped letters): “patetrn” for “pattern”