arXiv is one of the most valuable open-access repositories on the internet — 2.4 million preprints across physics, math, CS, biology, and economics, all publicly accessible. scraping arXiv preprint metadata and PDFs programmatically is a legitimate, well-supported workflow, and arXiv actively provides APIs to do it cleanly. the trap most engineers fall into is ignoring those APIs and hammering the HTML interface, which triggers rate limits and wastes engineering time.
the two official paths: OAI-PMH and the arXiv API
arXiv exposes two machine-readable interfaces you should know before touching a scraper:
- OAI-PMH endpoint (
https://export.arxiv.org/oai2): harvests bulk metadata using the Open Archives Initiative Protocol. best for full corpus pulls and incremental updates by date range. - arXiv API (
https://export.arxiv.org/api/query): Atom XML search interface for targeted queries by keyword, author, category, or arxiv ID. supports pagination up to 2000 results per request.
both are free, require no authentication, and have documented rate limits (3 seconds between requests for the API). neither returns PDF binaries directly — for PDFs you construct URLs from the paper ID: https://arxiv.org/pdf/{arxiv_id}.
extracting metadata with the arXiv API
the arXiv API is Atom XML, which sounds annoying but parses cleanly with feedparser or direct xml.etree calls.
import requests
import feedparser
import time
def fetch_arxiv(query, max_results=100, start=0):
base = "https://export.arxiv.org/api/query"
params = {
"search_query": query,
"start": start,
"max_results": max_results,
"sortBy": "submittedDate",
"sortOrder": "descending"
}
r = requests.get(base, params=params, timeout=30)
r.raise_for_status()
feed = feedparser.parse(r.text)
results = []
for entry in feed.entries:
results.append({
"id": entry.id.split("/abs/")[-1],
"title": entry.title,
"authors": [a.name for a in entry.authors],
"published": entry.published,
"summary": entry.summary,
"categories": [t.term for t in entry.tags],
"pdf_url": f"https://arxiv.org/pdf/{entry.id.split('/abs/')[-1]}"
})
time.sleep(3)
return results
papers = fetch_arxiv("cat:cs.LG AND ti:large language model", max_results=200)paginate by incrementing start in steps of max_results. for a full category harvest (cs.LG alone has 80k+ papers), the OAI-PMH path is faster since it supports resumptionToken continuation and avoids query limits.
bulk harvesting with OAI-PMH
OAI-PMH is the right tool for ingesting all preprints in a subject area or building an incremental sync. the from and until parameters let you pull only papers submitted in a date window — useful for daily pipelines feeding an AI training dataset. if you are building something like a research corpus for fine-tuning, this pairs naturally with what the DRT team covered in How to Scrape PubMed Central Open Access Articles for AI Training (2026).
OAI-PMH returns Dublin Core or arXiv-specific metadata. the arXiv format (metadataPrefix=arXiv) includes author affiliations, MSC codes, and DOI links that the API Atom feed omits.
numbered steps for a clean OAI-PMH harvest:
- start with
verb=ListRecords&metadataPrefix=arXiv&set=cs(replacecswith your subject set) - parse the
in the response - fetch next batch:
verb=ListRecords&resumptionToken= - sleep 20 seconds between requests (OAI-PMH is more sensitive to hammering than the API)
- write each batch to disk before fetching the next — a full cs harvest is ~500k records and will not fit in memory
downloading PDFs at scale
PDF downloads work via direct HTTP GET on https://arxiv.org/pdf/{id}. arXiv serves these from S3 via CloudFront and tolerates polite bulk downloads provided you honor a few constraints.
| approach | throughput | risk | notes |
|---|---|---|---|
| sequential with 3s sleep | ~1 PDF/4s | low | safe for up to ~10k PDFs |
| concurrent (5 workers) | ~1.2 PDFs/s | medium | use a semaphore, not a bare thread pool |
| S3 bulk access (requester-pays) | very high | none | official path for full corpus, costs ~$0.09/GB egress |
| third-party mirror (Semantic Scholar) | API-rate-limited | low | metadata only, no raw PDFs |
for anything over 50k PDFs, request bulk S3 access through arXiv’s data access program. it is a form submission with a brief project description and is usually approved within a week. this is the same institutional data pathway relevant to registries like ClinicalTrials.gov, which has a similar bulk download process for researchers.
for targeted PDF pulls under 10k, the direct HTTP approach with requests and exponential backoff on 429s is fine:
from pathlib import Path
import time, requests
def download_pdf(arxiv_id, dest_dir="pdfs"):
url = f"https://arxiv.org/pdf/{arxiv_id}"
dest = Path(dest_dir) / f"{arxiv_id.replace('/', '_')}.pdf"
if dest.exists():
return str(dest)
for attempt in range(4):
r = requests.get(url, timeout=60, headers={"User-Agent": "ResearchBot/1.0 (+youremail@example.com)"})
if r.status_code == 200:
dest.write_bytes(r.content)
time.sleep(3)
return str(dest)
if r.status_code == 429:
time.sleep(30 * (attempt + 1))
return Nonealways include a descriptive User-Agent with contact info. arXiv staff have flagged bots by user-agent pattern before reaching rate-limit escalation.
parsing PDFs for structured data
raw PDF bytes are not immediately useful. the main extraction stack in 2026 looks like this:
- PyMuPDF (fitz): fastest text extraction, preserves reading order better than pdfminer
- marker-pdf: open-source, converts to clean markdown including equations and tables
- Nougat (Meta): OCR-based, handles math notation accurately but is GPU-heavy
- grobid: Java service that extracts structured references, author affiliations, and section boundaries — the best option if you care about citation graphs
for citation network analysis, grobid is worth the setup overhead. for plain text extraction to feed into embeddings or search indexes, PyMuPDF is the fastest path. the same extraction decisions apply when you are pulling structured content from other technical publishing platforms — the DRT team covered similar toolchain choices when looking at ProductHunt launch data and maker profiles.
for ML training pipelines, check the arXiv bulk data S3 bucket layout: source LaTeX files are also available and produce much cleaner text than PDF extraction — no layout artifacts, no OCR errors, equation markup intact.
common errors and what they mean
503 Service Unavailablefrom the API: back off 60 seconds, the API is under load- PDF response is HTML (not binary): arXiv is serving an abstract page instead of the PDF, usually because the paper is under embargo or the ID is malformed
- OAI-PMH
badResumptionToken: token expired (valid for 24 hours), restart the harvest from the last successfully savedfromdate noRecordsMatch: the date range or set filter returned zero results — not an error, just an empty window
the debugging loop here is the same as any structured data pipeline. if you have worked through Hacker News front page scraping or review platforms like those covered in the G2 and Capterra guide, the pattern is familiar: parse response shapes defensively, log every non-200, and checkpoint state to disk so restarts are cheap.
Bottom line
use the arXiv API for targeted queries and OAI-PMH for bulk harvests — both are officially supported, rate-limit-tolerant when used correctly, and will not get your IP blocked. for PDF downloads at scale, the S3 bulk access program is the right tool and avoids the HTTP layer entirely. DRT covers this class of open-data, research-grade scraping target regularly because they represent the cleanest, highest-signal data pipelines available to engineers building AI and analytics products.
Related guides on dataresearchtools.com
- How to Scrape PubMed Central Open Access Articles for AI Training (2026)
- How to Scrape ClinicalTrials.gov Public Trial Registry (2026)
- How to Scrape ProductHunt Launch Data and Maker Profiles (2026)
- How to Scrape Hacker News Front Page Data Without API Limits (2026)
- Pillar: How to Scrape G2.com and Capterra SaaS Reviews Programmatically