How to Scrape arXiv Preprint Metadata and PDFs Programmatically (2026)

arXiv is one of the most valuable open-access repositories on the internet — 2.4 million preprints across physics, math, CS, biology, and economics, all publicly accessible. scraping arXiv preprint metadata and PDFs programmatically is a legitimate, well-supported workflow, and arXiv actively provides APIs to do it cleanly. the trap most engineers fall into is ignoring those APIs and hammering the HTML interface, which triggers rate limits and wastes engineering time.

the two official paths: OAI-PMH and the arXiv API

arXiv exposes two machine-readable interfaces you should know before touching a scraper:

  • OAI-PMH endpoint (https://export.arxiv.org/oai2): harvests bulk metadata using the Open Archives Initiative Protocol. best for full corpus pulls and incremental updates by date range.
  • arXiv API (https://export.arxiv.org/api/query): Atom XML search interface for targeted queries by keyword, author, category, or arxiv ID. supports pagination up to 2000 results per request.

both are free, require no authentication, and have documented rate limits (3 seconds between requests for the API). neither returns PDF binaries directly — for PDFs you construct URLs from the paper ID: https://arxiv.org/pdf/{arxiv_id}.

extracting metadata with the arXiv API

the arXiv API is Atom XML, which sounds annoying but parses cleanly with feedparser or direct xml.etree calls.

import requests
import feedparser
import time

def fetch_arxiv(query, max_results=100, start=0):
    base = "https://export.arxiv.org/api/query"
    params = {
        "search_query": query,
        "start": start,
        "max_results": max_results,
        "sortBy": "submittedDate",
        "sortOrder": "descending"
    }
    r = requests.get(base, params=params, timeout=30)
    r.raise_for_status()
    feed = feedparser.parse(r.text)
    results = []
    for entry in feed.entries:
        results.append({
            "id": entry.id.split("/abs/")[-1],
            "title": entry.title,
            "authors": [a.name for a in entry.authors],
            "published": entry.published,
            "summary": entry.summary,
            "categories": [t.term for t in entry.tags],
            "pdf_url": f"https://arxiv.org/pdf/{entry.id.split('/abs/')[-1]}"
        })
    time.sleep(3)
    return results

papers = fetch_arxiv("cat:cs.LG AND ti:large language model", max_results=200)

paginate by incrementing start in steps of max_results. for a full category harvest (cs.LG alone has 80k+ papers), the OAI-PMH path is faster since it supports resumptionToken continuation and avoids query limits.

bulk harvesting with OAI-PMH

OAI-PMH is the right tool for ingesting all preprints in a subject area or building an incremental sync. the from and until parameters let you pull only papers submitted in a date window — useful for daily pipelines feeding an AI training dataset. if you are building something like a research corpus for fine-tuning, this pairs naturally with what the DRT team covered in How to Scrape PubMed Central Open Access Articles for AI Training (2026).

OAI-PMH returns Dublin Core or arXiv-specific metadata. the arXiv format (metadataPrefix=arXiv) includes author affiliations, MSC codes, and DOI links that the API Atom feed omits.

numbered steps for a clean OAI-PMH harvest:

  1. start with verb=ListRecords&metadataPrefix=arXiv&set=cs (replace cs with your subject set)
  2. parse the in the response
  3. fetch next batch: verb=ListRecords&resumptionToken=
  4. sleep 20 seconds between requests (OAI-PMH is more sensitive to hammering than the API)
  5. write each batch to disk before fetching the next — a full cs harvest is ~500k records and will not fit in memory

downloading PDFs at scale

PDF downloads work via direct HTTP GET on https://arxiv.org/pdf/{id}. arXiv serves these from S3 via CloudFront and tolerates polite bulk downloads provided you honor a few constraints.

approachthroughputrisknotes
sequential with 3s sleep~1 PDF/4slowsafe for up to ~10k PDFs
concurrent (5 workers)~1.2 PDFs/smediumuse a semaphore, not a bare thread pool
S3 bulk access (requester-pays)very highnoneofficial path for full corpus, costs ~$0.09/GB egress
third-party mirror (Semantic Scholar)API-rate-limitedlowmetadata only, no raw PDFs

for anything over 50k PDFs, request bulk S3 access through arXiv’s data access program. it is a form submission with a brief project description and is usually approved within a week. this is the same institutional data pathway relevant to registries like ClinicalTrials.gov, which has a similar bulk download process for researchers.

for targeted PDF pulls under 10k, the direct HTTP approach with requests and exponential backoff on 429s is fine:

from pathlib import Path
import time, requests

def download_pdf(arxiv_id, dest_dir="pdfs"):
    url = f"https://arxiv.org/pdf/{arxiv_id}"
    dest = Path(dest_dir) / f"{arxiv_id.replace('/', '_')}.pdf"
    if dest.exists():
        return str(dest)
    for attempt in range(4):
        r = requests.get(url, timeout=60, headers={"User-Agent": "ResearchBot/1.0 (+youremail@example.com)"})
        if r.status_code == 200:
            dest.write_bytes(r.content)
            time.sleep(3)
            return str(dest)
        if r.status_code == 429:
            time.sleep(30 * (attempt + 1))
    return None

always include a descriptive User-Agent with contact info. arXiv staff have flagged bots by user-agent pattern before reaching rate-limit escalation.

parsing PDFs for structured data

raw PDF bytes are not immediately useful. the main extraction stack in 2026 looks like this:

  • PyMuPDF (fitz): fastest text extraction, preserves reading order better than pdfminer
  • marker-pdf: open-source, converts to clean markdown including equations and tables
  • Nougat (Meta): OCR-based, handles math notation accurately but is GPU-heavy
  • grobid: Java service that extracts structured references, author affiliations, and section boundaries — the best option if you care about citation graphs

for citation network analysis, grobid is worth the setup overhead. for plain text extraction to feed into embeddings or search indexes, PyMuPDF is the fastest path. the same extraction decisions apply when you are pulling structured content from other technical publishing platforms — the DRT team covered similar toolchain choices when looking at ProductHunt launch data and maker profiles.

for ML training pipelines, check the arXiv bulk data S3 bucket layout: source LaTeX files are also available and produce much cleaner text than PDF extraction — no layout artifacts, no OCR errors, equation markup intact.

common errors and what they mean

  • 503 Service Unavailable from the API: back off 60 seconds, the API is under load
  • PDF response is HTML (not binary): arXiv is serving an abstract page instead of the PDF, usually because the paper is under embargo or the ID is malformed
  • OAI-PMH badResumptionToken: token expired (valid for 24 hours), restart the harvest from the last successfully saved from date
  • noRecordsMatch: the date range or set filter returned zero results — not an error, just an empty window

the debugging loop here is the same as any structured data pipeline. if you have worked through Hacker News front page scraping or review platforms like those covered in the G2 and Capterra guide, the pattern is familiar: parse response shapes defensively, log every non-200, and checkpoint state to disk so restarts are cheap.

Bottom line

use the arXiv API for targeted queries and OAI-PMH for bulk harvests — both are officially supported, rate-limit-tolerant when used correctly, and will not get your IP blocked. for PDF downloads at scale, the S3 bulk access program is the right tool and avoids the HTTP layer entirely. DRT covers this class of open-data, research-grade scraping target regularly because they represent the cleanest, highest-signal data pipelines available to engineers building AI and analytics products.

Related guides on dataresearchtools.com

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top

Resources

Proxy Signals Podcast
Operator-level insights on mobile proxies and access infrastructure.

Multi-Account Proxies: Setup, Types, Tools & Mistakes (2026)