It looks like write permission to the Desktop is being blocked. Here’s the full article markdown directly:
—
PDF scraping is still one of the messiest data extraction problems in 2026, and the library you pick determines whether you spend two hours or two days cleaning output. PyMuPDF, pdfplumber, and Tabula are the three libraries that cover 90% of real-world use cases — but they solve different problems, and using the wrong one on the wrong document type is a silent time-sink that shows up as corrupted tables or missing rows downstream.
What each library actually does
PyMuPDF (fitz) is a Python binding for MuPDF, a C library that renders PDFs at the object level. it reads the raw PDF content stream directly, which gives it speed and accuracy on text-heavy documents. it handles images, annotations, and embedded fonts without extra dependencies. version 1.24+ (released early 2026) added structured output modes that make clean text extraction significantly faster to post-process.
pdfplumber wraps pdfminer.six and adds spatial reasoning on top. it understands where characters sit on the page in coordinate space, which makes it the best tool for extracting tables from documents that weren’t generated by a spreadsheet or BI tool. the tradeoff is speed: pdfplumber is slower than PyMuPDF by a factor of 3-5x on long documents.
Tabula (tabula-py, the Python wrapper around Tabula-java) targets one problem: tables in structured PDFs. it uses a Java runtime under the hood, which adds startup overhead and a JVM dependency. that’s a real infrastructure cost if you’re building a containerized pipeline. Tabula’s lattice mode (for PDFs with visible grid lines) is excellent; its stream mode (for whitespace-separated columns) is inconsistent.
If you’re already familiar with the HTML parsing landscape, this is a similar decision to the one covered in Best Python HTML Parsers 2026: lxml vs BeautifulSoup vs Selectolax — a fast low-level tool, a convenient mid-level tool, and a specialized one for structured data.
Head-to-head comparison
| Library | Text extraction | Table extraction | Speed | JVM required | Scanned PDF support |
|---|---|---|---|---|---|
| PyMuPDF | Excellent | Limited | Fast (C) | No | No (needs OCR) |
| pdfplumber | Good | Excellent | Moderate | No | No (needs OCR) |
| Tabula | Poor | Good-Excellent | Slow (JVM) | Yes | Partial (lattice) |
Speed note: on a 200-page dense PDF, PyMuPDF completes text extraction in under 2 seconds on a standard EC2 instance. pdfplumber takes 8-12 seconds on the same document. Tabula can take 20-30 seconds including JVM boot.
For scanned PDFs where text is embedded in images, none of these libraries work natively. you need an OCR layer first — Image OCR for Web Scraping in 2026: Tesseract vs Google Vision vs Claude covers that step in detail.
Extracting text with PyMuPDF
PyMuPDF is the right starting point for anything that’s programmatically generated — reports, invoices, filings, documentation. here’s a minimal extraction pattern:
import fitz # PyMuPDF
def extract_text(path: str) -> str:
doc = fitz.open(path)
pages = []
for page in doc:
blocks = page.get_text("blocks") # returns (x0, y0, x1, y1, text, ...)
text = "\n".join(b[4] for b in sorted(blocks, key=lambda b: (b[1], b[0])))
pages.append(text)
doc.close()
return "\n\n".join(pages)Sorting blocks by (y1, x0) before joining preserves reading order on multi-column layouts. the default get_text("text") loses column structure silently — a common source of scrambled output that’s hard to debug at scale. for structured output, get_text("dict") returns character-level metadata including font size and bounding box, which is useful for stripping headers and footers by position.
Extracting tables with pdfplumber
pdfplumber shines on annual reports, government data releases, and any PDF where tables are laid out spatially without explicit grid lines. the extract_table() method uses character bounding boxes to infer cell boundaries:
import pdfplumber
with pdfplumber.open("report.pdf") as pdf:
for page in pdf.pages:
tables = page.extract_tables()
for table in tables:
for row in table:
print(row)A few things to tune when the default settings produce merged cells or missing rows:
snap_tolerance: controls how far apart characters can be and still count as the same column (default 3px, increase to 6-8 for large fonts)join_tolerance: controls character spacing within a single text blockedge_min_length: filters out short lines that aren’t table borders
The Excel and CSV Scraping Patterns for Web Data Pipelines (2026) article covers what to do once you’ve got clean tabular output and need to normalize it into a pipeline-ready format.
When to use Tabula
Use Tabula when:
- the PDF has clean, visible grid lines (Tabula’s lattice mode handles these better than pdfplumber in most cases)
- you’re already in a Java-friendly environment such as a data platform running Spark
- the tables span multiple pages and you need automatic continuation detection
Tabula’s Python wrapper is straightforward:
import tabula
dfs = tabula.read_pdf("data.pdf", pages="all", multiple_tables=True, lattice=True)
for df in dfs:
print(df.head())The main risk with Tabula in production is JVM cold start in serverless environments. on AWS Lambda with a Java layer, first-invocation latency can exceed 10 seconds. if you’re building a high-throughput PDF scraping pipeline of the kind covered in PDF Scraping: Extract Data from PDF Documents at Scale, that cost adds up fast, and switching to pdfplumber with coordinate-based detection usually wins on both speed and operational simplicity.
Don’t use Tabula for:
- text extraction (it’s not designed for it)
- PDFs without clear structure
- any pipeline where a JVM dependency is a deployment constraint
Handling edge cases at scale
The libraries above fail predictably in specific scenarios. here’s what to watch for:
- Encrypted PDFs: PyMuPDF supports password-based decryption via
doc.authenticate(password). pdfplumber does too. Tabula does not handle encrypted files. - Right-to-left text: PyMuPDF handles Arabic and Hebrew correctly in 1.24+. pdfplumber sometimes reverses character order on RTL runs.
- Custom font encodings: both PyMuPDF and pdfplumber can output garbage characters when fonts use non-standard encoding tables. use
get_text("rawdict")in PyMuPDF and inspect the encoding map manually. - Mixed scanned + digital pages: run a page-level check using PyMuPDF (
page.get_text()returns an empty string for image-only pages) and route those pages to OCR before your main extraction pass.
The coordinate-aware approach in pdfplumber also makes it easier to isolate specific page regions, which is useful for financial PDFs where the target table sits between a header block and a footnote block. similar spatial reasoning applies when selecting specific DOM nodes in HTML, which is why the Selectolax vs lxml Speed Benchmarks for HTML Parsing (2026) article draws a useful parallel between node selection strategies and region-based PDF extraction.
Bottom line
For most programmatic PDFs, start with PyMuPDF for speed and reliability. reach for pdfplumber when table structure matters more than throughput. use Tabula only if you have grid-line tables and a deployment environment where the JVM dependency is acceptable. DRT covers more of the PDF extraction stack — including bulk processing patterns and anti-bot considerations for PDF endpoints — across the publication.
Related guides on dataresearchtools.com
- Best Python HTML Parsers 2026: lxml vs BeautifulSoup vs Selectolax
- Selectolax vs lxml Speed Benchmarks for HTML Parsing (2026)
- Excel and CSV Scraping Patterns for Web Data Pipelines (2026)
- Image OCR for Web Scraping in 2026: Tesseract vs Google Vision vs Claude
- Pillar: PDF Scraping: Extract Data from PDF Documents at Scale