Docento.app Logo
Docento.app
Code editor on a dark theme
All Posts

How to Convert a PDF to CSV (Tables, Forms, and Everything in Between)

April 23, 2026·7 min read

You have a PDF with a table in it, maybe a bank statement, a price list, a research dataset, an exported report. You need it as a CSV so a spreadsheet, a database, or a script can do something useful with it. Converting a PDF to CSV sounds simple, and for cleanly-structured PDFs it is. For messy real-world PDFs, it ranges from "annoying" to "small archaeology project". This guide walks through the different kinds of PDF you might be converting and the best tool for each.

Three kinds of PDF tables (and why it matters)

Before you pick a tool, classify your file:

  1. Native-text PDF with a real table structure, the table was generated programmatically (an exported report, a payroll system) and has actual table objects in the PDF. The text is selectable, columns are aligned, headers are clear. These are easy to convert.
  2. Native-text PDF without explicit table structure, the text is selectable but the "table" is just text positioned in columns. No table tags, no row/column metadata. Most PDFs are this. Conversion needs heuristics.
  3. Scanned image PDF, the table is a picture of a table. No selectable text. Requires OCR before you can extract anything.

Each kind needs a different approach. For more on adding searchable text to scanned files, see how to make a PDF searchable OCR.

Quick path: copy and paste

For a one-off small table, the fastest approach is also the dumbest one:

  1. Open the PDF in any reader
  2. Select the table area with your cursor
  3. Copy
  4. Paste into a spreadsheet (Excel, Google Sheets, LibreOffice Calc)
  5. Use Data → Text to Columns if the paste lands as a single column

This works surprisingly often for small tables in well-laid-out PDFs. It fails badly when columns are misaligned, when cells contain line breaks, or when the PDF has tabs versus spaces inconsistently.

Mid-tier: dedicated extractor tools

Several free tools specialize in table extraction:

  • Tabula, an open-source desktop app dedicated to PDF table extraction. You draw a rectangle around the table, choose "Stream" or "Lattice" extraction (Lattice for tables with visible borders, Stream for whitespace-separated columns), and export CSV. It is the standard tool for journalists and researchers pulling tables out of government PDFs.
  • Camelot (Python), programmatic table extraction. Works similarly to Tabula. Useful if you have a folder of PDFs to process in a script.
  • pdftotext with layout flag, pdftotext -layout file.pdf - (from poppler-utils, see our poppler-utils introduction) preserves visual spacing. Combined with awk or Python column-splitting, this handles many simple tables.
  • Excel's "From PDF" import, modern Excel can import tables from a PDF directly. Data → Get Data → From File → From PDF. Power Query inspects the file and lets you pick which tables to import.
  • LibreOffice Calc + Draw, open the PDF in Draw, copy the table region, paste into Calc. Often surprisingly clean for native-text PDFs.

Heavy-duty: commercial table extractors

For dirty real-world PDFs with merged cells, multi-row headers, and varying layouts, commercial tools have a clear edge:

  • ABBYY FineReader, the long-running OCR and document conversion suite. Excellent at table reconstruction.
  • Adobe Acrobat Pro, export to Excel directly. Pay for what it is worth: Acrobat is one of the better table extractors for arbitrary documents.
  • Smallpdf, iLovePDF, Nitro, online and desktop tools with table extraction. Quality varies but for casual use these are quick.
  • PDF-XChange Editor, solid table extraction in its export tools.

For comparisons among these, see Smallpdf vs iLovePDF and Acrobat vs Foxit.

When you have a scanned PDF

You need OCR first. The pipeline is:

  1. Run OCR on the file to produce a searchable PDF with a text layer. Tools: ABBYY FineReader, Adobe Acrobat Pro, OCRmyPDF (open source CLI), Tesseract directly.
  2. Run your table extractor on the OCR'd PDF.

Scanned tables introduce their own headaches, OCR errors, misread numbers, broken column alignment. Always check totals and known values against the original.

For an overview of OCR concepts, see PDF OCR explained.

Scripted conversion for batch jobs

If you need to convert dozens or hundreds of PDFs to CSV, write a small script. A typical Python pipeline:

import camelot
import pandas as pd

tables = camelot.read_pdf("statement.pdf", pages="all", flavor="stream")
for i, table in enumerate(tables):
    table.df.to_csv(f"statement_table_{i+1}.csv", index=False)

For lattice tables (with visible cell borders), use flavor="lattice". For stream tables (whitespace-separated), use flavor="stream". Tune row tolerance and column boundaries if extraction is off.

Camelot, Tabula's CLI mode, and pdfplumber are all good libraries depending on your existing stack.

Common gotchas and how to handle them

Merged header cells. A table with "Q1 2026" spanning two columns and "Revenue / Expenses" under it confuses every extractor. Expect to clean these by hand or write a post-processing rule.

Multi-line cells. A cell containing "Reference: ABC-123" on one line and "Order: XYZ-456" on the next often gets split into two rows. Re-merge by detecting empty leading columns.

Decimal and thousand separators. A European PDF using commas as decimals ("1.234,56") becomes ambiguous when parsed by a tool expecting commas as thousands. Normalize before importing into a spreadsheet.

Footnotes and totals. Many financial tables include subtotals and footnotes mid-table. Filter them out or your aggregations will double-count.

Currency symbols and units. "$1,234.56" or "1,234 kg" must become a number plus a separate unit field if you want to compute on them. Strip symbols, convert to floats.

Pagination. A table that spans 20 pages of a PDF usually has its header repeated on every page. After extraction, dedupe header rows before doing anything else.

Sanity-check your output

Always verify the conversion before you trust it:

  • Row count. Compare the number of data rows in the CSV against the visible count in the PDF.
  • Column totals. Spot-check that the sum of a numeric column matches a known total.
  • First and last rows. Confirm they match the PDF; extractors sometimes drop the first row (mistaking it for headers) or the last (truncating).
  • Special characters. Currency symbols, em dashes, and non-ASCII characters often get garbled. Open the CSV in a text editor before importing.

When extraction just won't work

Sometimes a PDF is simply too messy for automated extraction. A few escape hatches:

  • Manual transcription, for a small dataset, often faster than fighting tools.
  • Crowdsource it, services like Mechanical Turk or specialist data-entry vendors can transcribe scanned tables quickly.
  • Vendor APIs, services like Amazon Textract, Google Document AI, and Azure Form Recognizer use modern machine learning models specifically for table extraction from messy documents. For a deeper look at this category, see AI data extraction from PDFs.

Takeaway

Converting a PDF to CSV is a sliding scale of difficulty: native PDFs with explicit table structure are nearly free; native PDFs with whitespace-aligned columns need a smart extractor like Tabula or Camelot; scanned PDFs need OCR first and then extraction; and pathological tables sometimes need human eyes. Pick the tool that matches the PDF, verify the numbers before trusting them, and keep a re-runnable script around for files you expect to receive again. For the upstream step of cleaning or re-organizing the PDF before extraction, say, splitting the document so you only process the relevant pages, you can do that in Docento.app without uploading anywhere.

Related Posts