How to Convert a PDF to JSON: From Plain Extraction to Structured Output

JSON is the lingua franca of modern application data. APIs return JSON, databases store JSON, web apps consume JSON. PDFs do not. Converting a PDF to JSON is a common need whenever you want to feed PDF content into a web service, an LLM, a search index, or a downstream business process. This article walks through the practical approaches, from a single command line to enterprise document AI, and shows you which to pick for which job.

What "PDF to JSON" can mean

Just like PDF to XML, the term covers very different conversions:

Text dump. A JSON document containing the raw text from every page.
Text with layout. JSON with text chunks plus coordinates, fonts, and styling.
Structured content. JSON modeling headings, paragraphs, lists, tables, and figures.
Domain-specific JSON. The PDF is an invoice; the JSON conforms to your invoice schema.

Each is a different problem. Pick deliberately.

Quick command-line extraction

For a plain text-by-page dump, several tools work well:

pdftotext + jq / Python. Run pdftotext file.pdf - to get plain text, then wrap in JSON.
pdfplumber (Python). A few lines of code give you pages with extracted text, words with bounding boxes, and tables. Easy to serialize as JSON.
mutool extract (MuPDF). Extracts text and images; pipe through a small script for JSON. See our MuPDF introduction.
pdfminer.six. pdf2txt.py --output_type tag produces an XML-like output you can map to JSON.

A minimal pdfplumber example:

import json, pdfplumber

with pdfplumber.open("file.pdf") as pdf:
    pages = [{"page": i + 1, "text": p.extract_text() or ""} for i, p in enumerate(pdf.pages)]

print(json.dumps({"pages": pages}, ensure_ascii=False, indent=2))

That handles the 80% case for native-text PDFs. For scanned PDFs, run OCR first, see PDF OCR explained and how to make a PDF searchable OCR.

Text plus layout

When you need coordinates, for example, you want to highlight extracted regions back in the original PDF, most libraries expose that:

import json, pdfplumber

with pdfplumber.open("file.pdf") as pdf:
    blocks = []
    for i, page in enumerate(pdf.pages):
        for w in page.extract_words():
            blocks.append({
                "page": i + 1,
                "text": w["text"],
                "x0": w["x0"], "y0": w["top"],
                "x1": w["x1"], "y1": w["bottom"],
            })
print(json.dumps(blocks, ensure_ascii=False))

The same shape of output is what most "PDF to JSON" SaaS APIs return.

Structured content from tagged PDFs

If the PDF is properly tagged, you can extract its structure tree and emit JSON shaped like the document's logical model:

{
  "type": "Document",
  "title": "Quarterly Report",
  "children": [
    { "type": "H1", "text": "Quarterly Report" },
    { "type": "P", "text": "This document summarizes…" },
    {
      "type": "Table",
      "rows": [
        ["Quarter", "Revenue"],
        ["Q1", "$1.2M"]
      ]
    }
  ]
}

Tools:

Adobe Acrobat Pro, Export As → XML, then run XSLT to convert to JSON.
pikepdf + a structure walker, small Python that walks /StructTreeRoot and emits JSON.
**commonlook and axesPDF, commercial tools that produce structured exports for compliance.

For untagged PDFs, you can attempt structure inference using layout heuristics (font sizes for headings, indentation for lists) but the results are uneven. For documents you control, tag at authoring time and structured JSON falls out for free.

Tables specifically

PDF table extraction deserves its own pass. Several tools produce JSON directly:

Tabula with the JSON output option
Camelot (tables[0].df.to_json())
pdfplumber (page.extract_tables() returns nested lists, trivially serialized)
AWS Textract, Google Document AI, Azure Form Recognizer, return tables as nested JSON with cell coordinates and confidences

For more on tables, see how to convert a PDF to CSV.

Form data extraction

If the PDF has interactive form fields (AcroForm or XFA), JSON conversion is direct:

import json
import pikepdf

with pikepdf.open("form.pdf") as pdf:
    fields = pdf.Root.AcroForm.Fields
    data = {str(f.T): str(f.V) for f in fields if hasattr(f, "T")}

print(json.dumps(data))

Every named field becomes a JSON key. Combine with how to export PDF form data for the more flexible export workflows.

Domain-specific JSON via document AI

If you need PDF → invoice JSON, contract JSON, lab-report JSON, this is no longer a pure conversion job. It is structured extraction, where the system has to recognize fields like "invoice number" or "test result". Options:

AWS Textract Analyze Document / Analyze Expense, purpose-built for invoices, receipts, identity documents. JSON output, including confidence scores per field.
Google Document AI, processors for many document types, with custom processors trainable for your forms.
Azure AI Document Intelligence, similar territory, strong on forms.
Open-source equivalents, Donut, LayoutLM variants, unstructured.io. Lower out-of-the-box accuracy on uncommon document types but flexible and self-hostable.
LLM-based extraction. Send the PDF text (or image) to an LLM with a prompt like "extract invoice number, date, line items as JSON matching this schema". Pair with response-format JSON schemas for reliability. See chatting with PDFs explained.

Each of these returns JSON. The differences are accuracy, cost, training requirements, and data residency.

Common gotchas

Schema design. Decide on your JSON schema before you start extracting. Otherwise every PDF produces ad-hoc shapes and downstream consumers get fragile.

Confidence and provenance. Production pipelines need more than the extracted value. Carry confidence scores per field and ideally bounding-box references back to the source page, so a human can audit when something looks off.

Unicode and language. Multilingual PDFs are common. Ensure your tools handle UTF-8 end to end, normalize to NFC, and tag the language so downstream NLP knows what to expect.

Reading order. As with PDF-to-XML, coordinate-based extraction is in content-stream order, not reading order. Reconstruct reading order before output for multi-column documents.

Date and number formats. ISO 8601 dates and unambiguous numbers (decimal point, no thousands separators) in the JSON. Do the locale normalization in extraction, not downstream.

Encryption. A password-protected PDF needs the password to extract anything. See how to remove a password from a PDF (with the legal caveats that apply).

A small reference pipeline

A practical batch pipeline for general PDF → JSON:

Ingest, drop PDFs into a folder or S3 bucket.
Normalize, repair any corrupt PDFs, OCR if needed, strip metadata you do not want.
Extract, run pdfplumber or pdfminer for native text, AWS Textract or similar for scans and forms.
Map, apply schema-specific mapping rules to coerce raw extraction into your domain schema.
Validate, JSON Schema validation against your target shape; reject and quarantine anything that fails.
Index or deliver, push to your data warehouse, search index, or API consumer.

Each step is straightforward; together they form a reliable conversion pipeline.

Takeaway

Converting a PDF to JSON is a tiered problem. For plain text, a couple of CLI tools handle the job. For structured content, leverage tags or domain-specific document AI. For high-stakes pipelines, design your schema first and validate every output. JSON-output extraction tools have matured enormously in the last few years, there is no longer an excuse to be parsing PDFs with regex in 2026. Pair the right extractor with a clear schema, and you have a reliable pipeline. For lightweight per-document tasks like trimming, splitting, or extracting specific pages before sending to your extractor, Docento.app handles the cleanup in the browser.