Docento.app Logo
Docento.app
Humanoid robot reading a document
All Posts

AI Data Extraction from PDFs: From Invoices to Lab Reports

April 29, 2026·7 min read

Extracting structured data from PDFs has long been the messy middle step of countless business workflows. Invoices, receipts, applications, lab reports, contracts, all arrive as PDFs and need to land in databases or spreadsheets. AI data extraction in 2026 has reduced this from a fragile, template-driven engineering effort to a workflow that handles arbitrary documents with high accuracy. This guide walks through the practical landscape.

The problem

A PDF arrives. You need certain fields out of it: invoice number, customer name, line items, total, due date. The traditional approaches:

  • Template-based extraction. Define field positions per document layout. Fragile, every layout change breaks it.
  • Regex on extracted text. Works for highly-structured text. Brittle for varied formats.
  • Manual data entry. Reliable but slow and expensive.
  • OCR plus parsing. For scanned PDFs, OCR adds noise that downstream parsing has to handle.

AI extraction sidesteps all of these by recognizing the structure semantically.

How AI extraction works

A modern AI extraction pipeline:

  1. Document understanding. AI parses the document, identifying fields, tables, paragraphs.
  2. Field recognition. The model recognizes "this is an invoice number", "this is a date", "this is a line item".
  3. Structured output. Fields are returned in JSON, XML, or similar structured format with confidence scores.
  4. Validation. Schema-based or rule-based checks confirm the output makes sense.

Compared to traditional extraction, AI extraction adapts to varying layouts. A new vendor's invoice with a different format is handled without retraining.

Tools

Cloud document AI services:

  • AWS Textract, strong on forms, receipts, IDs, invoices
  • Google Document AI, many specialized processors plus custom training
  • Azure AI Document Intelligence, formerly Form Recognizer, similar capabilities
  • Microsoft Syntex, for Microsoft 365-integrated workflows

These offer per-document processing with structured output. Cost is per page or per processed document.

Specialized extraction tools:

  • Rossum, focused on invoice and procurement extraction
  • Klippa, receipts, invoices, IDs
  • Hypatos, document processing for finance
  • Veryfi, receipts and expense management
  • Nanonets, general-purpose document AI

These typically have higher accuracy on their specific domains than general-purpose services.

Chat AI (ChatGPT, Claude, Gemini):

  • Upload PDF
  • Prompt: "Extract the invoice number, date, vendor, line items, and total as JSON."
  • Response: structured JSON

Useful for one-offs and prototyping. For production, dedicated tools are more reliable.

Open-source / self-hosted:

  • Donut, Document Understanding Transformer, runs locally
  • LayoutLM variants, Microsoft research models
  • unstructured, Python library for document parsing
  • PaddleOCR + custom extraction, open-source alternatives

For privacy-sensitive workflows where data cannot leave your infrastructure, self-hosting is the path.

Quality and accuracy

Modern extraction accuracy on common document types in 2026:

  • Invoices, 95-99% field accuracy with leading tools (Textract, Rossum, etc.)
  • Receipts, 90-97% for standard retail receipts
  • IDs and passports, 95%+ for major government documents
  • Lab reports, 85-95% depending on standardization
  • Custom domain documents, varies widely; trainable processors can match

For comparison, manual data entry has 5-10% error rate on similar fields. AI extraction is usually more accurate as well as faster.

A practical pipeline

A typical invoice extraction pipeline:

  1. Receive invoice (email attachment, scan, upload)
  2. Pre-process, OCR if scanned, normalize page size
  3. Extract, send to AWS Textract / Rossum / similar
  4. Validate, check that total = sum of line items, dates are reasonable, vendor exists
  5. Route, for matched / valid invoices, send to ERP; for exceptions, send to human reviewer
  6. Reconcile, match against PO or expected values
  7. Archive, store both the original PDF and the extracted data

A well-designed pipeline handles 90-95% of invoices automatically and routes the rest to humans for review.

Schema design

Before extraction, define what you need:

{
  "invoice_number": "string",
  "invoice_date": "date",
  "vendor_name": "string",
  "vendor_address": "string",
  "due_date": "date",
  "line_items": [
    {
      "description": "string",
      "quantity": "number",
      "unit_price": "number",
      "line_total": "number"
    }
  ],
  "subtotal": "number",
  "tax": "number",
  "total": "number"
}

A clear schema with types makes extraction easier to validate and route.

Confidence scoring

Good extraction tools return confidence per field:

  • Total: $1,234.56 (confidence 0.97)
  • Invoice number: INV-2026-0042 (confidence 0.99)
  • Vendor name: "AcmeCop" (confidence 0.62), note the low score; probably an OCR error

Use confidence thresholds to route low-confidence extractions to human review. A 95% confidence threshold typically catches genuine ambiguity while passing routine cases.

Common document types

Different document types have different optimal tools:

Invoices. AWS Textract Analyze Expense, Rossum, Hypatos. Excellent accuracy in 2026.

Receipts. Veryfi, Klippa, Textract. Great for expense management.

Forms (applications, surveys). Google Document AI custom processors; Textract Analyze Document.

IDs and licenses. Textract Analyze ID, Klippa ID verification, Microsoft. Highly regulated; use approved providers.

Lab reports. Healthcare-specific platforms (Persivia, Saykara); or custom-trained models.

Contracts. Lex Machina, Kira, eBrevia for legal review and clause extraction.

Resumes/CVs. HireRight, dedicated HR AI tools.

Financial statements. Specialized fintech extraction tools.

For each type, the dedicated tool usually outperforms general-purpose AI.

Privacy and compliance

Extracting data from PDFs means sending content to a service. For sensitive content:

  • Healthcare data. Use HIPAA-compliant providers (most major cloud services offer this). See HIPAA-compliant PDF handling.
  • Financial data. Ensure provider meets SOC 2 and other financial compliance.
  • EU personal data. Verify GDPR compliance; data residency in EU if required. See GDPR and PDF documents.
  • Highly confidential. Self-host the extraction.

See risks of using AI on confidential PDFs.

Combining with human review

Pure AI extraction has error rates that are too high for many high-stakes workflows. The right pattern is AI + human:

  1. AI extracts and assigns confidence scores
  2. High-confidence extractions flow through automatically
  3. Low-confidence extractions go to a human reviewer who confirms or corrects
  4. Reviewed extractions are added to a training corpus for ongoing model improvement

This hybrid approach handles 90-95% of documents with no human touch while ensuring the rest get attention.

Cost

Pricing varies:

  • AWS Textract, ~$0.0015-$0.05 per page depending on processing type
  • Google Document AI, similar
  • Rossum, Klippa, per-document pricing, often $0.05-$0.50 per document
  • Custom self-hosted, fixed infrastructure cost; per-document cost approaches zero

For high volumes, self-hosting can be cheaper but has setup and maintenance overhead.

Common gotchas

Hallucinations. AI may invent fields not in the document. Validate extractions against the source.

Confidence calibration. Different tools score confidence differently; calibrate thresholds for your tool.

Mixed-language documents. May confuse field recognition. Specify language.

Image-only PDFs. OCR first; quality determines extraction quality. See PDF OCR explained.

Tables across pages. Tables that span page breaks may extract incorrectly. Pre-process if needed.

Unusual layouts. Custom-designed invoices from boutique vendors may fail. Build a fallback to human review.

Number formats. "$1,234.56" vs "1.234,56 €", verify locale handling.

Date formats. "01/05/2026" is January 5 or May 1 depending on locale. Verify and normalize.

Currency conversion. A USD invoice processed by an EU system needs explicit currency identification.

Special characters. Curly quotes, em dashes, non-ASCII may confuse extraction.

Document quality. Crumpled scans, poor lighting, partial pages reduce accuracy. Preprocess where possible.

Real-world workflow examples

Accounts payable automation:

  1. Invoice arrives in email
  2. Automated pipeline extracts vendor, invoice number, amount, line items
  3. Validates against PO
  4. Routes to approver
  5. After approval, sends to ERP for payment

Reduces invoice processing time from days to hours.

Lab result intake:

  1. Lab sends PDF report
  2. AI extracts patient ID, test results, reference ranges
  3. Matches against the EHR record
  4. Notifies clinician of abnormal results

Replaces manual data entry and accelerates clinical workflow.

Expense reporting:

  1. Employee photographs receipt
  2. AI extracts merchant, date, amount, category
  3. Pre-fills expense report
  4. Employee reviews and submits

Cuts time per expense from minutes to seconds.

Takeaway

AI data extraction from PDFs has matured into a production-ready technology in 2026. Cloud services (AWS Textract, Google Document AI, Azure) handle common documents with 90%+ accuracy; specialized tools (Rossum for invoices, Veryfi for receipts) do even better on their domains. Design schemas thoughtfully, set confidence thresholds, pair AI with human review for high-stakes cases, and verify carefully for sensitive data. For browser-based PDF operations alongside extraction workflows, Docento.app handles common tasks. For related topics, see AI PDF summarization explained, how to convert a PDF to JSON, and risks of using AI on confidential PDFs.

Related Posts