Docento.app Logo
Docento.app
Notebook, pen and laptop
All Posts

Classifying PDFs at Scale With Machine Learning

April 3, 2026·7 min read

Organizations that handle thousands of PDFs a day have a routine problem: incoming documents arrive without labels. Is this a contract or an invoice? A resume or a cover letter? A purchase order or a packing slip? Manual sorting is slow and inconsistent. Machine learning classifiers do the job in milliseconds with accuracy that often beats human reviewers. This guide covers the practical landscape of PDF classification in 2026.

Where classification fits

Classification is rarely the goal in itself. It is the routing step that decides what happens next:

  • An invoice goes to accounts payable extraction.
  • A resume goes to the recruiting workflow.
  • A signed contract goes to the contract repository.
  • An unknown document goes to a human review queue.

Without reliable classification, every downstream automation has to handle every possible type. With it, each downstream system gets a homogeneous stream of one document type and can use specialized extractors.

Types of classifiers

Rule-based. If the document contains the word "invoice" and a dollar sign and a date pattern, call it an invoice. Cheap and fast. Brittle and high-maintenance.

Classical ML on text features. TF-IDF plus logistic regression or gradient-boosted trees. Surprisingly competitive for clean, distinct categories. Trains in seconds; runs in milliseconds.

Embedding-plus-classifier. Embed the document text with a model like bge-base or text-embedding-3-small, then train a small classifier on the embeddings. Strong baseline; easy to retrain as categories evolve.

Fine-tuned transformer. Fine-tune a BERT, DeBERTa, or LayoutLM model on labeled examples. State of the art for hard cases.

Vision and layout models. Models like LayoutLM v3, Donut, and Microsoft's MarkupLM use both text and visual layout. Right when layout matters, e.g. distinguishing a W-2 from a 1099 by visual structure.

Zero-shot with LLMs. Hand the document text and a list of categories to GPT-4 class, Claude Sonnet, or Gemini Pro. Asks the model to pick the best fit. Works out of the box without training data; expensive at scale.

Choosing an approach

A rough decision tree:

  1. Five distinct categories, plenty of labeled examples: classical ML. Done in an afternoon.
  2. Dozens of subtle categories, layout matters: LayoutLM or Donut.
  3. No labeled data yet, want to ship today: zero-shot LLM. Use to bootstrap a labeled set, then retrain a cheap classifier on that set.
  4. Mostly text-distinct categories, want a strong baseline: embedding-plus-classifier.

In practice, many teams run a tiered system: a cheap classifier for most documents, an LLM fallback for low-confidence cases.

Preparing data

Whatever the approach, the input is text and (optionally) page images. Steps:

  1. Extract text from each PDF. Native text where possible; OCR otherwise. See PDF OCR explained.
  2. Limit input size. The first page or two is often enough to classify. Skip the appendix. Saves cost and improves accuracy.
  3. Normalize. Lowercase, strip non-content (headers, footers, page numbers).
  4. Include metadata where useful: number of pages, filename, sender (for email-arrived PDFs).

Bias the training data to match the production distribution. A classifier trained 50/50 on contracts and invoices will misbehave if production is 95% invoices.

Labels and label hygiene

Categories are the foundation. They must be:

  • Mutually exclusive. A document should belong to one category, not three.
  • Collectively exhaustive. Add an "other" or "unknown" category for anything that does not fit.
  • Stable. Avoid renaming or splitting categories mid-training without revisiting all labels.

Label at least 50 to 100 examples per category for a classical or embedding-based classifier; 500 to 1,000 for a fine-tuned transformer. For LLM zero-shot, you do not need labels, but you do need a clear category definition for the prompt.

For ambiguous documents, label the most likely category and add a "low confidence" flag. The classifier will learn the same uncertainty.

Production architecture

A typical PDF classification service has:

  1. Intake. Receive PDF from email, upload, API.
  2. Extract. Pull text and (optionally) render page 1 as image.
  3. Classify. Run through the model. Return a label plus a confidence score.
  4. Route. High confidence: send to the appropriate downstream automation. Low confidence: send to human review.
  5. Feedback. Capture human corrections. Periodically retrain.

The feedback loop matters most. Without it, accuracy decays as document types drift. With it, the model improves continuously.

Evaluation

Accuracy alone is misleading on imbalanced data. Track:

  • Per-class precision and recall. A 95% accurate classifier that misses every example of a rare class is useless for that class.
  • Confusion matrix. Which classes get confused for each other? Often reveals labeling issues or genuinely overlapping categories.
  • Confidence calibration. When the model says 80% confidence, is it right 80% of the time? Miscalibrated models break routing thresholds.
  • Cost per document. Compute and API costs at scale.

Hold out a recent, in-production sample as the evaluation set. Synthetic or old test data understates real-world degradation.

Handling layout-sensitive types

Some document types look very similar in text but differ in layout: tax forms, government forms, standardized industry documents. Pure text classification confuses them. Solutions:

  • LayoutLM family uses bounding boxes and visual features alongside text.
  • Donut is a text-free vision-only model. Robust to OCR errors.
  • Vision-language models (GPT-4o, Claude Sonnet with vision) can classify from page images directly, no training required.

For high-volume layout-sensitive workloads, a fine-tuned layout model is the right answer. For low volume, a multimodal LLM is faster to ship.

Privacy and compliance

Classifying PDFs means reading their content. If documents contain PII, PHI, or financial data, the classifier (and any hosted service it runs in) must comply with the relevant regulations.

Common gotchas

Class drift. Production documents change over time. A classifier trained on 2024 invoices may miss the 2026 formats. Retrain at least quarterly.

Long documents. A 200-page document classified by its first paragraph may be miscategorized. For mixed documents, classify each section or page separately.

Multi-class documents. Some PDFs are bundles, e.g. a contract with attached invoices. Either split before classifying or label as "bundle" and have a downstream splitter.

Adversarial inputs. If users can name files or include arbitrary text, do not trust filenames or document titles as features. Train on body content.

Cost blow-up. LLM-based classification at $0.01 per document is fine at hundreds per day, brutal at millions. Tier with cheap classifiers first.

Tools and libraries

Open source:

  • scikit-learn: classical ML; train an SVM or gradient boost in 20 lines.
  • transformers (Hugging Face): fine-tune BERT-class models.
  • LayoutLMv3, Donut, MarkupLM: layout-aware models.
  • spaCy: pipelines including text classification.
  • unstructured.io: PDF parsing with built-in classifiers for some document types.

Hosted:

  • AWS Textract and AWS Comprehend for end-to-end pipelines.
  • Google Document AI for pre-built classifiers and custom training.
  • Azure Document Intelligence for pre-built models on common business documents.
  • Anthropic, OpenAI, Google APIs for zero-shot LLM classification.

Practical recipe

  1. Pick 5 to 20 document categories from your real intake. Add an "other" bucket.
  2. Label 50 examples per category from recent data.
  3. Baseline with embedding plus logistic regression. Measures the easy ceiling.
  4. If accuracy is enough (say, 95%+), ship it. If not, fine-tune or move to LayoutLM.
  5. Set a confidence threshold for auto-routing. Send below-threshold to human review.
  6. Build the feedback loop. Every human correction becomes a future training example.
  7. Retrain quarterly or whenever production accuracy drops.

Takeaway

PDF classification is a mature problem with mature solutions. Start with the cheapest approach that meets your accuracy bar and only graduate when the data demands it. The big wins are in the operational details, clean labels, calibrated thresholds, feedback loops, more than in picking exotic models. For browser-based PDF preparation that often precedes classification (splitting bundles, redacting, signing), Docento.app handles common operations locally. For related topics, see AI data extraction from PDFs, building a RAG system with PDFs, and automating PDF workflows with n8n.

Related Posts