Multimodal LLMs and PDF Documents

The shift from text-only LLMs to multimodal models (text plus vision) has reshaped what AI can do with PDFs. A 2023 LLM had to depend on text extraction; if the extraction was lossy or the document was scanned, the model worked from garbage. A 2026 multimodal LLM looks at the page as an image and reads it the way a human does, ignoring extraction entirely. This guide covers what changed, what is now possible, and where multimodal PDF work still falls short.

What "multimodal" means here

A multimodal LLM accepts both text and images as input. For PDFs:

You render each page as a PNG or JPEG.
You pass the image(s) plus a text prompt to the model.
The model returns text in response.

The big four in 2026: GPT-4o and the o-series (OpenAI), Claude Sonnet 4 and Opus 4 (Anthropic), Gemini 2.5 Pro (Google), and various open multimodal models (Llama 3.2 Vision, Qwen-VL, InternVL).

Why this matters for PDFs

A traditional PDF pipeline depends on:

Text extraction. Works for native PDFs, fails partially or completely for scans.
Layout reconstruction. Trying to recover paragraphs, columns, tables from positioned text fragments.
OCR for scanned content. Adds another lossy step.

A multimodal pipeline skips all three. The model sees what is on the page and reads it. Scanned, handwritten, multi-column, figure-heavy, none of it breaks the model the way it breaks extraction pipelines.

Strengths of multimodal PDF work

Scanned documents. No OCR step needed; the model reads from the image.
Forms. Read structured forms with handwritten or typed values.
Tables. Recovered correctly even with merged cells or no visible lines. See extracting tables from PDFs with AI.
Figures and charts. Describe a chart's content, extract data series from a bar chart, read an org diagram.
Equations. Read math from images, convert to LaTeX or describe.
Handwriting. Modern multimodal models read printed handwriting reasonably well; cursive is harder.
Page layout. The model perceives reading order, columns, sidebars.
Multi-language. No need for language-specific OCR; multimodal models cover most major languages out of the box.

Weaknesses

Cost. Image tokens are expensive. A high-resolution page can cost 5 to 20 times a text-only call.
Latency. Vision inference is slower; multi-page documents are slow.
Resolution tradeoffs. Too low and text becomes unreadable; too high and you pay more without quality gain. Most providers recommend 1024-2048 px on the long side.
Long documents. Even with long-context multimodal, processing every page as image hits context and cost ceilings.
Precise spatial work. "What is at coordinates (300, 450)?" is brittle. Better: ask about content, not coordinates.
Tables can still misalign. Multimodal does better than extraction but not perfectly.

The right answer for many real workflows is hybrid: text extraction where it works, multimodal where it does not.

Common workflows

Read a scanned letter.

Render the page; prompt "Transcribe this letter verbatim."

Faster and often more accurate than running Tesseract.

Extract a structured field set from an invoice.

Render the first page; prompt "Extract vendor, invoice number, total, line items, and due date as JSON."

Specialized invoice services still beat raw multimodal for high volume, but multimodal is fastest to ship.

Read a complex table.

Render the page; prompt "Extract the table on this page as Markdown. Preserve exact text. Treat merged cells by repeating values."

Beats extraction libraries for messy tables, costs more.

Describe a figure.

Render the page; prompt "Describe the figure on this page. Include axis labels, units, and what the figure shows."

Useful for accessibility (alt text generation) and for research summarization.

Classify a document.

Render the first page; prompt "Classify this document into one of: invoice, contract, resume, packing slip, other."

See classifying PDFs with machine learning for when this is the right approach.

Side-by-side comparison.

Render pages from two documents; prompt "Compare these two pages. What changed?"

Especially useful for redlines or contract revisions. See how to compare two PDFs.

Prompting multimodal LLMs

A few patterns:

Be specific about output format. "Return as JSON," "Use Markdown," "One line per row." Without specification the model picks for you.
Constrain hallucination. "If you cannot read a field, return null. Do not guess." Combined with a temperature near zero.
Ground in the image. "Quote the exact text on the page. Do not paraphrase."
Tell the model what kind of document. "This is a US W-2 tax form. Extract Box 1, Box 2, ..." Domain context boosts accuracy noticeably.
Use few-shot examples. For consistent JSON, show one or two examples.

Cost engineering

Multimodal costs add up fast. Strategies:

Process only the relevant page. Most documents do not need every page imaged.
Down-sample to the minimum readable resolution. Often 1024 px is enough.
Cache results. Hash the page image; reuse the extraction.
Tier with cheap text extraction. Run extraction first; only fall back to multimodal when extraction fails.
Batch where supported. Some providers allow multiple images in one call at a discount.
Pick the cheapest model that meets quality. Haiku and Mini-tier models often suffice for simple tasks.

A common cost-cutting approach: text-first pipeline with multimodal fallback for scanned and table-heavy pages.

Privacy considerations

Sending PDFs (especially scanned medical records, legal documents, financial filings) to a hosted multimodal LLM is a privacy decision. Considerations:

Provider data handling. Verify retention and training-data policies. Major providers offer enterprise plans with contractual guarantees.
Regional hosting. EU residency, US residency, dedicated tenants.
Open-source multimodal. Llama 3.2 Vision, Qwen-VL, InternVL run locally. Quality lags top hosted models but is good for many tasks.

See risks of using AI on confidential PDFs for the broader picture.

Multimodal in RAG

In a RAG pipeline, multimodal capabilities show up in two places:

Ingest time. Use multimodal to extract content from pages where text extraction is lossy. Store the extracted text in the index.
Query time. For specific questions that need to see the original layout (tables, figures), re-render the cited page and pass to a multimodal model.

The second pattern is sometimes called "image retrieval RAG" or "ColPali-style RAG" and is increasingly common for image-heavy corpora.

See building a RAG system with PDFs for the broader pipeline.

Limits in 2026

What multimodal still cannot reliably do:

Exact coordinate-level work. "Place a stamp at (400, 600)." Models often miscalibrate.
Pixel-perfect comparisons. "Find the 1-pixel difference between these two pages."
Very long documents. A 500-page PDF as 500 images blows budgets.
Highly specialized notation. Some scientific or engineering notation outside the training distribution.
Reasoning over many figures. Compare 20 chart figures across a report and synthesize. Models do parts of this; aggregation is brittle.

For these jobs, dedicated tools or human review still beat multimodal.

Tools and APIs

The major hosted multimodal options:

OpenAI GPT-4o, o-series: high quality on vision; broad capability.
Anthropic Claude Sonnet, Opus: strong vision; longer context windows.
Google Gemini 2.5 Pro: long context; native multimodal training; strong on figures.
Mistral Pixtral: open-weights vision model from Mistral.

Local / open:

Llama 3.2 Vision (11B, 90B).
Qwen-VL family.
InternVL family.
Florence-2 for fine-grained vision tasks.

For PDFs specifically, dedicated services like Reducto, Unstructured.io, and Apryse are built on multimodal under the hood and expose simpler APIs for document workflows.

Practical recipe

For a one-off multimodal PDF task:

Render the relevant page to a 1024-2048 px image (pdftoppm or pdf2image).
Pick a provider based on cost and privacy needs.
Write a specific prompt with format and grounding instructions.
Call the API with temperature near zero.
Verify the output against the page visually.

For production pipelines:

Tier: text extraction first, multimodal fallback.
Cache: hash inputs, reuse outputs.
Sample-evaluate: 5 to 10 percent human review.
Monitor costs: alerts on per-document spend.

Takeaway

Multimodal LLMs have closed many of the gaps in traditional PDF processing pipelines: scanned documents, handwritten content, complex tables, charts, figures. The cost is higher per page, but for the right tasks it is dramatically faster to ship than building specialized pipelines. The best modern workflows are hybrid: cheap extraction first, multimodal where it matters. For browser-based PDF preparation that often precedes multimodal extraction (splitting into the relevant pages, cropping, redacting), Docento.app keeps the file local. See also building a RAG system with PDFs, extracting tables from PDFs with AI, and AI data extraction from PDFs.