PDF OCR Explained: How Scanned Pages Become Searchable

OCR, Optical Character Recognition, is the difference between a PDF that's a stack of pictures and a PDF you can search, copy, and edit. The technology has improved dramatically in the last few years, but the basics of when and how to use it haven't changed. Here's everything most people need to know to use OCR confidently.

What OCR actually does

Take a scanned page. To a computer, it's a grid of pixels, pretty colours, no meaning. OCR analyses those pixels and produces text: characters, words, paragraphs. With modern PDFs, the OCR result is added as an invisible text layer sitting exactly behind the visible image, so the page still looks identical but is now searchable, copyable, and indexable.

This is the magic feature: the same PDF that looked like a scan now behaves like a real document.

When you need OCR

Run OCR when:

You scanned paper and want to search the result.
Someone sent you a PDF that's actually a series of phone-photo images of a contract.
You need to edit text in a scanned PDF.
You want to feed scanned documents to a search index, an LLM, or any text-based pipeline.
You want a PDF that's readable by screen readers for accessibility.

When you don't need OCR

Skip OCR when:

The PDF was born digital, exported from Word, Google Docs, LaTeX, etc. It already has a text layer. Try to highlight a word; if you can, no OCR needed.
The document will only be viewed, never searched or copied.
The "text" in the document is intentionally not text, for example, a scan of a hand-drawn sketch where the words are part of an artwork.

A quick test: open the PDF, try to highlight a word with your cursor. If text gets selected, OCR is unnecessary. If only a fuzzy box appears, OCR will help.

How modern OCR works (briefly)

Modern OCR has two stages:

Layout analysis: detect the page structure, where text blocks are, where images are, which columns belong together.
Character recognition: for each text block, recognise the characters, often using a neural network trained on text in many fonts and languages.

Older OCR (early 2000s) used rule-based character matching. Modern OCR uses deep learning. The accuracy difference is enormous, modern OCR can hit 99%+ on clean scans where older OCR managed 95% on a good day.

Choosing an OCR tool

The main free options:

Tesseract: open source, supports 100+ languages, runs anywhere from a phone to a server. Free, mature, well-documented.
PaddleOCR: newer, often more accurate on real-world scans, especially for non-English text.
EasyOCR: simpler API than Tesseract, good for prototypes.

Cloud OCR (Google Vision, Azure, AWS Textract) is more accurate still but uploads your document, fine for non-sensitive material, not for confidential.

For browser-based OCR without uploading: WebAssembly Tesseract works locally. Docento.app handles browser OCR without sending the file anywhere.

What affects accuracy

Garbage in, garbage out applies harder to OCR than almost anywhere else:

Resolution: 300 DPI is the standard minimum. 150 DPI scans cost noticeable accuracy. 600 DPI rarely helps over 300.
Contrast: faded printouts and dark scans both confuse OCR. Pre-process to high contrast black-on-white if you can.
Skew: tilted scans hurt accuracy a lot. Most OCR tools deskew automatically; check the output.
Compression artefacts: heavily JPEG-compressed scans introduce noise that OCR misreads.
Font and language: standard fonts in major languages are nearly perfect; cursive script, hand-writing, and minority languages are harder.
Layout: simple single-column text is easiest; multi-column with footnotes and tables is hardest.

Pre-processing for better results

If accuracy matters, pre-process your scans before OCR:

Deskew so text lines are horizontal.
Increase contrast to make text fully black, background fully white.
Crop to remove edges that confuse layout analysis. See cropping a PDF.
Despeckle to remove dust and scanner artefacts.
Increase DPI by re-scanning if the original was below 300.

Five minutes of pre-processing can turn a 95% accurate OCR into a 99% accurate one.

Error patterns to watch for

OCR errors aren't random, they cluster around specific confusions:

0 vs O, 1 vs l vs I: numbers and letters that look alike.
rn vs m: in serif fonts.
comma vs period: in some fonts and at low resolution.
Decimal points vanishing or appearing: in numerical data this is critical to catch.
Hyphenation at line breaks: gets joined or split unpredictably.

For documents where numbers matter (invoices, tax forms, financial statements), reconcile a sample of values manually. The risk of a $1,000 charge becoming a $1.000 charge is real.

After OCR: the searchable PDF

Most OCR tools produce a "searchable PDF", the original image plus an invisible text layer. The result:

Looks identical to the scan.
Can be searched in any PDF reader.
Lets you copy text out.
Can be edited (with limitations).
Is readable by screen readers.

To distribute the OCR'd version, see making a PDF searchable.

OCR for non-Latin scripts

Modern OCR handles non-Latin scripts well, but quality varies by language. Tesseract supports 100+ languages but accuracy is highest on widely-trained scripts (Latin, Chinese, Arabic). Niche scripts may need specialised tools or careful pre-processing.

For multi-language documents, run OCR with multiple language packs, most tools accept a list.

Handwriting OCR

Standard OCR is for printed text. Handwriting OCR (HTR, Handwritten Text Recognition) is a related but harder problem. Modern HTR works well on clean modern handwriting; cursive, old documents, and shorthand are still hard. If your documents are handwritten, look for tools that specifically advertise HTR.

Conclusion

OCR turns scans into real documents. Run it on anything you'll search, copy, or edit. Pre-process for better accuracy, reconcile critical numbers, and use the right tool for your language. Docento.app handles OCR in the browser without uploads, practical for occasional use on sensitive documents. For more, see making a PDF searchable and converting scanned PDFs to editable.