poppler-utils is a collection of small, fast, single-purpose command-line tools for working with PDFs. Each tool does one job well: extract text, render to image, convert to HTML, get info, merge, separate. Together they form one of the most useful PDF toolkits on Unix-like systems, and they are typically pre-installed on most Linux distributions. This guide is an introduction.
What poppler-utils is
poppler-utils is the command-line interface to the poppler library, an open-source PDF rendering library based on a fork of xpdf. The library powers many PDF viewers (Okular, Evince, GNOME Documents) and is widely used in production systems.
The "utils" part is a bundle of CLI tools:
pdftotext, extract textpdftohtml, convert to HTMLpdftocairo, render to PNG, JPEG, PDF, SVG, PostScriptpdfimages, extract embedded imagespdfinfo, display PDF metadatapdfunite, merge PDFspdfseparate, split PDF into individual pagespdfsig, verify digital signaturespdfdetach, extract embedded filespdftoppm, render to image (older, similar to pdftocairo)pdffonts, list fonts used
Each tool is small and focused, with concise options and predictable behavior.
Installing poppler-utils
Debian / Ubuntu:
sudo apt install poppler-utils
Fedora:
sudo dnf install poppler-utils
macOS:
brew install poppler
Windows:
Less common, but available via packages like poppler on Chocolatey: choco install poppler. Or use WSL.
After installation, each tool is independently invocable.
pdftotext: extract text
The most-used tool of the bunch.
pdftotext input.pdf output.txt
Or to standard out:
pdftotext input.pdf -
Layout preservation:
pdftotext -layout input.pdf -
-layout preserves visual columns and spacing, which is essential for tables and multi-column documents.
Page range:
pdftotext -f 1 -l 5 input.pdf -
Pages 1 through 5 only.
Encoding:
pdftotext -enc UTF-8 input.pdf -
Force UTF-8 (the default on most systems).
Use cases:
- Indexing PDFs for search
- Pipelines that need raw text
- Quick inspection without opening a reader
- Feeding text to scripts and tools
See how to convert a PDF to text and how to convert a PDF to Markdown for related workflows.
pdftohtml: convert to HTML
pdftohtml input.pdf
Produces input.html, input-N.html per page, plus images and CSS.
Useful options:
-c, produces a single complex HTML file-s, single document (no per-page split)-i, ignore images-stdout, write to stdout
See how to convert a PDF to HTML.
pdftocairo: high-quality rendering
The modern rendering tool, replacing pdftoppm for most uses.
Render to PNG:
pdftocairo -png -r 300 input.pdf output
Produces output-1.png, output-2.png, etc., at 300 DPI.
Render to JPEG:
pdftocairo -jpeg -r 150 input.pdf page
Render to single PDF (compressed):
pdftocairo -pdf input.pdf output.pdf
Render to SVG:
pdftocairo -svg input.pdf page
See how to convert a PDF to image and how to convert a PDF to SVG.
Render to multi-page TIFF:
pdftocairo -tiff -r 300 input.pdf output.tif
See how to convert a PDF to TIFF.
pdfimages: extract embedded images
pdfimages -all input.pdf images
Extracts every embedded image as a separate file: images-000.jpg, images-001.png, etc., in their original encoding.
Useful for:
- Recovering source images from a PDF (you may also want to see how to replace an image in PDF)
- Understanding what is inside a complex PDF
- Forensic inspection
pdfinfo: display metadata
pdfinfo input.pdf
Output includes:
- Title, Author, Subject, Keywords
- Creator, Producer
- CreationDate, ModDate
- Pages, page size
- File size
- PDF version
- Tagged status
- Encryption status
For programmatic parsing:
pdfinfo input.pdf | grep "Pages:"
pdfunite: merge PDFs
pdfunite file1.pdf file2.pdf file3.pdf combined.pdf
The fastest way to merge PDFs on Linux. No options needed for the basic case. See how to combine PDF files.
pdfseparate: split PDF into pages
pdfseparate input.pdf page-%d.pdf
Produces page-1.pdf, page-2.pdf, etc.
pdfseparate -f 1 -l 5 input.pdf page-%d.pdf
Only pages 1-5. See how to split a PDF.
pdfsig: verify signatures
pdfsig input.pdf
Reports each digital signature in the file with its status (valid, invalid, unknown).
For verifying integrity in forensic or compliance workflows. See how to detect tampered PDFs and digital signatures vs electronic signatures.
pdfdetach: extract embedded files
pdfdetach -saveall input.pdf -o output_dir/
Extracts every embedded attachment to output_dir/. See hidden data in PDFs explained.
pdffonts: list fonts
pdffonts input.pdf
Lists every font used in the file with its encoding and whether it is embedded.
Useful for:
- Verifying fonts are embedded for portability
- Identifying missing or non-embedded fonts
- Compliance checks for print workflows
Common pipelines
poppler-utils tools shine in pipelines:
Extract text and search:
pdftotext -layout input.pdf - | grep -i "confidential"
Render pages and count:
pdftocairo -png -r 100 input.pdf page && ls page-*.png | wc -l
Burst, OCR each page, recombine:
pdfseparate input.pdf page-%d.pdf
for p in page-*.pdf; do
ocrmypdf "$p" "ocr/$p"
done
pdfunite ocr/*.pdf final.pdf
Inspect metadata and content:
pdfinfo input.pdf > metadata.txt
pdftotext input.pdf - > text.txt
pdffonts input.pdf > fonts.txt
When to use poppler-utils
- Lightweight, fast PDF operations on Linux/macOS
- Pipelines that chain multiple tools
- Scripting in shell, Python, or any language
- Read-only operations (text extraction, info, rendering)
- As the rendering engine in your applications via the poppler library
poppler-utils does not modify PDFs structurally (apart from pdfunite and pdfseparate). For structural edits, use qpdf. For verb-based operations, pdftk. For compression and conversion, Ghostscript.
Strengths
- Speed. Tools are small and fast.
- Simplicity. Each tool does one thing well.
- Reliability. Mature, well-tested.
- Wide availability. Pre-installed on most Linux distros.
- Open source. Can be modified and embedded.
Weaknesses
- No structural editing. Cannot encrypt, decrypt, or modify metadata directly.
- Limited form support. No form filling or extraction.
- Some operations slower than alternatives. For large-scale rendering, MuPDF's mutool is faster.
- OCR is not included. Use OCRmyPDF or Tesseract.
poppler-utils vs alternatives
- poppler-utils for text extraction, info, rendering, fastest and simplest
- qpdf for structure, see qpdf introduction
- Ghostscript for compression and conversion, see Ghostscript introduction
- pdftk for forms and stamps, see pdftk introduction
- MuPDF/mutool for fast rendering, see MuPDF introduction
A typical Linux PDF workflow chains these tools together.
Common gotchas
Encoding issues. pdftotext may emit non-UTF-8 by default on some systems. Use -enc UTF-8 explicitly.
Layout in tables. Without -layout, tabular text comes out interleaved. Always use -layout for tables.
Hyphenation at line breaks. pdftotext does not re-join hyphenated words. Post-process if needed.
Custom encodings. Some PDFs have non-Unicode-mapped fonts. pdftotext produces gibberish. OCR is the fallback.
Multi-column documents. pdftotext -layout mostly works but may interleave columns in tricky layouts.
Image quality. pdftocairo defaults to 150 DPI for many operations. Use -r 300 or higher for print-quality output.
Watermarks and decoration. pdftotext extracts all text, including watermarks and footers. Filter as needed.
Encrypted PDFs. Use -upw user_pw or -opw owner_pw to provide passwords for encrypted PDFs.
Practical recipe
For a typical "extract content from a PDF" job:
pdfinfo input.pdf, confirm structure and metadatapdftotext -layout input.pdf text.txt, extract textpdfimages -all input.pdf images, extract imagespdffonts input.pdf, list fonts for portability check- Process text and images downstream
Takeaway
poppler-utils is the lightweight, fast, scriptable CLI suite for reading and rendering PDFs on Unix-like systems. Each tool does one job exceptionally well: text extraction, HTML conversion, image rendering, info, merging, splitting. Combined with qpdf, pdftk, and Ghostscript, it forms the backbone of any serious Linux PDF workflow. For browser-based one-off operations, Docento.app handles many similar tasks visually. For related CLI tools, see Ghostscript introduction, qpdf introduction, pdftk introduction, and MuPDF introduction.