poppler-utils Introduction: Lightweight CLI Tools for PDFs

poppler-utils is a collection of small, fast, single-purpose command-line tools for working with PDFs. Each tool does one job well: extract text, render to image, convert to HTML, get info, merge, separate. Together they form one of the most useful PDF toolkits on Unix-like systems, and they are typically pre-installed on most Linux distributions. This guide is an introduction.

What poppler-utils is

poppler-utils is the command-line interface to the poppler library, an open-source PDF rendering library based on a fork of xpdf. The library powers many PDF viewers (Okular, Evince, GNOME Documents) and is widely used in production systems.

The "utils" part is a bundle of CLI tools:

pdftotext, extract text
pdftohtml, convert to HTML
pdftocairo, render to PNG, JPEG, PDF, SVG, PostScript
pdfimages, extract embedded images
pdfinfo, display PDF metadata
pdfunite, merge PDFs
pdfseparate, split PDF into individual pages
pdfsig, verify digital signatures
pdfdetach, extract embedded files
pdftoppm, render to image (older, similar to pdftocairo)
pdffonts, list fonts used

Each tool is small and focused, with concise options and predictable behavior.

Installing poppler-utils

Debian / Ubuntu:

sudo apt install poppler-utils

Fedora:

sudo dnf install poppler-utils

macOS:

brew install poppler

Windows:

Less common, but available via packages like poppler on Chocolatey: choco install poppler. Or use WSL.

After installation, each tool is independently invocable.

pdftotext: extract text

The most-used tool of the bunch.

pdftotext input.pdf output.txt

Or to standard out:

pdftotext input.pdf -

Layout preservation:

pdftotext -layout input.pdf -

-layout preserves visual columns and spacing, which is essential for tables and multi-column documents.

Page range:

pdftotext -f 1 -l 5 input.pdf -

Pages 1 through 5 only.

Encoding:

pdftotext -enc UTF-8 input.pdf -

Force UTF-8 (the default on most systems).

Use cases:

Indexing PDFs for search
Pipelines that need raw text
Quick inspection without opening a reader
Feeding text to scripts and tools

See how to convert a PDF to text and how to convert a PDF to Markdown for related workflows.

pdftohtml: convert to HTML

pdftohtml input.pdf

Produces input.html, input-N.html per page, plus images and CSS.

Useful options:

-c, produces a single complex HTML file
-s, single document (no per-page split)
-i, ignore images
-stdout, write to stdout

See how to convert a PDF to HTML.

pdftocairo: high-quality rendering

The modern rendering tool, replacing pdftoppm for most uses.

Render to PNG:

pdftocairo -png -r 300 input.pdf output

Produces output-1.png, output-2.png, etc., at 300 DPI.

Render to JPEG:

pdftocairo -jpeg -r 150 input.pdf page

Render to single PDF (compressed):

pdftocairo -pdf input.pdf output.pdf

Render to SVG:

pdftocairo -svg input.pdf page

See how to convert a PDF to image and how to convert a PDF to SVG.

Render to multi-page TIFF:

pdftocairo -tiff -r 300 input.pdf output.tif

See how to convert a PDF to TIFF.

pdfimages: extract embedded images

pdfimages -all input.pdf images

Extracts every embedded image as a separate file: images-000.jpg, images-001.png, etc., in their original encoding.

Useful for:

Recovering source images from a PDF (you may also want to see how to replace an image in PDF)
Understanding what is inside a complex PDF
Forensic inspection

pdfinfo: display metadata

pdfinfo input.pdf

Output includes:

Title, Author, Subject, Keywords
Creator, Producer
CreationDate, ModDate
Pages, page size
File size
PDF version
Tagged status
Encryption status

For programmatic parsing:

pdfinfo input.pdf | grep "Pages:"

See how to edit PDF metadata.

pdfunite: merge PDFs

pdfunite file1.pdf file2.pdf file3.pdf combined.pdf

The fastest way to merge PDFs on Linux. No options needed for the basic case. See how to combine PDF files.

pdfseparate: split PDF into pages

pdfseparate input.pdf page-%d.pdf

Produces page-1.pdf, page-2.pdf, etc.

pdfseparate -f 1 -l 5 input.pdf page-%d.pdf

Only pages 1-5. See how to split a PDF.

pdfsig: verify signatures

pdfsig input.pdf

Reports each digital signature in the file with its status (valid, invalid, unknown).

For verifying integrity in forensic or compliance workflows. See how to detect tampered PDFs and digital signatures vs electronic signatures.

pdfdetach: extract embedded files

pdfdetach -saveall input.pdf -o output_dir/

Extracts every embedded attachment to output_dir/. See hidden data in PDFs explained.

pdffonts: list fonts

pdffonts input.pdf

Lists every font used in the file with its encoding and whether it is embedded.

Useful for:

Verifying fonts are embedded for portability
Identifying missing or non-embedded fonts
Compliance checks for print workflows

Common pipelines

poppler-utils tools shine in pipelines:

Extract text and search:

pdftotext -layout input.pdf - | grep -i "confidential"

Render pages and count:

pdftocairo -png -r 100 input.pdf page && ls page-*.png | wc -l

Burst, OCR each page, recombine:

pdfseparate input.pdf page-%d.pdf
for p in page-*.pdf; do
  ocrmypdf "$p" "ocr/$p"
done
pdfunite ocr/*.pdf final.pdf

Inspect metadata and content:

pdfinfo input.pdf > metadata.txt
pdftotext input.pdf - > text.txt
pdffonts input.pdf > fonts.txt

When to use poppler-utils

Lightweight, fast PDF operations on Linux/macOS
Pipelines that chain multiple tools
Scripting in shell, Python, or any language
Read-only operations (text extraction, info, rendering)
As the rendering engine in your applications via the poppler library

poppler-utils does not modify PDFs structurally (apart from pdfunite and pdfseparate). For structural edits, use qpdf. For verb-based operations, pdftk. For compression and conversion, Ghostscript.

Strengths

Speed. Tools are small and fast.
Simplicity. Each tool does one thing well.
Reliability. Mature, well-tested.
Wide availability. Pre-installed on most Linux distros.
Open source. Can be modified and embedded.

Weaknesses

No structural editing. Cannot encrypt, decrypt, or modify metadata directly.
Limited form support. No form filling or extraction.
Some operations slower than alternatives. For large-scale rendering, MuPDF's mutool is faster.
OCR is not included. Use OCRmyPDF or Tesseract.

poppler-utils vs alternatives

poppler-utils for text extraction, info, rendering, fastest and simplest
qpdf for structure, see qpdf introduction
Ghostscript for compression and conversion, see Ghostscript introduction
pdftk for forms and stamps, see pdftk introduction
MuPDF/mutool for fast rendering, see MuPDF introduction

A typical Linux PDF workflow chains these tools together.

Common gotchas

Encoding issues. pdftotext may emit non-UTF-8 by default on some systems. Use -enc UTF-8 explicitly.

Layout in tables. Without -layout, tabular text comes out interleaved. Always use -layout for tables.

Hyphenation at line breaks. pdftotext does not re-join hyphenated words. Post-process if needed.

Custom encodings. Some PDFs have non-Unicode-mapped fonts. pdftotext produces gibberish. OCR is the fallback.

Multi-column documents. pdftotext -layout mostly works but may interleave columns in tricky layouts.

Image quality. pdftocairo defaults to 150 DPI for many operations. Use -r 300 or higher for print-quality output.

Watermarks and decoration. pdftotext extracts all text, including watermarks and footers. Filter as needed.

Encrypted PDFs. Use -upw user_pw or -opw owner_pw to provide passwords for encrypted PDFs.

Practical recipe

For a typical "extract content from a PDF" job:

pdfinfo input.pdf, confirm structure and metadata
pdftotext -layout input.pdf text.txt, extract text
pdfimages -all input.pdf images, extract images
pdffonts input.pdf, list fonts for portability check
Process text and images downstream

Takeaway

poppler-utils is the lightweight, fast, scriptable CLI suite for reading and rendering PDFs on Unix-like systems. Each tool does one job exceptionally well: text extraction, HTML conversion, image rendering, info, merging, splitting. Combined with qpdf, pdftk, and Ghostscript, it forms the backbone of any serious Linux PDF workflow. For browser-based one-off operations, Docento.app handles many similar tasks visually. For related CLI tools, see Ghostscript introduction, qpdf introduction, pdftk introduction, and MuPDF introduction.