Docento.app Logo
Docento.app
Clean workspace with laptop and notebook
All Posts

poppler-utils Introduction: Lightweight CLI Tools for PDFs

April 23, 2026·7 min read

poppler-utils is a collection of small, fast, single-purpose command-line tools for working with PDFs. Each tool does one job well: extract text, render to image, convert to HTML, get info, merge, separate. Together they form one of the most useful PDF toolkits on Unix-like systems, and they are typically pre-installed on most Linux distributions. This guide is an introduction.

What poppler-utils is

poppler-utils is the command-line interface to the poppler library, an open-source PDF rendering library based on a fork of xpdf. The library powers many PDF viewers (Okular, Evince, GNOME Documents) and is widely used in production systems.

The "utils" part is a bundle of CLI tools:

  • pdftotext, extract text
  • pdftohtml, convert to HTML
  • pdftocairo, render to PNG, JPEG, PDF, SVG, PostScript
  • pdfimages, extract embedded images
  • pdfinfo, display PDF metadata
  • pdfunite, merge PDFs
  • pdfseparate, split PDF into individual pages
  • pdfsig, verify digital signatures
  • pdfdetach, extract embedded files
  • pdftoppm, render to image (older, similar to pdftocairo)
  • pdffonts, list fonts used

Each tool is small and focused, with concise options and predictable behavior.

Installing poppler-utils

Debian / Ubuntu:

sudo apt install poppler-utils

Fedora:

sudo dnf install poppler-utils

macOS:

brew install poppler

Windows:

Less common, but available via packages like poppler on Chocolatey: choco install poppler. Or use WSL.

After installation, each tool is independently invocable.

pdftotext: extract text

The most-used tool of the bunch.

pdftotext input.pdf output.txt

Or to standard out:

pdftotext input.pdf -

Layout preservation:

pdftotext -layout input.pdf -

-layout preserves visual columns and spacing, which is essential for tables and multi-column documents.

Page range:

pdftotext -f 1 -l 5 input.pdf -

Pages 1 through 5 only.

Encoding:

pdftotext -enc UTF-8 input.pdf -

Force UTF-8 (the default on most systems).

Use cases:

  • Indexing PDFs for search
  • Pipelines that need raw text
  • Quick inspection without opening a reader
  • Feeding text to scripts and tools

See how to convert a PDF to text and how to convert a PDF to Markdown for related workflows.

pdftohtml: convert to HTML

pdftohtml input.pdf

Produces input.html, input-N.html per page, plus images and CSS.

Useful options:

  • -c, produces a single complex HTML file
  • -s, single document (no per-page split)
  • -i, ignore images
  • -stdout, write to stdout

See how to convert a PDF to HTML.

pdftocairo: high-quality rendering

The modern rendering tool, replacing pdftoppm for most uses.

Render to PNG:

pdftocairo -png -r 300 input.pdf output

Produces output-1.png, output-2.png, etc., at 300 DPI.

Render to JPEG:

pdftocairo -jpeg -r 150 input.pdf page

Render to single PDF (compressed):

pdftocairo -pdf input.pdf output.pdf

Render to SVG:

pdftocairo -svg input.pdf page

See how to convert a PDF to image and how to convert a PDF to SVG.

Render to multi-page TIFF:

pdftocairo -tiff -r 300 input.pdf output.tif

See how to convert a PDF to TIFF.

pdfimages: extract embedded images

pdfimages -all input.pdf images

Extracts every embedded image as a separate file: images-000.jpg, images-001.png, etc., in their original encoding.

Useful for:

  • Recovering source images from a PDF (you may also want to see how to replace an image in PDF)
  • Understanding what is inside a complex PDF
  • Forensic inspection

pdfinfo: display metadata

pdfinfo input.pdf

Output includes:

  • Title, Author, Subject, Keywords
  • Creator, Producer
  • CreationDate, ModDate
  • Pages, page size
  • File size
  • PDF version
  • Tagged status
  • Encryption status

For programmatic parsing:

pdfinfo input.pdf | grep "Pages:"

See how to edit PDF metadata.

pdfunite: merge PDFs

pdfunite file1.pdf file2.pdf file3.pdf combined.pdf

The fastest way to merge PDFs on Linux. No options needed for the basic case. See how to combine PDF files.

pdfseparate: split PDF into pages

pdfseparate input.pdf page-%d.pdf

Produces page-1.pdf, page-2.pdf, etc.

pdfseparate -f 1 -l 5 input.pdf page-%d.pdf

Only pages 1-5. See how to split a PDF.

pdfsig: verify signatures

pdfsig input.pdf

Reports each digital signature in the file with its status (valid, invalid, unknown).

For verifying integrity in forensic or compliance workflows. See how to detect tampered PDFs and digital signatures vs electronic signatures.

pdfdetach: extract embedded files

pdfdetach -saveall input.pdf -o output_dir/

Extracts every embedded attachment to output_dir/. See hidden data in PDFs explained.

pdffonts: list fonts

pdffonts input.pdf

Lists every font used in the file with its encoding and whether it is embedded.

Useful for:

  • Verifying fonts are embedded for portability
  • Identifying missing or non-embedded fonts
  • Compliance checks for print workflows

Common pipelines

poppler-utils tools shine in pipelines:

Extract text and search:

pdftotext -layout input.pdf - | grep -i "confidential"

Render pages and count:

pdftocairo -png -r 100 input.pdf page && ls page-*.png | wc -l

Burst, OCR each page, recombine:

pdfseparate input.pdf page-%d.pdf
for p in page-*.pdf; do
  ocrmypdf "$p" "ocr/$p"
done
pdfunite ocr/*.pdf final.pdf

Inspect metadata and content:

pdfinfo input.pdf > metadata.txt
pdftotext input.pdf - > text.txt
pdffonts input.pdf > fonts.txt

When to use poppler-utils

  • Lightweight, fast PDF operations on Linux/macOS
  • Pipelines that chain multiple tools
  • Scripting in shell, Python, or any language
  • Read-only operations (text extraction, info, rendering)
  • As the rendering engine in your applications via the poppler library

poppler-utils does not modify PDFs structurally (apart from pdfunite and pdfseparate). For structural edits, use qpdf. For verb-based operations, pdftk. For compression and conversion, Ghostscript.

Strengths

  • Speed. Tools are small and fast.
  • Simplicity. Each tool does one thing well.
  • Reliability. Mature, well-tested.
  • Wide availability. Pre-installed on most Linux distros.
  • Open source. Can be modified and embedded.

Weaknesses

  • No structural editing. Cannot encrypt, decrypt, or modify metadata directly.
  • Limited form support. No form filling or extraction.
  • Some operations slower than alternatives. For large-scale rendering, MuPDF's mutool is faster.
  • OCR is not included. Use OCRmyPDF or Tesseract.

poppler-utils vs alternatives

A typical Linux PDF workflow chains these tools together.

Common gotchas

Encoding issues. pdftotext may emit non-UTF-8 by default on some systems. Use -enc UTF-8 explicitly.

Layout in tables. Without -layout, tabular text comes out interleaved. Always use -layout for tables.

Hyphenation at line breaks. pdftotext does not re-join hyphenated words. Post-process if needed.

Custom encodings. Some PDFs have non-Unicode-mapped fonts. pdftotext produces gibberish. OCR is the fallback.

Multi-column documents. pdftotext -layout mostly works but may interleave columns in tricky layouts.

Image quality. pdftocairo defaults to 150 DPI for many operations. Use -r 300 or higher for print-quality output.

Watermarks and decoration. pdftotext extracts all text, including watermarks and footers. Filter as needed.

Encrypted PDFs. Use -upw user_pw or -opw owner_pw to provide passwords for encrypted PDFs.

Practical recipe

For a typical "extract content from a PDF" job:

  1. pdfinfo input.pdf, confirm structure and metadata
  2. pdftotext -layout input.pdf text.txt, extract text
  3. pdfimages -all input.pdf images, extract images
  4. pdffonts input.pdf, list fonts for portability check
  5. Process text and images downstream

Takeaway

poppler-utils is the lightweight, fast, scriptable CLI suite for reading and rendering PDFs on Unix-like systems. Each tool does one job exceptionally well: text extraction, HTML conversion, image rendering, info, merging, splitting. Combined with qpdf, pdftk, and Ghostscript, it forms the backbone of any serious Linux PDF workflow. For browser-based one-off operations, Docento.app handles many similar tasks visually. For related CLI tools, see Ghostscript introduction, qpdf introduction, pdftk introduction, and MuPDF introduction.

Related Posts