A PDF is full of compressed content: page streams, embedded images, fonts, embedded files. The compression schemes are called "filters" in PDF terminology. They are pluggable, can be chained, and each is suited to different content. Understanding filters explains why PDFs are sometimes huge and sometimes tiny, and how to shrink them effectively.
The filter concept
A PDF stream object has a dictionary that describes how to decompress (decode) its bytes:
6 0 obj
<<
/Length 245
/Filter /FlateDecode
>>
stream
... compressed bytes ...
endstream
endobj
The /Filter entry names a decoder. A reader applies the decoder to get the original bytes.
Filters can chain:
/Filter [/ASCII85Decode /FlateDecode]
Means: first ASCII85-decode, then Flate-decode.
The major filters
FlateDecode: zlib/deflate compression. The general-purpose choice for text, page contents, font data. The most common filter.
ASCIIHexDecode and ASCII85Decode: text encodings of binary, not real compression. Used when a PDF needs to be all-ASCII (rare today).
LZWDecode: LZW compression. Older; mostly replaced by Flate.
RunLengthDecode: simple run-length encoding. Niche use.
CCITTFaxDecode: fax-style compression for monochrome (1-bit) images. Excellent on scanned text. The default for black-and-white scans.
JBIG2Decode: JBIG2 for bilevel images. Stronger than CCITT for scanned text; also more complex. Used heavily in newer scanners. Notable for occasionally swapping similar glyphs (causing the famous Xerox JBIG2 digit-substitution bug).
DCTDecode: JPEG. The default for color and grayscale photos.
JPXDecode: JPEG 2000. Lossless or lossy; better quality at the same size than JPEG but less universal support.
Crypt: a special filter applied during encryption; not a compressor.
Which filter for which content
- Text content streams: FlateDecode.
- Fonts: FlateDecode.
- Scanned monochrome (black-and-white) pages: CCITTFaxDecode or JBIG2Decode.
- Color photos and screenshots: DCTDecode (JPEG) or JPXDecode (JPEG 2000).
- Grayscale photos: DCTDecode.
- Diagrams with sharp edges: FlateDecode on the bitmap, or store as vector.
- Embedded files: FlateDecode (already-compressed payloads gain little).
A well-built PDF chooses the right filter per content type. A poorly-built one applies Flate to JPEG-encoded photos, gaining nothing and complicating decode.
Why PDFs get bloated
Common causes of large PDFs:
- High-quality JPEGs: a scanned receipt at 600 DPI color is 5+ MB; the same at 300 DPI grayscale is 200 KB.
- Embedded fonts: a full TrueType font is 100-300 KB; subsetting drops it to 20-50 KB.
- No image compression: rare but happens with simple PDF producers.
- Hidden full-page rasterization: a "text" PDF that's really an image of text. OCR fixes this somewhat, but file size remains heavy.
- Embedded files: attachments add up.
For reduction strategies, see reduce PDF file size.
Image-specific compression
For embedded images:
- JPEG (DCTDecode): lossy. Good for photos. Quality 70-85 is usually invisible at typical viewing.
- JPEG 2000 (JPXDecode): lossless or lossy. Better quality at same size, broader viewer support in 2026.
- CCITT G4: lossless. Great for 1-bit monochrome (scanned text).
- JBIG2: lossless or lossy. Excellent on scanned text; can produce surprising glyph substitutions in some modes.
- Flate on raw raster: lossless. Massive size for photos; OK for small graphics.
The big size lever for scanned PDFs: convert color scans to grayscale or 1-bit (CCITT/JBIG2) where possible. A 600 DPI color receipt scan at 5 MB drops to under 200 KB as 1-bit CCITT.
Font compression
Fonts in PDFs are typically Flate-compressed:
- Subsetting: only embed the glyphs actually used in the document. Standard practice.
- Full embedding: every glyph of the font. Used when the document might be edited later.
- No embedding: the font is referenced by name; the reader substitutes. Risky.
For 100 pages of text using one font, embedded-subset adds maybe 30 KB. Embedded-full could add 200 KB. No-embedding adds zero but risks font substitution at view time.
For more on fonts, see embedded fonts in PDF explained and troubleshooting PDF fonts not displaying.
Compression vs. content
The filter affects the file size; the content choice often affects size more:
- A 300-page scanned PDF at 300 DPI color: 100+ MB.
- The same PDF re-OCR'd and represented as text plus low-DPI images: 5-10 MB.
- The same content as native-text PDF: under 1 MB.
For digital-source documents, always start native-text. Scanning + OCR is fallback.
Content streams
Page content streams (the drawing instructions) compress well with Flate:
- Native-text PDFs have small content streams (text plus formatting commands).
- Vector-heavy PDFs (CAD-exported, infographics) have larger content streams; still Flate-friendly.
- Raster-heavy PDFs have small content streams (mostly
Dooperators referencing images); the bulk is in the image streams.
Object streams
PDF 1.5+ allows multiple objects to be combined into a single Flate-compressed object stream:
- Reduces overhead of repeated object boundaries.
- Improves compression by giving Flate more context.
- Invisible to readers; transparent to authors.
Modern PDF writers use object streams by default. Older ones may produce one object per stream, less efficient.
Decompression and inspection
To peek inside:
- qpdf --qdf input.pdf output.pdf: expands all filters, produces a human-readable PDF.
- mutool show input.pdf 6: show object 6, decompressed.
- Decompress a stream manually: feed the bytes between
streamandendstream(after the filter chain in reverse) to the corresponding decoder.
For Flate streams specifically, zlib-flate -uncompress < bytes (from qpdf's tooling) is convenient.
Recompression
When optimizing a PDF:
- Inventory image content: count and size.
- Recompress images: lower JPEG quality where acceptable; convert photo regions to JPEG; reduce DPI for large rasters.
- Re-subset fonts: drop full-embedded fonts to subsets.
- Strip unused content: orphaned objects, old form fields, optional metadata.
- Linearize (or not, depending on use case).
Tools that do this:
- Acrobat Pro: Save As Optimized PDF (lots of knobs).
- Ghostscript:
gs -dPDFSETTINGS=/ebook input.pdf output.pdf(or /screen, /printer, /prepress). - qpdf: structural cleanup; pair with image-specific tools for image-level optimization.
- MuPDF mutool clean.
For browser-based PDF compression, Docento.app and similar tools handle common operations.
Lossy vs. lossless
For images:
- Lossy (JPEG, JBIG2 lossy mode): smaller files; visible artifacts at extreme compression.
- Lossless (CCITT G4 on 1-bit, Flate on raw, JPEG 2000 lossless): preserves exact pixels; larger files.
For text content streams: always lossless. PDF text shouldn't be re-rendered as JPEG.
For archival (PDF/A), lossy JBIG2 is restricted; preserve fidelity. See PDF/A archival format explained.
The JBIG2 substitution issue
A famous bug: Xerox scanners using JBIG2 sometimes swapped similar-looking digits and characters in scanned documents (e.g., 8 becoming 6 in invoices). The cause: JBIG2's "pattern matching and substitution" mode treats nearly-identical glyphs as the same, which works for clean text but breaks when scan quality is borderline.
For high-stakes scans (legal, financial), avoid lossy JBIG2 or use the safer "lossless" mode. Many modern scanners now use safer modes by default; verify on your gear.
Hidden costs of bad compression
Beyond file size:
- Decoding speed: complex filter chains slow rendering, especially on low-end devices.
- Memory: decompressed content has to fit in memory; huge images strain mobile.
- Cross-tool fidelity: exotic JPEG 2000 settings may not render in all viewers.
- Long-term durability: lossy compression baked into archival PDFs can't be undone.
Common gotchas
Already-compressed content compressed again. Flate-wrapping a JPEG gains nothing. Some PDF producers do this anyway.
No subsetting. Fonts embedded in full when subset would do. Easy size win on optimization.
Scanner DPI too high. 600 DPI color for a basic document is overkill. 200-300 DPI is plenty.
Image-of-text masquerading as text. A "scanned" PDF that no one OCR'd. Search returns nothing; file size is high.
Embedded fonts substituted at view time. The producer's font isn't on the viewer's system; visual differs from intent. Embed.
Multiple incremental updates with images. Each save can add a fresh copy of large content. Rewrite periodically. See PDF incremental updates explained.
Practical recipe
For producing a well-compressed PDF:
- Native-text whenever possible. Don't scan if you can export.
- Scan settings: 300 DPI grayscale or 1-bit for text; color only when needed.
- JPEG quality 75-85 for embedded photos.
- Subset embedded fonts.
- Use object streams (any modern producer).
- Optimize before publishing: Acrobat Optimize, Ghostscript, or mutool clean.
- Re-test in target viewers after optimization.
For a routine "make this PDF smaller" need, see reduce PDF file size.
Takeaway
PDF compression filters are the unsung machinery behind sensible file sizes. Choosing the right filter per content type (Flate for text/fonts, JPEG/JPEG 2000 for photos, CCITT/JBIG2 for monochrome scans) plus subsetting fonts gets PDFs down to their natural minimum. The biggest wins come from content choices (DPI, color depth, native vs. scanned) rather than from clever filter tweaks. See also reduce PDF file size, PDF internals: objects and streams, and embedded fonts in PDF explained.