A PDF looks like a single document, but inside it is a small structured database of objects. Open a PDF in a hex editor or a text editor and you see a soup of obj/endobj markers, dictionaries, and binary streams. Understanding the structure helps you build tools, debug strange behavior, and stop being intimidated by the format. This guide walks through the actual internals.
The file structure
Every PDF has four major sections, in order:
- Header:
%PDF-1.7or%PDF-2.0plus a few high-bit bytes to mark it as binary. - Body: a sequence of objects.
- Cross-reference table (xref): the byte offsets of every object.
- Trailer: pointer to the root object and the xref.
A PDF reader starts at the trailer, reads the xref, then jumps to the objects it needs. This random-access design is why a 1,000-page PDF can open instantly.
The header
%PDF-1.7 says the file conforms to PDF 1.7. The next line typically has high bytes (%\xE2\xE3\xCF\xD3) so that file-type detectors recognize binary. That's it.
PDF 2.0 (ISO 32000-2) is the current version in 2026, but 1.7 is still the most common because most tools haven't switched. See PDF 1.7 vs PDF 2.0.
Objects
Eight basic types:
- Boolean:
true,false. - Numeric: integers and reals (
42,3.14). - String: literal (
(hello)) or hex (<48656c6c6f>). - Name:
/Type,/Page(start with slash). - Array:
[1 2 3]. - Dictionary:
<< /Key /Value /Key2 (string) >>. - Stream: a dictionary followed by raw bytes (compressed content).
- Null:
null.
Plus indirect references: every object has an ID and generation number, and other objects can reference it.
Indirect objects
The actual object definitions live in the body:
1 0 obj
<<
/Type /Catalog
/Pages 2 0 R
>>
endobj
1 0 is the object number plus generation; 2 0 R is a reference to object 2 generation 0. This R is essential to PDF's structure.
The catalog
Every PDF has a root catalog. The trailer points to it. The catalog points to:
- The page tree (
/Pages). - Optionally: outlines (bookmarks), metadata, named destinations, form fields, structure tree, viewer preferences.
The page tree
Pages are organized as a balanced tree:
2 0 obj
<<
/Type /Pages
/Kids [3 0 R 4 0 R 5 0 R]
/Count 3
>>
endobj
Each page is itself an object:
3 0 obj
<<
/Type /Page
/Parent 2 0 R
/MediaBox [0 0 612 792]
/Resources << /Font << /F1 7 0 R >> >>
/Contents 6 0 R
>>
endobj
The MediaBox defines the page size in points (1 point = 1/72 inch). 612x792 is US Letter.
The Contents stream is the actual drawing instructions.
Content streams
The Contents object is a stream. It contains a small page-description language:
BT
/F1 12 Tf
100 700 Td
(Hello, World!) Tj
ET
This says: begin text, set font F1 at 12 points, move to (100, 700), show "Hello, World!", end text. PostScript-derived, RPN-style.
Operators include: Tj/TJ (show text), Td/TD (move position), Tf (set font), m/l (move/line for vector drawing), re (rectangle), f/F (fill), s/S (stroke), and dozens more.
Streams and compression
Streams are usually compressed. The dictionary specifies how:
6 0 obj
<<
/Length 245
/Filter /FlateDecode
>>
stream
... compressed bytes ...
endstream
endobj
/Filter /FlateDecode means zlib/deflate compression. Other filters:
/ASCIIHexDecode,/ASCII85Decode: text encodings./LZWDecode: LZW compression./RunLengthDecode: simple RLE./CCITTFaxDecode: fax-style compression for monochrome images./JBIG2Decode: JBIG2 for scanned text./DCTDecode: JPEG./JPXDecode: JPEG 2000./Crypt: encryption filter.
Filters can chain: /Filter [/ASCII85Decode /FlateDecode].
For more on compression, see PDF compression filters explained.
Fonts
A font in a PDF is a dictionary describing:
- The font type (Type1, TrueType, Type0, Type3).
- Encoding (character code to glyph name).
- Widths (per character).
- Optional: an embedded font program (the actual font file).
When fonts are embedded, the PDF is self-contained. When not, the reader has to find them on the user's system, or substitute.
For font issues, see troubleshooting PDF fonts not displaying and embedded fonts in PDF explained.
Images
Images are streams with an /Image subtype:
8 0 obj
<<
/Type /XObject
/Subtype /Image
/Width 800
/Height 600
/ColorSpace /DeviceRGB
/BitsPerComponent 8
/Filter /DCTDecode
/Length 12345
>>
stream
... JPEG data ...
endstream
endobj
Referenced from page content streams via the Do operator.
The cross-reference table
After the body comes:
xref
0 6
0000000000 65535 f
0000000015 00000 n
0000000090 00000 n
0000000200 00000 n
0000000270 00000 n
0000000350 00000 n
10-digit byte offset, 5-digit generation, n (in use) or f (free). For object 5, find its bytes at offset 350.
PDF 1.5+ also supports cross-reference streams, which are compressed xrefs that allow object streams.
Object streams
To save space, PDF 1.5+ allows multiple objects to be stored inside a single compressed stream. The xref points into it. This is invisible to readers but makes for smaller files.
The trailer
After the xref:
trailer
<<
/Size 6
/Root 1 0 R
>>
startxref
1234
%%EOF
/Root is the catalog. startxref is the byte offset of the xref. %%EOF marks the end.
A reader starts at %%EOF, reads backward to startxref, jumps to the xref, then traverses from /Root.
Incremental updates
PDF can be appended to. After saving, edits add a new body, xref, and trailer at the end:
[original body]
[original xref]
[original trailer]
[new body]
[new xref]
[new trailer with /Prev pointing to old xref]
This is how Acrobat saves changes without rewriting the whole file. See PDF incremental updates explained.
Annotations
An annotation is an object referenced from a page's /Annots array:
9 0 obj
<<
/Type /Annot
/Subtype /Text
/Rect [100 700 120 720]
/Contents (A sticky note)
>>
endobj
Many subtypes: Text, Highlight, Square, Link, FileAttachment, Stamp, FreeText, etc.
For form fields (also annotations of type Widget), see PDF form field types explained.
Metadata
A PDF can have metadata in two places:
- The Info dictionary in the trailer: title, author, creator, producer, dates.
- XMP metadata as a stream attached to the catalog: structured RDF.
For inspecting and editing, see how to edit PDF metadata.
Encryption
When encrypted, the trailer contains an /Encrypt reference to an encryption dictionary. The dictionary specifies:
- Method (V): standard versions 1-5.
- Length: key length (40, 128, 256 bits).
- Permissions: print, copy, modify, etc.
- Owner and user password hashes.
Stream and string contents are encrypted with derived keys. Object structure (the dictionaries themselves) remains visible.
For more, see PDF encryption explained and AES 128 vs AES 256 PDF encryption.
Tools to inspect
Free tools for poking inside PDFs:
- qpdf with
--qdfmode: produces a human-readable version of the PDF. - mutool (from MuPDF):
mutool show file.pdf trailer, etc. - pdftk: legacy but still useful.
- PDFBox (Java): pretty-print everything.
- pdf-parser.py (Didier Stevens): forensic-focused.
- Origami (Ruby): manipulate PDF objects.
To explore a PDF:
qpdf --qdf file.pdf out.pdf
mutool show file.pdf catalog
mutool show file.pdf trailer
Why this matters
Knowing internals helps when:
- A tool produces broken PDFs. You can read the file and see what's wrong.
- You want to script PDF operations (extract pages, modify fields, batch process).
- You're hunting for hidden content (forensic or journalistic). See hidden data in PDFs explained.
- You're debugging compression, encryption, or annotation issues.
- You want to write a PDF tool. The spec is open; the internals are accessible.
Specification
The actual standard:
- ISO 32000-2:2020 (PDF 2.0): the current ISO PDF spec.
- ISO 32000-1:2008 (PDF 1.7): the previous, still widely used.
- Both are free to download or available for free (32000-1 from PDF Association; 32000-2 from ISO with fees historically, now widely accessible).
PDF Association maintains companion documents and conformance specs (PDF/A, PDF/UA, PDF/X, etc.).
Common gotchas
Multiple xrefs after incremental updates. Each save adds a new one; the trailer chain references back.
Object stream contents. Inspecting via plain text tools shows compressed blobs. Use qpdf --qdf to expand.
Free objects. Objects marked f in the xref are reusable; numbering can be confusing.
Long generations. Most objects are generation 0. Higher generations are reserved for reusing freed object numbers; rare in normal PDFs.
Annotations vs content stream. A highlight that "looks like" it's on the page can be an annotation (overlay) or burned into the content stream. They behave differently.
Practical exercises
To get comfortable:
- Open a simple PDF in a text editor; identify header, objects, xref, trailer.
- Run
qpdf --qdfon it; see the expanded form. - Compare a PDF before and after an annotation; see the new objects.
- Examine the catalog and page tree manually.
- Decompress a content stream; read the operators.
Takeaway
PDF internals are not as scary as they look. The format is a small object database with strict structure rules. Once you can read the trailer, find the catalog, walk to a page, and decompress its content stream, you can debug most PDF problems and build most PDF tools. For browser-based PDF operations that work at the structural level (page extraction, metadata stripping, redaction), Docento.app handles them locally. See also PDF incremental updates explained, PDF compression filters explained, and PDF 1.7 vs PDF 2.0.