How to Convert a PDF to LaTeX When You Need the Source Back

Converting a PDF back to LaTeX is one of the more honest "you should not need to do this often" conversions. LaTeX is a source format. PDFs are its output. Going in reverse is reconstructing a recipe from a baked cake. That said, the need is real, usually when the LaTeX source has been lost, when you need to translate a paper to a different journal's template, or when you want to extract equations cleanly for re-use. This guide walks through what is possible and what is not.

What "PDF to LaTeX" actually means

Three different goals get bundled under this name:

Faithful source reconstruction. A LaTeX file that, when compiled, produces a PDF visually similar to the original. Very hard. Almost always requires manual cleanup.
Extractable LaTeX components. Pull out specific bits, equations, tables, figures, references, as LaTeX snippets you can paste into a new document. Much more tractable.
LaTeX-style Markdown. A clean Markdown-ish version that you can adapt to LaTeX by hand. The most realistic outcome for most documents.

Pick which one matches your actual need. The tools differ.

Why this is hard

LaTeX compiles \section{...}, \begin{itemize}, \begin{equation}, and so on into pages. The compiler decides line breaks, ligatures, hyphenation, kerning, page breaks. By the time the PDF exists, those decisions are baked into the geometry. Reversing them, figuring out that "this line" was originally in \section{} because of font size, or that "this paragraph" had \textbf{} because the glyphs are bold, is heuristic work, and the heuristics are imperfect.

A LaTeX PDF that comes from a published paper compiles cleanly. The recovered LaTeX from PDF-to-LaTeX tools usually does not, at least not without manual touch-ups.

Tools that try

TeX4ht / mk4ht, works in reverse: converts LaTeX to HTML/MathML for the web. Not for PDF-to-LaTeX, but useful in the broader ecosystem.

pdf2tex (open source), a small project that extracts a rough LaTeX from PDF. Reliability depends heavily on the source.

gnocchi, nougat, modern ML-based document parsers (originally from Meta and others) that can produce LaTeX output for equations and structured content. Generally Markdown-oriented, but with LaTeX equation extraction baked in.

InftyReader, commercial OCR specifically tuned for math and scientific documents. Output options include LaTeX. Strong on equations.

Mathpix Snip / Mathpix CLI, paid service that converts equations from images to LaTeX with very high accuracy. The killer use case: snip equations from a PDF, get clean LaTeX, paste into your new document. Many academics use this routinely.

Adobe Acrobat Pro + manual, export to Word, then rebuild as LaTeX. Slow but produces clean LaTeX because the manual step removes garbage.

tralics, XML/LaTeX exchange tool, used in INRIA scientific publishing.

A realistic recipe for equation recovery

The most common real need: "I want this equation from this paper as LaTeX so I can use it in mine."

Workflow:

Open the PDF
Use Mathpix Snip (or a similar equation-OCR tool) to capture each equation
Paste the resulting LaTeX into your document
Recompile and verify

This bypasses the impossible problem of full source recovery and addresses 90% of the actual day-to-day need.

A realistic recipe for whole-paper conversion

When you really do need a whole paper as LaTeX:

Convert the PDF to Markdown with marker or nougat. See how to convert a PDF to Markdown.
Use Pandoc to convert Markdown to LaTeX: pandoc paper.md -o paper.tex.
Open paper.tex in your favorite editor. Expect heavy cleanup:
- Re-apply your journal's template (\documentclass{...} and packages)
- Restore custom commands and macros
- Fix figure references and labels
- Re-build the bibliography with BibTeX or BibLaTeX
Compile and iterate until it builds.

The end result is LaTeX you can edit, but it is not a literal recreation of the original source. It is "your interpretation of how the source might have looked".

What converts cleanly and what does not

Reasonably reliable:

Paragraphs of body text
Numbered and bulleted lists
Basic tables (small, no merged cells)
Simple inline math
Section and subsection headings

Hit-or-miss:

Complex tables
Display equations with custom alignment
Multi-column layouts (academic templates)
Figures with sub-figures
Bibliography (you almost always need to rebuild it from a .bib file)

Essentially impossible to recover automatically:

Custom commands and macros (\newcommand{} definitions)
Spacing fine-tuning (\vspace, \hspace adjustments)
Document class and package list
Author-specific style choices

If the source was lost, accept that the "recovered" LaTeX will be a clean reinterpretation, not a forensic restoration.

When you do NOT need to convert to LaTeX

A few cases where the urge to convert is misguided:

Quoting a single passage. Just transcribe it. Faster than any tool.
Re-using equations in a presentation. Most slide tools accept image clipboards. Screenshot the equation and paste.
Sharing a paper. PDFs already work everywhere. The recipient does not need source.
Modernizing a 30-year-old paper. Type it fresh. Faster, and the result is clean.

Sanity-check what the tool returns

A few things to verify in any PDF-to-LaTeX output:

Does it compile? Run pdflatex paper.tex. If it does not compile, the output is a starting point, not a deliverable.
Are equations correct? Compile and compare side by side with the original PDF.
Are references resolved? If the bibliography is broken, citations show as [?].
Are special characters escaped? Underscores, ampersands, percent signs need escaping in LaTeX. Tools that produced Markdown first usually get this right; direct PDF-to-LaTeX tools often do not.

For more on dealing with scanned source material before conversion, see PDF OCR explained and how to make a PDF searchable OCR.

Working with publishers

If you are converting a published paper because the publisher requested a LaTeX submission for a re-issue, the publisher often has access to the original source, ask first. The reproduction problem is much easier when you start from a .zip of .tex + .bib + figures rather than from a PDF.

If the publisher does not have the source, they may provide a template (.cls file). Use that template as your scaffold and pour the converted content into it. The visual result will match the publisher's expectation, even if the underlying LaTeX differs in structure from the original.

Common gotchas

Fonts. PDFs typically embed glyph subsets. The recovered text may have characters with no obvious LaTeX representation (special ligatures, custom symbols). You will see boxes or question marks. Manual replacement.

Math notation conventions. Different fields use different LaTeX styles. Physics often uses \vec{} for vectors; mathematics often uses \mathbf{}. Tools cannot tell which to use.

Footnotes and margin notes. Often misplaced after conversion. Manually restore.

Cross-references. Figure~\ref{fig:flowchart} becomes "Figure 3" in the PDF; the recovered LaTeX has "Figure 3" as literal text. You have to re-introduce labels and refs.

Tables. Conversions often produce \begin{tabular}{...} with column specifications that compile but look wrong. Manual cleanup is expected.

Takeaway

Converting PDF to LaTeX is a heuristic process, not an exact reverse. The best uses are extracting equations cleanly (Mathpix Snip, InftyReader) and producing a Markdown intermediate that you then convert to LaTeX and clean up by hand. Whole-paper restoration of a lost source is mostly an act of rewriting using the PDF as a template. Plan time for cleanup, and consider whether you really need LaTeX or whether Markdown or HTML would actually serve the same purpose. For the upstream task of extracting just one section of a long paper before equation snipping, Docento.app lets you split the PDF in the browser first.