Docento.app Logo
Docento.app
All Posts

How to Batch Process Many PDFs at Once

April 7, 2026·5 min read

Sooner or later, anyone who works with PDFs hits the wall of bulk operations. Compress 200 invoices for storage. Watermark every report from last quarter. Merge a folder of scans into a single PDF per supplier. Doing these one at a time turns a five-minute job into a five-hour one. Batch processing turns five hours back into five minutes.

When batching is worth the setup time

Batch processing has a one-time cost (write the script or set up the workflow) and a per-file cost. If you're going to do the operation more than 20-30 times, batching wins. Below that, manual is faster.

Things people usefully batch:

  • Compression of monthly statement PDFs.
  • Watermarking of all client deliverables.
  • OCR of every scan from the last year.
  • Merging of multi-file scans into per-document files.
  • Splitting of statements into per-month files.
  • Renaming based on PDF content.
  • Metadata stripping before external delivery.
  • Conversion of every Word document in a folder to PDF.

Method 1: Command-line tools

The fastest, most reliable batching uses standard PDF command-line tools in a shell loop:

  • qpdf — split, merge, encrypt, decrypt, repair.
  • Ghostscript — compress, convert formats, fix corrupted PDFs.
  • mutool (mupdf) — convert, extract, clean.
  • pdftk — older but solid for merge, split, watermark.
  • pdfcpu — modern Go-based tool, fast and feature-rich.
  • poppler-utils (pdftotext, pdfimages, pdfinfo) — text and image extraction.
  • pdftoppm — convert pages to images.
  • tesseract — OCR.

Example: compress every PDF in a folder.

for f in *.pdf; do
  gs -sDEVICE=pdfwrite -dCompatibilityLevel=1.7 \
     -dPDFSETTINGS=/ebook -dNOPAUSE -dQUIET -dBATCH \
     -sOutputFile="compressed/$f" "$f"
done

Example: OCR every scanned PDF in a folder.

for f in scans/*.pdf; do
  ocrmypdf "$f" "ocr/${f##*/}"
done

This pattern (loop over files, run a tool on each) handles 80% of batch needs.

Method 2: Python with pypdf or pdfplumber

For anything beyond simple loops — conditional logic, data extraction, custom output naming — Python wins:

from pypdf import PdfReader, PdfWriter
from pathlib import Path

for pdf_path in Path("invoices").glob("*.pdf"):
    reader = PdfReader(pdf_path)
    writer = PdfWriter()
    for page in reader.pages:
        page.compress_content_streams()
        writer.add_page(page)
    with open(f"compressed/{pdf_path.name}", "wb") as out:
        writer.write(out)

The advantage: full programmatic control. Read PDF text, extract a date or invoice number, rename the file based on content, branch on size, log results to a CSV. All practical with pypdf.

Method 3: PowerShell on Windows

Windows users often prefer PowerShell. With iTextSharp (a .NET PDF library) or pdftk-server installed, similar batching is straightforward:

Get-ChildItem -Filter *.pdf | ForEach-Object {
  & qpdf --linearize $_.FullName "compressed/$($_.Name)"
}

For Windows-specific merge examples, see how to merge PDFs on Windows.

Method 4: Cloud batch services

When the batch is huge (10,000+ files), cloud services scale better than a laptop:

  • AWS Lambda + S3 trigger: every PDF uploaded to S3 triggers a Lambda that processes it. Pay per invocation.
  • Google Cloud Functions + Cloud Storage trigger.
  • Azure Functions + Blob Storage trigger.

These work especially well for OCR and conversion, where each file is independent and processing time per file is non-trivial.

Privacy note: these services see every file. For sensitive documents, prefer on-device batching. See our note on browser-based privacy.

Method 5: Browser-based batch

A growing number of browser tools handle small-to-medium batches in-browser, with no uploads. Docento.app supports batch operations directly in the browser — drop a folder, choose an operation, run. Useful when:

  • You don't want to install command-line tools.
  • The files are confidential and shouldn't leave the device.
  • You only have access to a Chromebook, library computer, or work laptop with locked-down software installation.

Naming output files sensibly

The hardest part of batching, in practice, is naming the output. Common patterns:

  • Mirror the source name: input/foo.pdf → output/foo.pdf. Simple, doesn't help you find anything.
  • Add a suffix: input/foo.pdf → output/foo-compressed.pdf. Makes the operation visible.
  • Date-prefixed: 2026-04-07/foo.pdf. Useful for periodic batches.
  • Content-based: extract a date or invoice number from the PDF and use it. The most useful, the most fragile if your extraction logic misses a few files.

For long-term storage, content-based naming pays off enormously. See how to organise digital documents.

Logging and error handling

A batch over a thousand files will hit edge cases:

  • Corrupted PDFs that crash the tool. Wrap the call in try/except, log the failure, continue.
  • Password-protected PDFs. Detect with pdfinfo, skip or log.
  • Empty files masquerading as PDFs. Same.
  • Permission errors on locked output directories.

Always log to a file: which input was processed, what the output was, success or failure, time taken. When something goes wrong, the log tells you which 12 of 1,000 files need attention.

Idempotency

A good batch job can be run twice with the same result. Practical habits:

  • Skip files where the output already exists.
  • If the output exists but is older than the input, regenerate.
  • Log a checksum so you can verify nothing changed unexpectedly.

This lets you re-run a batch after fixing a bug without redoing the work that succeeded.

Performance

For big batches, performance matters:

  • Parallel execution: parallel, xargs -P, or Python's multiprocessing. PDF tools are mostly single-threaded; parallel processing across files easily gets 4-8x speedups on a modern CPU.
  • Avoid recompression. If the operation is "merge then compress," combine into one step instead of two.
  • Profile the slow step. Sometimes 90% of the runtime is one specific tool; replacing it (e.g., Ghostscript with qpdf for non-compression jobs) cuts runtime hugely.

Conclusion

Batching turns repetitive PDF work from hours into minutes. Command-line tools handle the simple cases. Python handles the complex ones. Browser tools handle the privacy-sensitive ones. Docento.app supports browser-based batch with no uploads, useful when the files are confidential or you can't install other tools. For specific operations to batch, see our guides on splitting, merging, OCR, and compression.

Related Posts