Building a RAG System With PDF Knowledge Bases

Retrieval-augmented generation (RAG) is the dominant pattern for getting an LLM to answer questions over your own documents. PDFs are the most common source format in real-world RAG, and they are also the trickiest. This guide walks through the full pipeline for a PDF-backed RAG system: extraction, chunking, embedding, retrieval, and answer synthesis, plus the failure modes that bite first-time builders.

What RAG actually is

A RAG pipeline answers a question in two steps. First, it retrieves a small number of relevant passages from a large corpus. Second, it asks an LLM to answer the question using those passages as context. The retrieval step keeps the prompt small and grounded; the generation step produces fluent natural-language answers.

The alternative, dumping the whole corpus into a long-context model, works for tens of thousands of tokens but does not scale to a real document library. RAG scales to millions of pages with a constant prompt size.

Step 1: extract text from PDFs

The quality of your RAG system is capped by extraction quality. Garbage in, garbage out.

For native-text PDFs (most modern reports, contracts, and articles), text extraction is fast and high-fidelity. Open-source choices include pdfplumber, PyMuPDF, and pdfminer.six in Python; pdf-parse and pdf-lib in Node. For tables, pdfplumber and camelot do better than naive extraction.

For scanned PDFs, you need OCR first. See how to make a PDF searchable (OCR) for the basics and PDF OCR explained for the deeper mechanics. Modern OCR (Tesseract 5, AWS Textract, Google Document AI) is reliable for printed text and improving fast for handwriting.

A robust extraction step:

Detect whether each page has a text layer.
OCR pages without one.
Keep page numbers and bounding boxes alongside the text. You will need them for citations.
Preserve heading hierarchy where possible (font size or PDF structure tree).

Step 2: chunk intelligently

The most common mistake in early RAG builds is fixed-size chunking. A 500-token sliding window splits paragraphs mid-sentence and destroys the semantic coherence that retrieval depends on.

Better approaches:

Heading-aware chunking. Split on H2/H3 boundaries, then sub-chunk only if the section is too long.
Sentence-respecting windows. Use a tokenizer that snaps to sentence boundaries.
Hierarchical chunks. Store both a small "leaf" chunk (300 tokens) and the larger parent section (1,500 tokens). Retrieve on the leaf, supply the parent to the LLM.
Overlap. A 50 to 100 token overlap between adjacent chunks reduces boundary issues.

Each chunk should carry metadata: source document, page, section heading, position, and a stable chunk ID.

Step 3: choose an embedding model

Embeddings turn each chunk into a vector. Similar text has similar vectors. The choices in 2026:

OpenAI text-embedding-3-large: high quality, hosted, costs money per token.
Cohere Embed v3: similar tier; good multilingual.
Voyage AI: strong on retrieval benchmarks; recommended by Anthropic for use with Claude.
BGE, E5, GTE (open source): run locally; competitive on English benchmarks.
Multilingual variants (multilingual-e5-large, bge-m3): use these if your corpus has multiple languages.

Match the embedding model to your corpus. A finance corpus benefits from a financially-aware model; multilingual corpora need multilingual embeddings.

Step 4: store in a vector database

You need a vector index that supports approximate nearest-neighbor search. Options:

pgvector (PostgreSQL extension): great if you already use Postgres.
Qdrant, Weaviate, Milvus: dedicated vector DBs; rich filtering and hybrid search.
Pinecone: hosted; minimal ops; pay per index.
LanceDB, Chroma: lightweight, embeddable; good for prototypes.
Elasticsearch / OpenSearch: solid if you want BM25 alongside vectors.

For most real-world RAG, hybrid search (BM25 plus vectors) beats pure vector search. Keyword matches catch exact terms (product codes, names, citations) that semantic search may miss.

Step 5: retrieve

The retrieval step takes a user question, embeds it, and queries the vector index for the top K most similar chunks. Typical K is 10 to 50 for the initial retrieve, then a rerank step narrows to 3 to 8 for the final prompt.

Rerankers are smaller, cross-encoder models that score (query, chunk) pairs more accurately than the bi-encoder used at first-stage retrieval. Cohere Rerank, Voyage Rerank, and BGE rerankers all work well. Rerank cuts hallucinations meaningfully because the final context is more relevant.

Add filters where you have them: document type, date range, author, tag. A document with a recent date filter often beats general relevance.

Step 6: synthesize the answer

The final prompt sends the question plus the retrieved chunks to an LLM. A good system prompt:

Specifies that answers must be grounded in the provided context.
Tells the model to say "I do not know" if the context is insufficient.
Asks for citations (chunk IDs or page numbers).
Specifies the format (bullet points, paragraph, JSON).

For citations, include the chunk metadata in the prompt and ask the model to reference it. Verify in post-processing that every claim has a citation tied to a real chunk.

Handling tables and figures

Tables in PDFs are the single biggest source of RAG failure. A table that looks fine visually often extracts as a jumbled run of cells. Three options:

Extract table-as-Markdown with pdfplumber or camelot and embed the Markdown.
Use a vision-language model to extract structured table data per page (slower, costs more, much better quality).
Render the page as an image and use a multimodal LLM at query time for table-heavy questions.

For figures and diagrams, multimodal embeddings or page-image embeddings work better than text-only pipelines. See multimodal LLMs and PDF documents.

Evaluation

Without evaluation you are flying blind. Build a small set of 50 to 200 representative questions with expected answers. Track:

Retrieval recall: does the correct chunk appear in the top K?
Answer correctness: does the LLM produce the right answer?
Faithfulness: does the answer stick to the retrieved context?
Citation accuracy: do the cited chunks actually support the claim?

Frameworks like RAGAS, TruLens, and DeepEval automate parts of this. Track scores over time as you change models, chunkers, and prompts.

Common gotchas

Confusing chunks. Two chunks may have near-identical embeddings (a table of contents and the section it references). Use deduplication.

Stale data. A document may be replaced by a new version. Without a versioning strategy your RAG answers from old documents.

PII leakage. RAG returns whatever is in the chunks. If a contract contains personal data, the LLM sees it. Redact at ingest if needed. See how to redact text in a PDF.

Permission boundaries. A user should only retrieve from documents they are authorized to read. Build filters into the retrieval step, not just the UI.

Costs that creep. Embedding 100,000 pages is cheap; re-embedding when you change models is not. Plan for it.

Hallucinations under low retrieval. When retrieval returns nothing relevant, models often invent. The system prompt should say "I do not know" explicitly.

Frameworks

You do not need to build from scratch. Popular RAG frameworks:

LangChain: broad, mature, sometimes over-abstracted.
LlamaIndex: data-first; strong PDF parsers and chunking strategies.
Haystack: production-leaning; good evaluation tooling.
DSPy: declarative; powerful for optimizing pipelines.

Use a framework when prototyping, then drop into raw API calls for the parts that need control.

When not to use RAG

RAG is the right answer when your corpus is too large to fit in a context window, when documents change frequently, or when you need citations. It is the wrong answer when:

The corpus fits comfortably in a long-context window (under 200K tokens).
The question requires reasoning across the entire corpus, not retrieval of specific passages.
Latency budgets are sub-100ms (RAG adds retrieval round-trip).

For small corpora, a long-context model with the whole document inline often outperforms RAG with less plumbing.

Practical recipe

To stand up a PDF RAG system in a week:

Day 1, 2: extraction pipeline. Native text plus OCR fallback. Store text with page metadata.
Day 2, 3: chunking strategy. Heading-aware where possible; overlap; metadata.
Day 3: embeddings and vector store. Pick one off the shelf; do not over-engineer.
Day 4: retrieval plus rerank. Implement hybrid search.
Day 5: generation prompt with citations.
Day 6, 7: evaluation set; measure; iterate.

Takeaway

RAG over PDFs is now standard practice in 2026. The pipeline is conceptually simple but each stage has sharp edges, especially extraction and chunking. Get the document processing right and the rest is mostly engineering. For browser-based PDF tasks that often feed into RAG ingest (signing, redacting, splitting documents before indexing), Docento.app handles the common operations without uploading to a server. For related deep-dives see chat with your PDF library, extracting tables from PDFs with AI, and AI data extraction from PDFs.