ai

RAG Pipelines: What Actually Works in Production

Building a RAG pipeline that works in a demo is easy. Building one that works reliably in production is a different challenge entirely. Here is what we have learned.

Retrieval-augmented generation has become the default approach for connecting large language models to proprietary data. The basic concept is deceptively simple: retrieve relevant context at query time, inject it into the prompt, and let the model generate a grounded answer. Every demo looks impressive. The problems appear at scale.

Having built RAG systems across document sizes ranging from a few hundred pages to several million, for domains including legal, technical documentation, financial services, and internal knowledge bases, here is a detailed look at what actually works in production and where teams consistently get it wrong.

RAG Pipeline Architecture, from user query to grounded answer

The Retrieval Problem Is Harder Than the Generation Problem

Most teams spend 80% of their effort on the generation side, prompt engineering, model selection, output formatting, and treat retrieval as a solved problem. This is backwards. In our experience, retrieval quality determines 70-80% of the overall system quality. If you retrieve the wrong chunks, no amount of prompt engineering will save you.

The fundamental challenge is this: your user asks a question in natural language, and your system needs to find the 3-10 most relevant passages from potentially millions of chunks. “Relevant” is context-dependent, ambiguous, and often requires understanding concepts that are not explicitly stated in either the query or the document.

Chunking: The Foundation That Everyone Underestimates

How you split documents into chunks has an outsized impact on retrieval quality, and there is no single best strategy. The right approach depends on your document types, query patterns, and accuracy requirements.

Fixed-Size Chunking

The simplest approach: split text into chunks of N characters or tokens with some overlap. We typically start with 512-token chunks with 50-token overlap as a baseline. This works acceptably for homogeneous document collections where the information density is relatively uniform.

Where it fails: documents with varying structure. A 512-token chunk might cut a table in half, split a definition from its explanation, or combine the conclusion of one section with the introduction of the next. Each of these creates a chunk that is either incomplete or semantically incoherent.

Semantic Chunking

A more sophisticated approach that uses the document’s own structure to determine chunk boundaries. We parse headings, paragraphs, lists, and other structural elements to create chunks that represent complete thoughts or sections.

For structured documents like technical documentation, legal contracts, and policy manuals, semantic chunking consistently outperforms fixed-size chunking by 15-25% on retrieval accuracy benchmarks. The implementation is more complex, you need reliable document parsing and sensible fallback behaviour for unstructured content, but the quality improvement justifies the effort.

Hierarchical Chunking

For complex documents, we sometimes implement a two-level chunking strategy. Large chunks (1000-2000 tokens) capture broad context, while smaller child chunks (200-400 tokens) capture specific details. At retrieval time, we search against the smaller chunks for precision, then expand to the parent chunk for context. This approach works particularly well for question-answering over long documents where the answer requires both specific facts and surrounding context.

Chunking Strategy Comparison, fixed-size vs semantic vs hierarchical

What We Actually Do

In practice, we use a combination of strategies. Document type detection determines the primary chunking approach. PDFs with clear headings get semantic chunking. Unstructured text gets fixed-size chunking with larger overlap. Tables are extracted as complete units. Code blocks are never split. Metadata (document title, section heading, page number) is attached to every chunk.

The key insight: chunking is not a one-time configuration. It is an iterative process that you refine based on retrieval quality metrics. We typically go through 3-5 chunking strategy iterations before reaching production-quality retrieval.

Embedding Models: Selection and Trade-offs

The embedding model converts your text chunks and queries into vectors that capture semantic meaning. The choice of model determines the quality of similarity matching and has direct implications for cost, latency, and infrastructure.

What We Evaluate

We evaluate embedding models on five dimensions:

Retrieval accuracy on your specific domain data. General-purpose benchmarks like MTEB are useful for shortlisting but do not predict performance on legal contracts, medical records, or product documentation. We always run domain-specific evaluations using a test set of 50-100 query-answer pairs drawn from your actual data.

Dimensionality affects storage costs and query latency. Models produce vectors ranging from 384 to 3072 dimensions. Higher dimensions generally capture more nuance but increase storage and slow down similarity search. For most use cases, 768-1024 dimensions hit the sweet spot.

Throughput matters for ingestion pipelines. If you are processing millions of chunks, a model that embeds 100 tokens/second versus 10,000 tokens/second makes the difference between a pipeline that runs in minutes versus days.

Multilingual capability if your documents span languages. Some models handle multilingual content natively; others require per-language models.

Context window of the embedding model limits your maximum chunk size. Most models support 512 tokens; newer models support 8192 or more. Longer context windows allow for larger chunks, which can improve retrieval quality for certain document types.

Our Current Recommendations

For most English-language production deployments, we start with models in the E5 or GTE family for open-source options, or Cohere’s embed models or OpenAI’s text-embedding-3 family for API-based options. The specific choice depends on whether you need to run the model on your own infrastructure (data privacy requirements) or can use an API.

For domain-specific applications where accuracy is critical, we often fine-tune an open-source embedding model on your data. Fine-tuning requires a relatively small dataset, 1,000-5,000 positive pairs, and typically yields 10-20% improvement in retrieval accuracy on domain-specific queries. The investment pays for itself quickly in domains like legal or medical where precision matters.

Vector Databases: More Than Just Storage

The vector database stores your embeddings and handles similarity search at query time. The choice of platform depends on your scale, latency requirements, deployment model, and operational complexity budget.

Platform Options

Pinecone is our default recommendation for teams that want a managed service with minimal operational overhead. Serverless architecture, automatic scaling, solid performance up to tens of millions of vectors. The trade-off is vendor lock-in and limited control over infrastructure.

Weaviate is our choice when teams need more control or want to run on their own infrastructure. Strong hybrid search capabilities (vector + keyword), good multi-tenancy support, and an active community. Operational complexity is moderate.

pgvector is increasingly viable for teams already running PostgreSQL. It avoids introducing a new infrastructure component and works well for collections up to a few million vectors. Beyond that scale, dedicated vector databases offer better performance. We use pgvector for prototypes and smaller production deployments where adding another database would be overkill.

Qdrant has become our recommendation for high-performance requirements. Written in Rust, excellent query latency, and strong filtering capabilities. We use it for applications where sub-50ms retrieval is required.

Pure vector similarity search, find the K nearest neighbours to the query vector, is a starting point, not a complete retrieval strategy. Production systems benefit from several additional techniques:

Hybrid search combines vector similarity with traditional keyword matching (BM25). This catches cases where semantic similarity misses exact terminology. A query for “SOC 2 Type II compliance requirements” needs exact term matching as much as semantic understanding. We typically weight the combination 70% vector, 30% keyword, adjusted per domain.

Metadata filtering narrows the search space before running similarity search. If the user is asking about a specific contract, filter to that contract’s chunks first, then run similarity search within that subset. This improves both accuracy and latency.

Re-ranking uses a cross-encoder model to re-score the top candidates from the initial retrieval. Cross-encoders are more accurate than bi-encoders (embedding models) because they process the query and document together, but they are too slow to run against the full collection. Running a re-ranker over the top 20-50 candidates typically improves precision by 10-15%.

Evaluation: The Non-Negotiable Discipline

You cannot improve what you do not measure. Every production RAG system needs an evaluation framework that runs continuously, not just during development.

What We Measure

Retrieval recall: Of the relevant chunks in the collection, what percentage does the retrieval system find? We measure this using a curated test set of queries with known relevant passages.

Retrieval precision: Of the chunks retrieved, what percentage are actually relevant? High recall with low precision means the model is drowning in noise.

Answer accuracy: Does the generated answer correctly reflect the information in the retrieved context? We use a combination of automated evaluation (using a separate LLM as a judge) and periodic human evaluation.

Faithfulness: Does the model’s answer stay grounded in the retrieved context, or does it hallucinate additional information? This is the most critical metric for trust.

Latency: End-to-end response time broken down by retrieval, re-ranking, and generation components.

How We Build Evaluation Pipelines

We maintain a test set of 100-500 query-answer pairs per domain, with the relevant source passages identified. This test set is the most valuable artifact in the entire RAG system, it is what allows you to make changes with confidence.

Every change to chunking strategy, embedding model, retrieval parameters, or prompt template triggers an automated evaluation run against this test set. Results are tracked over time so we can detect regressions immediately.

We also implement production monitoring that samples live queries and evaluates answer quality using an LLM judge. This catches distribution shifts, when users start asking questions that your test set does not cover.

Common Failure Modes

After building numerous RAG systems, these are the failure modes we see most frequently:

Chunk boundaries destroying context. The answer to the user’s question spans two chunks, and neither chunk alone contains enough information. Fix: improve chunking strategy, increase overlap, or implement hierarchical retrieval.

Embedding model weak on domain terminology. The model does not understand that “AEM dispatcher” and “CDN caching layer” are related concepts in the Adobe ecosystem. Fix: fine-tune embeddings or add synonym expansion to the query.

Insufficient retrieval context. Retrieving 3 chunks when the answer requires information from 7 different sections. Fix: increase K, implement iterative retrieval, or use query decomposition.

Over-reliance on vector similarity. The most semantically similar chunk is not always the most useful chunk. A passage that defines a term might be highly similar to a query about that term, but the user actually wants the passage that describes how to configure it. Fix: hybrid search, query classification, and diverse retrieval strategies.

The Bottom Line

RAG is not a plug-and-play solution. It is an engineering discipline that requires thoughtful design, rigorous testing, and continuous refinement. But when done well, it gives you something valuable: AI systems that actually know your business, grounded in your data, with answers you can trace back to specific sources.

The teams that succeed with RAG are the ones that treat it as a search engineering problem first and a language model problem second. Get the retrieval right, measure relentlessly, and the generation side largely takes care of itself.

ai 15 April 2026