Building Production RAG Systems with LangChain: Lessons Learned

What Is RAG and Why Does It Matter?

Retrieval-Augmented Generation (RAG) is a technique that gives large language models access to your private data. Instead of relying solely on the model's training data, a RAG system retrieves relevant documents from your knowledge base and includes them in the prompt. This allows the LLM to answer questions about your specific business, products, or domain with accuracy and up-to-date information.

RAG has become the standard approach for building AI-powered search, customer support chatbots, internal knowledge assistants, and document analysis tools. It is simpler and cheaper than fine-tuning, and it works with any LLM provider. But building a RAG system that actually works well in production is significantly harder than the tutorials suggest.

The Production RAG Architecture

Document Ingestion Pipeline

The first step is getting your documents into a searchable format. This involves loading documents from various sources (PDFs, web pages, databases, APIs), splitting them into chunks of appropriate size, generating embeddings for each chunk, and storing them in a vector database. We use LangChain's document loaders and text splitters for ingestion, OpenAI's text-embedding-3-small for embeddings, and Supabase with pgvector for storage.

Chunking Strategy

Chunking is the most underrated part of RAG. Poor chunking leads to poor retrieval, which leads to poor answers. We have learned several lessons the hard way:

Chunk size matters enormously. Too small (under 200 tokens) and you lose context. Too large (over 1000 tokens) and you get irrelevant noise. We typically use 500-800 tokens with 100-token overlap.
Respect document structure. Split on section headers and paragraph boundaries, not arbitrary character counts. LangChain's RecursiveCharacterTextSplitter handles this well.
Include metadata. Every chunk should carry metadata about its source — document title, section heading, page number, and date. This metadata is essential for filtering and citation.
Handle tables and structured data separately. Standard text splitting destroys tables. Extract tables as structured data and store them with special handling.

Retrieval Strategy

Naive vector similarity search — the approach most tutorials teach — works for demos but struggles in production. Real-world queries are ambiguous, use different terminology than your documents, and often require information from multiple document sections. Here is what we do instead:

Hybrid search: Combine vector similarity with keyword search (BM25). This catches cases where exact terminology matters, like product names or error codes.
Query transformation: Use the LLM to rewrite user queries before searching. A vague question like "how do I fix the billing issue" becomes a specific search query targeting your documentation.
Reranking: Retrieve more documents than needed (top 20), then use a reranking model to select the most relevant 3-5. This dramatically improves precision.
Multi-query retrieval: Generate multiple search queries from a single user question and merge the results. This improves recall for complex questions.

Common Production Pitfalls

Hallucination Despite RAG

RAG reduces hallucination but does not eliminate it. The LLM can still generate plausible-sounding answers that are not supported by the retrieved documents. Mitigate this by including explicit instructions in your system prompt: "Only answer based on the provided context. If the context does not contain the answer, say you do not know." Also implement citation — require the model to reference specific document sections in its answer.

Stale Data

Your knowledge base changes. Documents are updated, policies change, products evolve. Build an automated re-ingestion pipeline that detects changes to source documents and updates the vector store. We use N8N workflows that monitor source systems and trigger re-ingestion when content changes.

Performance at Scale

A RAG system with 10,000 documents performs very differently from one with 10 million. At scale, you need efficient indexing (use HNSW indexes in pgvector), query caching for common questions, and semantic routing to narrow the search space before retrieval. Monitor your p95 latency and set performance budgets.

Building Your First Production RAG System

Start with a focused knowledge base — your product documentation or FAQ, not your entire company wiki. Use LangChain for orchestration, a quality embedding model, and Supabase with pgvector for storage. Implement hybrid search from the beginning. Build an evaluation framework that tests retrieval quality against a set of known question-answer pairs. Measure, iterate, and expand the knowledge base gradually. RAG done well is a significant competitive advantage — it turns your private data into an intelligent, accessible system.