What is RAG?
RAG (Retrieval Augmented Generation) combines vector search with LLMs. Instead of relying only on the model's training data, RAG retrieves relevant information from your documents and includes it in the prompt.
RAG Flow
- 1. User asks question
- 2. Search vector database for relevant documents
- 3. Add retrieved context to the prompt
- 4. LLM generates answer using the context
Why Use RAG?
- Private data: LLMs can answer about your documents
- Up-to-date: No retraining needed for new information
- Reduced hallucination: Answers grounded in real sources
- Transparency: Can cite sources for answers
- Cost-effective: Cheaper than fine-tuning
Building a RAG System
Step 1: Prepare Documents
# Split documents into chunks
chunks = text_splitter.split_documents(documents)
# Typical chunk size: 500-1000 characters
# Include overlap: 50-100 characters
Step 2: Create Embeddings
# Generate embeddings for each chunk
embeddings = embedding_model.encode(chunks)
# Store in vector database
vector_db.add(embeddings, metadata=chunks)
Step 3: Query
# User question
query = "What is our refund policy?"
# Search for relevant chunks
results = vector_db.search(query, top_k=5)
# Build prompt with context
prompt = f"""
Answer based on the following context:
{results}
Question: {query}
"""
# Get LLM response
answer = llm.generate(prompt)
Chunking Strategies
| Strategy | Description | Best For |
|---|---|---|
| Fixed size | Split every N characters | Simple, uniform text |
| Sentence | Split at sentence boundaries | Preserving context |
| Recursive | Try paragraph, then sentence, then character | Most documents |
| Semantic | Split where meaning changes | Complex documents |
Best Practices
- Chunk overlap: 10-20% overlap prevents losing context at boundaries
- Metadata: Store source, page number, date for citations
- Reranking: Use a reranker to improve retrieval quality
- Hybrid search: Combine vector + keyword search
- Evaluation: Measure retrieval quality, not just final answers
Popular RAG Tools
- LangChain: Python framework with RAG components
- LlamaIndex: Data framework for LLM apps
- Haystack: End-to-end NLP framework
- Vercel AI SDK: JavaScript/TypeScript RAG
Common Challenges
Poor Retrieval
Solution: Better chunking, hybrid search, reranking
Lost in the Middle
LLMs focus on start/end. Put key info first.
Context Length
Can't fit all relevant chunks. Use summarization or filtering.