The Problem RAG Solves
LLMs have a knowledge cutoff and no access to your proprietary data. Fine-tuning adds your data to the model weights — but it is expensive, slow to update, and prone to forgetting. Retrieval-Augmented Generation (RAG) takes a different approach: at query time, retrieve the relevant documents from your knowledge base and inject them into the LLM's context. The model reasons over current, accurate data — not stale training weights.
This is how you build AI that knows your product documentation, your codebase, your customer history, and your internal policies — and stays up to date without retraining.
How RAG Works
User Query
│
▼
Embed query → vector [0.12, -0.87, 0.34, ...]
│
▼
Vector DB similarity search → top-k relevant chunks
│
▼
Inject chunks into LLM prompt as context
│
▼
LLM generates answer grounded in retrieved documentsThe quality of your RAG system depends on three things: chunking strategy, embedding model quality, and retrieval precision.
Choosing a Vector Database
pgvector (PostgreSQL extension)
The pragmatic choice for teams already on PostgreSQL. Adds vector similarity search with `<->` (L2), `<=>` (cosine), and `<#>` (inner product) operators. No new infrastructure. Scales to tens of millions of vectors comfortably.
CREATE TABLE documents (
id bigserial PRIMARY KEY,
content text,
embedding vector(1536)
);
-- Find 5 most similar documents
SELECT content, embedding <=> $1 AS distance
FROM documents
ORDER BY embedding <=> $1
LIMIT 5;Pinecone
Managed, purpose-built vector database. Handles billions of vectors, real-time upserts, and metadata filtering out of the box. The right choice when pgvector's performance hits its ceiling or when you need serverless vector search with zero ops overhead.
Weaviate
Open-source, self-hostable, with built-in support for hybrid search (vector + BM25 keyword). Excellent for use cases where keyword matching alongside semantic similarity matters — legal document search, technical documentation, compliance queries.
Production Chunking Strategy
How you split documents matters more than which vector database you choose.
// Naive chunking — splits mid-sentence, loses context
const chunks = document.split(/ {0,}
{2,}/) // paragraph split
// Better: recursive character splitting with overlap
import { RecursiveCharacterTextSplitter } from "langchain/text_splitter"
const splitter = new RecursiveCharacterTextSplitter({
chunkSize: 512,
chunkOverlap: 64, // overlap preserves context across chunk boundaries
separators: ["
", "
", ". ", " ", ""],
})
const chunks = await splitter.createDocuments([documentText])Rule of thumb: chunks should be semantically complete thoughts. A chunk of 512 tokens with 64 token overlap works well for most prose documents.
Hybrid Search: When Pure Vector Is Not Enough
Vector search finds semantically similar content. Keyword search finds exact matches. For product names, error codes, and proper nouns — keyword search wins. For conceptual questions — vector wins.
Hybrid search combines both:
// Weaviate hybrid search
const result = await client.graphql
.get()
.withClassName('Document')
.withHybrid({ query: userQuery, alpha: 0.75 }) // 0 = keyword only, 1 = vector only
.withLimit(5)
.do()Evaluating RAG Quality
The most common failure mode: high retrieval recall, poor answer quality. The retrieved chunks are relevant but the LLM still hallucinates. Evaluate each component separately:
- Retrieval precision: Are the top-k chunks actually relevant? (human eval or LLM-as-judge)
- Answer faithfulness: Is the answer grounded in the retrieved context? (check with Ragas, TruLens)
- Answer relevance: Does the answer address the question asked?
Deploy evaluation as part of your CI pipeline — catch retrieval regressions before they reach production.