Vector Databases & RAG in Production

The Problem RAG Solves

LLMs have a knowledge cutoff and no access to your proprietary data. Fine-tuning adds your data to the model weights — but it is expensive, slow to update, and prone to forgetting. Retrieval-Augmented Generation (RAG) takes a different approach: at query time, retrieve the relevant documents from your knowledge base and inject them into the LLM's context. The model reasons over current, accurate data — not stale training weights.

This is how you build AI that knows your product documentation, your codebase, your customer history, and your internal policies — and stays up to date without retraining.

How RAG Works

    User Query
        │
        ▼
    Embed query → vector [0.12, -0.87, 0.34, ...]
        │
        ▼
    Vector DB similarity search → top-k relevant chunks
        │
        ▼
    Inject chunks into LLM prompt as context
        │
        ▼
    LLM generates answer grounded in retrieved documents

The quality of your RAG system depends on three things: chunking strategy, embedding model quality, and retrieval precision.

Choosing a Vector Database

pgvector (PostgreSQL extension)

The pragmatic choice for teams already on PostgreSQL. Adds vector similarity search with `<->` (L2), `<=>` (cosine), and `<#>` (inner product) operators. No new infrastructure. Scales to tens of millions of vectors comfortably.

    CREATE TABLE documents (
      id bigserial PRIMARY KEY,
      content text,
      embedding vector(1536)
    );

    -- Find 5 most similar documents
    SELECT content, embedding <=> $1 AS distance
    FROM documents
    ORDER BY embedding <=> $1
    LIMIT 5;

Pinecone

Managed, purpose-built vector database. Handles billions of vectors, real-time upserts, and metadata filtering out of the box. The right choice when pgvector's performance hits its ceiling or when you need serverless vector search with zero ops overhead.

Weaviate

Open-source, self-hostable, with built-in support for hybrid search (vector + BM25 keyword). Excellent for use cases where keyword matching alongside semantic similarity matters — legal document search, technical documentation, compliance queries.

Production Chunking Strategy

How you split documents matters more than which vector database you choose.

    // Naive chunking — splits mid-sentence, loses context
    const chunks = document.split(/ {0,}
{2,}/) // paragraph split

    // Better: recursive character splitting with overlap
    import { RecursiveCharacterTextSplitter } from "langchain/text_splitter"

    const splitter = new RecursiveCharacterTextSplitter({
      chunkSize: 512,
      chunkOverlap: 64, // overlap preserves context across chunk boundaries
      separators: ["

", "
", ". ", " ", ""],
    })
    const chunks = await splitter.createDocuments([documentText])

Rule of thumb: chunks should be semantically complete thoughts. A chunk of 512 tokens with 64 token overlap works well for most prose documents.

Hybrid Search: When Pure Vector Is Not Enough

Vector search finds semantically similar content. Keyword search finds exact matches. For product names, error codes, and proper nouns — keyword search wins. For conceptual questions — vector wins.

Hybrid search combines both:

    // Weaviate hybrid search
    const result = await client.graphql
      .get()
      .withClassName('Document')
      .withHybrid({ query: userQuery, alpha: 0.75 }) // 0 = keyword only, 1 = vector only
      .withLimit(5)
      .do()

Evaluating RAG Quality

The most common failure mode: high retrieval recall, poor answer quality. The retrieved chunks are relevant but the LLM still hallucinates. Evaluate each component separately:

Retrieval precision: Are the top-k chunks actually relevant? (human eval or LLM-as-judge)
Answer faithfulness: Is the answer grounded in the retrieved context? (check with Ragas, TruLens)
Answer relevance: Does the answer address the question asked?

Deploy evaluation as part of your CI pipeline — catch retrieval regressions before they reach production.

The Problem RAG Solves

This is how you build AI that knows your product documentation, your codebase, your customer history, and your internal policies — and stays up to date without retraining.

How RAG Works

    User Query
        │
        ▼
    Embed query → vector [0.12, -0.87, 0.34, ...]
        │
        ▼
    Vector DB similarity search → top-k relevant chunks
        │
        ▼
    Inject chunks into LLM prompt as context
        │
        ▼
    LLM generates answer grounded in retrieved documents

The quality of your RAG system depends on three things: chunking strategy, embedding model quality, and retrieval precision.

Choosing a Vector Database

pgvector (PostgreSQL extension)

    CREATE TABLE documents (
      id bigserial PRIMARY KEY,
      content text,
      embedding vector(1536)
    );

    -- Find 5 most similar documents
    SELECT content, embedding <=> $1 AS distance
    FROM documents
    ORDER BY embedding <=> $1
    LIMIT 5;

Pinecone

Weaviate

Production Chunking Strategy

How you split documents matters more than which vector database you choose.

    // Naive chunking — splits mid-sentence, loses context
    const chunks = document.split(/ {0,}
{2,}/) // paragraph split

    // Better: recursive character splitting with overlap
    import { RecursiveCharacterTextSplitter } from "langchain/text_splitter"

    const splitter = new RecursiveCharacterTextSplitter({
      chunkSize: 512,
      chunkOverlap: 64, // overlap preserves context across chunk boundaries
      separators: ["

", "
", ". ", " ", ""],
    })
    const chunks = await splitter.createDocuments([documentText])

Rule of thumb: chunks should be semantically complete thoughts. A chunk of 512 tokens with 64 token overlap works well for most prose documents.

Hybrid Search: When Pure Vector Is Not Enough

Hybrid search combines both:

    // Weaviate hybrid search
    const result = await client.graphql
      .get()
      .withClassName('Document')
      .withHybrid({ query: userQuery, alpha: 0.75 }) // 0 = keyword only, 1 = vector only
      .withLimit(5)
      .do()

Evaluating RAG Quality

The most common failure mode: high retrieval recall, poor answer quality. The retrieved chunks are relevant but the LLM still hallucinates. Evaluate each component separately:

Retrieval precision: Are the top-k chunks actually relevant? (human eval or LLM-as-judge)
Answer faithfulness: Is the answer grounded in the retrieved context? (check with Ragas, TruLens)
Answer relevance: Does the answer address the question asked?

Deploy evaluation as part of your CI pipeline — catch retrieval regressions before they reach production.

Vector Databases & RAG in Production

The Problem RAG Solves

How RAG Works

Choosing a Vector Database

Production Chunking Strategy

Hybrid Search: When Pure Vector Is Not Enough

Evaluating RAG Quality

More from the blog

Building AI Agents with LangChain & LangGraph

Fine-Tuning vs RAG: When to Train Your Own Model

Want to build something like this?

Vector Databases & RAG in Production

The Problem RAG Solves

How RAG Works

Choosing a Vector Database

Production Chunking Strategy

Hybrid Search: When Pure Vector Is Not Enough

Evaluating RAG Quality

More from the blog

Building AI Agents with LangChain & LangGraph

Fine-Tuning vs RAG: When to Train Your Own Model

Want to build something like this?