RAG with Claude and pgvector: A Document Q&A POC in Python

Series
AI in Production: 30 Real-World Use Cases with Claude

Part 10 of 30 · View the full series

TL;DR

  • RAG (retrieval-augmented generation) lets Claude answer questions grounded in your own documents, not just its training data. The rag claude pgvector stack is one of the most practical starting points for production document Q&A.
  • pgvector turns Postgres into a vector store you already know how to operate, backup, and monitor. No extra managed service needed for most teams.
  • The full pipeline is five steps: chunk, embed, store, retrieve (cosine similarity top-k), generate. Each step is independently replaceable.
  • Claude cites the source chunks in its answer, giving users traceability and letting you audit hallucinations before they reach production.
  • The complete POC runs in Docker with a single docker compose up. It takes under 15 minutes to go from zero to working Q&A on your own PDFs.
  • Prompt caching on the system prompt cuts repeated-query costs by up to 90% when your context window is large.

Why RAG and the rag claude pgvector Stack Make Sense for Engineering Teams

Most enterprise AI projects hit the same wall. Claude is smart enough to answer almost any question, but it does not know what is in your internal docs, product specs, support tickets, or legal contracts. Fine-tuning is expensive, slow, and quickly becomes stale. RAG (retrieval-augmented generation) is the answer that works in production: you keep documents in a searchable store, retrieve the relevant fragments at query time, and feed only those fragments to the model. The model then answers from evidence, not guesswork.

The specific combination of rag claude pgvector is worth explaining before diving into code. pgvector is a Postgres extension that adds a vector column type and approximate nearest-neighbor search. If your team already runs Postgres (and the majority of engineering teams do), this means your vector store lives in the same database cluster as the rest of your data. You get ACID transactions, row-level security, familiar backup tooling, and a JOIN if you ever need one. The tradeoff is that pgvector’s HNSW index does not match Pinecone or Weaviate at a hundred million vectors, but it easily handles tens of millions, which is well beyond what most internal document systems need.

Claude’s role is the generation half: reading retrieved chunks, attributing each claim to a specific source, and producing a coherent answer. That citation behavior is not magic. You have to ask for it explicitly in your system prompt, and the POC below shows how.

Who actually needs this

  • Product teams that want employees to ask questions about a knowledge base or wiki without copy-pasting content into a chat window.
  • Legal and compliance teams that need answers from contracts or regulations with explicit source references.
  • Customer support organizations building a Tier-0 bot that can answer from product documentation before escalating.
  • Developer tools teams that want a code-aware Q&A system over large internal codebases or RFCs.
  • Any team that has outgrown “give Claude the whole document” because the documents are too large or too numerous.

What this article builds

A fully self-contained Python project that: ingests text files or PDFs, splits them into overlapping chunks, embeds each chunk using a local sentence-transformer model (so embedding costs nothing), stores embeddings in Postgres with pgvector, answers questions by retrieving the top-5 most similar chunks, and calls Claude with those chunks to get a cited answer. Everything runs with docker compose up.

RAG Architecture: The Five-Step Pipeline

1. Chunk Documents

2. Embed Each Chunk

3. Store pgvector

4. Retrieve Top-K Chunks

5. Generate Claude + Citations

Ingest Offline / Batch Query Time

Figure 1. The five-step RAG pipeline. Steps 1-3 run offline during ingestion. Steps 4-5 run on every user query.

Steps 1 through 3 are the offline ingestion pipeline. You run them once (or on a cron job) whenever documents are added or updated. Steps 4 and 5 are the hot path that runs on every user query. Separating these two phases matters because it lets you re-embed or re-chunk without touching query logic, and it keeps query latency low because the vector search is the only variable-cost operation at runtime.

Chunking strategy choices

The right chunk size is almost always smaller than you think. Claude can read many tokens, but if you send a 20-page chapter as a single chunk, it drowns out the specific sentence the user actually needs. Common starting points:

  • 512 tokens with 64-token overlap: Good general default. Overlap preserves context at chunk boundaries.
  • 256 tokens with 32-token overlap: Better recall precision for factual Q&A where exact sentences matter.
  • Paragraph-based splitting: Best for structured docs (contracts, specs) where paragraphs have semantic coherence.

The POC below uses 512 characters (not tokens) with 50-character overlap. That is a simplification that works fine for demos. In production you want to split on token boundaries using the model’s tokenizer.

Embedding model choices

This POC uses sentence-transformers/all-MiniLM-L6-v2 from HuggingFace running locally. It produces 384-dimensional vectors, is fast on CPU, and is free. For production you might want OpenAI text-embedding-3-small (1536 dimensions, very good quality, costs money) or Cohere’s embedding API. The key constraint: whatever model you use to embed at ingest time, you must use the same model at query time. Mixing models produces nonsense similarity scores.

Setting Up Postgres with pgvector

pgvector ships as a Postgres extension. The Docker image pgvector/pgvector:pg16 comes with it pre-installed. You just need to run CREATE EXTENSION vector; once in your database.

docker pull pgvector/pgvector:pg16

The docker-compose below wires everything together. No manual database setup needed. The initdb.sql script runs automatically on first container start.

Python App rag_poc.py

sentence-transformers all-MiniLM-L6-v2

Postgres 16 + pgvector documents table

Claude claude-sonnet-4-6

embed

store/query

top-k chunks

Figure 2. Component map. The Python app calls sentence-transformers locally for embeddings, stores/retrieves from Postgres+pgvector, then passes retrieved chunks to Claude for answer generation.

The Complete POC: File by File

The project has four files: docker-compose.yml, initdb.sql, requirements.txt, and rag_poc.py. Copy them into a fresh directory and run docker compose up -d to start Postgres, then python rag_poc.py to run the pipeline.

Install and requirements

pip install anthropic psycopg2-binary sentence-transformers numpy python-dotenv

requirements.txt

anthropic>=0.25.0
psycopg2-binary>=2.9.9
sentence-transformers>=2.7.0
numpy>=1.26.0
python-dotenv>=1.0.0

.env example

# .env
ANTHROPIC_API_KEY=sk-ant-your-key-here
PGHOST=localhost
PGPORT=5432
PGDATABASE=ragdb
PGUSER=raguser
PGPASSWORD=ragpassword

docker-compose.yml

version: "3.9"

services:
  postgres:
    image: pgvector/pgvector:pg16
    environment:
      POSTGRES_DB: ragdb
      POSTGRES_USER: raguser
      POSTGRES_PASSWORD: ragpassword
    ports:
      - "5432:5432"
    volumes:
      - pgdata:/var/lib/postgresql/data
      - ./initdb.sql:/docker-entrypoint-initdb.d/initdb.sql:ro
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U raguser -d ragdb"]
      interval: 5s
      timeout: 5s
      retries: 10

volumes:
  pgdata:

initdb.sql

-- Run automatically on first container start.
-- Enables pgvector and creates the chunks table.

CREATE EXTENSION IF NOT EXISTS vector;

CREATE TABLE IF NOT EXISTS document_chunks (
    id          SERIAL PRIMARY KEY,
    source      TEXT NOT NULL,          -- filename or URL
    chunk_index INTEGER NOT NULL,       -- position in document
    chunk_text  TEXT NOT NULL,          -- raw text of this chunk
    embedding   vector(384) NOT NULL,   -- all-MiniLM-L6-v2 produces 384 dims
    created_at  TIMESTAMPTZ DEFAULT NOW()
);

-- HNSW index for fast approximate nearest-neighbour search.
-- cosine distance matches the similarity metric we use at query time.
CREATE INDEX IF NOT EXISTS idx_chunks_embedding
    ON document_chunks
    USING hnsw (embedding vector_cosine_ops)
    WITH (m = 16, ef_construction = 64);

rag_poc.py (full source)

"""
rag_poc.py  --  RAG with Claude and pgvector

Pipeline:
  1. Chunk  - split text documents into overlapping windows
  2. Embed  - encode each chunk with sentence-transformers (local, free)
  3. Store  - insert embeddings into Postgres pgvector
  4. Retrieve - cosine similarity top-k for a user question
  5. Generate - Claude answers from retrieved chunks with citations

Usage:
  # First run: ingest sample documents, then answer a question
  python rag_poc.py --ingest --question "What is the return policy?"

  # Subsequent runs: skip ingestion, just ask
  python rag_poc.py --question "How do I reset my password?"
"""

import os
import sys
import argparse
import textwrap
from pathlib import Path

import anthropic
import psycopg2
from psycopg2.extras import execute_values
from sentence_transformers import SentenceTransformer
import numpy as np
from dotenv import load_dotenv

# ---------------------------------------------------------------------------
# Config
# ---------------------------------------------------------------------------
load_dotenv()

ANTHROPIC_API_KEY = os.environ["ANTHROPIC_API_KEY"]   # never hard-code
PGHOST     = os.getenv("PGHOST", "localhost")
PGPORT     = int(os.getenv("PGPORT", 5432))
PGDATABASE = os.getenv("PGDATABASE", "ragdb")
PGUSER     = os.getenv("PGUSER", "raguser")
PGPASSWORD = os.getenv("PGPASSWORD", "ragpassword")

EMBED_MODEL  = "all-MiniLM-L6-v2"   # 384 dims, fast on CPU, free
CLAUDE_MODEL = "claude-sonnet-4-6"  # balanced; upgrade to claude-opus-4-8 for complex reasoning
CHUNK_SIZE   = 512   # characters
CHUNK_OVERLAP = 64   # characters
TOP_K        = 5     # number of chunks to retrieve

# ---------------------------------------------------------------------------
# Sample documents (inline for demo; replace with real file loading)
# ---------------------------------------------------------------------------
SAMPLE_DOCS = {
    "return_policy.txt": textwrap.dedent("""\
        Return Policy

        We accept returns within 30 days of purchase.
        Items must be unused and in original packaging.
        Digital downloads and software licenses are non-refundable
        once the key has been revealed or the download started.
        To initiate a return, email [email protected] with your
        order number. We process refunds within 5-7 business days
        to the original payment method.
        Shipping costs for returns are covered by the customer
        unless the item was defective or we made an error.
    """),

    "password_policy.txt": textwrap.dedent("""\
        Password and Account Security

        Passwords must be at least 12 characters long and include
        at least one uppercase letter, one number, and one symbol.
        We recommend using a password manager.

        To reset your password:
        1. Visit https://example.com/reset-password
        2. Enter your registered email address.
        3. Check your inbox for a reset link (valid for 15 minutes).
        4. Follow the link and enter a new password.

        If you do not receive the email within 5 minutes, check
        your spam folder. If the account is still locked, contact
        [email protected]. Account lockout occurs after 5 failed
        login attempts. Locked accounts unlock automatically after
        30 minutes or can be unlocked instantly by support.
    """),

    "shipping_policy.txt": textwrap.dedent("""\
        Shipping Information

        Standard shipping (3-5 business days): free on orders over $50.
        Standard shipping on orders under $50: $6.99.
        Express shipping (1-2 business days): $14.99 flat.
        Overnight shipping: $29.99 flat.

        We ship to all 50 US states. International shipping is
        available to Canada and the UK at checkout. Orders placed
        before 2 PM Eastern time on business days ship the same day.
        You will receive a tracking number by email once the package
        is picked up by the carrier.

        We are not responsible for delays caused by customs,
        weather, or carrier failures beyond our control.
    """),
}

# ---------------------------------------------------------------------------
# Step 1: Chunking
# ---------------------------------------------------------------------------
def chunk_text(text: str, size: int = CHUNK_SIZE, overlap: int = CHUNK_OVERLAP) -> list[str]:
    """
    Split text into overlapping windows of `size` characters.
    Each chunk starts `size - overlap` characters after the previous.
    """
    chunks = []
    step = size - overlap
    start = 0
    while start < len(text):
        end = start + size
        chunk = text[start:end].strip()
        if chunk:
            chunks.append(chunk)
        start += step
    return chunks


# ---------------------------------------------------------------------------
# Step 2: Embedding
# ---------------------------------------------------------------------------
def load_embed_model() -> SentenceTransformer:
    """Load sentence-transformer model. Downloaded on first call, cached after."""
    print(f"Loading embedding model: {EMBED_MODEL}")
    return SentenceTransformer(EMBED_MODEL)


def embed_chunks(model: SentenceTransformer, chunks: list[str]) -> np.ndarray:
    """
    Encode a list of text chunks to numpy float32 vectors.
    sentence-transformers normalises by default which is correct for cosine sim.
    """
    print(f"  Embedding {len(chunks)} chunks...")
    vectors = model.encode(chunks, normalize_embeddings=True, show_progress_bar=False)
    return vectors.astype(np.float32)


# ---------------------------------------------------------------------------
# Step 3: Postgres / pgvector helpers
# ---------------------------------------------------------------------------
def get_conn():
    return psycopg2.connect(
        host=PGHOST,
        port=PGPORT,
        dbname=PGDATABASE,
        user=PGUSER,
        password=PGPASSWORD,
    )


def clear_source(conn, source: str):
    """Remove all existing chunks for this source so re-ingestion is idempotent."""
    with conn.cursor() as cur:
        cur.execute("DELETE FROM document_chunks WHERE source = %s", (source,))
    conn.commit()


def insert_chunks(conn, source: str, chunks: list[str], vectors: np.ndarray):
    """
    Bulk-insert chunks with their embeddings.
    pgvector expects the vector as a Python list or the '[x,y,z]' string format.
    psycopg2 execute_values is fastest for batch inserts.
    """
    rows = [
        (source, idx, text, vec.tolist())
        for idx, (text, vec) in enumerate(zip(chunks, vectors))
    ]
    with conn.cursor() as cur:
        execute_values(
            cur,
            """
            INSERT INTO document_chunks (source, chunk_index, chunk_text, embedding)
            VALUES %s
            """,
            rows,
            template="(%s, %s, %s, %s::vector)",
        )
    conn.commit()
    print(f"  Stored {len(rows)} chunks from '{source}'")


# ---------------------------------------------------------------------------
# Step 4: Retrieval
# ---------------------------------------------------------------------------
def retrieve_top_k(conn, model: SentenceTransformer, question: str, k: int = TOP_K) -> list[dict]:
    """
    Embed the question and find the k nearest chunks by cosine distance.
    pgvector cosine distance operator: <=>
    Cosine similarity = 1 - cosine_distance
    """
    q_vec = model.encode([question], normalize_embeddings=True)[0].astype(np.float32)

    with conn.cursor() as cur:
        cur.execute(
            """
            SELECT source, chunk_index, chunk_text,
                   1 - (embedding <=> %s::vector) AS similarity
            FROM document_chunks
            ORDER BY embedding <=> %s::vector
            LIMIT %s
            """,
            (q_vec.tolist(), q_vec.tolist(), k),
        )
        rows = cur.fetchall()

    results = []
    for source, chunk_index, chunk_text, similarity in rows:
        results.append({
            "source": source,
            "chunk_index": chunk_index,
            "text": chunk_text,
            "similarity": float(similarity),
        })
    return results


# ---------------------------------------------------------------------------
# Step 5: Generation with Claude
# ---------------------------------------------------------------------------
def build_context_block(chunks: list[dict]) -> str:
    """
    Format retrieved chunks into a numbered context string for Claude.
    Each chunk is labelled with its source and similarity score so Claude
    can cite them accurately.
    """
    parts = []
    for i, chunk in enumerate(chunks, 1):
        parts.append(
            f"[{i}] Source: {chunk['source']} (similarity: {chunk['similarity']:.3f})\n"
            f"{chunk['text']}"
        )
    return "\n\n---\n\n".join(parts)


def answer_with_claude(question: str, chunks: list[dict]) -> str:
    """
    Pass the retrieved chunks to Claude and ask for a cited answer.

    Key prompt design choices:
    - The system prompt tells Claude to answer ONLY from the context.
    - It must cite sources using [1], [2] notation matching the context block.
    - If the answer is not in the context, Claude must say so explicitly.
    - We enable prompt caching on the system prompt for repeated queries.
    """
    client = anthropic.Anthropic()

    context = build_context_block(chunks)

    # Prompt caching: the system prompt is large and identical across queries.
    # Marking it ephemeral lets Anthropic cache it on their end.
    # Savings kick in on the second request with the same cached block.
    system_prompt = [
        {
            "type": "text",
            "text": (
                "You are a helpful Q&A assistant. Answer the user's question "
                "using ONLY the context passages provided below. "
                "Cite every claim with the passage number in square brackets, "
                "for example [1] or [2]. "
                "If the answer cannot be found in the provided context, "
                "reply with: 'I could not find a reliable answer in the available documents.' "
                "Do not speculate or use outside knowledge. Be concise."
            ),
            "cache_control": {"type": "ephemeral"},  # prompt caching
        }
    ]

    user_message = (
        f"Context passages:\n\n{context}\n\n"
        f"Question: {question}"
    )

    try:
        msg = client.messages.create(
            model=CLAUDE_MODEL,
            max_tokens=1024,
            system=system_prompt,
            messages=[{"role": "user", "content": user_message}],
        )
    except anthropic.APIError as exc:
        print(f"Claude API error: {exc}", file=sys.stderr)
        raise

    # Report token usage (helpful for cost tracking; see cache stats too)
    usage = msg.usage
    print(
        f"\n[Token usage] input={usage.input_tokens} "
        f"output={usage.output_tokens} "
        f"cache_created={getattr(usage, 'cache_creation_input_tokens', 0)} "
        f"cache_read={getattr(usage, 'cache_read_input_tokens', 0)}"
    )

    return msg.content[0].text


# ---------------------------------------------------------------------------
# Main
# ---------------------------------------------------------------------------
def ingest_documents(model: SentenceTransformer):
    """Chunk, embed, and store all sample documents."""
    conn = get_conn()
    try:
        for source, text in SAMPLE_DOCS.items():
            print(f"\nIngesting '{source}'...")
            chunks = chunk_text(text)
            print(f"  {len(chunks)} chunks created")
            vectors = embed_chunks(model, chunks)
            clear_source(conn, source)
            insert_chunks(conn, source, chunks, vectors)
        print("\nIngestion complete.")
    finally:
        conn.close()


def answer_question(model: SentenceTransformer, question: str):
    """Retrieve relevant chunks and generate a cited answer."""
    conn = get_conn()
    try:
        print(f"\nRetrieving top-{TOP_K} chunks for: '{question}'")
        chunks = retrieve_top_k(conn, model, question)

        print("\nTop retrieved chunks:")
        for i, c in enumerate(chunks, 1):
            print(f"  [{i}] {c['source']} (sim={c['similarity']:.3f}): {c['text'][:80]}...")

        print("\nGenerating answer with Claude...")
        answer = answer_with_claude(question, chunks)

        print("\n" + "=" * 60)
        print(f"Q: {question}")
        print("=" * 60)
        print(answer)
        print("=" * 60)

        return answer
    finally:
        conn.close()


def main():
    parser = argparse.ArgumentParser(description="RAG with Claude and pgvector")
    parser.add_argument("--ingest", action="store_true", help="Ingest sample documents")
    parser.add_argument("--question", type=str, default="What is the return policy?",
                        help="Question to answer")
    args = parser.parse_args()

    model = load_embed_model()

    if args.ingest:
        ingest_documents(model)

    answer_question(model, args.question)


if __name__ == "__main__":
    main()

Sample run with realistic output

$ docker compose up -d
[+] Running 1/1
 ✓ Container rag-postgres-1  Started

$ python rag_poc.py --ingest --question "What is the return policy?"

Loading embedding model: all-MiniLM-L6-v2

Ingesting 'return_policy.txt'...
  3 chunks created
  Embedding 3 chunks...
  Stored 3 chunks from 'return_policy.txt'

Ingesting 'password_policy.txt'...
  3 chunks created
  Embedding 3 chunks...
  Stored 3 chunks from 'password_policy.txt'

Ingesting 'shipping_policy.txt'...
  3 chunks created
  Embedding 3 chunks...
  Stored 3 chunks from 'shipping_policy.txt'

Ingestion complete.

Retrieving top-5 chunks for: 'What is the return policy?'

Top retrieved chunks:
  [1] return_policy.txt (sim=0.891): Return Policy  We accept returns within 30 days of purchase...
  [2] return_policy.txt (sim=0.723): To initiate a return, email [email protected] with your...
  [3] shipping_policy.txt (sim=0.412): Standard shipping (3-5 business days): free on orders...
  [4] password_policy.txt (sim=0.198): Passwords must be at least 12 characters long...
  [5] shipping_policy.txt (sim=0.187): We ship to all 50 US states...

Generating answer with Claude...

[Token usage] input=643 output=112 cache_created=87 cache_read=0

============================================================
Q: What is the return policy?
============================================================
Returns are accepted within 30 days of purchase, provided items
are unused and in original packaging [1]. Digital downloads and
software licenses are non-refundable once the key is revealed or
the download has started [1].

To initiate a return, email [email protected] with your order
number [2]. Refunds are processed within 5-7 business days to
the original payment method [2]. Return shipping costs are the
customer's responsibility unless the item was defective or an
error was made by the seller [2].
============================================================

$ python rag_poc.py --question "How do I reset my password?"

Loading embedding model: all-MiniLM-L6-v2

Retrieving top-5 chunks for: 'How do I reset my password?'
...

[Token usage] input=643 output=134 cache_created=0 cache_read=87

============================================================
Q: How do I reset my password?
============================================================
To reset your password, visit https://example.com/reset-password
and enter your registered email address [2]. A reset link will
be emailed to you and is valid for 15 minutes [2]. If the email
does not arrive within 5 minutes, check your spam folder [2].

If your account is locked due to 5 failed login attempts, it
will unlock automatically after 30 minutes. Support at
[email protected] can also unlock it immediately [2].
============================================================
Key idea: Notice the second query shows cache_read=87 and cache_created=0. The system prompt was served from Anthropic’s prompt cache, so you paid roughly a tenth of the normal input price on that repeated portion. With a large system prompt this effect is significant. See Part 4 on prompt caching for the full breakdown.
Honest caveat about caching: Anthropic only caches a block once it crosses a minimum size (about 1,024 tokens for Sonnet-class models). The short system prompt in this POC is below that floor, so in real runs you will see cache_creation_input_tokens=0 until you make the cached block bigger. The numbers above are illustrative of the mechanics. Caching pays off when you put a large, stable knowledge block (company guidelines, a glossary, a style guide of a few thousand tokens) into the cached system block. Below the floor, the cache_control marker is simply ignored and you are billed normally, so leaving it in costs you nothing.

Going Deeper: Production Considerations for rag claude pgvector

Hybrid search: combining BM25 and vector similarity

Pure vector search misses exact-match queries. If a user types an unusual product code or a proper noun not well-represented in the embedding space, BM25 full-text search will find it where cosine similarity fails. The standard pattern is reciprocal rank fusion (RRF): run both searches, rank each independently, then blend the scores. Postgres supports tsvector full-text search natively. You can do both in one SQL query and fuse in Python. This is worth adding before go-live.

Metadata filtering

pgvector does not support filtered ANN natively in all index types, but you can add a JSONB column metadata to document_chunks and apply a WHERE metadata @> '{"department":"legal"}' clause before the vector ORDER BY. The query planner will filter first, then rank by vector distance. This is how you implement tenant isolation or document-type scoping.

-- Add to initdb.sql if you need metadata filtering
ALTER TABLE document_chunks ADD COLUMN IF NOT EXISTS metadata JSONB DEFAULT '{}';
CREATE INDEX IF NOT EXISTS idx_chunks_metadata ON document_chunks USING gin(metadata);

Re-ranking with a cross-encoder

Bi-encoder embeddings (like all-MiniLM) are fast but not the most accurate at ranking. A cross-encoder (e.g., cross-encoder/ms-marco-MiniLM-L-6-v2) reads the query and a candidate chunk together and produces a much more accurate relevance score. The typical pattern: retrieve top-20 by cosine similarity (cheap), re-rank with cross-encoder (slightly slower), pass top-5 to Claude. This gives you better precision without the full cost of embedding every document at query time with a slow model.

Chunking documents with structure

PDFs and HTML documents have structure: headings, sections, tables. Ignoring that structure and splitting by character count often slices sentences in half or separates a question from its answer. The LangChain recursive character text splitter tries to split on paragraph then sentence then word boundaries. For very structured docs (contracts, legal filings) parsing to a document tree before chunking is worth the extra work.

Connecting to Part 13: the full support agent

This POC is a complete RAG pipeline but it is single-turn. In Part 13 of this series, we extend this pattern into a multi-turn support agent that combines RAG retrieval with tool use: it can look up order status, create tickets, and escalate to a human, all within the same conversation loop.

Common Pitfalls

  • Wrong embedding at query time. If you retrain or switch your embedding model, you must re-embed every chunk in the database. Mixing models means similarity scores are meaningless. Keep the model name in a config table and assert it at startup.
  • Chunks that are too large. Long chunks reduce precision. The top retrieved chunk might be a 1,000-token section where only 20 tokens are relevant. Start with 256-512 characters and tune from real recall metrics.
  • Not using overlap. A fact that falls at the boundary of two chunks will be split between them. 10-15% overlap (e.g., 50 characters on 512-character chunks) prevents most boundary problems.
  • Treating similarity score as confidence. A similarity of 0.85 does not mean the chunk answers the question. It means it is textually close. Always pass multiple chunks and let Claude judge relevance.
  • No hallucination guard in the prompt. Without an explicit instruction to answer only from context, Claude will use its training data to fill gaps. The system prompt in the POC above includes this guard. Do not remove it.
  • Re-ingesting without clearing. Running ingest twice without clearing old rows creates duplicate chunks. The clear_source function in the POC handles this. In production, add a content hash column and upsert on it instead.
  • Forgetting the HNSW index. Without the index, every query does a full sequential scan. On 10,000 chunks that is tolerable. On 100,000 chunks it becomes seconds per query. The CREATE INDEX USING hnsw in initdb.sql handles this.
  • pgvector version mismatch. The vector_cosine_ops HNSW index was added in pgvector 0.5.0. The Docker image pgvector/pgvector:pg16 ships a current version. If you install pgvector from your distro’s package manager, check the version first.

Cost and Latency

The cost split in a RAG pipeline is: embedding (near-zero with a local model), vector search (near-zero on Postgres), Claude generation (the dominant cost). The table below shows representative numbers at moderate scale.

Step Typical latency Cost per 1,000 queries Scaling notes
Embed question (local) 5-20ms $0 Linear with batch size; GPU speeds this up 10x
pgvector HNSW search (10k chunks) 2-8ms $0 Sub-linear with HNSW; degrades gracefully to 1M+ chunks
Claude claude-haiku-4-5 (500 input tokens) 400-800ms ~$0.40 Fastest; fine for short factual answers
Claude claude-sonnet-4-6 (500 input tokens) 600-1400ms ~$1.50 Best overall quality/cost default
Claude claude-opus-4-8 (500 input tokens) 1500-4000ms ~$7.50 Reserve for complex multi-step reasoning over large contexts
Prompt cache hit (same system prompt) Same latency ~10% of normal input cost Effective when system prompt is large and stable

Prompt caching is the most impactful cost lever here. If your system prompt contains a large static knowledge block (say, a 2,000-token company guidelines section), caching it saves 90% of those input tokens on every repeat call. The POC already enables caching on the system prompt. For a heavier context, also cache the context block itself if it is reused across queries.

Situation Recommended model Why
High-volume support bot, simple FAQ answers claude-haiku-4-5 4x cheaper than Sonnet; answers are factual, context is provided
General document Q&A, most internal tools claude-sonnet-4-6 Best balance; handles multi-part questions well
Legal, contracts, complex multi-doc synthesis claude-opus-4-8 Better at nuanced reasoning across conflicting chunks
Classification / routing before Q&A claude-haiku-4-5 Route to specialist doc store before calling heavier model

Extending the POC: What to Add Before Production

Async ingestion pipeline

The POC ingests synchronously in the main process. For any real document volume (hundreds of files), you want a background worker. A simple approach: write file paths to a queue (Redis or a Postgres table), run workers that pull from the queue, embed, and insert. This also handles failures gracefully: if embedding fails mid-document you do not lose half the chunks.

Document update detection

Track a content_hash (SHA-256 of the raw text) alongside each source. On re-ingest, compare hashes. Only re-chunk and re-embed files whose hash changed. This makes incremental updates fast and avoids unnecessary API or compute calls.

Answer evaluation

RAG systems drift over time as documents change and questions evolve. Build an eval set of question-answer pairs and run it on every ingestion cycle. See Part 24 on LLM evals for a practical harness you can adapt. At minimum, track whether the correct source chunk appears in the top-3 retrieved results (recall@3) and whether Claude’s answer contains the key fact (checked by a second Claude call or regex).

Streaming responses

For a web interface, streaming the Claude response token by token is much better UX than waiting 2 seconds for the full answer. The Anthropic SDK makes this easy. See Part 26 on streaming for the full pattern. In the POC, replace client.messages.create(...) with client.messages.stream(...) and iterate stream.text_stream.

Structured citations via tool use

The POC asks Claude to include citations in prose ([1], [2], etc.). That is fine for a console demo, but parsing brackets out of prose is brittle once you want to render clickable source links in a web UI. A more reliable approach for downstream processing: define a single tool that represents your output schema and force Claude to call it. You then read the structured object directly from the tool input. The drop-in replacement for answer_with_claude below does exactly that. It is the same retrieval pipeline; only the generation step changes.

def answer_with_citations(question: str, chunks: list[dict]) -> dict:
    """
    Generation step that returns MACHINE-READABLE citations.

    Instead of asking Claude to embed [1] markers in prose, we define one
    tool whose input_schema IS our output shape, then force Claude to call
    it with tool_choice. We read block.input as the structured result.
    """
    client = anthropic.Anthropic()
    context = build_context_block(chunks)

    answer_tool = {
        "name": "submit_answer",
        "description": "Return the final answer with explicit source citations.",
        "input_schema": {
            "type": "object",
            "properties": {
                "answer": {
                    "type": "string",
                    "description": "The answer, written for an end user.",
                },
                "found": {
                    "type": "boolean",
                    "description": "True if the context contained the answer.",
                },
                "citations": {
                    "type": "array",
                    "items": {
                        "type": "object",
                        "properties": {
                            "passage": {"type": "integer"},
                            "source": {"type": "string"},
                            "quote": {"type": "string"},
                        },
                        "required": ["passage", "source", "quote"],
                    },
                },
            },
            "required": ["answer", "found", "citations"],
        },
    }

    system_prompt = (
        "You answer questions using ONLY the supplied context passages. "
        "Set found=false and citations=[] if the answer is not present. "
        "Every citation quote must be copied verbatim from a passage."
    )
    user_message = f"Context passages:\n\n{context}\n\nQuestion: {question}"

    try:
        msg = client.messages.create(
            model=CLAUDE_MODEL,
            max_tokens=1024,
            system=system_prompt,
            tools=[answer_tool],
            tool_choice={"type": "tool", "name": "submit_answer"},
            messages=[{"role": "user", "content": user_message}],
        )
    except anthropic.APIError as exc:
        print(f"Claude API error: {exc}", file=sys.stderr)
        raise

    # With a forced tool, Claude responds with a tool_use block whose
    # .input is our structured object. stop_reason will be "tool_use".
    for block in msg.content:
        if block.type == "tool_use" and block.name == "submit_answer":
            return block.input  # {"answer": ..., "found": ..., "citations": [...]}

    # Defensive fallback: should not happen with a forced tool.
    return {"answer": "", "found": False, "citations": []}

Because tool_choice forces the call, msg.stop_reason is "tool_use" and the first content block is the structured object. You get a Python dict with answer, a found boolean you can branch on, and a list of citations that each name a passage number, a source file, and the verbatim quote. Rendering a footnote with a real link is then a loop over citations, not a regex over prose. Part 2 on tool use covers the full request/response loop including multi-tool flows. Part 3 on structured output goes deeper on the forced-tool pattern and schema design.

Frequently Asked Questions

Can I use OpenAI embeddings with this pgvector setup instead of sentence-transformers?

Yes. Swap embed_chunks to call openai.embeddings.create(model="text-embedding-3-small", input=chunks) and change the vector dimension in initdb.sql to 1536 (or 3072 for the large model). OpenAI’s embeddings are higher quality for English text at the cost of an API fee. The rest of the pipeline stays the same. The critical rule: pick one embedding model and never mix it with another in the same table.

How many documents can pgvector handle before it becomes slow?

With the HNSW index, pgvector handles tens of millions of vectors on a well-provisioned Postgres instance (8+ GB RAM, SSD). At 1 million 384-dimensional float32 vectors the index takes roughly 1.5 GB of RAM. Query latency typically stays under 10ms for top-10 searches up to about 5 million rows. Beyond that, consider approximate search parameters (increase ef_search for recall, decrease for speed) or shard by document category.

What happens when Claude gets context chunks that contradict each other?

Claude will usually note the contradiction and present both perspectives if the system prompt does not tell it otherwise. For legal or compliance use cases, add an instruction like “If context passages conflict, report the conflict to the user rather than choosing one.” You can also add a step that deduplicates or reconciles chunks before passing them to Claude.

Is this approach safe for confidential documents?

The raw text of each chunk is stored in Postgres in plaintext. If your documents are confidential, apply the same access controls you would to any Postgres table (row-level security, encrypted connection, encrypted at-rest storage). The embedding vectors are not reversible to the original text with current techniques, but you should treat them as sensitive anyway. Never send raw document text to an external embedding API if the documents are confidential without reviewing that provider’s data retention policy.

How do I handle very long documents that exceed the chunk budget?

The top-k retrieval step naturally handles this: you ingest the whole document in chunks and at query time only the relevant chunks are passed to Claude. You do not send the whole document on every query. A 500-page PDF produces several thousand chunks, but a single Q&A call will only send 5-10 of them to Claude.

Should I use pgvector in production or a dedicated vector database like Pinecone?

For most teams, pgvector is the right answer up to roughly 10 million vectors. It removes an entire service from your infrastructure (and its associated cost, ops burden, and failure modes). Pinecone and Weaviate offer better performance at very large scale and have managed offerings that remove the ops burden in a different way. If you already run Postgres, start with pgvector. You can migrate to a dedicated store later if you genuinely hit scale limits.

Do I need to re-embed documents if I switch from claude-haiku-4-5 to claude-sonnet-4-6?

No. The embedding model and the generation model are completely independent. Embeddings are produced by sentence-transformers (or whatever embedding model you choose) and stored in pgvector. The Claude model only sees the retrieved text chunks, not the vectors. You can switch Claude models on any query with no changes to the database.

Read the full series index at skillsuites.com/category/ai-use-cases/.

Further Reading and External Resources

MUASIF80 Avatar
Previous

Leave a Reply

Your email address will not be published. Required fields are marked *