TL;DR
- RAG (retrieval-augmented generation) lets Claude answer questions grounded in your own documents, not just its training data. The rag claude pgvector stack is one of the most practical starting points for production document Q&A.
- pgvector turns Postgres into a vector store you already know how to operate, backup, and monitor. No extra managed service needed for most teams.
- The full pipeline is five steps: chunk, embed, store, retrieve (cosine similarity top-k), generate. Each step is independently replaceable.
- Claude cites the source chunks in its answer, giving users traceability and letting you audit hallucinations before they reach production.
- The complete POC runs in Docker with a single
docker compose up. It takes under 15 minutes to go from zero to working Q&A on your own PDFs. - Prompt caching on the system prompt cuts repeated-query costs by up to 90% when your context window is large.
Why RAG and the rag claude pgvector Stack Make Sense for Engineering Teams
Most enterprise AI projects hit the same wall. Claude is smart enough to answer almost any question, but it does not know what is in your internal docs, product specs, support tickets, or legal contracts. Fine-tuning is expensive, slow, and quickly becomes stale. RAG (retrieval-augmented generation) is the answer that works in production: you keep documents in a searchable store, retrieve the relevant fragments at query time, and feed only those fragments to the model. The model then answers from evidence, not guesswork.
The specific combination of rag claude pgvector is worth explaining before diving into code. pgvector is a Postgres extension that adds a vector column type and approximate nearest-neighbor search. If your team already runs Postgres (and the majority of engineering teams do), this means your vector store lives in the same database cluster as the rest of your data. You get ACID transactions, row-level security, familiar backup tooling, and a JOIN if you ever need one. The tradeoff is that pgvector’s HNSW index does not match Pinecone or Weaviate at a hundred million vectors, but it easily handles tens of millions, which is well beyond what most internal document systems need.
Claude’s role is the generation half: reading retrieved chunks, attributing each claim to a specific source, and producing a coherent answer. That citation behavior is not magic. You have to ask for it explicitly in your system prompt, and the POC below shows how.
Who actually needs this
- Product teams that want employees to ask questions about a knowledge base or wiki without copy-pasting content into a chat window.
- Legal and compliance teams that need answers from contracts or regulations with explicit source references.
- Customer support organizations building a Tier-0 bot that can answer from product documentation before escalating.
- Developer tools teams that want a code-aware Q&A system over large internal codebases or RFCs.
- Any team that has outgrown “give Claude the whole document” because the documents are too large or too numerous.
What this article builds
A fully self-contained Python project that: ingests text files or PDFs, splits them into overlapping chunks, embeds each chunk using a local sentence-transformer model (so embedding costs nothing), stores embeddings in Postgres with pgvector, answers questions by retrieving the top-5 most similar chunks, and calls Claude with those chunks to get a cited answer. Everything runs with docker compose up.
RAG Architecture: The Five-Step Pipeline
Steps 1 through 3 are the offline ingestion pipeline. You run them once (or on a cron job) whenever documents are added or updated. Steps 4 and 5 are the hot path that runs on every user query. Separating these two phases matters because it lets you re-embed or re-chunk without touching query logic, and it keeps query latency low because the vector search is the only variable-cost operation at runtime.
Chunking strategy choices
The right chunk size is almost always smaller than you think. Claude can read many tokens, but if you send a 20-page chapter as a single chunk, it drowns out the specific sentence the user actually needs. Common starting points:
- 512 tokens with 64-token overlap: Good general default. Overlap preserves context at chunk boundaries.
- 256 tokens with 32-token overlap: Better recall precision for factual Q&A where exact sentences matter.
- Paragraph-based splitting: Best for structured docs (contracts, specs) where paragraphs have semantic coherence.
The POC below uses 512 characters (not tokens) with 50-character overlap. That is a simplification that works fine for demos. In production you want to split on token boundaries using the model’s tokenizer.
Embedding model choices
This POC uses sentence-transformers/all-MiniLM-L6-v2 from HuggingFace running locally. It produces 384-dimensional vectors, is fast on CPU, and is free. For production you might want OpenAI text-embedding-3-small (1536 dimensions, very good quality, costs money) or Cohere’s embedding API. The key constraint: whatever model you use to embed at ingest time, you must use the same model at query time. Mixing models produces nonsense similarity scores.
Setting Up Postgres with pgvector
pgvector ships as a Postgres extension. The Docker image pgvector/pgvector:pg16 comes with it pre-installed. You just need to run CREATE EXTENSION vector; once in your database.
docker pull pgvector/pgvector:pg16The docker-compose below wires everything together. No manual database setup needed. The initdb.sql script runs automatically on first container start.
The Complete POC: File by File
The project has four files: docker-compose.yml, initdb.sql, requirements.txt, and rag_poc.py. Copy them into a fresh directory and run docker compose up -d to start Postgres, then python rag_poc.py to run the pipeline.
Install and requirements
pip install anthropic psycopg2-binary sentence-transformers numpy python-dotenv
requirements.txt
anthropic>=0.25.0
psycopg2-binary>=2.9.9
sentence-transformers>=2.7.0
numpy>=1.26.0
python-dotenv>=1.0.0
.env example
# .env
ANTHROPIC_API_KEY=sk-ant-your-key-here
PGHOST=localhost
PGPORT=5432
PGDATABASE=ragdb
PGUSER=raguser
PGPASSWORD=ragpassword
docker-compose.yml
version: "3.9"
services:
postgres:
image: pgvector/pgvector:pg16
environment:
POSTGRES_DB: ragdb
POSTGRES_USER: raguser
POSTGRES_PASSWORD: ragpassword
ports:
- "5432:5432"
volumes:
- pgdata:/var/lib/postgresql/data
- ./initdb.sql:/docker-entrypoint-initdb.d/initdb.sql:ro
healthcheck:
test: ["CMD-SHELL", "pg_isready -U raguser -d ragdb"]
interval: 5s
timeout: 5s
retries: 10
volumes:
pgdata:
initdb.sql
-- Run automatically on first container start.
-- Enables pgvector and creates the chunks table.
CREATE EXTENSION IF NOT EXISTS vector;
CREATE TABLE IF NOT EXISTS document_chunks (
id SERIAL PRIMARY KEY,
source TEXT NOT NULL, -- filename or URL
chunk_index INTEGER NOT NULL, -- position in document
chunk_text TEXT NOT NULL, -- raw text of this chunk
embedding vector(384) NOT NULL, -- all-MiniLM-L6-v2 produces 384 dims
created_at TIMESTAMPTZ DEFAULT NOW()
);
-- HNSW index for fast approximate nearest-neighbour search.
-- cosine distance matches the similarity metric we use at query time.
CREATE INDEX IF NOT EXISTS idx_chunks_embedding
ON document_chunks
USING hnsw (embedding vector_cosine_ops)
WITH (m = 16, ef_construction = 64);
rag_poc.py (full source)
"""
rag_poc.py -- RAG with Claude and pgvector
Pipeline:
1. Chunk - split text documents into overlapping windows
2. Embed - encode each chunk with sentence-transformers (local, free)
3. Store - insert embeddings into Postgres pgvector
4. Retrieve - cosine similarity top-k for a user question
5. Generate - Claude answers from retrieved chunks with citations
Usage:
# First run: ingest sample documents, then answer a question
python rag_poc.py --ingest --question "What is the return policy?"
# Subsequent runs: skip ingestion, just ask
python rag_poc.py --question "How do I reset my password?"
"""
import os
import sys
import argparse
import textwrap
from pathlib import Path
import anthropic
import psycopg2
from psycopg2.extras import execute_values
from sentence_transformers import SentenceTransformer
import numpy as np
from dotenv import load_dotenv
# ---------------------------------------------------------------------------
# Config
# ---------------------------------------------------------------------------
load_dotenv()
ANTHROPIC_API_KEY = os.environ["ANTHROPIC_API_KEY"] # never hard-code
PGHOST = os.getenv("PGHOST", "localhost")
PGPORT = int(os.getenv("PGPORT", 5432))
PGDATABASE = os.getenv("PGDATABASE", "ragdb")
PGUSER = os.getenv("PGUSER", "raguser")
PGPASSWORD = os.getenv("PGPASSWORD", "ragpassword")
EMBED_MODEL = "all-MiniLM-L6-v2" # 384 dims, fast on CPU, free
CLAUDE_MODEL = "claude-sonnet-4-6" # balanced; upgrade to claude-opus-4-8 for complex reasoning
CHUNK_SIZE = 512 # characters
CHUNK_OVERLAP = 64 # characters
TOP_K = 5 # number of chunks to retrieve
# ---------------------------------------------------------------------------
# Sample documents (inline for demo; replace with real file loading)
# ---------------------------------------------------------------------------
SAMPLE_DOCS = {
"return_policy.txt": textwrap.dedent("""\
Return Policy
We accept returns within 30 days of purchase.
Items must be unused and in original packaging.
Digital downloads and software licenses are non-refundable
once the key has been revealed or the download started.
To initiate a return, email [email protected] with your
order number. We process refunds within 5-7 business days
to the original payment method.
Shipping costs for returns are covered by the customer
unless the item was defective or we made an error.
"""),
"password_policy.txt": textwrap.dedent("""\
Password and Account Security
Passwords must be at least 12 characters long and include
at least one uppercase letter, one number, and one symbol.
We recommend using a password manager.
To reset your password:
1. Visit https://example.com/reset-password
2. Enter your registered email address.
3. Check your inbox for a reset link (valid for 15 minutes).
4. Follow the link and enter a new password.
If you do not receive the email within 5 minutes, check
your spam folder. If the account is still locked, contact
[email protected]. Account lockout occurs after 5 failed
login attempts. Locked accounts unlock automatically after
30 minutes or can be unlocked instantly by support.
"""),
"shipping_policy.txt": textwrap.dedent("""\
Shipping Information
Standard shipping (3-5 business days): free on orders over $50.
Standard shipping on orders under $50: $6.99.
Express shipping (1-2 business days): $14.99 flat.
Overnight shipping: $29.99 flat.
We ship to all 50 US states. International shipping is
available to Canada and the UK at checkout. Orders placed
before 2 PM Eastern time on business days ship the same day.
You will receive a tracking number by email once the package
is picked up by the carrier.
We are not responsible for delays caused by customs,
weather, or carrier failures beyond our control.
"""),
}
# ---------------------------------------------------------------------------
# Step 1: Chunking
# ---------------------------------------------------------------------------
def chunk_text(text: str, size: int = CHUNK_SIZE, overlap: int = CHUNK_OVERLAP) -> list[str]:
"""
Split text into overlapping windows of `size` characters.
Each chunk starts `size - overlap` characters after the previous.
"""
chunks = []
step = size - overlap
start = 0
while start < len(text):
end = start + size
chunk = text[start:end].strip()
if chunk:
chunks.append(chunk)
start += step
return chunks
# ---------------------------------------------------------------------------
# Step 2: Embedding
# ---------------------------------------------------------------------------
def load_embed_model() -> SentenceTransformer:
"""Load sentence-transformer model. Downloaded on first call, cached after."""
print(f"Loading embedding model: {EMBED_MODEL}")
return SentenceTransformer(EMBED_MODEL)
def embed_chunks(model: SentenceTransformer, chunks: list[str]) -> np.ndarray:
"""
Encode a list of text chunks to numpy float32 vectors.
sentence-transformers normalises by default which is correct for cosine sim.
"""
print(f" Embedding {len(chunks)} chunks...")
vectors = model.encode(chunks, normalize_embeddings=True, show_progress_bar=False)
return vectors.astype(np.float32)
# ---------------------------------------------------------------------------
# Step 3: Postgres / pgvector helpers
# ---------------------------------------------------------------------------
def get_conn():
return psycopg2.connect(
host=PGHOST,
port=PGPORT,
dbname=PGDATABASE,
user=PGUSER,
password=PGPASSWORD,
)
def clear_source(conn, source: str):
"""Remove all existing chunks for this source so re-ingestion is idempotent."""
with conn.cursor() as cur:
cur.execute("DELETE FROM document_chunks WHERE source = %s", (source,))
conn.commit()
def insert_chunks(conn, source: str, chunks: list[str], vectors: np.ndarray):
"""
Bulk-insert chunks with their embeddings.
pgvector expects the vector as a Python list or the '[x,y,z]' string format.
psycopg2 execute_values is fastest for batch inserts.
"""
rows = [
(source, idx, text, vec.tolist())
for idx, (text, vec) in enumerate(zip(chunks, vectors))
]
with conn.cursor() as cur:
execute_values(
cur,
"""
INSERT INTO document_chunks (source, chunk_index, chunk_text, embedding)
VALUES %s
""",
rows,
template="(%s, %s, %s, %s::vector)",
)
conn.commit()
print(f" Stored {len(rows)} chunks from '{source}'")
# ---------------------------------------------------------------------------
# Step 4: Retrieval
# ---------------------------------------------------------------------------
def retrieve_top_k(conn, model: SentenceTransformer, question: str, k: int = TOP_K) -> list[dict]:
"""
Embed the question and find the k nearest chunks by cosine distance.
pgvector cosine distance operator: <=>
Cosine similarity = 1 - cosine_distance
"""
q_vec = model.encode([question], normalize_embeddings=True)[0].astype(np.float32)
with conn.cursor() as cur:
cur.execute(
"""
SELECT source, chunk_index, chunk_text,
1 - (embedding <=> %s::vector) AS similarity
FROM document_chunks
ORDER BY embedding <=> %s::vector
LIMIT %s
""",
(q_vec.tolist(), q_vec.tolist(), k),
)
rows = cur.fetchall()
results = []
for source, chunk_index, chunk_text, similarity in rows:
results.append({
"source": source,
"chunk_index": chunk_index,
"text": chunk_text,
"similarity": float(similarity),
})
return results
# ---------------------------------------------------------------------------
# Step 5: Generation with Claude
# ---------------------------------------------------------------------------
def build_context_block(chunks: list[dict]) -> str:
"""
Format retrieved chunks into a numbered context string for Claude.
Each chunk is labelled with its source and similarity score so Claude
can cite them accurately.
"""
parts = []
for i, chunk in enumerate(chunks, 1):
parts.append(
f"[{i}] Source: {chunk['source']} (similarity: {chunk['similarity']:.3f})\n"
f"{chunk['text']}"
)
return "\n\n---\n\n".join(parts)
def answer_with_claude(question: str, chunks: list[dict]) -> str:
"""
Pass the retrieved chunks to Claude and ask for a cited answer.
Key prompt design choices:
- The system prompt tells Claude to answer ONLY from the context.
- It must cite sources using [1], [2] notation matching the context block.
- If the answer is not in the context, Claude must say so explicitly.
- We enable prompt caching on the system prompt for repeated queries.
"""
client = anthropic.Anthropic()
context = build_context_block(chunks)
# Prompt caching: the system prompt is large and identical across queries.
# Marking it ephemeral lets Anthropic cache it on their end.
# Savings kick in on the second request with the same cached block.
system_prompt = [
{
"type": "text",
"text": (
"You are a helpful Q&A assistant. Answer the user's question "
"using ONLY the context passages provided below. "
"Cite every claim with the passage number in square brackets, "
"for example [1] or [2]. "
"If the answer cannot be found in the provided context, "
"reply with: 'I could not find a reliable answer in the available documents.' "
"Do not speculate or use outside knowledge. Be concise."
),
"cache_control": {"type": "ephemeral"}, # prompt caching
}
]
user_message = (
f"Context passages:\n\n{context}\n\n"
f"Question: {question}"
)
try:
msg = client.messages.create(
model=CLAUDE_MODEL,
max_tokens=1024,
system=system_prompt,
messages=[{"role": "user", "content": user_message}],
)
except anthropic.APIError as exc:
print(f"Claude API error: {exc}", file=sys.stderr)
raise
# Report token usage (helpful for cost tracking; see cache stats too)
usage = msg.usage
print(
f"\n[Token usage] input={usage.input_tokens} "
f"output={usage.output_tokens} "
f"cache_created={getattr(usage, 'cache_creation_input_tokens', 0)} "
f"cache_read={getattr(usage, 'cache_read_input_tokens', 0)}"
)
return msg.content[0].text
# ---------------------------------------------------------------------------
# Main
# ---------------------------------------------------------------------------
def ingest_documents(model: SentenceTransformer):
"""Chunk, embed, and store all sample documents."""
conn = get_conn()
try:
for source, text in SAMPLE_DOCS.items():
print(f"\nIngesting '{source}'...")
chunks = chunk_text(text)
print(f" {len(chunks)} chunks created")
vectors = embed_chunks(model, chunks)
clear_source(conn, source)
insert_chunks(conn, source, chunks, vectors)
print("\nIngestion complete.")
finally:
conn.close()
def answer_question(model: SentenceTransformer, question: str):
"""Retrieve relevant chunks and generate a cited answer."""
conn = get_conn()
try:
print(f"\nRetrieving top-{TOP_K} chunks for: '{question}'")
chunks = retrieve_top_k(conn, model, question)
print("\nTop retrieved chunks:")
for i, c in enumerate(chunks, 1):
print(f" [{i}] {c['source']} (sim={c['similarity']:.3f}): {c['text'][:80]}...")
print("\nGenerating answer with Claude...")
answer = answer_with_claude(question, chunks)
print("\n" + "=" * 60)
print(f"Q: {question}")
print("=" * 60)
print(answer)
print("=" * 60)
return answer
finally:
conn.close()
def main():
parser = argparse.ArgumentParser(description="RAG with Claude and pgvector")
parser.add_argument("--ingest", action="store_true", help="Ingest sample documents")
parser.add_argument("--question", type=str, default="What is the return policy?",
help="Question to answer")
args = parser.parse_args()
model = load_embed_model()
if args.ingest:
ingest_documents(model)
answer_question(model, args.question)
if __name__ == "__main__":
main()
Sample run with realistic output
$ docker compose up -d
[+] Running 1/1
✓ Container rag-postgres-1 Started
$ python rag_poc.py --ingest --question "What is the return policy?"
Loading embedding model: all-MiniLM-L6-v2
Ingesting 'return_policy.txt'...
3 chunks created
Embedding 3 chunks...
Stored 3 chunks from 'return_policy.txt'
Ingesting 'password_policy.txt'...
3 chunks created
Embedding 3 chunks...
Stored 3 chunks from 'password_policy.txt'
Ingesting 'shipping_policy.txt'...
3 chunks created
Embedding 3 chunks...
Stored 3 chunks from 'shipping_policy.txt'
Ingestion complete.
Retrieving top-5 chunks for: 'What is the return policy?'
Top retrieved chunks:
[1] return_policy.txt (sim=0.891): Return Policy We accept returns within 30 days of purchase...
[2] return_policy.txt (sim=0.723): To initiate a return, email [email protected] with your...
[3] shipping_policy.txt (sim=0.412): Standard shipping (3-5 business days): free on orders...
[4] password_policy.txt (sim=0.198): Passwords must be at least 12 characters long...
[5] shipping_policy.txt (sim=0.187): We ship to all 50 US states...
Generating answer with Claude...
[Token usage] input=643 output=112 cache_created=87 cache_read=0
============================================================
Q: What is the return policy?
============================================================
Returns are accepted within 30 days of purchase, provided items
are unused and in original packaging [1]. Digital downloads and
software licenses are non-refundable once the key is revealed or
the download has started [1].
To initiate a return, email [email protected] with your order
number [2]. Refunds are processed within 5-7 business days to
the original payment method [2]. Return shipping costs are the
customer's responsibility unless the item was defective or an
error was made by the seller [2].
============================================================
$ python rag_poc.py --question "How do I reset my password?"
Loading embedding model: all-MiniLM-L6-v2
Retrieving top-5 chunks for: 'How do I reset my password?'
...
[Token usage] input=643 output=134 cache_created=0 cache_read=87
============================================================
Q: How do I reset my password?
============================================================
To reset your password, visit https://example.com/reset-password
and enter your registered email address [2]. A reset link will
be emailed to you and is valid for 15 minutes [2]. If the email
does not arrive within 5 minutes, check your spam folder [2].
If your account is locked due to 5 failed login attempts, it
will unlock automatically after 30 minutes. Support at
[email protected] can also unlock it immediately [2].
============================================================
cache_read=87 and cache_created=0. The system prompt was served from Anthropic’s prompt cache, so you paid roughly a tenth of the normal input price on that repeated portion. With a large system prompt this effect is significant. See Part 4 on prompt caching for the full breakdown.
cache_creation_input_tokens=0 until you make the cached block bigger. The numbers above are illustrative of the mechanics. Caching pays off when you put a large, stable knowledge block (company guidelines, a glossary, a style guide of a few thousand tokens) into the cached system block. Below the floor, the cache_control marker is simply ignored and you are billed normally, so leaving it in costs you nothing.
Going Deeper: Production Considerations for rag claude pgvector
Hybrid search: combining BM25 and vector similarity
Pure vector search misses exact-match queries. If a user types an unusual product code or a proper noun not well-represented in the embedding space, BM25 full-text search will find it where cosine similarity fails. The standard pattern is reciprocal rank fusion (RRF): run both searches, rank each independently, then blend the scores. Postgres supports tsvector full-text search natively. You can do both in one SQL query and fuse in Python. This is worth adding before go-live.
Metadata filtering
pgvector does not support filtered ANN natively in all index types, but you can add a JSONB column metadata to document_chunks and apply a WHERE metadata @> '{"department":"legal"}' clause before the vector ORDER BY. The query planner will filter first, then rank by vector distance. This is how you implement tenant isolation or document-type scoping.
-- Add to initdb.sql if you need metadata filtering
ALTER TABLE document_chunks ADD COLUMN IF NOT EXISTS metadata JSONB DEFAULT '{}';
CREATE INDEX IF NOT EXISTS idx_chunks_metadata ON document_chunks USING gin(metadata);
Re-ranking with a cross-encoder
Bi-encoder embeddings (like all-MiniLM) are fast but not the most accurate at ranking. A cross-encoder (e.g., cross-encoder/ms-marco-MiniLM-L-6-v2) reads the query and a candidate chunk together and produces a much more accurate relevance score. The typical pattern: retrieve top-20 by cosine similarity (cheap), re-rank with cross-encoder (slightly slower), pass top-5 to Claude. This gives you better precision without the full cost of embedding every document at query time with a slow model.
Chunking documents with structure
PDFs and HTML documents have structure: headings, sections, tables. Ignoring that structure and splitting by character count often slices sentences in half or separates a question from its answer. The LangChain recursive character text splitter tries to split on paragraph then sentence then word boundaries. For very structured docs (contracts, legal filings) parsing to a document tree before chunking is worth the extra work.
Connecting to Part 13: the full support agent
This POC is a complete RAG pipeline but it is single-turn. In Part 13 of this series, we extend this pattern into a multi-turn support agent that combines RAG retrieval with tool use: it can look up order status, create tickets, and escalate to a human, all within the same conversation loop.
Common Pitfalls
- Wrong embedding at query time. If you retrain or switch your embedding model, you must re-embed every chunk in the database. Mixing models means similarity scores are meaningless. Keep the model name in a config table and assert it at startup.
- Chunks that are too large. Long chunks reduce precision. The top retrieved chunk might be a 1,000-token section where only 20 tokens are relevant. Start with 256-512 characters and tune from real recall metrics.
- Not using overlap. A fact that falls at the boundary of two chunks will be split between them. 10-15% overlap (e.g., 50 characters on 512-character chunks) prevents most boundary problems.
- Treating similarity score as confidence. A similarity of 0.85 does not mean the chunk answers the question. It means it is textually close. Always pass multiple chunks and let Claude judge relevance.
- No hallucination guard in the prompt. Without an explicit instruction to answer only from context, Claude will use its training data to fill gaps. The system prompt in the POC above includes this guard. Do not remove it.
- Re-ingesting without clearing. Running ingest twice without clearing old rows creates duplicate chunks. The
clear_sourcefunction in the POC handles this. In production, add a content hash column and upsert on it instead. - Forgetting the HNSW index. Without the index, every query does a full sequential scan. On 10,000 chunks that is tolerable. On 100,000 chunks it becomes seconds per query. The
CREATE INDEX USING hnswininitdb.sqlhandles this. - pgvector version mismatch. The
vector_cosine_opsHNSW index was added in pgvector 0.5.0. The Docker imagepgvector/pgvector:pg16ships a current version. If you install pgvector from your distro’s package manager, check the version first.
Cost and Latency
The cost split in a RAG pipeline is: embedding (near-zero with a local model), vector search (near-zero on Postgres), Claude generation (the dominant cost). The table below shows representative numbers at moderate scale.
| Step | Typical latency | Cost per 1,000 queries | Scaling notes |
|---|---|---|---|
| Embed question (local) | 5-20ms | $0 | Linear with batch size; GPU speeds this up 10x |
| pgvector HNSW search (10k chunks) | 2-8ms | $0 | Sub-linear with HNSW; degrades gracefully to 1M+ chunks |
| Claude claude-haiku-4-5 (500 input tokens) | 400-800ms | ~$0.40 | Fastest; fine for short factual answers |
| Claude claude-sonnet-4-6 (500 input tokens) | 600-1400ms | ~$1.50 | Best overall quality/cost default |
| Claude claude-opus-4-8 (500 input tokens) | 1500-4000ms | ~$7.50 | Reserve for complex multi-step reasoning over large contexts |
| Prompt cache hit (same system prompt) | Same latency | ~10% of normal input cost | Effective when system prompt is large and stable |
Prompt caching is the most impactful cost lever here. If your system prompt contains a large static knowledge block (say, a 2,000-token company guidelines section), caching it saves 90% of those input tokens on every repeat call. The POC already enables caching on the system prompt. For a heavier context, also cache the context block itself if it is reused across queries.
| Situation | Recommended model | Why |
|---|---|---|
| High-volume support bot, simple FAQ answers | claude-haiku-4-5 | 4x cheaper than Sonnet; answers are factual, context is provided |
| General document Q&A, most internal tools | claude-sonnet-4-6 | Best balance; handles multi-part questions well |
| Legal, contracts, complex multi-doc synthesis | claude-opus-4-8 | Better at nuanced reasoning across conflicting chunks |
| Classification / routing before Q&A | claude-haiku-4-5 | Route to specialist doc store before calling heavier model |
Extending the POC: What to Add Before Production
Async ingestion pipeline
The POC ingests synchronously in the main process. For any real document volume (hundreds of files), you want a background worker. A simple approach: write file paths to a queue (Redis or a Postgres table), run workers that pull from the queue, embed, and insert. This also handles failures gracefully: if embedding fails mid-document you do not lose half the chunks.
Document update detection
Track a content_hash (SHA-256 of the raw text) alongside each source. On re-ingest, compare hashes. Only re-chunk and re-embed files whose hash changed. This makes incremental updates fast and avoids unnecessary API or compute calls.
Answer evaluation
RAG systems drift over time as documents change and questions evolve. Build an eval set of question-answer pairs and run it on every ingestion cycle. See Part 24 on LLM evals for a practical harness you can adapt. At minimum, track whether the correct source chunk appears in the top-3 retrieved results (recall@3) and whether Claude’s answer contains the key fact (checked by a second Claude call or regex).
Streaming responses
For a web interface, streaming the Claude response token by token is much better UX than waiting 2 seconds for the full answer. The Anthropic SDK makes this easy. See Part 26 on streaming for the full pattern. In the POC, replace client.messages.create(...) with client.messages.stream(...) and iterate stream.text_stream.
Structured citations via tool use
The POC asks Claude to include citations in prose ([1], [2], etc.). That is fine for a console demo, but parsing brackets out of prose is brittle once you want to render clickable source links in a web UI. A more reliable approach for downstream processing: define a single tool that represents your output schema and force Claude to call it. You then read the structured object directly from the tool input. The drop-in replacement for answer_with_claude below does exactly that. It is the same retrieval pipeline; only the generation step changes.
def answer_with_citations(question: str, chunks: list[dict]) -> dict:
"""
Generation step that returns MACHINE-READABLE citations.
Instead of asking Claude to embed [1] markers in prose, we define one
tool whose input_schema IS our output shape, then force Claude to call
it with tool_choice. We read block.input as the structured result.
"""
client = anthropic.Anthropic()
context = build_context_block(chunks)
answer_tool = {
"name": "submit_answer",
"description": "Return the final answer with explicit source citations.",
"input_schema": {
"type": "object",
"properties": {
"answer": {
"type": "string",
"description": "The answer, written for an end user.",
},
"found": {
"type": "boolean",
"description": "True if the context contained the answer.",
},
"citations": {
"type": "array",
"items": {
"type": "object",
"properties": {
"passage": {"type": "integer"},
"source": {"type": "string"},
"quote": {"type": "string"},
},
"required": ["passage", "source", "quote"],
},
},
},
"required": ["answer", "found", "citations"],
},
}
system_prompt = (
"You answer questions using ONLY the supplied context passages. "
"Set found=false and citations=[] if the answer is not present. "
"Every citation quote must be copied verbatim from a passage."
)
user_message = f"Context passages:\n\n{context}\n\nQuestion: {question}"
try:
msg = client.messages.create(
model=CLAUDE_MODEL,
max_tokens=1024,
system=system_prompt,
tools=[answer_tool],
tool_choice={"type": "tool", "name": "submit_answer"},
messages=[{"role": "user", "content": user_message}],
)
except anthropic.APIError as exc:
print(f"Claude API error: {exc}", file=sys.stderr)
raise
# With a forced tool, Claude responds with a tool_use block whose
# .input is our structured object. stop_reason will be "tool_use".
for block in msg.content:
if block.type == "tool_use" and block.name == "submit_answer":
return block.input # {"answer": ..., "found": ..., "citations": [...]}
# Defensive fallback: should not happen with a forced tool.
return {"answer": "", "found": False, "citations": []}
Because tool_choice forces the call, msg.stop_reason is "tool_use" and the first content block is the structured object. You get a Python dict with answer, a found boolean you can branch on, and a list of citations that each name a passage number, a source file, and the verbatim quote. Rendering a footnote with a real link is then a loop over citations, not a regex over prose. Part 2 on tool use covers the full request/response loop including multi-tool flows. Part 3 on structured output goes deeper on the forced-tool pattern and schema design.
Frequently Asked Questions
Can I use OpenAI embeddings with this pgvector setup instead of sentence-transformers?
Yes. Swap embed_chunks to call openai.embeddings.create(model="text-embedding-3-small", input=chunks) and change the vector dimension in initdb.sql to 1536 (or 3072 for the large model). OpenAI’s embeddings are higher quality for English text at the cost of an API fee. The rest of the pipeline stays the same. The critical rule: pick one embedding model and never mix it with another in the same table.
How many documents can pgvector handle before it becomes slow?
With the HNSW index, pgvector handles tens of millions of vectors on a well-provisioned Postgres instance (8+ GB RAM, SSD). At 1 million 384-dimensional float32 vectors the index takes roughly 1.5 GB of RAM. Query latency typically stays under 10ms for top-10 searches up to about 5 million rows. Beyond that, consider approximate search parameters (increase ef_search for recall, decrease for speed) or shard by document category.
What happens when Claude gets context chunks that contradict each other?
Claude will usually note the contradiction and present both perspectives if the system prompt does not tell it otherwise. For legal or compliance use cases, add an instruction like “If context passages conflict, report the conflict to the user rather than choosing one.” You can also add a step that deduplicates or reconciles chunks before passing them to Claude.
Is this approach safe for confidential documents?
The raw text of each chunk is stored in Postgres in plaintext. If your documents are confidential, apply the same access controls you would to any Postgres table (row-level security, encrypted connection, encrypted at-rest storage). The embedding vectors are not reversible to the original text with current techniques, but you should treat them as sensitive anyway. Never send raw document text to an external embedding API if the documents are confidential without reviewing that provider’s data retention policy.
How do I handle very long documents that exceed the chunk budget?
The top-k retrieval step naturally handles this: you ingest the whole document in chunks and at query time only the relevant chunks are passed to Claude. You do not send the whole document on every query. A 500-page PDF produces several thousand chunks, but a single Q&A call will only send 5-10 of them to Claude.
Should I use pgvector in production or a dedicated vector database like Pinecone?
For most teams, pgvector is the right answer up to roughly 10 million vectors. It removes an entire service from your infrastructure (and its associated cost, ops burden, and failure modes). Pinecone and Weaviate offer better performance at very large scale and have managed offerings that remove the ops burden in a different way. If you already run Postgres, start with pgvector. You can migrate to a dedicated store later if you genuinely hit scale limits.
Do I need to re-embed documents if I switch from claude-haiku-4-5 to claude-sonnet-4-6?
No. The embedding model and the generation model are completely independent. Embeddings are produced by sentence-transformers (or whatever embedding model you choose) and stored in pgvector. The Claude model only sees the retrieved text chunks, not the vectors. You can switch Claude models on any query with no changes to the database.
Read the full series index at skillsuites.com/category/ai-use-cases/.
Further Reading and External Resources
- Anthropic prompt caching documentation for the exact cache_control syntax and billing details.
- pgvector on GitHub for HNSW vs IVFFlat index comparisons, tuning parameters, and version history.
- sentence-transformers documentation for the full model zoo, including multilingual and code-aware embedding models.
- Anthropic Messages API reference for the full messages.create parameter list including tool use and vision.
- Postgres full-text search documentation for the tsvector/tsquery system you would use in a hybrid BM25 + vector search setup.
Leave a Reply