Build an AI Customer Support Agent with Claude (RAG + Tools)

Series
AI in Production: 30 Real-World Use Cases with Claude

Part 13 of 30 · View the full series

TL;DR

  • An ai customer support agent combines retrieval-augmented generation (RAG) with tool calls so the model answers from real docs and takes actions like looking up orders or filing tickets.
  • The full loop is: embed query, retrieve top-K chunks, inject into context, call Claude, handle tool_use stop, execute tools, loop back until end_turn.
  • Two tools cover most tier-1 support: lookup_order (read data) and create_ticket (write action). Add more without changing the loop.
  • Prompt caching on the knowledge base cuts costs sharply when the same docs are hit across many turns or sessions.
  • Use claude-sonnet-4-6 as the default; route purely factual, single-turn FAQ questions to claude-haiku-4-5 to save money at high volume.
  • The POC below is fully runnable with pip install anthropic openai sentence-transformers and a small in-memory vector store.

Why an AI Customer Support Agent Pays for Itself Quickly

Tier-1 support is expensive to staff and boring to do. The questions are mostly the same: where is my order, how do I reset my password, what is your refund policy. A human agent earning $18/hour can handle maybe 15 tickets per hour. At any meaningful scale, that cost compounds fast.

An ai customer support agent running on Claude handles those questions instantly, 24/7, in any language, while also escalating edge cases to humans with a pre-filled ticket. Done well, it deflects 60 to 80 percent of tier-1 volume. The remaining 20 to 40 percent gets to a human with full context already attached, so the human spends their time solving rather than gathering.

The critical design choice is the combination of RAG and tool use. RAG alone lets the agent answer from your docs but it cannot act. Tools alone let it act but it might hallucinate facts it should have looked up from a doc. Together, the agent knows your policies and can check a real order status. That pairing is what separates a useful production agent from a demo.

If you have not read the RAG fundamentals from Part 10 (RAG with Claude and pgvector) or the tool use basics from Part 2 (Tool Use with Claude), skim those first. This article builds on both.

Architecture of the AI Customer Support Agent

The Three Layers

The agent has three layers that work together on every turn:

  1. Knowledge layer (RAG): A set of plain-text documents (policies, FAQs, product descriptions) that have been chunked and embedded. On each turn, the user message is embedded and the top-K most similar chunks are retrieved. Those chunks are injected into the system prompt as context.
  2. Action layer (tools): Two or more functions that the model can call when it needs live data or needs to write something. In this POC: lookup_order hits a fake order database and create_ticket writes a support ticket record. Claude decides when to call them based on the conversation.
  3. Conversation layer (multi-turn loop): A message history that grows each turn. Claude’s responses and tool results both get appended so the model has full context for follow-up questions.
User Message Embed Query Vector Store (top-K) Claude Sonnet 4.6 + Tools Tools lookup_order create_ticket Tool result appended to message history AI Customer Support Agent: Request Flow Knowledge Base (docs/FAQs)
Figure 1. Request flow for the ai customer support agent. Each user message triggers an embedding lookup, retrieves relevant doc chunks, and calls Claude with both the docs and the full tool set. If Claude returns a tool call, the function runs and the result is fed back in a new turn.

Why Not Just Fine-Tune?

Fine-tuning bakes knowledge into weights. That works if your knowledge is static, but support knowledge changes constantly: new product versions, policy changes, pricing updates. With RAG, you update a document and the agent knows it instantly. No retraining cycle.

The Multi-Turn Loop in Detail

Most tutorials show a single-turn tool use example. Production support is multi-turn: “Where is order 1234?” then “Can you escalate it?” then “What is your SLA for escalations?” The loop must handle all of that with the same history object. The structure is:

  1. Retrieve RAG chunks for the latest user message and append them to the system prompt (or to the user message as a context block).
  2. Send the full messages list plus tool definitions to Claude.
  3. If stop_reason == "tool_use", extract all tool-use blocks, execute them, append a user message with tool_result content, go to step 2.
  4. If stop_reason == "end_turn", extract the text and return it to the user. Append both the assistant response and the user response to history.

Step 3 can happen multiple times in one logical turn if Claude decides to call two tools in sequence. The loop handles that naturally.

Knowledge Base Design for Support Agents

What Goes in the Knowledge Base

Start with three document types that cover 80 percent of support queries:

  • Policy documents: Refund policy, SLA tiers, shipping windows, acceptable use. These answer “what will you do for me” questions.
  • Product FAQs: How does X work, what is Y limit, is Z compatible. These answer “how do I use this” questions.
  • Troubleshooting guides: Step-by-step resolution flows for the top 10 to 20 error types. These answer “something is broken” questions.

Avoid dumping your entire wiki into the knowledge base. Long documents with weak signal produce noisy retrievals. Prefer focused, 200 to 400 word chunks that each answer one question well.

Chunking Strategy

For support docs, semantic chunking by section heading works better than fixed-token sliding windows. A section on “Refund Eligibility” should be one chunk, not split mid-sentence because a 512-token window ended there. In the POC below, the knowledge base is small enough to keep each document as a single chunk. For a real deployment, use a recursive text splitter with overlap.

Embedding Model Choice

For this POC, sentence-transformers/all-MiniLM-L6-v2 runs locally, costs nothing, and produces good retrieval quality on English support text. For production at scale, consider OpenAI’s text-embedding-3-small (faster, API-hosted) or Voyage AI’s models (strong on code and technical text). The embedding model is orthogonal to Claude. You swap it without touching the agent logic.

Defining the Tools

lookup_order

This tool takes an order ID and returns status, shipping carrier, estimated delivery, and any hold reason. In production this calls your order management system. In the POC it queries a small in-memory dictionary. The key design decision is the return shape: return a structured dict with well-named fields so Claude can cite them accurately in its reply.

Key idea: Give tool results clean, labelled fields, not raw database rows or JSON blobs. Claude summarises what it receives. Ambiguous field names produce ambiguous summaries.

create_ticket

This tool creates a support ticket with a category, priority, summary, and optional order ID. It returns a ticket ID and an acknowledgement message. The agent calls it when the user asks for escalation, when no resolution is found in the knowledge base, or when the user explicitly says “I want to speak to someone.”

Notice that create_ticket is a write operation. You may want guardrails on it: rate limiting per session, requiring a confirmation message before executing, or logging every call. Those are production concerns. The POC demonstrates the plumbing; add the safety layer before shipping.

Tool Schema (the JSON you pass Claude)

TOOLS = [
    {
        "name": "lookup_order",
        "description": (
            "Look up the current status of a customer order. "
            "Call this when the customer asks about an order, shipment, or delivery. "
            "Returns order status, carrier, estimated delivery, and any hold reason."
        ),
        "input_schema": {
            "type": "object",
            "properties": {
                "order_id": {
                    "type": "string",
                    "description": "The order ID, e.g. ORD-1234."
                }
            },
            "required": ["order_id"]
        }
    },
    {
        "name": "create_ticket",
        "description": (
            "Create a support ticket for issues that need human review or escalation. "
            "Call this when the customer explicitly asks to escalate, when the issue "
            "cannot be resolved from the knowledge base, or when the customer is very frustrated."
        ),
        "input_schema": {
            "type": "object",
            "properties": {
                "category": {
                    "type": "string",
                    "enum": ["billing", "shipping", "technical", "returns", "other"],
                    "description": "Category of the support issue."
                },
                "priority": {
                    "type": "string",
                    "enum": ["low", "medium", "high", "urgent"],
                    "description": "Priority level based on customer impact."
                },
                "summary": {
                    "type": "string",
                    "description": "One to three sentence summary of the issue and what has been tried."
                },
                "order_id": {
                    "type": "string",
                    "description": "Related order ID if applicable."
                }
            },
            "required": ["category", "priority", "summary"]
        }
    }
]

For more on writing good tool definitions, see Part 2 of this series. Clear descriptions matter: Claude uses the description to decide when to call the tool. Vague descriptions produce wrong-tool calls.

Prompt Caching on the Knowledge Base

If you inject the knowledge base into the system prompt (or as a prefilled user context block), and that knowledge base is the same across many requests, you can cache it. Prompt caching with Claude marks a content block as ephemeral and the platform stores the KV cache for that block for up to 5 minutes (or until the cache expires). Subsequent requests that hit the same cached block pay only cache-read tokens, which are about 10x cheaper than input tokens.

For a support agent with a 20KB knowledge base and 100 requests per minute, caching can cut knowledge-base token costs by 80 to 90 percent. That is significant. The pattern is:

system = [
    {
        "type": "text",
        "text": BASE_SYSTEM_PROMPT
    },
    {
        "type": "text",
        "text": knowledge_base_text,   # the retrieved or full KB
        "cache_control": {"type": "ephemeral"}
    }
]

See Part 4 (Prompt Caching) for the full details and the math. In the POC below, caching is applied to the static knowledge base portion of the system prompt.

The Complete POC

What the POC Does

The POC is a terminal-based multi-turn support agent. It starts a conversation loop, embeds each user message, retrieves the top-2 knowledge base chunks, builds a cached system prompt, sends the full message history to Claude with both tools defined, and handles tool calls in a nested loop until Claude produces a final text answer. Conversation history grows across turns so the agent remembers earlier context.

Install and Dependencies

pip install anthropic sentence-transformers numpy python-dotenv
# requirements.txt
anthropic>=0.28.0
sentence-transformers>=3.0.0
numpy>=1.26.0
python-dotenv>=1.0.0
# .env.example
ANTHROPIC_API_KEY=sk-ant-...

Full Source: support_agent.py

"""
support_agent.py - AI Customer Support Agent with RAG + Tools
Part 13 of "AI in Production: 30 Real-World Use Cases with Claude"
Build an AI Customer Support Agent with Claude (RAG + Tools)
Run: python support_agent.py Requirements: pip install anthropic sentence-transformers numpy python-dotenv """ import os import time import uuid import json import datetime import numpy as np from dotenv import load_dotenv from sentence_transformers import SentenceTransformer import anthropic load_dotenv() # reads ANTHROPIC_API_KEY from .env into the environment # --------------------------------------------------------------------------- # 1. Knowledge Base (small, in-memory for the POC) # --------------------------------------------------------------------------- KNOWLEDGE_DOCS = [ { "id": "refund-policy", "title": "Refund Policy", "text": ( "Refund Policy: Customers may request a full refund within 30 days of purchase " "for any unused item in its original packaging. Digital products are refundable " "within 7 days if the download has not been activated. Refunds are processed " "within 5 to 7 business days back to the original payment method. Shipping costs " "are non-refundable unless the return is due to our error. To start a refund, " "contact support with your order ID." ) }, { "id": "shipping-policy", "title": "Shipping Policy", "text": ( "Shipping Policy: Standard shipping takes 5 to 7 business days within the US. " "Express shipping takes 2 business days and costs $12.99. Overnight shipping is " "available for $24.99. Orders placed before 2 PM EST ship same day on business " "days. International shipping to 45 countries is available; delivery times are " "10 to 21 business days depending on destination. Once an order ships, you will " "receive a tracking number by email within 1 hour." ) }, { "id": "account-faq", "title": "Account and Password FAQ", "text": ( "Account FAQ: To reset your password, go to the login page and click 'Forgot " "Password'. You will receive a reset link valid for 24 hours. If you do not " "receive the email, check your spam folder or contact support. To update your " "email address, log in and navigate to Account Settings. Two accounts cannot " "share the same email address. For enterprise accounts with SSO, contact your " "IT administrator to update credentials." ) }, { "id": "order-status-faq", "title": "Order Status FAQ", "text": ( "Order Status: You can check your order status at any time by providing your " "order ID (format: ORD-XXXX). Orders have the following statuses: 'pending' " "(payment processing), 'confirmed' (payment accepted, preparing to ship), " "'shipped' (in transit with carrier), 'delivered' (confirmed delivery), " "'on_hold' (action required, see hold reason), 'cancelled' (order cancelled). " "If your order shows 'on_hold', check the hold reason or contact support." ) }, { "id": "damaged-item", "title": "Damaged or Defective Items", "text": ( "Damaged or Defective Items: If your item arrives damaged or defective, contact " "support within 48 hours of delivery. Include your order ID and photos of the " "damage. We will ship a replacement at no cost with expedited shipping, or issue " "a full refund including original shipping charges. We do not require you to " "return the damaged item for replacements under $50. For higher-value items, " "a prepaid return label will be emailed." ) } ] # --------------------------------------------------------------------------- # 2. In-memory Vector Store # --------------------------------------------------------------------------- print("Loading embedding model (first run downloads ~90MB)...") EMBED_MODEL = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2") DOC_TEXTS = [d["text"] for d in KNOWLEDGE_DOCS] DOC_EMBEDDINGS = EMBED_MODEL.encode(DOC_TEXTS, normalize_embeddings=True) def retrieve(query: str, top_k: int = 2) -> list[dict]: """Return top_k most relevant knowledge base chunks for a query.""" q_emb = EMBED_MODEL.encode([query], normalize_embeddings=True) scores = (DOC_EMBEDDINGS @ q_emb.T).flatten() top_indices = np.argsort(scores)[::-1][:top_k] return [KNOWLEDGE_DOCS[i] for i in top_indices] # --------------------------------------------------------------------------- # 3. Fake Order Database # --------------------------------------------------------------------------- ORDERS = { "ORD-1001": { "order_id": "ORD-1001", "status": "shipped", "carrier": "UPS", "tracking": "1Z999AA10123456784", "estimated_delivery": "2026-06-07", "hold_reason": None, "items": ["Blue Widget x2", "Red Gadget x1"], "total": "$87.50" }, "ORD-1002": { "order_id": "ORD-1002", "status": "on_hold", "carrier": None, "tracking": None, "estimated_delivery": None, "hold_reason": "Address verification failed. Please confirm your shipping address.", "items": ["Premium Package x1"], "total": "$149.00" }, "ORD-1003": { "order_id": "ORD-1003", "status": "delivered", "carrier": "FedEx", "tracking": "7489923401234560", "estimated_delivery": "2026-06-01", "hold_reason": None, "items": ["Starter Kit x1"], "total": "$34.99" } } # In-memory ticket store (for the demo) TICKETS: dict[str, dict] = {} # --------------------------------------------------------------------------- # 4. Tool Implementations # --------------------------------------------------------------------------- def lookup_order(order_id: str) -> dict: """Look up an order by ID and return its status details.""" order_id = order_id.strip().upper() if order_id in ORDERS: return ORDERS[order_id] return { "error": f"Order '{order_id}' not found. Please check the order ID format (e.g. ORD-1001)." } def create_ticket( category: str, priority: str, summary: str, order_id: str = None ) -> dict: """Create a support ticket and return the ticket ID.""" ticket_id = f"TKT-{uuid.uuid4().hex[:6].upper()}" ticket = { "ticket_id": ticket_id, "category": category, "priority": priority, "summary": summary, "order_id": order_id, "status": "open", "created_at": datetime.datetime.now(datetime.timezone.utc).isoformat(), "message": ( f"Ticket {ticket_id} has been created ({priority} priority). " "A support agent will reach out within 1 business day for medium/low priority, " "or within 2 hours for high/urgent priority." ) } TICKETS[ticket_id] = ticket return ticket TOOL_MAP = { "lookup_order": lookup_order, "create_ticket": create_ticket, } # --------------------------------------------------------------------------- # 5. Tool Definitions for Claude # --------------------------------------------------------------------------- TOOLS = [ { "name": "lookup_order", "description": ( "Look up the current status of a customer order. " "Call this when the customer asks about an order, shipment, or delivery. " "Returns order status, carrier, estimated delivery, and any hold reason." ), "input_schema": { "type": "object", "properties": { "order_id": { "type": "string", "description": "The order ID, e.g. ORD-1001." } }, "required": ["order_id"] } }, { "name": "create_ticket", "description": ( "Create a support ticket for issues that need human review or escalation. " "Call this when the customer explicitly asks to escalate, when the issue " "cannot be resolved from the knowledge base, or when the customer is very frustrated." ), "input_schema": { "type": "object", "properties": { "category": { "type": "string", "enum": ["billing", "shipping", "technical", "returns", "other"], "description": "Category of the support issue." }, "priority": { "type": "string", "enum": ["low", "medium", "high", "urgent"], "description": "Priority level based on customer impact." }, "summary": { "type": "string", "description": "One to three sentence summary of the issue and what has been tried." }, "order_id": { "type": "string", "description": "Related order ID if applicable." } }, "required": ["category", "priority", "summary"] } } ] # --------------------------------------------------------------------------- # 6. System Prompt # --------------------------------------------------------------------------- BASE_SYSTEM = """You are a friendly and efficient customer support agent for ShopWave, \ an e-commerce platform. Your job is to help customers with orders, refunds, shipping, \ account issues, and general questions. Guidelines: - Answer questions using the provided knowledge base context first. - Use the lookup_order tool when the customer asks about a specific order status. - Use the create_ticket tool when: the customer asks to escalate, the issue cannot be \ resolved from available information, or the customer expresses significant frustration. - Be concise but complete. Do not make up policies or order details. - If you create a ticket, tell the customer the ticket ID and expected response time. - Always confirm before creating a ticket if the customer has not explicitly requested one. """ # --------------------------------------------------------------------------- # 7. The Agent Loop # --------------------------------------------------------------------------- # The SDK reads ANTHROPIC_API_KEY from the environment. Never hardcode a key. client = anthropic.Anthropic() conversation_history: list[dict] = [] MODEL = "claude-sonnet-4-6" MAX_TOOL_ITERATIONS = 8 # guard against runaway tool loops def call_claude(system, messages, max_retries: int = 4): """ Call the Messages API with simple exponential backoff on transient errors. Re-raises after the final attempt so the caller can decide what to do. """ delay = 1.0 for attempt in range(max_retries): try: return client.messages.create( model=MODEL, max_tokens=1024, system=system, tools=TOOLS, messages=messages, ) except (anthropic.RateLimitError, anthropic.APIStatusError) as exc: # 429 and 5xx are worth retrying; back off and try again. if attempt == max_retries - 1: raise print(f" [transient error: {exc}; retrying in {delay:.1f}s]") time.sleep(delay) delay *= 2 except anthropic.APIError as exc: # Non-retryable API error (bad request, auth, etc.): fail fast. print(f" [API error: {exc}]") raise def build_system_with_cache(retrieved_chunks: list[dict]) -> list[dict]: """ Build the system prompt as a list of content blocks so the KB portion can be cached with prompt caching. """ kb_text = "\n\n---\n\n".join( f"[{doc['title']}]\n{doc['text']}" for doc in retrieved_chunks ) return [ { "type": "text", "text": BASE_SYSTEM }, { "type": "text", "text": f"\n\nRelevant knowledge base context for this query:\n\n{kb_text}", "cache_control": {"type": "ephemeral"} } ] def run_agent_turn(user_message: str) -> str: """ Process one user turn through the full agent loop. Handles RAG retrieval, tool calls (potentially multiple), and returns the final text response. Mutates conversation_history in place. """ # 1. Retrieve relevant docs retrieved = retrieve(user_message, top_k=2) # 2. Append user message to history conversation_history.append({ "role": "user", "content": user_message }) # 3. Build system with cached KB system = build_system_with_cache(retrieved) # 4. Inner loop: handles 0 or more tool calls per turn iterations = 0 while True: iterations += 1 if iterations > MAX_TOOL_ITERATIONS: # Safety net: stop a runaway tool loop before it creates work. print(" [max tool iterations reached; stopping]") return ("I'm having trouble completing that automatically. " "Let me hand this to a human teammate who can finish it.") response = call_claude(system, conversation_history) # Debug: show cache stats if available usage = response.usage if hasattr(usage, "cache_creation_input_tokens") and usage.cache_creation_input_tokens: print(f" [cache WRITE: {usage.cache_creation_input_tokens} tokens]") if hasattr(usage, "cache_read_input_tokens") and usage.cache_read_input_tokens: print(f" [cache HIT: {usage.cache_read_input_tokens} tokens]") if response.stop_reason == "end_turn": # Extract text from content blocks text_parts = [ block.text for block in response.content if hasattr(block, "text") ] assistant_text = "\n".join(text_parts) # Append assistant response to history conversation_history.append({ "role": "assistant", "content": response.content }) return assistant_text elif response.stop_reason == "tool_use": # Append the assistant message (contains tool_use blocks) to history conversation_history.append({ "role": "assistant", "content": response.content }) # Execute all tool calls and collect results tool_results = [] for block in response.content: if block.type == "tool_use": tool_name = block.name tool_input = block.input tool_use_id = block.id print(f" [Tool call: {tool_name}({json.dumps(tool_input)})]") if tool_name in TOOL_MAP: result = TOOL_MAP[tool_name](**tool_input) else: result = {"error": f"Unknown tool: {tool_name}"} print(f" [Tool result: {json.dumps(result, default=str)[:120]}...]") tool_results.append({ "type": "tool_result", "tool_use_id": tool_use_id, "content": json.dumps(result, default=str) }) # Append tool results as a user message and loop back conversation_history.append({ "role": "user", "content": tool_results }) # Continue inner loop: Claude will now process the tool results else: # Unexpected stop reason break return "[Agent stopped unexpectedly]" # --------------------------------------------------------------------------- # 8. Main REPL # --------------------------------------------------------------------------- def main(): print("\n=== ShopWave Customer Support Agent ===") print("Type your question (or 'quit' to exit)\n") while True: try: user_input = input("You: ").strip() except (EOFError, KeyboardInterrupt): print("\nGoodbye!") break if not user_input: continue if user_input.lower() in ("quit", "exit", "q"): print("Thank you for contacting ShopWave support. Goodbye!") break reply = run_agent_turn(user_input) print(f"\nAgent: {reply}\n") if __name__ == "__main__": main()

Sample Run

This is a realistic multi-turn session showing RAG answers, a tool call, and ticket creation:

=== ShopWave Customer Support Agent ===
Type your question (or 'quit' to exit)

You: What is your refund policy?

Agent: Our refund policy allows you to request a full refund within 30 days of
purchase for any unused item in its original packaging. Digital products are
refundable within 7 days if the download has not been activated.

Refunds are processed within 5 to 7 business days back to your original payment
method. Shipping costs are generally non-refundable unless the return is due to
our error.

To start a refund, just provide your order ID and I can help you from there.

You: Can you check order ORD-1002 for me?

  [Tool call: lookup_order({"order_id": "ORD-1002"})]
  [cache WRITE: 412 tokens]
  [Tool result: {"order_id": "ORD-1002", "status": "on_hold", "carrier": null, ...]

Agent: Your order ORD-1002 is currently on hold. The reason is:

"Address verification failed. Please confirm your shipping address."

This means the order has not shipped yet. To resolve this, you will need to
confirm your shipping address. You can update it in your account under Order
History, or I can escalate this to our team if you would like immediate help.

Would you like me to create a support ticket for this?

You: Yes please escalate it, this is urgent

  [Tool call: create_ticket({"category": "shipping", "priority": "urgent",
    "summary": "Order ORD-1002 is on hold due to address verification failure.
    Customer has requested urgent escalation to confirm shipping address.",
    "order_id": "ORD-1002"})]
  [cache HIT: 412 tokens]
  [Tool result: {"ticket_id": "TKT-A3F91C", "status": "open", ...]

Agent: I have created an urgent support ticket for you:

Ticket ID: TKT-A3F91C
Priority: Urgent
Category: Shipping
Issue: Order ORD-1002 on hold due to address verification

A support agent will reach out within 2 hours to help confirm your address and
get your order shipped. Is there anything else I can help you with?

You: quit
Thank you for contacting ShopWave support. Goodbye!

Notice the cache hit on the second tool call turn. The knowledge base content was written to cache on the first API call and read for free on the second. As the conversation grows, every subsequent turn reads from cache rather than paying full input token cost for the KB.

Common Pitfalls

Pitfall 1: Forgetting to Append Tool Results as a User Message

The Anthropic API requires tool results to be sent back as a user-role message with tool_result content blocks. If you send them as an assistant message or skip them entirely, you get a 400 error or a confused model. Check the inner loop in the POC: the assistant message (containing the tool-use block) is appended first, then a user message with the result is appended before the next API call.

Pitfall 2: Embedding the Question, Not the User’s Intent

If the user says “I bought something a week ago and it still has not arrived,” the literal words are about time and shipping. A pure keyword search misses “shipping policy.” Embedding the full sentence works better because sentence embeddings capture semantic meaning, not just keywords. For very short queries like “return,” consider expanding the query before embedding (e.g., “how do I return an item and get a refund”).

Pitfall 3: Infinite Tool Call Loops

In rare cases a model may call the same tool repeatedly if the result does not satisfy an expectation. Add a max-iterations guard to the inner loop (e.g., if iterations > 8: break). This is especially important for write tools like create_ticket where an infinite loop means infinite tickets created.

Pitfall 4: Stale Cache After Doc Updates

The ephemeral cache TTL is 5 minutes. If you update a policy doc and deploy, the next 5 minutes of requests may serve the old cached version. For high-stakes policy updates (price changes, new terms), either wait out the TTL or change the cache key (e.g., add a version marker to the KB text).

Pitfall 5: Injecting Too Much Context

Retrieving top-10 chunks to “be safe” inflates input tokens and costs more per request. It can also hurt answer quality: the model may cite a less-relevant chunk and ignore the best one buried in position 8. Top-2 to top-4 is usually the right range for focused support queries. Measure retrieval precision with a small eval set before tuning K upward.

Pitfall 6: Not Handling Partial Tool Results

If a tool call fails (network error, DB down), the result block should still be sent back with an error message rather than raising an exception that kills the loop. Claude will gracefully acknowledge the failure and offer to try again or escalate. The POC’s try/except on the API call handles Claude API errors; wrap your tool implementations similarly.

Cost and Latency

Model Input ($/1M tokens) Output ($/1M tokens) Cache read ($/1M) Median first-token latency Best for
claude-haiku-4-5 $0.80 $4.00 $0.08 ~300ms FAQ triage, short factual answers, high volume
claude-sonnet-4-6 $3.00 $15.00 $0.30 ~600ms Full support agent (recommended default)
claude-opus-4-8 $15.00 $75.00 $1.50 ~1200ms Complex escalations, sensitive negotiations

At 10,000 support requests per day, each averaging 800 input tokens and 200 output tokens with a 20KB knowledge base cached 90 percent of the time:

  • Haiku: roughly $2.60/day
  • Sonnet: roughly $9.50/day (with cache savings vs $38/day without)
  • Opus: roughly $45/day

For most support workloads, Sonnet is the right call. Route purely FAQ-style, single-turn questions to Haiku and reserve Opus for the top 1 to 2 percent of escalations that require nuanced handling. See Part 27 (Model Routing and Batching) for a complete routing pattern.

Scenario Recommended Model RAG? Tools? Notes
FAQ bot (no order data) Haiku Yes No RAG only, no tool overhead
Order status lookup Sonnet Yes lookup_order Standard case
Full tier-1 deflection Sonnet Yes Both This POC
Angry customer escalation Sonnet or Opus Yes create_ticket Higher stakes, tone matters
B2B contract dispute Opus Yes (contract docs) Both + lookup_contract High stakes, precise citations needed

Extending the POC for Production

Replace the In-Memory Store with pgvector

The POC uses NumPy for similarity search. For production, swap in pgvector on Postgres (covered in detail in Part 10). This gives you persistence, concurrent access, filtered retrieval (e.g., retrieve only docs tagged for a specific product line), and the ability to update the KB without restarting the process.

Add a Classifier for Routing

Before calling Claude, run a cheap Haiku call to classify the query into one of: FAQ, order-lookup, complaint, escalation-request. Use the classification to pick the model and tools. A pure FAQ query does not need tools defined at all, which slightly improves response quality (Claude does not consider calling tools it cannot use) and reduces output token variability.

User Message Classifier Haiku 4.5 FAQ path Haiku + RAG only Order path Sonnet + lookup_order Escalation path Sonnet + both tools Response to user Model routing by query type
Figure 2. A classifier layer (Haiku) routes each query to the cheapest appropriate path. FAQ queries skip tool definitions entirely. Only order and escalation paths get the full tool set.

Add Observability

Log every tool call (inputs and outputs), every cache hit/miss, and every response with its input and output token count. At scale you want to know which queries are triggering the most tool calls (possible ambiguity in the KB), which RAG chunks are being retrieved most (good candidates for explicit FAQ entries), and which queries produce escalations (gaps in self-service coverage). See Part 28 (Observability for LLM Apps) for a full tracing approach.

Guardrails

Before shipping, add: input validation (length cap, PII scrubbing), output filtering (do not leak other customers’ order data), rate limiting per session, and an escalation safety net (always offer a human path). For a full guardrails treatment, see Part 25 (Guardrails and Prompt Injection Defense).

Frequently Asked Questions

What is the difference between an ai customer support agent and a simple chatbot?

A simple chatbot matches keywords to pre-written responses. An ai customer support agent uses a large language model to understand natural language, retrieves relevant information from a knowledge base, and can call real tools to look up live data or take actions. The result is much higher coverage: the agent handles variations in phrasing, multi-part questions, and novel issues that a keyword bot would route to “I did not understand your question.”

Can Claude handle multiple tool calls in a single turn?

Yes. If Claude needs to both look up an order and create a ticket in the same response, it may return both tool-use blocks in one API response. The inner loop in the POC handles this: it iterates over all blocks in the response, executes each tool, collects all results into a single user message with multiple tool_result entries, and sends that back in one call.

How many knowledge base documents can I use?

There is no hard limit on the number of documents in your vector store. You retrieve only the top-K most relevant chunks per query, so the store size does not directly affect per-request cost. What matters is retrieval quality: a store of 10,000 high-quality focused chunks will outperform a store of 500 low-quality large documents. For most SaaS support teams, 100 to 500 well-written document chunks covers the vast majority of queries.

What happens if a tool call fails?

The POC catches tool errors and returns them as structured dicts with an "error" key. Claude reads the error message and responds appropriately: telling the user the system is temporarily unavailable, offering to try again, or offering to create a ticket for follow-up. Do not let tool failures throw exceptions that abort the conversation. The graceful error path is part of the user experience.

Should the knowledge base context go in the system prompt or the user message?

Either can work, but the system prompt is more cacheable because it stays constant across turns for the same KB content. If you inject context into the user message, the cache key changes every turn because the user message changes. For best caching efficiency, put the static KB portion in the system prompt as a cached block, and keep the dynamic per-turn content in the user message.

How do I keep the knowledge base up to date?

Re-embed and re-insert changed documents whenever they are updated. If you are using pgvector, use an INSERT ... ON CONFLICT DO UPDATE pattern keyed by document ID. For the ephemeral prompt cache, changes take effect at the next cache miss (within 5 minutes). The re-embedding step is cheap: a 500-word document costs a fraction of a cent to embed with most models.

Is this approach suitable for B2B or enterprise support with complex contracts?

Yes, but with some additions. Enterprise support often involves contract-specific SLAs, account tiers, and sensitive pricing. Add a lookup_account tool that returns the customer’s tier and SLA. Store contract documents in the vector DB with customer-ID metadata and filter retrievals to only that customer’s docs. Use claude-opus-4-8 for high-stakes interactions where precise citation is critical. The same loop applies; you are just adding more domain-specific tools and knowledge.

Back to the full series: AI in Production: 30 Real-World Use Cases with Claude

References and Further Reading

MUASIF80 Avatar
Previous

Leave a Reply

Your email address will not be published. Required fields are marked *