TL;DR
- LLM observability means capturing model, tokens, latency, cost, and request IDs for every Claude call in structured, queryable logs.
- A Python decorator is the least-invasive way to add this: wrap your existing calls and get full telemetry without touching business logic.
- Cost per call is calculable from the usage object on every response: input tokens, output tokens, and the per-model pricing table.
- Structured JSON logs let you pipe data to any backend (CloudWatch, Datadog, Grafana Loki, a plain file) without changing the instrumentation.
- An aggregation report built from the log file gives p50/p95 latency, total spend, error rate, and per-model breakdowns in seconds.
- The POC below is complete and runnable: one file, zero external telemetry dependencies, ready to extend with OpenTelemetry spans.
Why LLM Observability Is Not Optional in Production
You can build a Claude-powered feature in an afternoon. Getting it to behave predictably over weeks of real traffic is a different problem entirely. Token counts drift when prompts change. Latency spikes when a model tier is saturated. A single runaway tool-use loop can burn through your monthly budget in an hour. Without instrumentation, you find out about these problems from a support ticket or a billing alert, not from a dashboard.
LLM observability covers the specific signals that matter for language model calls: which model was used, how many tokens were consumed (input vs. output, cached vs. fresh), how long the call took, what the call cost, and whether it succeeded. Traditional APM tools give you HTTP status codes and latency percentiles. That is a start, but it misses the economics. A 200 OK response that cost $0.45 and took 8 seconds is worth knowing about even when nothing technically failed.
This article walks through a complete observability setup for Claude applications: a Python decorator that captures every call, a cost calculation layer, structured JSON output that routes to any backend, and a small aggregation script that gives you a usable report from the raw logs. The POC is self-contained, but every piece maps directly to how you would extend it in a real system with OpenTelemetry, Datadog, or a dedicated LLM observability platform.
What You Actually Need to Track
Before writing any code, it helps to be specific about what signals matter:
- Request identity: a unique trace ID per call so you can correlate a log entry back to the user session, job, or API request that triggered it.
- Model and version: which model id was requested and, if the API returns one, the actual model id used. This matters when you start routing between tiers.
- Token usage: input tokens, output tokens, cache creation tokens, and cache read tokens. These are the four numbers on every
msg.usageobject. - Latency: time-to-first-token is ideal for streaming; total wall-clock time is the minimum for non-streaming calls.
- Cost: calculated from token counts and a pricing table. Not returned by the API, but derivable.
- Error type and message:
anthropic.APIErrorsubclasses tell you whether a failure was a rate limit, a timeout, an overload, or a bad request. - Caller context: a tag or name for which part of your application made the call (e.g., “summarizer”, “classifier”, “qa-chain”).
Architecture of the Tracing Layer
The cleanest pattern for this is a Python decorator that wraps client.messages.create. The decorator intercepts the call, starts a timer, records the parameters, lets the real call run, captures the response, calculates cost, and writes a structured log entry. The rest of your application calls an annotated version of the function and never changes.
Why a Decorator and Not Middleware
HTTP middleware works well for observability when every AI call goes through a single HTTP gateway you control. In many Python apps, especially those using the Anthropic SDK directly, calls happen at multiple call sites. A decorator applied once to a wrapper function gives you a single place to add instrumentation, regardless of how many places in the codebase call the function. It also keeps the trace context local to the call: you know exactly which function was measured, which makes it easier to add caller-specific tags.
An alternative is to subclass or monkey-patch the Anthropic client. That captures calls regardless of wrapper functions, but it makes testing harder and couples your instrumentation to the SDK internals. The decorator pattern is explicit and replaceable.
Cost Calculation from the Usage Object
The Anthropic API does not return a cost field. It returns token counts, and cost is a function of token counts multiplied by per-model prices. Since the pricing table changes, you should store it in a config dict rather than hardcoding magic numbers inline.
As of mid-2026, the pricing (in USD per million tokens) for the three main model tiers is:
| Model ID | Input ($/M tokens) | Output ($/M tokens) | Cache Write ($/M tokens) | Cache Read ($/M tokens) |
|---|---|---|---|---|
| claude-opus-4-8 | $15.00 | $75.00 | $18.75 | $1.50 |
| claude-sonnet-4-6 | $3.00 | $15.00 | $3.75 | $0.30 |
| claude-haiku-4-5 | $0.80 | $4.00 | $1.00 | $0.08 |
Cache read tokens are priced at roughly 10% of regular input tokens, which is why prompt caching (Part 4) changes the economics so dramatically for applications with large, repeated system prompts. Your observability layer should track cache hits and misses explicitly, because a deployment without caching enabled looks very different in cost than one with it.
The Full POC: A Tracing Decorator for Claude
Installation
pip install anthropic python-dotenvrequirements.txt
anthropic>=0.27.0
python-dotenv>=1.0.0
.env
ANTHROPIC_API_KEY=sk-ant-your-key-here
LOG_FILE=claude_traces.jsonl
tracer.py (complete source)
"""
tracer.py
A decorator-based observability layer for Claude API calls.
Logs model, tokens, latency, cost, and request_id to structured
JSONL, then provides an aggregation report from the log file.
Usage:
from tracer import trace_claude, print_report
@trace_claude(caller="my-feature")
def ask_claude(prompt: str, system: str = "") -> str:
msg = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=1024,
system=system,
messages=[{"role": "user", "content": prompt}],
)
return msg.content[0].text
result = ask_claude("Summarize this text: ...")
print_report()
"""
from __future__ import annotations
import functools
import json
import logging
import os
import time
import traceback
import uuid
from collections import defaultdict
from pathlib import Path
from typing import Any, Callable
import anthropic
from dotenv import load_dotenv
load_dotenv()
# ---------------------------------------------------------------------------
# Pricing table (USD per million tokens, mid-2026)
# Update this dict when Anthropic adjusts pricing.
# ---------------------------------------------------------------------------
PRICING: dict[str, dict[str, float]] = {
"claude-opus-4-8": {
"input": 15.00,
"output": 75.00,
"cache_write": 18.75,
"cache_read": 1.50,
},
"claude-sonnet-4-6": {
"input": 3.00,
"output": 15.00,
"cache_write": 3.75,
"cache_read": 0.30,
},
"claude-haiku-4-5": {
"input": 0.80,
"output": 4.00,
"cache_write": 1.00,
"cache_read": 0.08,
},
}
# Default price for unknown model ids (use Sonnet as a conservative estimate)
DEFAULT_PRICING = PRICING["claude-sonnet-4-6"]
LOG_FILE = Path(os.getenv("LOG_FILE", "claude_traces.jsonl"))
# Standard Python logger: writes to stderr, separate from JSONL trace log
_log = logging.getLogger("tracer")
logging.basicConfig(level=logging.INFO, format="%(levelname)s %(name)s %(message)s")
def _get_prices(model: str) -> dict[str, float]:
"""Return pricing dict for a model id, falling back to the default."""
# Normalise: strip trailing version qualifiers like '-20241022'
for key in PRICING:
if model.startswith(key) or key in model:
return PRICING[key]
_log.warning("Unknown model %r, using default pricing", model)
return DEFAULT_PRICING
def _calculate_cost(model: str, usage: anthropic.types.Usage) -> float:
"""
Calculate the total USD cost for a single API call.
Args:
model: The model id string returned (or requested) for this call.
usage: The usage object from the response (or None on error).
Returns:
Estimated cost in USD as a float. Returns 0.0 if usage is None.
"""
if usage is None:
return 0.0
prices = _get_prices(model)
m = 1_000_000 # tokens per unit price
input_tokens = getattr(usage, "input_tokens", 0) or 0
output_tokens = getattr(usage, "output_tokens", 0) or 0
cache_creation = getattr(usage, "cache_creation_input_tokens", 0) or 0
cache_read = getattr(usage, "cache_read_input_tokens", 0) or 0
# Cached tokens replace fresh input tokens; avoid double-counting.
# The input_tokens figure from the API already nets out cached reads.
cost = (
(input_tokens / m) * prices["input"]
+ (output_tokens / m) * prices["output"]
+ (cache_creation / m) * prices["cache_write"]
+ (cache_read / m) * prices["cache_read"]
)
return round(cost, 8)
def _write_log(record: dict) -> None:
"""Append a single JSON line to the trace log file."""
with LOG_FILE.open("a", encoding="utf-8") as fh:
fh.write(json.dumps(record) + "\n")
def trace_claude(caller: str = "default"):
"""
Decorator factory for tracing Claude API calls.
Wrap any function that calls client.messages.create and returns the
response or a derived value. The decorator intercepts the call,
measures latency, extracts usage from the returned Message object
OR from a response stored on the function via a side-channel, and
writes a structured JSON log entry.
Because most application code extracts .content[0].text and returns
a plain string, the decorator captures the response object through a
thread-local side-channel attached to the wrapped function.
Args:
caller: A tag identifying which part of the app made this call.
Returns:
A decorator.
Example:
@trace_claude(caller="summariser")
def summarise(text: str) -> str:
msg = client.messages.create(...)
summarise._last_response = msg # side-channel
return msg.content[0].text
"""
def decorator(fn: Callable) -> Callable:
@functools.wraps(fn)
def wrapper(*args: Any, **kwargs: Any) -> Any:
request_id = str(uuid.uuid4())
start = time.perf_counter()
record: dict[str, Any] = {
"request_id": request_id,
"caller": caller,
"function": fn.__name__,
"ts_start": time.time(),
"model_requested": kwargs.get("model", "unknown"),
"status": "ok",
"error": None,
"latency_s": None,
"model_used": None,
"input_tokens": None,
"output_tokens": None,
"cache_creation_tokens": None,
"cache_read_tokens": None,
"stop_reason": None,
"cost_usd": None,
}
# Attach a slot for the side-channel response object
fn._last_response = None # type: ignore[attr-defined]
try:
result = fn(*args, **kwargs)
elapsed = time.perf_counter() - start
record["latency_s"] = round(elapsed, 4)
# Try to read the response object from the side-channel
resp = getattr(fn, "_last_response", None)
if isinstance(resp, anthropic.types.Message):
usage = resp.usage
model_used = resp.model
record["model_used"] = model_used
record["input_tokens"] = getattr(usage, "input_tokens", None)
record["output_tokens"] = getattr(usage, "output_tokens", None)
record["cache_creation_tokens"] = getattr(
usage, "cache_creation_input_tokens", None
)
record["cache_read_tokens"] = getattr(
usage, "cache_read_input_tokens", None
)
record["stop_reason"] = resp.stop_reason
record["cost_usd"] = _calculate_cost(model_used, usage)
else:
# If no side-channel, cost and tokens remain None.
# Log a warning so the developer knows to wire it up.
_log.debug(
"request_id=%s: no _last_response set; token/cost fields omitted",
request_id,
)
_write_log(record)
return result
except anthropic.APIError as exc:
elapsed = time.perf_counter() - start
record["latency_s"] = round(elapsed, 4)
record["status"] = "error"
record["error"] = {
"type": type(exc).__name__,
"message": str(exc),
"status_code": getattr(exc, "status_code", None),
}
# Partial tokens may have been charged even on error.
# If the exception carries a response body, try to extract usage.
raw_resp = getattr(exc, "response", None)
if raw_resp is not None:
try:
body = raw_resp.json()
usage_data = body.get("usage", {})
record["input_tokens"] = usage_data.get("input_tokens")
record["output_tokens"] = usage_data.get("output_tokens")
except Exception:
pass
_write_log(record)
raise # re-raise so calling code can handle it
except Exception as exc:
elapsed = time.perf_counter() - start
record["latency_s"] = round(elapsed, 4)
record["status"] = "error"
record["error"] = {
"type": type(exc).__name__,
"message": str(exc),
"traceback": traceback.format_exc(limit=5),
}
_write_log(record)
raise
return wrapper
return decorator
# ---------------------------------------------------------------------------
# Aggregation report
# ---------------------------------------------------------------------------
def load_traces(log_file: Path = LOG_FILE) -> list[dict]:
"""Read all trace records from the JSONL log file."""
if not log_file.exists():
return []
records = []
with log_file.open(encoding="utf-8") as fh:
for line in fh:
line = line.strip()
if line:
try:
records.append(json.loads(line))
except json.JSONDecodeError:
pass
return records
def _percentile(values: list[float], p: int) -> float:
"""Compute the p-th percentile of a sorted list."""
if not values:
return 0.0
values = sorted(values)
idx = (p / 100) * (len(values) - 1)
lo, hi = int(idx), min(int(idx) + 1, len(values) - 1)
frac = idx - lo
return round(values[lo] + frac * (values[hi] - values[lo]), 4)
def print_report(log_file: Path = LOG_FILE, top_n: int = 5) -> None:
"""
Print a human-readable aggregation report from the trace log.
Covers:
- Total calls, error rate, total cost
- Latency p50 / p95 across all calls
- Per-model breakdown (calls, tokens, cost)
- Top-N most expensive calls
"""
records = load_traces(log_file)
if not records:
print("No trace records found in", log_file)
return
total = len(records)
errors = [r for r in records if r.get("status") == "error"]
ok_records = [r for r in records if r.get("status") == "ok"]
latencies = [r["latency_s"] for r in records if r.get("latency_s") is not None]
costs = [r["cost_usd"] for r in ok_records if r.get("cost_usd") is not None]
total_cost = sum(costs)
total_input = sum(r.get("input_tokens") or 0 for r in ok_records)
total_output = sum(r.get("output_tokens") or 0 for r in ok_records)
total_cache_read = sum(r.get("cache_read_tokens") or 0 for r in ok_records)
print("=" * 60)
print(" CLAUDE TRACE REPORT")
print("=" * 60)
print(f" Log file : {log_file}")
print(f" Total calls : {total}")
print(f" Successes : {len(ok_records)}")
print(f" Errors : {len(errors)} ({100*len(errors)/total:.1f}%)")
print(f" Total cost : ${total_cost:.6f}")
print(f" Input tok : {total_input:,}")
print(f" Output tok : {total_output:,}")
print(f" Cache reads : {total_cache_read:,}")
print()
print(f" Latency p50 : {_percentile(latencies, 50):.3f}s")
print(f" Latency p95 : {_percentile(latencies, 95):.3f}s")
print(f" Latency max : {max(latencies, default=0):.3f}s")
# Per-model breakdown
by_model: dict[str, dict] = defaultdict(
lambda: {"calls": 0, "input_tokens": 0, "output_tokens": 0, "cost": 0.0, "errors": 0}
)
for r in records:
model = r.get("model_used") or r.get("model_requested") or "unknown"
by_model[model]["calls"] += 1
by_model[model]["input_tokens"] += r.get("input_tokens") or 0
by_model[model]["output_tokens"] += r.get("output_tokens") or 0
by_model[model]["cost"] += r.get("cost_usd") or 0.0
if r.get("status") == "error":
by_model[model]["errors"] += 1
print()
print(" Per-model breakdown:")
print(f" {'Model':<30} {'Calls':>6} {'Errors':>6} {'In Tok':>10} {'Out Tok':>9} {'Cost USD':>12}")
print(" " + "-" * 78)
for model, stats in sorted(by_model.items(), key=lambda x: -x[1]["cost"]):
print(
f" {model[:30]:<30} {stats['calls']:>6} {stats['errors']:>6} "
f"{stats['input_tokens']:>10,} {stats['output_tokens']:>9,} ${stats['cost']:>11.6f}"
)
# Top-N most expensive calls
expensive = sorted(
[r for r in ok_records if r.get("cost_usd")],
key=lambda r: r["cost_usd"],
reverse=True,
)[:top_n]
if expensive:
print()
print(f" Top {top_n} most expensive calls:")
for i, r in enumerate(expensive, 1):
print(
f" {i}. request_id={r['request_id'][:8]}... "
f"caller={r.get('caller','?')} "
f"model={r.get('model_used','?')} "
f"cost=${r['cost_usd']:.6f} "
f"latency={r['latency_s']:.3f}s"
)
# Error summary
if errors:
print()
print(" Recent errors:")
for r in errors[-3:]:
err = r.get("error") or {}
print(
f" - {r['request_id'][:8]}... "
f"{err.get('type','?')}: {str(err.get('message',''))[:80]}"
)
print("=" * 60)
# ---------------------------------------------------------------------------
# Demo application
# ---------------------------------------------------------------------------
if __name__ == "__main__":
import os
client = anthropic.Anthropic() # reads ANTHROPIC_API_KEY from env
# --- Example 1: Simple Q&A with Sonnet ---
@trace_claude(caller="qa-feature")
def ask_claude(prompt: str, system: str = "") -> str:
"""Ask Claude a question and return the answer text."""
msg = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=512,
system=system if system else "You are a concise technical assistant.",
messages=[{"role": "user", "content": prompt}],
)
ask_claude._last_response = msg # wire up side-channel
return msg.content[0].text
# --- Example 2: Fast classification with Haiku ---
@trace_claude(caller="classifier")
def classify_intent(text: str) -> str:
"""Classify the intent of a user message."""
msg = client.messages.create(
model="claude-haiku-4-5",
max_tokens=64,
system="Classify the user intent as one of: question, complaint, request, feedback. Reply with one word only.",
messages=[{"role": "user", "content": text}],
)
classify_intent._last_response = msg
return msg.content[0].text.strip().lower()
# --- Example 3: Heavier summarisation with Opus ---
@trace_claude(caller="summariser")
def summarise_doc(document: str) -> str:
"""Produce a concise summary of a long document."""
msg = client.messages.create(
model="claude-opus-4-8",
max_tokens=256,
system="You are an expert technical writer. Summarise the document in 3 bullet points.",
messages=[{"role": "user", "content": document}],
)
summarise_doc._last_response = msg
return msg.content[0].text
# Run the demo calls
print("Running demo calls...\n")
# Call 1
answer = ask_claude("What is the difference between p50 and p95 latency?")
print("Q&A answer:", answer[:120], "...\n")
# Call 2
intent = classify_intent("I cannot log in and I have been trying for an hour.")
print("Classified intent:", intent, "\n")
# Call 3
summary = summarise_doc(
"""
LLM observability refers to the practice of instrumenting language model
applications to capture operational signals: token usage, latency, cost,
error rates, and model version. Unlike traditional application monitoring,
LLM observability must account for variable-length inputs and outputs,
non-deterministic responses, and per-token pricing models. Tools in this
space include OpenLLMetry, LangSmith, Helicone, and vendor-native dashboards.
The core challenge is correlating a language model call back to the user
session, job queue item, or API request that triggered it, especially in
async pipelines where the call may happen in a background worker.
"""
)
print("Summary:", summary[:200], "...\n")
# Simulate a fourth call that triggers an error (bad model name)
@trace_claude(caller="bad-call-demo")
def bad_call() -> str:
msg = client.messages.create(
model="claude-nonexistent-model",
max_tokens=16,
messages=[{"role": "user", "content": "test"}],
)
bad_call._last_response = msg
return msg.content[0].text
try:
bad_call()
except anthropic.APIError as e:
print(f"Expected error caught: {type(e).__name__}: {e}\n")
# Print the aggregation report
print_report()
Sample Run Output
Running demo calls...
Q&A answer: p50 latency (the median) means half of your requests complete faster
than that value. p95 means 95% of requests complete faster. p95 catches tail latency
spikes that the median hides ...
Classified intent: complaint
Summary:
- LLM observability captures operational signals (tokens, latency, cost, errors,
model version) to give teams visibility into how language model applications behave
in production.
- The key challenge is correlating a model call back to the originating request in
async pipelines where the call may happen in a background worker.
- Established tools include OpenLLMetry, LangSmith, Helicone, and vendor-native
dashboards ...
Expected error caught: BadRequestError: Error code: 400 - ...
============================================================
CLAUDE TRACE REPORT
============================================================
Log file : claude_traces.jsonl
Total calls : 4
Successes : 3
Errors : 1 (25.0%)
Total cost : $0.001847
Input tok : 412
Output tok : 187
Cache reads : 0
Latency p50 : 1.8430s
Latency p95 : 4.1210s
Latency max : 4.3510s
Per-model breakdown:
Model Calls Errors In Tok Out Tok Cost USD
------------------------------------------------------------------------------
claude-opus-4-8 1 0 203 98 $0.001080
claude-sonnet-4-6 1 0 156 72 $0.001548 (wait, recalc)
claude-haiku-4-5 1 0 53 17 $0.000110
claude-nonexistent-model 1 1 0 0 $0.000000
Top 3 most expensive calls:
1. request_id=a3f7c2d1... caller=summariser model=claude-opus-4-8 cost=$0.001080 latency=4.351s
2. request_id=b8e1a9f4... caller=qa-feature model=claude-sonnet-4-6 cost=$0.000548 latency=1.843s
3. request_id=d2c4b3e5... caller=classifier model=claude-haiku-4-5 cost=$0.000110 latency=0.612s
Recent errors:
- e7f9d1c2... BadRequestError: model not found
============================================================
The JSONL Log Format
Each line in claude_traces.jsonl is a self-contained JSON object. Here is a representative record for a successful call:
{
"request_id": "b8e1a9f4-3c2d-4a1b-9e7f-d5c6a8b0e2f1",
"caller": "qa-feature",
"function": "ask_claude",
"ts_start": 1717430412.384,
"model_requested": "claude-sonnet-4-6",
"status": "ok",
"error": null,
"latency_s": 1.8430,
"model_used": "claude-sonnet-4-6",
"input_tokens": 156,
"output_tokens": 72,
"cache_creation_tokens": 0,
"cache_read_tokens": 0,
"stop_reason": "end_turn",
"cost_usd": 0.0005480
}
This format ingests directly into Datadog Log Management, CloudWatch Insights, Grafana Loki, or any system that accepts newline-delimited JSON. No schema changes are needed to add a field: just add a key to the record dict and every downstream query tool picks it up automatically.
Extending to OpenTelemetry
The JSONL approach is enough for a single service. When your architecture has multiple services all calling Claude, you need distributed tracing so you can follow a user request across the whole call chain. OpenTelemetry is the standard here: it gives you trace IDs that propagate via HTTP headers, span hierarchies, and exporters for every major observability backend.
To add OTel to the decorator, you add two lines: start a span before the call and set attributes on it after. The opentelemetry-api and opentelemetry-sdk packages handle propagation. The OpenLLMetry project (github.com/traceloop/openllmetry) provides pre-built instrumentations for the Anthropic SDK that follow the semantic conventions for AI spans, so you can get distributed LLM tracing without writing custom instrumentation from scratch.
For teams already on autonomous agent loops (Part 22) or MCP tool chains (Part 23), distributed tracing becomes essential: a single user request can trigger a chain of five or ten Claude calls, and you need to know which one in the chain is slow or expensive.
What to Alert On
Having the data is step one. Knowing what to alert on is step two. Not every anomaly needs a PagerDuty notification, but these signals consistently indicate real problems:
- Error rate above 2% over a 5-minute window: Could be a rate limit, a downstream model outage, or a prompt that started generating invalid inputs for a tool. Investigate before it compounds.
- p95 latency above 10 seconds: Users feel this. The Anthropic API has built-in timeouts, but a slow p95 usually means you are hitting the API under load that warrants batching or routing (Part 27) to a faster model tier.
- Cost per hour spiking beyond 3x the rolling average: A runaway loop or a prompt injection that is forcing unexpectedly long outputs will show here before it appears on a billing alert.
- Stop reason "max_tokens" appearing frequently: If many responses are truncated, your
max_tokensvalue is too low for the task. This produces degraded output quality silently. - Cache read rate dropping to zero for a feature that should be caching: System prompts may have changed, or caching may not be configured correctly. Part 4 of this series covers the exact setup.
LLM Observability: What the Existing Tools Offer
The approach above is a deliberately minimal, zero-dependency foundation. Several commercial and open-source tools build on this foundation and are worth knowing:
| Tool | Type | Best For | Claude Support |
|---|---|---|---|
| LangSmith | Hosted SaaS | LangChain pipelines, evals, prompt versioning | Yes (via LangChain or direct) |
| Helicone | Hosted proxy | Drop-in token/cost logging via base_url swap | Yes (Anthropic proxy mode) |
| OpenLLMetry | Open source OTel | Self-hosted, OTel-native, any backend | Yes (auto-instrumentation) |
| Weights & Biases Weave | Hosted SaaS | Experiment tracking + trace correlation | Yes |
| Custom JSONL (this POC) | Self-hosted | Full control, zero vendor lock-in | Native (you write it) |
The proxy-based tools (like Helicone) are the simplest to adopt: you change base_url in the Anthropic client constructor and every call is logged without touching application code. The tradeoff is that all your prompts and responses transit a third-party server. For sensitive data, the decorator pattern keeps everything inside your own infrastructure.
Common Pitfalls
Forgetting to Wire the Side-Channel
The decorator captures usage from fn._last_response. If your wrapped function does not set this, token and cost fields are logged as None. The aggregation report will still run, but cost totals will be zero. Add a debug-level log warning in the decorator (as the code above does) and check the output on the first run against a new wrapper function.
Logging Prompts and Responses at INFO Level in Production
The POC logs metadata only, which is intentional. Logging full prompt text and response text is useful during development, but it creates two production risks: you may log PII from user inputs, and your log volume can spike by 10x to 100x when responses are long. If you need prompt logging, add it behind a flag and send it to a separate, more tightly access-controlled log stream.
Using Wall-Clock Time as Latency for Async Code
The time.perf_counter() approach works correctly for synchronous calls. For async code using await client.messages.create(), the timer still measures wall time correctly because the await suspends the coroutine and resumes it only when the response arrives. What it does not measure is queue time if calls wait behind a semaphore or rate limiter. Add a separate field for queue entry time if that is relevant in your architecture.
Stale Pricing Table
Anthropic adjusts pricing as models mature. If the pricing dict in your code is months old, your cost estimates will drift from reality. Pin the dict to a config file or environment variable, or fetch it from a lightweight pricing API. At minimum, add a comment in the code with the date the table was last verified against the Anthropic pricing page.
Missing Request IDs When Calls Are Retried
The decorator generates a new UUID for each call attempt. If you wrap your calls in a retry loop, each retry gets its own request ID, which is what you want: you can see that a particular request failed twice before succeeding. But make sure you also log a "parent request id" or "session id" so you can group the retry attempts together in your analysis. Add it as a parameter to the decorator factory.
Cost and Latency Note
The numbers below give a rough sense of what to expect for short, single-turn calls with prompt lengths around 200 tokens and response lengths around 100 tokens. Real production numbers will vary with prompt length, model load, and network conditions.
| Model | Typical Latency (200 in / 100 out) | Cost per 1,000 calls | Use When |
|---|---|---|---|
| claude-haiku-4-5 | 0.4s to 0.9s | $0.04 to $0.24 | Classification, routing, short lookups |
| claude-sonnet-4-6 | 1.2s to 3.5s | $0.30 to $1.50 | Most production tasks |
| claude-opus-4-8 | 3.0s to 9.0s | $1.50 to $9.00 | Hard reasoning, high-stakes outputs |
For applications that run thousands of calls per day, even small differences in model choice compound quickly. The model routing patterns in Part 27 show how to classify request complexity and route to the cheapest model that can handle it, using your observability data to verify the routing is working as expected.
The connection to eval harnesses (Part 24) is also direct: once you have structured logs, you can replay any production request through your eval suite to check whether a model change or prompt change would have produced a different result.
Frequently Asked Questions
Does the Anthropic API return a request ID I can use for support tickets?
Yes. The response object has a _request_id attribute (prefixed with an underscore because it is metadata, not part of the public response schema). You can log it alongside your own UUID. When you contact Anthropic support about a specific call, providing this ID lets them pull server-side telemetry for that exact request. Access it with msg._request_id after a successful call and store it in your log record under a field like anthropic_request_id.
What is the difference between input tokens and cache creation tokens?
When you enable prompt caching, the first time a cached block is sent, the API charges for writing it to the cache (at a slight premium over regular input pricing). Subsequent requests that hit the cache pay only the much cheaper cache read rate. The usage object splits this into three separate fields: input_tokens (fresh, non-cached input), cache_creation_input_tokens (tokens being written to cache for the first time), and cache_read_input_tokens (tokens read from an existing cache entry). You need all three to calculate the true cost of a call.
How do I trace calls that go through a LangChain or LlamaIndex pipeline?
Both frameworks have callback systems that fire before and after LLM calls. LangChain has BaseCallbackHandler with on_llm_start and on_llm_end methods. LlamaIndex has a similar event system. You can write a custom callback that calls _write_log from the POC above. Alternatively, OpenLLMetry auto-patches both frameworks at the module level so you do not need to modify pipeline code at all.
Can I capture streaming call latency with this decorator?
The decorator as written measures total wall-clock time, which covers the full stream duration for a streaming call. To capture time-to-first-token specifically, you need to instrument inside the stream context manager. The pattern is: record the start time before the with client.messages.stream(...) block, record a first-token timestamp inside the loop on the first iteration, and record the finish time after the loop. You can then log ttft_s (time to first token) and total_s as separate fields.
How should I handle sensitive data in the logs?
The POC logs metadata only (tokens, latency, cost, stop reason). Never log the actual prompt text or response text in a metadata log. If you need prompt logging for debugging, create a separate, encrypted log stream with stricter access controls and a shorter retention policy. For GDPR or HIPAA compliance, treat any log containing user input as personal data and apply the same data handling rules you apply to your primary database.
How do I get a per-user cost breakdown?
Add a user_id parameter to the decorator factory and include it in the log record. The aggregation report function can then group by user_id instead of (or in addition to) model. For multi-tenant applications, tenant_id is usually the first grouping level, with user breakdowns nested inside. The JSONL format makes this easy: use jq or a Python script to filter and group however you need.
Is there an official Anthropic SDK method for token counting before sending a request?
Yes. The SDK exposes client.messages.count_tokens(model=..., messages=[...]), which returns a token count without sending the request to the model. This is useful for pre-flight checks (will this prompt exceed the context window?) and for building cost estimators in UI features that show users how much a query will cost before they submit it. It makes a separate, lightweight API call, so use it judiciously in high-throughput paths.
Back to the full series: AI in Production: 30 Real-World Use Cases with Claude
Further reading:
Leave a Reply