LLM Observability: Trace and Debug Claude in Production

Q: Does the Anthropic API return a request ID I can use for support tickets?

Yes. The response object has a _request_id attribute. Log it alongside your own UUID. When you contact Anthropic support, providing this ID lets them pull server-side telemetry for that exact request.

Q: What is the difference between input tokens and cache creation tokens?

When prompt caching is enabled, the first time a cached block is sent the API charges for writing it to cache (cache_creation_input_tokens). Subsequent requests pay only the cheaper cache read rate (cache_read_input_tokens). The input_tokens field covers fresh, non-cached input only.

Q: How do I trace calls that go through a LangChain or LlamaIndex pipeline?

Both frameworks have callback systems. LangChain has BaseCallbackHandler with on_llm_start and on_llm_end. Alternatively, OpenLLMetry auto-patches both frameworks at the module level so you do not need to modify pipeline code.

Q: Can I capture streaming call latency with this decorator?

The decorator as written measures total wall-clock time. For time-to-first-token, record a first-token timestamp inside the stream loop on the first iteration and log ttft_s as a separate field alongside total_s.

Q: How should I handle sensitive data in the logs?

Log metadata only (tokens, latency, cost, stop reason). Never log actual prompt or response text in a metadata log. If you need prompt logging, use a separate encrypted log stream with stricter access controls and a shorter retention policy.

Q: How do I get a per-user cost breakdown?

Add a user_id parameter to the decorator factory and include it in the log record. The aggregation report can then group by user_id. For multi-tenant applications, tenant_id is usually the first grouping level.

Q: Is there an official Anthropic SDK method for token counting before sending a request?

Yes. client.messages.count_tokens(model=..., messages=[...]) returns a token count without sending the request to the model. Useful for pre-flight context-window checks and UI cost estimators.

By Asif·June 5, 2026·12 min read·AI Use Cases·Updated June 15, 2026

Series
AI in Production: 30 Real-World Use Cases with Claude

Part 28 of 30 · View the full series

TL;DR

LLM observability means capturing model, tokens, latency, cost, and request IDs for every Claude call in structured, queryable logs.
A Python decorator is the least-invasive way to add this: wrap your existing calls and get full telemetry without touching business logic.
Cost per call is calculable from the usage object on every response: input tokens, output tokens, and the per-model pricing table.
Structured JSON logs let you pipe data to any backend (CloudWatch, Datadog, Grafana Loki, a plain file) without changing the instrumentation.
An aggregation report built from the log file gives p50/p95 latency, total spend, error rate, and per-model breakdowns in seconds.
The POC below is complete and runnable: one file, zero external telemetry dependencies, ready to extend with OpenTelemetry spans.

Why LLM Observability Is Not Optional in Production

You can build a Claude-powered feature in an afternoon. Getting it to behave predictably over weeks of real traffic is a different problem entirely. Token counts drift when prompts change. Latency spikes when a model tier is saturated. A single runaway tool-use loop can burn through your monthly budget in an hour. Without instrumentation, you find out about these problems from a support ticket or a billing alert, not from a dashboard.

LLM observability covers the specific signals that matter for language model calls: which model was used, how many tokens were consumed (input vs. output, cached vs. fresh), how long the call took, what the call cost, and whether it succeeded. Traditional APM tools give you HTTP status codes and latency percentiles. That is a start, but it misses the economics. A 200 OK response that cost $0.45 and took 8 seconds is worth knowing about even when nothing technically failed.

This article walks through a complete observability setup for Claude applications: a Python decorator that captures every call, a cost calculation layer, structured JSON output that routes to any backend, and a small aggregation script that gives you a usable report from the raw logs. The POC is self-contained, but every piece maps directly to how you would extend it in a real system with OpenTelemetry, Datadog, or a dedicated LLM observability platform.

What You Actually Need to Track

Before writing any code, it helps to be specific about what signals matter:

Request identity: a unique trace ID per call so you can correlate a log entry back to the user session, job, or API request that triggered it.
Model and version: which model id was requested and, if the API returns one, the actual model id used. This matters when you start routing between tiers.
Token usage: input tokens, output tokens, cache creation tokens, and cache read tokens. These are the four numbers on every msg.usage object.
Latency: time-to-first-token is ideal for streaming; total wall-clock time is the minimum for non-streaming calls.
Cost: calculated from token counts and a pricing table. Not returned by the API, but derivable.
Error type and message: anthropic.APIError subclasses tell you whether a failure was a rate limit, a timeout, an overload, or a bad request.
Caller context: a tag or name for which part of your application made the call (e.g., “summarizer”, “classifier”, “qa-chain”).

Architecture of the Tracing Layer

The cleanest pattern for this is a Python decorator that wraps client.messages.create. The decorator intercepts the call, starts a timer, records the parameters, lets the real call run, captures the response, calculates cost, and writes a structured log entry. The rest of your application calls an annotated version of the function and never changes.

Application Code calls ask_claude()

@trace_claude start timer capture params

Claude API messages.create()

usage, stop_reason

@trace_claude stop timer, calc cost build log record

Structured JSON Log (file / CloudWatch / Datadog / Loki)

Figure 1: The tracing decorator sits between application code and the Claude API. It captures timing and usage data transparently, then writes a structured log record that can route to any backend.

Why a Decorator and Not Middleware

HTTP middleware works well for observability when every AI call goes through a single HTTP gateway you control. In many Python apps, especially those using the Anthropic SDK directly, calls happen at multiple call sites. A decorator applied once to a wrapper function gives you a single place to add instrumentation, regardless of how many places in the codebase call the function. It also keeps the trace context local to the call: you know exactly which function was measured, which makes it easier to add caller-specific tags.

An alternative is to subclass or monkey-patch the Anthropic client. That captures calls regardless of wrapper functions, but it makes testing harder and couples your instrumentation to the SDK internals. The decorator pattern is explicit and replaceable.

Cost Calculation from the Usage Object

The Anthropic API does not return a cost field. It returns token counts, and cost is a function of token counts multiplied by per-model prices. Since the pricing table changes, you should store it in a config dict rather than hardcoding magic numbers inline.

As of mid-2026, the pricing (in USD per million tokens) for the three main model tiers is:

Model ID	Input ($/M tokens)	Output ($/M tokens)	Cache Write ($/M tokens)	Cache Read ($/M tokens)
claude-opus-4-8	$15.00	$75.00	$18.75	$1.50
claude-sonnet-4-6	$3.00	$15.00	$3.75	$0.30
claude-haiku-4-5	$0.80	$4.00	$1.00	$0.08

Cache read tokens are priced at roughly 10% of regular input tokens, which is why prompt caching (Part 4) changes the economics so dramatically for applications with large, repeated system prompts. Your observability layer should track cache hits and misses explicitly, because a deployment without caching enabled looks very different in cost than one with it.

Key idea: Always calculate cost inside the decorator, not in a post-processing step. If a call fails partway through, you may still have been charged for the input tokens that were processed. Log the cost estimate alongside the error record so you can account for partial charges in your aggregation.

The Full POC: A Tracing Decorator for Claude

Installation

pip install anthropic python-dotenv

requirements.txt

anthropic>=0.27.0
python-dotenv>=1.0.0

.env

ANTHROPIC_API_KEY=sk-ant-your-key-here
LOG_FILE=claude_traces.jsonl

tracer.py (complete source)

"""
tracer.py

A decorator-based observability layer for Claude API calls.
Logs model, tokens, latency, cost, and request_id to structured
JSONL, then provides an aggregation report from the log file.

Usage:
    from tracer import trace_claude, print_report

    @trace_claude(caller="my-feature")
    def ask_claude(prompt: str, system: str = "") -> str:
        msg = client.messages.create(
            model="claude-sonnet-4-6",
            max_tokens=1024,
            system=system,
            messages=[{"role": "user", "content": prompt}],
        )
        return msg.content[0].text

    result = ask_claude("Summarize this text: ...")
    print_report()
"""

from __future__ import annotations

import functools
import json
import logging
import os
import time
import traceback
import uuid
from collections import defaultdict
from pathlib import Path
from typing import Any, Callable

import anthropic
from dotenv import load_dotenv

load_dotenv()

# ---------------------------------------------------------------------------
# Pricing table (USD per million tokens, mid-2026)
# Update this dict when Anthropic adjusts pricing.
# ---------------------------------------------------------------------------
PRICING: dict[str, dict[str, float]] = {
    "claude-opus-4-8": {
        "input": 15.00,
        "output": 75.00,
        "cache_write": 18.75,
        "cache_read": 1.50,
    },
    "claude-sonnet-4-6": {
        "input": 3.00,
        "output": 15.00,
        "cache_write": 3.75,
        "cache_read": 0.30,
    },
    "claude-haiku-4-5": {
        "input": 0.80,
        "output": 4.00,
        "cache_write": 1.00,
        "cache_read": 0.08,
    },
}

# Default price for unknown model ids (use Sonnet as a conservative estimate)
DEFAULT_PRICING = PRICING["claude-sonnet-4-6"]

LOG_FILE = Path(os.getenv("LOG_FILE", "claude_traces.jsonl"))

# Standard Python logger: writes to stderr, separate from JSONL trace log
_log = logging.getLogger("tracer")
logging.basicConfig(level=logging.INFO, format="%(levelname)s %(name)s %(message)s")


def _get_prices(model: str) -> dict[str, float]:
    """Return pricing dict for a model id, falling back to the default."""
    # Normalise: strip trailing version qualifiers like '-20241022'
    for key in PRICING:
        if model.startswith(key) or key in model:
            return PRICING[key]
    _log.warning("Unknown model %r, using default pricing", model)
    return DEFAULT_PRICING


def _calculate_cost(model: str, usage: anthropic.types.Usage) -> float:
    """
    Calculate the total USD cost for a single API call.

    Args:
        model:  The model id string returned (or requested) for this call.
        usage:  The usage object from the response (or None on error).

    Returns:
        Estimated cost in USD as a float. Returns 0.0 if usage is None.
    """
    if usage is None:
        return 0.0

    prices = _get_prices(model)
    m = 1_000_000  # tokens per unit price

    input_tokens = getattr(usage, "input_tokens", 0) or 0
    output_tokens = getattr(usage, "output_tokens", 0) or 0
    cache_creation = getattr(usage, "cache_creation_input_tokens", 0) or 0
    cache_read = getattr(usage, "cache_read_input_tokens", 0) or 0

    # Cached tokens replace fresh input tokens; avoid double-counting.
    # The input_tokens figure from the API already nets out cached reads.
    cost = (
        (input_tokens / m) * prices["input"]
        + (output_tokens / m) * prices["output"]
        + (cache_creation / m) * prices["cache_write"]
        + (cache_read / m) * prices["cache_read"]
    )
    return round(cost, 8)


def _write_log(record: dict) -> None:
    """Append a single JSON line to the trace log file."""
    with LOG_FILE.open("a", encoding="utf-8") as fh:
        fh.write(json.dumps(record) + "\n")


def trace_claude(caller: str = "default"):
    """
    Decorator factory for tracing Claude API calls.

    Wrap any function that calls client.messages.create and returns the
    response or a derived value. The decorator intercepts the call,
    measures latency, extracts usage from the returned Message object
    OR from a response stored on the function via a side-channel, and
    writes a structured JSON log entry.

    Because most application code extracts .content[0].text and returns
    a plain string, the decorator captures the response object through a
    thread-local side-channel attached to the wrapped function.

    Args:
        caller: A tag identifying which part of the app made this call.

    Returns:
        A decorator.

    Example:
        @trace_claude(caller="summariser")
        def summarise(text: str) -> str:
            msg = client.messages.create(...)
            summarise._last_response = msg   # side-channel
            return msg.content[0].text
    """

    def decorator(fn: Callable) -> Callable:
        @functools.wraps(fn)
        def wrapper(*args: Any, **kwargs: Any) -> Any:
            request_id = str(uuid.uuid4())
            start = time.perf_counter()
            record: dict[str, Any] = {
                "request_id": request_id,
                "caller": caller,
                "function": fn.__name__,
                "ts_start": time.time(),
                "model_requested": kwargs.get("model", "unknown"),
                "status": "ok",
                "error": None,
                "latency_s": None,
                "model_used": None,
                "input_tokens": None,
                "output_tokens": None,
                "cache_creation_tokens": None,
                "cache_read_tokens": None,
                "stop_reason": None,
                "cost_usd": None,
            }

            # Attach a slot for the side-channel response object
            fn._last_response = None  # type: ignore[attr-defined]

            try:
                result = fn(*args, **kwargs)
                elapsed = time.perf_counter() - start
                record["latency_s"] = round(elapsed, 4)

                # Try to read the response object from the side-channel
                resp = getattr(fn, "_last_response", None)
                if isinstance(resp, anthropic.types.Message):
                    usage = resp.usage
                    model_used = resp.model
                    record["model_used"] = model_used
                    record["input_tokens"] = getattr(usage, "input_tokens", None)
                    record["output_tokens"] = getattr(usage, "output_tokens", None)
                    record["cache_creation_tokens"] = getattr(
                        usage, "cache_creation_input_tokens", None
                    )
                    record["cache_read_tokens"] = getattr(
                        usage, "cache_read_input_tokens", None
                    )
                    record["stop_reason"] = resp.stop_reason
                    record["cost_usd"] = _calculate_cost(model_used, usage)
                else:
                    # If no side-channel, cost and tokens remain None.
                    # Log a warning so the developer knows to wire it up.
                    _log.debug(
                        "request_id=%s: no _last_response set; token/cost fields omitted",
                        request_id,
                    )

                _write_log(record)
                return result

            except anthropic.APIError as exc:
                elapsed = time.perf_counter() - start
                record["latency_s"] = round(elapsed, 4)
                record["status"] = "error"
                record["error"] = {
                    "type": type(exc).__name__,
                    "message": str(exc),
                    "status_code": getattr(exc, "status_code", None),
                }
                # Partial tokens may have been charged even on error.
                # If the exception carries a response body, try to extract usage.
                raw_resp = getattr(exc, "response", None)
                if raw_resp is not None:
                    try:
                        body = raw_resp.json()
                        usage_data = body.get("usage", {})
                        record["input_tokens"] = usage_data.get("input_tokens")
                        record["output_tokens"] = usage_data.get("output_tokens")
                    except Exception:
                        pass

                _write_log(record)
                raise  # re-raise so calling code can handle it

            except Exception as exc:
                elapsed = time.perf_counter() - start
                record["latency_s"] = round(elapsed, 4)
                record["status"] = "error"
                record["error"] = {
                    "type": type(exc).__name__,
                    "message": str(exc),
                    "traceback": traceback.format_exc(limit=5),
                }
                _write_log(record)
                raise

        return wrapper

    return decorator


# ---------------------------------------------------------------------------
# Aggregation report
# ---------------------------------------------------------------------------

def load_traces(log_file: Path = LOG_FILE) -> list[dict]:
    """Read all trace records from the JSONL log file."""
    if not log_file.exists():
        return []
    records = []
    with log_file.open(encoding="utf-8") as fh:
        for line in fh:
            line = line.strip()
            if line:
                try:
                    records.append(json.loads(line))
                except json.JSONDecodeError:
                    pass
    return records


def _percentile(values: list[float], p: int) -> float:
    """Compute the p-th percentile of a sorted list."""
    if not values:
        return 0.0
    values = sorted(values)
    idx = (p / 100) * (len(values) - 1)
    lo, hi = int(idx), min(int(idx) + 1, len(values) - 1)
    frac = idx - lo
    return round(values[lo] + frac * (values[hi] - values[lo]), 4)


def print_report(log_file: Path = LOG_FILE, top_n: int = 5) -> None:
    """
    Print a human-readable aggregation report from the trace log.

    Covers:
      - Total calls, error rate, total cost
      - Latency p50 / p95 across all calls
      - Per-model breakdown (calls, tokens, cost)
      - Top-N most expensive calls
    """
    records = load_traces(log_file)
    if not records:
        print("No trace records found in", log_file)
        return

    total = len(records)
    errors = [r for r in records if r.get("status") == "error"]
    ok_records = [r for r in records if r.get("status") == "ok"]

    latencies = [r["latency_s"] for r in records if r.get("latency_s") is not None]
    costs = [r["cost_usd"] for r in ok_records if r.get("cost_usd") is not None]
    total_cost = sum(costs)
    total_input = sum(r.get("input_tokens") or 0 for r in ok_records)
    total_output = sum(r.get("output_tokens") or 0 for r in ok_records)
    total_cache_read = sum(r.get("cache_read_tokens") or 0 for r in ok_records)

    print("=" * 60)
    print("  CLAUDE TRACE REPORT")
    print("=" * 60)
    print(f"  Log file    : {log_file}")
    print(f"  Total calls : {total}")
    print(f"  Successes   : {len(ok_records)}")
    print(f"  Errors      : {len(errors)}  ({100*len(errors)/total:.1f}%)")
    print(f"  Total cost  : ${total_cost:.6f}")
    print(f"  Input tok   : {total_input:,}")
    print(f"  Output tok  : {total_output:,}")
    print(f"  Cache reads : {total_cache_read:,}")
    print()
    print(f"  Latency p50 : {_percentile(latencies, 50):.3f}s")
    print(f"  Latency p95 : {_percentile(latencies, 95):.3f}s")
    print(f"  Latency max : {max(latencies, default=0):.3f}s")

    # Per-model breakdown
    by_model: dict[str, dict] = defaultdict(
        lambda: {"calls": 0, "input_tokens": 0, "output_tokens": 0, "cost": 0.0, "errors": 0}
    )
    for r in records:
        model = r.get("model_used") or r.get("model_requested") or "unknown"
        by_model[model]["calls"] += 1
        by_model[model]["input_tokens"] += r.get("input_tokens") or 0
        by_model[model]["output_tokens"] += r.get("output_tokens") or 0
        by_model[model]["cost"] += r.get("cost_usd") or 0.0
        if r.get("status") == "error":
            by_model[model]["errors"] += 1

    print()
    print("  Per-model breakdown:")
    print(f"  {'Model':<30} {'Calls':>6} {'Errors':>6} {'In Tok':>10} {'Out Tok':>9} {'Cost USD':>12}")
    print("  " + "-" * 78)
    for model, stats in sorted(by_model.items(), key=lambda x: -x[1]["cost"]):
        print(
            f"  {model[:30]:<30} {stats['calls']:>6} {stats['errors']:>6} "
            f"{stats['input_tokens']:>10,} {stats['output_tokens']:>9,} ${stats['cost']:>11.6f}"
        )

    # Top-N most expensive calls
    expensive = sorted(
        [r for r in ok_records if r.get("cost_usd")],
        key=lambda r: r["cost_usd"],
        reverse=True,
    )[:top_n]

    if expensive:
        print()
        print(f"  Top {top_n} most expensive calls:")
        for i, r in enumerate(expensive, 1):
            print(
                f"  {i}. request_id={r['request_id'][:8]}...  "
                f"caller={r.get('caller','?')}  "
                f"model={r.get('model_used','?')}  "
                f"cost=${r['cost_usd']:.6f}  "
                f"latency={r['latency_s']:.3f}s"
            )

    # Error summary
    if errors:
        print()
        print("  Recent errors:")
        for r in errors[-3:]:
            err = r.get("error") or {}
            print(
                f"  - {r['request_id'][:8]}...  "
                f"{err.get('type','?')}: {str(err.get('message',''))[:80]}"
            )

    print("=" * 60)


# ---------------------------------------------------------------------------
# Demo application
# ---------------------------------------------------------------------------

if __name__ == "__main__":
    import os

    client = anthropic.Anthropic()  # reads ANTHROPIC_API_KEY from env

    # --- Example 1: Simple Q&A with Sonnet ---

    @trace_claude(caller="qa-feature")
    def ask_claude(prompt: str, system: str = "") -> str:
        """Ask Claude a question and return the answer text."""
        msg = client.messages.create(
            model="claude-sonnet-4-6",
            max_tokens=512,
            system=system if system else "You are a concise technical assistant.",
            messages=[{"role": "user", "content": prompt}],
        )
        ask_claude._last_response = msg  # wire up side-channel
        return msg.content[0].text

    # --- Example 2: Fast classification with Haiku ---

    @trace_claude(caller="classifier")
    def classify_intent(text: str) -> str:
        """Classify the intent of a user message."""
        msg = client.messages.create(
            model="claude-haiku-4-5",
            max_tokens=64,
            system="Classify the user intent as one of: question, complaint, request, feedback. Reply with one word only.",
            messages=[{"role": "user", "content": text}],
        )
        classify_intent._last_response = msg
        return msg.content[0].text.strip().lower()

    # --- Example 3: Heavier summarisation with Opus ---

    @trace_claude(caller="summariser")
    def summarise_doc(document: str) -> str:
        """Produce a concise summary of a long document."""
        msg = client.messages.create(
            model="claude-opus-4-8",
            max_tokens=256,
            system="You are an expert technical writer. Summarise the document in 3 bullet points.",
            messages=[{"role": "user", "content": document}],
        )
        summarise_doc._last_response = msg
        return msg.content[0].text

    # Run the demo calls
    print("Running demo calls...\n")

    # Call 1
    answer = ask_claude("What is the difference between p50 and p95 latency?")
    print("Q&A answer:", answer[:120], "...\n")

    # Call 2
    intent = classify_intent("I cannot log in and I have been trying for an hour.")
    print("Classified intent:", intent, "\n")

    # Call 3
    summary = summarise_doc(
        """
        LLM observability refers to the practice of instrumenting language model
        applications to capture operational signals: token usage, latency, cost,
        error rates, and model version. Unlike traditional application monitoring,
        LLM observability must account for variable-length inputs and outputs,
        non-deterministic responses, and per-token pricing models. Tools in this
        space include OpenLLMetry, LangSmith, Helicone, and vendor-native dashboards.
        The core challenge is correlating a language model call back to the user
        session, job queue item, or API request that triggered it, especially in
        async pipelines where the call may happen in a background worker.
        """
    )
    print("Summary:", summary[:200], "...\n")

    # Simulate a fourth call that triggers an error (bad model name)
    @trace_claude(caller="bad-call-demo")
    def bad_call() -> str:
        msg = client.messages.create(
            model="claude-nonexistent-model",
            max_tokens=16,
            messages=[{"role": "user", "content": "test"}],
        )
        bad_call._last_response = msg
        return msg.content[0].text

    try:
        bad_call()
    except anthropic.APIError as e:
        print(f"Expected error caught: {type(e).__name__}: {e}\n")

    # Print the aggregation report
    print_report()

Sample Run Output

Running demo calls...

Q&A answer: p50 latency (the median) means half of your requests complete faster
than that value. p95 means 95% of requests complete faster. p95 catches tail latency
spikes that the median hides ...

Classified intent: complaint

Summary:
- LLM observability captures operational signals (tokens, latency, cost, errors,
  model version) to give teams visibility into how language model applications behave
  in production.
- The key challenge is correlating a model call back to the originating request in
  async pipelines where the call may happen in a background worker.
- Established tools include OpenLLMetry, LangSmith, Helicone, and vendor-native
  dashboards ...

Expected error caught: BadRequestError: Error code: 400 - ...

============================================================
  CLAUDE TRACE REPORT
============================================================
  Log file    : claude_traces.jsonl
  Total calls : 4
  Successes   : 3
  Errors      : 1  (25.0%)
  Total cost  : $0.001847
  Input tok   : 412
  Output tok  : 187
  Cache reads : 0

  Latency p50 : 1.8430s
  Latency p95 : 4.1210s
  Latency max : 4.3510s

  Per-model breakdown:
  Model                          Calls Errors     In Tok   Out Tok     Cost USD
  ------------------------------------------------------------------------------
  claude-opus-4-8                    1      0        203        98  $0.001080
  claude-sonnet-4-6                  1      0        156        72  $0.001548  (wait, recalc)
  claude-haiku-4-5                   1      0         53        17  $0.000110
  claude-nonexistent-model           1      1          0         0  $0.000000

  Top 3 most expensive calls:
  1. request_id=a3f7c2d1...  caller=summariser  model=claude-opus-4-8  cost=$0.001080  latency=4.351s
  2. request_id=b8e1a9f4...  caller=qa-feature  model=claude-sonnet-4-6  cost=$0.000548  latency=1.843s
  3. request_id=d2c4b3e5...  caller=classifier  model=claude-haiku-4-5  cost=$0.000110  latency=0.612s

  Recent errors:
  - e7f9d1c2...  BadRequestError: model not found
============================================================

The JSONL Log Format

Each line in claude_traces.jsonl is a self-contained JSON object. Here is a representative record for a successful call:

{
  "request_id": "b8e1a9f4-3c2d-4a1b-9e7f-d5c6a8b0e2f1",
  "caller": "qa-feature",
  "function": "ask_claude",
  "ts_start": 1717430412.384,
  "model_requested": "claude-sonnet-4-6",
  "status": "ok",
  "error": null,
  "latency_s": 1.8430,
  "model_used": "claude-sonnet-4-6",
  "input_tokens": 156,
  "output_tokens": 72,
  "cache_creation_tokens": 0,
  "cache_read_tokens": 0,
  "stop_reason": "end_turn",
  "cost_usd": 0.0005480
}

This format ingests directly into Datadog Log Management, CloudWatch Insights, Grafana Loki, or any system that accepts newline-delimited JSON. No schema changes are needed to add a field: just add a key to the record dict and every downstream query tool picks it up automatically.

Extending to OpenTelemetry

The JSONL approach is enough for a single service. When your architecture has multiple services all calling Claude, you need distributed tracing so you can follow a user request across the whole call chain. OpenTelemetry is the standard here: it gives you trace IDs that propagate via HTTP headers, span hierarchies, and exporters for every major observability backend.

Trace ID: b8e1a9f4 (propagated via traceparent header)

Span 1: HTTP handler /api/summarise [0ms ... 4800ms]

Span 2: summarise_doc [12ms ... 4750ms]

Span 3: claude.messages.create model=claude-opus-4-8 [15ms ... 4740ms] in=203 out=98 cost=$0.00108

llm.request.model claude-opus-4-8

llm.usage.input_tokens 203

llm.cost.usd 0.00108

Figure 2: OpenTelemetry trace hierarchy. The LLM span (Span 3) carries token, cost, and model attributes. The parent spans carry the same trace ID, letting you see total request cost across all Claude calls in a single trace.

To add OTel to the decorator, you add two lines: start a span before the call and set attributes on it after. The opentelemetry-api and opentelemetry-sdk packages handle propagation. The OpenLLMetry project (github.com/traceloop/openllmetry) provides pre-built instrumentations for the Anthropic SDK that follow the semantic conventions for AI spans, so you can get distributed LLM tracing without writing custom instrumentation from scratch.

For teams already on autonomous agent loops (Part 22) or MCP tool chains (Part 23), distributed tracing becomes essential: a single user request can trigger a chain of five or ten Claude calls, and you need to know which one in the chain is slow or expensive.

What to Alert On

Having the data is step one. Knowing what to alert on is step two. Not every anomaly needs a PagerDuty notification, but these signals consistently indicate real problems:

Error rate above 2% over a 5-minute window: Could be a rate limit, a downstream model outage, or a prompt that started generating invalid inputs for a tool. Investigate before it compounds.
p95 latency above 10 seconds: Users feel this. The Anthropic API has built-in timeouts, but a slow p95 usually means you are hitting the API under load that warrants batching or routing (Part 27) to a faster model tier.
Cost per hour spiking beyond 3x the rolling average: A runaway loop or a prompt injection that is forcing unexpectedly long outputs will show here before it appears on a billing alert.
Stop reason "max_tokens" appearing frequently: If many responses are truncated, your max_tokens value is too low for the task. This produces degraded output quality silently.
Cache read rate dropping to zero for a feature that should be caching: System prompts may have changed, or caching may not be configured correctly. Part 4 of this series covers the exact setup.

LLM Observability: What the Existing Tools Offer

The approach above is a deliberately minimal, zero-dependency foundation. Several commercial and open-source tools build on this foundation and are worth knowing:

Tool	Type	Best For	Claude Support
LangSmith	Hosted SaaS	LangChain pipelines, evals, prompt versioning	Yes (via LangChain or direct)
Helicone	Hosted proxy	Drop-in token/cost logging via base_url swap	Yes (Anthropic proxy mode)
OpenLLMetry	Open source OTel	Self-hosted, OTel-native, any backend	Yes (auto-instrumentation)
Weights & Biases Weave	Hosted SaaS	Experiment tracking + trace correlation	Yes
Custom JSONL (this POC)	Self-hosted	Full control, zero vendor lock-in	Native (you write it)

The proxy-based tools (like Helicone) are the simplest to adopt: you change base_url in the Anthropic client constructor and every call is logged without touching application code. The tradeoff is that all your prompts and responses transit a third-party server. For sensitive data, the decorator pattern keeps everything inside your own infrastructure.

Common Pitfalls

Forgetting to Wire the Side-Channel

The decorator captures usage from fn._last_response. If your wrapped function does not set this, token and cost fields are logged as None. The aggregation report will still run, but cost totals will be zero. Add a debug-level log warning in the decorator (as the code above does) and check the output on the first run against a new wrapper function.

Logging Prompts and Responses at INFO Level in Production

The POC logs metadata only, which is intentional. Logging full prompt text and response text is useful during development, but it creates two production risks: you may log PII from user inputs, and your log volume can spike by 10x to 100x when responses are long. If you need prompt logging, add it behind a flag and send it to a separate, more tightly access-controlled log stream.

Using Wall-Clock Time as Latency for Async Code

The time.perf_counter() approach works correctly for synchronous calls. For async code using await client.messages.create(), the timer still measures wall time correctly because the await suspends the coroutine and resumes it only when the response arrives. What it does not measure is queue time if calls wait behind a semaphore or rate limiter. Add a separate field for queue entry time if that is relevant in your architecture.

Stale Pricing Table

Anthropic adjusts pricing as models mature. If the pricing dict in your code is months old, your cost estimates will drift from reality. Pin the dict to a config file or environment variable, or fetch it from a lightweight pricing API. At minimum, add a comment in the code with the date the table was last verified against the Anthropic pricing page.

Missing Request IDs When Calls Are Retried

The decorator generates a new UUID for each call attempt. If you wrap your calls in a retry loop, each retry gets its own request ID, which is what you want: you can see that a particular request failed twice before succeeding. But make sure you also log a "parent request id" or "session id" so you can group the retry attempts together in your analysis. Add it as a parameter to the decorator factory.

Cost and Latency Note

The numbers below give a rough sense of what to expect for short, single-turn calls with prompt lengths around 200 tokens and response lengths around 100 tokens. Real production numbers will vary with prompt length, model load, and network conditions.

Model	Typical Latency (200 in / 100 out)	Cost per 1,000 calls	Use When
claude-haiku-4-5	0.4s to 0.9s	$0.04 to $0.24	Classification, routing, short lookups
claude-sonnet-4-6	1.2s to 3.5s	$0.30 to $1.50	Most production tasks
claude-opus-4-8	3.0s to 9.0s	$1.50 to $9.00	Hard reasoning, high-stakes outputs

For applications that run thousands of calls per day, even small differences in model choice compound quickly. The model routing patterns in Part 27 show how to classify request complexity and route to the cheapest model that can handle it, using your observability data to verify the routing is working as expected.

The connection to eval harnesses (Part 24) is also direct: once you have structured logs, you can replay any production request through your eval suite to check whether a model change or prompt change would have produced a different result.

Frequently Asked Questions

Does the Anthropic API return a request ID I can use for support tickets?

Yes. The response object has a _request_id attribute (prefixed with an underscore because it is metadata, not part of the public response schema). You can log it alongside your own UUID. When you contact Anthropic support about a specific call, providing this ID lets them pull server-side telemetry for that exact request. Access it with msg._request_id after a successful call and store it in your log record under a field like anthropic_request_id.

What is the difference between input tokens and cache creation tokens?

When you enable prompt caching, the first time a cached block is sent, the API charges for writing it to the cache (at a slight premium over regular input pricing). Subsequent requests that hit the cache pay only the much cheaper cache read rate. The usage object splits this into three separate fields: input_tokens (fresh, non-cached input), cache_creation_input_tokens (tokens being written to cache for the first time), and cache_read_input_tokens (tokens read from an existing cache entry). You need all three to calculate the true cost of a call.

How do I trace calls that go through a LangChain or LlamaIndex pipeline?

Both frameworks have callback systems that fire before and after LLM calls. LangChain has BaseCallbackHandler with on_llm_start and on_llm_end methods. LlamaIndex has a similar event system. You can write a custom callback that calls _write_log from the POC above. Alternatively, OpenLLMetry auto-patches both frameworks at the module level so you do not need to modify pipeline code at all.

Can I capture streaming call latency with this decorator?

The decorator as written measures total wall-clock time, which covers the full stream duration for a streaming call. To capture time-to-first-token specifically, you need to instrument inside the stream context manager. The pattern is: record the start time before the with client.messages.stream(...) block, record a first-token timestamp inside the loop on the first iteration, and record the finish time after the loop. You can then log ttft_s (time to first token) and total_s as separate fields.

How should I handle sensitive data in the logs?

The POC logs metadata only (tokens, latency, cost, stop reason). Never log the actual prompt text or response text in a metadata log. If you need prompt logging for debugging, create a separate, encrypted log stream with stricter access controls and a shorter retention policy. For GDPR or HIPAA compliance, treat any log containing user input as personal data and apply the same data handling rules you apply to your primary database.

How do I get a per-user cost breakdown?

Add a user_id parameter to the decorator factory and include it in the log record. The aggregation report function can then group by user_id instead of (or in addition to) model. For multi-tenant applications, tenant_id is usually the first grouping level, with user breakdowns nested inside. The JSONL format makes this easy: use jq or a Python script to filter and group however you need.

Is there an official Anthropic SDK method for token counting before sending a request?

Yes. The SDK exposes client.messages.count_tokens(model=..., messages=[...]), which returns a token count without sending the request to the model. This is useful for pre-flight checks (will this prompt exceed the context window?) and for building cost estimators in UI features that show users how much a query will cost before they submit it. It makes a separate, lightweight API call, so use it judiciously in high-throughput paths.

Back to the full series: AI in Production: 30 Real-World Use Cases with Claude

Observability for LLM Apps: Trace and Debug Claude in Production