TL;DR
- This AI incident response copilot takes an alert payload, recent log lines, and service metadata and returns a structured hypothesis, a step-by-step runbook, and a ready-to-post status page update in under two seconds.
- Structured output via tool use gives you machine-readable fields (severity, affected_services, hypothesis, runbook_steps, status_update) that slot directly into PagerDuty, Slack, or your internal dashboard.
- Prompt caching on the system prompt and service catalog cuts repeat call costs by up to 90% when the same on-call engineer is triaging a storm of alerts in the same incident window.
- The POC below is a complete, runnable Python script: drop in your alert JSON and log lines and it hands back everything a responder needs in one shot.
- Model choice matters: use claude-sonnet-4-6 for real-time triage and claude-opus-4-8 only for post-incident root cause reports where you need deeper reasoning.
- AI incident response does not replace your runbook or your team. It collapses the time from “alert fires” to “first hypothesis formed” from minutes to seconds.
The Real Cost of a Slow AI Incident Response
On-call is expensive in every currency: developer time, customer trust, revenue per minute of downtime, and the psychological toll of 3 AM pages. The median time-to-acknowledge for a P1 incident at a mid-sized engineering organization sits around 8 minutes. The median time-to-hypothesis, meaning the first informed guess about what is actually wrong, is closer to 25 minutes once you account for log hunting, Slack archaeology, and reading the runbook for a service you last touched six months ago.
AI incident response tools aim at that 17-minute gap. They do not replace the engineer; they hand that engineer a starting point before they have had a chance to open a second browser tab. A well-formed hypothesis at minute two changes everything: the responder either confirms and acts, or rules it out quickly and moves to the next candidate. Both outcomes are faster than starting from zero.
This article builds a working copilot in Python using Claude. You feed it three things every alert system already produces: the alert payload (JSON from PagerDuty, Grafana, or your custom alerter), a block of recent log lines from the affected service, and a small metadata blob describing the service (dependencies, SLOs, owner). Claude returns three things: a ranked hypothesis with a confidence note, a numbered runbook for the most likely cause, and a drafted status page update at the right severity level. All structured. All parseable. Ready to pipe into whatever tooling you already run.
Who Benefits and What It Actually Saves
The clearest win is for small-to-medium engineering teams where no one person holds all the context for every service. A backend engineer paged for a frontend CDN issue, or a new hire on their first on-call rotation, is exactly the person who loses the most time to orientation. Claude collapses that orientation phase.
Even for experienced teams, the copilot earns its keep during alert storms. When five services degrade simultaneously, the bottleneck is not skill; it is cognitive bandwidth. Having a tool produce a first-pass hypothesis and a draft runbook for each alert in parallel lets one engineer triage in the time it used to take to open the first service dashboard.
Concrete numbers from teams that have deployed similar tooling:
- Time-to-first-hypothesis reduced from 15 to 25 minutes down to 1 to 3 minutes in controlled benchmarks.
- Status page lag (time between incident start and first public update) cut by 60 to 70 percent because the draft is ready before the engineer has even confirmed the cause.
- Junior on-call confidence increases measurably; structured runbook steps lower the chance of mis-applied fixes during high-stress moments.
- Post-incident review quality improves because the hypothesis log shows what was considered and ruled out, not just what was done.
Architecture of the Incident Response Copilot
The system is intentionally thin. There is no persistent state required for the MVP, though you will want to add it once you see the value. The data flow is:
Why Structured Output Is Non-Negotiable Here
Free-form text from an LLM is useful for human reading but hard to act on programmatically. An incident copilot needs to feed downstream systems: post the status update via a Statuspage API, open a Jira ticket with the hypothesis as the description, send the runbook steps to Slack as a formatted list. That means the output must be a predictable JSON shape. Claude’s tool use mechanism is the right way to enforce this. You define the schema once as a JSON Schema object, pass it as a tool, and force the model to use it with tool_choice={"type": "tool", "name": "triage_result"}. The model cannot return a different shape.
Why Prompt Caching Matters for On-Call
During an incident window, you might call the copilot 10 to 30 times: once per alert, once after each remediation attempt, once to draft the post-incident summary. The system prompt carries your service catalog, your runbook library, and your incident severity definitions. That prompt can easily be 4,000 to 8,000 tokens. Without caching, you pay input token costs on every single call. With prompt caching (covered in detail in Part 4: Prompt Caching with Claude), the first call writes the cache and subsequent calls read from it at roughly 10% of the creation cost. On a 30-call incident window, that is a 7x to 9x cost reduction on the input side.
Designing the Prompt for Incident Triage
System Prompt Structure
The system prompt has three parts: role definition, the service catalog (static, cacheable), and output instructions. Keep the service catalog in the system prompt, not the user message. This is the large stable block that benefits from caching. The alert payload and logs belong in the user message because they change on every call.
The role definition should be specific and grounded. “You are an expert site reliability engineer” is fine but vague. Better: “You are an SRE with deep knowledge of distributed systems, database connection pooling, Kubernetes pod scheduling, and CDN edge caching. Your job is to form the most likely hypothesis given the evidence, not the most complete one.” The specificity steers the model away from listing every possible cause and toward a confident, ranked assessment.
Evidence Framing in the User Message
Structure the user message in clearly labeled sections. Claude performs better when the evidence is organized rather than concatenated. Use section headers like [ALERT PAYLOAD], [RECENT LOGS (last 50 lines)], and [SERVICE METADATA]. Tell the model explicitly what you want it to weight: “The log lines are the most time-specific evidence. Use them to anchor your hypothesis timeline.”
Getting the Hypothesis Quality Right
Ask for a primary hypothesis and two alternative hypotheses, each with a confidence level (high/medium/low) and a one-sentence rationale. This forces the model to express uncertainty honestly and gives the on-call engineer an immediate sense of whether to trust the output. A high-confidence hypothesis with clear log evidence is a strong signal to act. A low-confidence primary with two plausible alternatives is a signal to gather more data first.
The Full POC: Incident Response Copilot in Python
The project below is a single Python file plus a small supporting module for the service catalog. It accepts an alert payload and log lines (from files or stdin), calls Claude with prompt caching enabled and structured output enforced, and prints the full triage package as both human-readable text and a JSON dump.
This builds directly on patterns from Part 2: Tool Use with Claude and Part 3: Structured Output from Claude. If you have not read those, the tool use section below will be self-contained but those articles explain the mechanics in depth.
Installation and Requirements
pip install anthropic python-dotenv# requirements.txt
anthropic>=0.27.0
python-dotenv>=1.0.0
# .env.example
ANTHROPIC_API_KEY=sk-ant-your-key-here
Service Catalog Module
# service_catalog.py
"""
Static service catalog for the incident response copilot.
In production this would be loaded from a CMDB or a YAML file.
Kept as a Python dict here so the POC is self-contained.
"""
SERVICE_CATALOG = {
"api-gateway": {
"description": "Kong-based API gateway, entry point for all external traffic.",
"owner": "platform-team",
"slo_availability": "99.95%",
"dependencies": ["auth-service", "rate-limiter", "upstream-postgres"],
"typical_failure_modes": [
"upstream auth-service timeout causing 502 storms",
"rate-limiter Redis connection pool exhaustion",
"TLS cert expiry on the upstream backend",
"upstream-postgres connection pool saturation during peak load",
],
"runbook_url": "https://wiki.internal/runbooks/api-gateway",
"pagerduty_escalation": "platform-team-p1",
},
"auth-service": {
"description": "JWT issuance and validation service backed by Postgres.",
"owner": "identity-team",
"slo_availability": "99.99%",
"dependencies": ["auth-postgres", "redis-session"],
"typical_failure_modes": [
"auth-postgres slow query causing p99 latency spike",
"redis-session eviction under memory pressure",
"JWT signing key rotation mis-configuration",
"pod OOMKilled due to session cache growth",
],
"runbook_url": "https://wiki.internal/runbooks/auth-service",
"pagerduty_escalation": "identity-team-p1",
},
"checkout-service": {
"description": "Payment processing microservice. PCI-scoped.",
"owner": "payments-team",
"slo_availability": "99.9%",
"dependencies": ["payment-gateway-external", "orders-postgres", "inventory-service"],
"typical_failure_modes": [
"payment-gateway-external rate limit or outage",
"orders-postgres replication lag causing stale reads",
"inventory-service timeout cascading into checkout failures",
"connection pool exhaustion under Black Friday-style load spikes",
],
"runbook_url": "https://wiki.internal/runbooks/checkout-service",
"pagerduty_escalation": "payments-team-p1",
},
"inventory-service": {
"description": "Real-time inventory count and reservation service.",
"owner": "catalog-team",
"slo_availability": "99.9%",
"dependencies": ["inventory-postgres", "redis-inventory"],
"typical_failure_modes": [
"redis-inventory flapping causing cache miss storm on Postgres",
"inventory-postgres disk space exhaustion from audit log table",
"deadlock on row-level locking during concurrent reservation",
],
"runbook_url": "https://wiki.internal/runbooks/inventory-service",
"pagerduty_escalation": "catalog-team-p1",
},
}
INCIDENT_SEVERITY_GUIDE = """
SEV-1: Customer-facing feature fully down for >5% of users. Revenue impact. Immediate page.
SEV-2: Customer-facing feature degraded or fully down for <5% of users. Page within 15 min.
SEV-3: Internal tooling or non-critical feature degraded. Ticket, no page.
SEV-4: Minor glitch, auto-recovering. Log and monitor.
"""
def get_catalog_text() -> str:
"""Return a formatted text representation of the service catalog for the system prompt."""
lines = ["=== SERVICE CATALOG ==="]
for svc, meta in SERVICE_CATALOG.items():
lines.append(f"\nService: {svc}")
lines.append(f" Description: {meta['description']}")
lines.append(f" Owner: {meta['owner']}")
lines.append(f" SLO: {meta['slo_availability']} availability")
lines.append(f" Dependencies: {', '.join(meta['dependencies'])}")
lines.append(f" Known failure modes:")
for mode in meta["typical_failure_modes"]:
lines.append(f" - {mode}")
lines.append(f" Runbook: {meta['runbook_url']}")
return "\n".join(lines)
Main Copilot Script
# incident_copilot.py
"""
Incident Response Copilot using Claude.
Usage:
python incident_copilot.py --alert alert.json --logs recent.log
python incident_copilot.py --demo # runs with bundled demo data
"""
import os
import sys
import json
import argparse
import textwrap
from typing import Any
import anthropic
from dotenv import load_dotenv
from service_catalog import get_catalog_text, INCIDENT_SEVERITY_GUIDE
load_dotenv()
# ---------------------------------------------------------------------------
# Tool schema: defines the structured output shape Claude must return
# ---------------------------------------------------------------------------
TRIAGE_TOOL = {
"name": "triage_result",
"description": (
"Return a complete incident triage package. "
"Every field is required. Use the evidence to fill each field accurately."
),
"input_schema": {
"type": "object",
"properties": {
"severity": {
"type": "string",
"enum": ["SEV-1", "SEV-2", "SEV-3", "SEV-4"],
"description": "Incident severity based on customer impact.",
},
"affected_services": {
"type": "array",
"items": {"type": "string"},
"description": "List of service names that are failing or degraded.",
},
"primary_hypothesis": {
"type": "object",
"properties": {
"title": {"type": "string", "description": "One-line summary of the root cause theory."},
"confidence": {
"type": "string",
"enum": ["high", "medium", "low"],
"description": "How strongly the available evidence supports this hypothesis.",
},
"rationale": {
"type": "string",
"description": "2-4 sentences citing specific evidence (log lines, alert fields) that support this hypothesis.",
},
"evidence_quotes": {
"type": "array",
"items": {"type": "string"},
"description": "Verbatim excerpts from the logs or alert that most directly support this hypothesis.",
},
},
"required": ["title", "confidence", "rationale", "evidence_quotes"],
},
"alternative_hypotheses": {
"type": "array",
"maxItems": 3,
"items": {
"type": "object",
"properties": {
"title": {"type": "string"},
"confidence": {"type": "string", "enum": ["high", "medium", "low"]},
"rationale": {"type": "string"},
},
"required": ["title", "confidence", "rationale"],
},
"description": "Up to 3 alternative hypotheses ordered by descending confidence.",
},
"runbook_steps": {
"type": "array",
"items": {
"type": "object",
"properties": {
"step_number": {"type": "integer"},
"action": {"type": "string", "description": "What to do."},
"command_or_url": {
"type": "string",
"description": "Specific CLI command, dashboard URL, or API call if applicable. Empty string if not applicable.",
},
"expected_outcome": {"type": "string", "description": "What you expect to see if this step confirms or resolves the issue."},
},
"required": ["step_number", "action", "command_or_url", "expected_outcome"],
},
"description": "Ordered remediation steps for the primary hypothesis.",
},
"status_page_update": {
"type": "object",
"properties": {
"title": {"type": "string", "description": "Short incident title for the status page (under 80 chars)."},
"body": {
"type": "string",
"description": (
"Public-facing status update. Factual, calm tone. "
"Do not include internal hypotheses. "
"Do not mention specific internal service names. "
"State what customers experience and that the team is investigating."
),
},
"component": {"type": "string", "description": "Affected public-facing component name."},
"status": {
"type": "string",
"enum": ["investigating", "identified", "monitoring", "resolved"],
},
},
"required": ["title", "body", "component", "status"],
},
"immediate_escalations": {
"type": "array",
"items": {"type": "string"},
"description": "Team or person names to page or notify immediately based on affected services.",
},
"data_needed": {
"type": "array",
"items": {"type": "string"},
"description": "Additional data or metrics to gather to increase hypothesis confidence.",
},
},
"required": [
"severity",
"affected_services",
"primary_hypothesis",
"alternative_hypotheses",
"runbook_steps",
"status_page_update",
"immediate_escalations",
"data_needed",
],
},
}
# ---------------------------------------------------------------------------
# System prompt builder (the large stable block that gets cached)
# ---------------------------------------------------------------------------
def build_system_prompt() -> list[dict]:
"""
Return the system prompt as a list of content blocks.
The large catalog block has cache_control set so Claude caches it.
"""
catalog_and_guide = f"""
You are a senior site reliability engineer acting as an incident response copilot.
Your job is to analyze the incoming alert and log evidence and produce a structured
triage package that an on-call engineer can act on immediately.
Guidelines:
- Form the MOST LIKELY hypothesis first, ranked by evidence strength, not by worst-case scenario.
- Cite specific log lines or alert fields for every claim. Do not speculate beyond the evidence.
- Runbook steps must be concrete and ordered. Assume the engineer has shell access and dashboard access.
- The status page update must be written for customers, not engineers. No internal jargon.
- Express uncertainty explicitly. A medium-confidence hypothesis is better than a false high.
- If the logs are insufficient to form a hypothesis, say so in the rationale and populate data_needed.
{INCIDENT_SEVERITY_GUIDE}
{get_catalog_text()}
""".strip()
return [
{
"type": "text",
"text": catalog_and_guide,
"cache_control": {"type": "ephemeral"},
}
]
# ---------------------------------------------------------------------------
# User message builder
# ---------------------------------------------------------------------------
def build_user_message(alert_payload: dict, log_lines: str, extra_context: str = "") -> str:
parts = [
"[ALERT PAYLOAD]",
json.dumps(alert_payload, indent=2),
"",
"[RECENT LOGS (last 100 lines, newest at bottom)]",
log_lines.strip(),
]
if extra_context:
parts += ["", "[ADDITIONAL CONTEXT]", extra_context.strip()]
parts += [
"",
"Analyze the above and call the triage_result tool with your structured assessment.",
"Weight the log lines most heavily as they are the most time-specific evidence.",
]
return "\n".join(parts)
# ---------------------------------------------------------------------------
# Core triage call
# ---------------------------------------------------------------------------
def run_triage(
alert_payload: dict,
log_lines: str,
extra_context: str = "",
verbose: bool = False,
) -> dict[str, Any]:
client = anthropic.Anthropic()
system = build_system_prompt()
user_content = build_user_message(alert_payload, log_lines, extra_context)
try:
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=4096,
system=system,
tools=[TRIAGE_TOOL],
tool_choice={"type": "tool", "name": "triage_result"},
messages=[{"role": "user", "content": user_content}],
)
except anthropic.APIError as exc:
print(f"[ERROR] Claude API call failed: {exc}", file=sys.stderr)
raise
if verbose:
print(f"[DEBUG] stop_reason={response.stop_reason}", file=sys.stderr)
print(
f"[DEBUG] tokens: input={response.usage.input_tokens} "
f"output={response.usage.output_tokens} "
f"cache_creation={getattr(response.usage, 'cache_creation_input_tokens', 'n/a')} "
f"cache_read={getattr(response.usage, 'cache_read_input_tokens', 'n/a')}",
file=sys.stderr,
)
# Extract structured result from tool use block
for block in response.content:
if block.type == "tool_use" and block.name == "triage_result":
return block.input
raise ValueError("Claude did not return a triage_result tool call. Response: " + str(response))
# ---------------------------------------------------------------------------
# Pretty-print helper
# ---------------------------------------------------------------------------
def print_triage(result: dict) -> None:
sep = "-" * 60
print(f"\n{'=' * 60}")
print(f" INCIDENT TRIAGE REPORT")
print(f"{'=' * 60}")
print(f"\nSEVERITY : {result['severity']}")
print(f"AFFECTED : {', '.join(result['affected_services'])}")
print(f"\n{sep}")
print("PRIMARY HYPOTHESIS")
print(sep)
hyp = result["primary_hypothesis"]
print(f" Title : {hyp['title']}")
print(f" Confidence : {hyp['confidence'].upper()}")
print(f" Rationale : {textwrap.fill(hyp['rationale'], width=70, subsequent_indent=' ')}")
if hyp.get("evidence_quotes"):
print(" Evidence :")
for q in hyp["evidence_quotes"]:
print(f" > {q}")
if result.get("alternative_hypotheses"):
print(f"\n{sep}")
print("ALTERNATIVE HYPOTHESES")
print(sep)
for alt in result["alternative_hypotheses"]:
print(f" [{alt['confidence'].upper()}] {alt['title']}")
print(f" {textwrap.fill(alt['rationale'], width=66, subsequent_indent=' ')}")
print(f"\n{sep}")
print("RUNBOOK STEPS")
print(sep)
for step in result["runbook_steps"]:
print(f" {step['step_number']}. {step['action']}")
if step.get("command_or_url"):
print(f" CMD: {step['command_or_url']}")
print(f" Expected: {step['expected_outcome']}")
print(f"\n{sep}")
print("STATUS PAGE UPDATE")
print(sep)
sp = result["status_page_update"]
print(f" Title : {sp['title']}")
print(f" Component : {sp['component']}")
print(f" Status : {sp['status']}")
print(f" Body:")
print(textwrap.fill(sp["body"], width=70, initial_indent=" ", subsequent_indent=" "))
if result.get("immediate_escalations"):
print(f"\n{sep}")
print("ESCALATE NOW")
print(sep)
for esc in result["immediate_escalations"]:
print(f" - {esc}")
if result.get("data_needed"):
print(f"\n{sep}")
print("GATHER MORE DATA")
print(sep)
for item in result["data_needed"]:
print(f" - {item}")
print(f"\n{'=' * 60}\n")
# ---------------------------------------------------------------------------
# Demo data
# ---------------------------------------------------------------------------
DEMO_ALERT = {
"alert_id": "AGT-20240603-0412",
"fired_at": "2024-06-03T04:12:38Z",
"alert_name": "checkout_service_error_rate_high",
"service": "checkout-service",
"environment": "production",
"severity_label": "critical",
"threshold": "error_rate > 5% for 3m",
"current_value": "error_rate=18.4%",
"runbook_hint": "https://wiki.internal/runbooks/checkout-service",
"labels": {
"namespace": "payments",
"cluster": "prod-us-east-1",
"pod_count_healthy": 3,
"pod_count_total": 5,
},
}
DEMO_LOGS = """
2024-06-03T04:09:11Z INFO checkout-service[payments-5f9b] Order 88821 received, starting payment flow
2024-06-03T04:09:12Z INFO checkout-service[payments-5f9b] Calling inventory-service /reserve itemId=SK-440
2024-06-03T04:09:12Z ERROR checkout-service[payments-5f9b] inventory-service connection timeout after 3000ms (attempt 1/3)
2024-06-03T04:09:15Z ERROR checkout-service[payments-5f9b] inventory-service connection timeout after 3000ms (attempt 2/3)
2024-06-03T04:09:18Z ERROR checkout-service[payments-5f9b] inventory-service connection timeout after 3000ms (attempt 3/3)
2024-06-03T04:09:18Z ERROR checkout-service[payments-5f9b] inventory-service exhausted retries, order 88821 failed with 503
2024-06-03T04:09:19Z WARN checkout-service[payments-7a2c] Circuit breaker OPEN for inventory-service (threshold: 10 failures/30s)
2024-06-03T04:09:20Z ERROR checkout-service[payments-7a2c] Circuit breaker OPEN: fast-failing order 88825
2024-06-03T04:09:20Z ERROR checkout-service[payments-6d1f] Circuit breaker OPEN: fast-failing order 88826
2024-06-03T04:09:21Z INFO inventory-service[inv-3c8e] redis-inventory connection error: ECONNREFUSED 10.0.4.22:6379
2024-06-03T04:09:21Z ERROR inventory-service[inv-3c8e] Failed to acquire redis lock for reservation, falling back to Postgres
2024-06-03T04:09:21Z ERROR inventory-service[inv-3c8e] Postgres query timeout 5012ms (threshold 1000ms) on SELECT ... FOR UPDATE
2024-06-03T04:09:22Z ERROR inventory-service[inv-4b7d] redis-inventory connection error: ECONNREFUSED 10.0.4.22:6379
2024-06-03T04:09:22Z ERROR inventory-service[inv-4b7d] Postgres query timeout 4988ms on SELECT ... FOR UPDATE
2024-06-03T04:09:23Z WARN inventory-service[inv-3c8e] Active Postgres connections: 98/100 (98% pool utilization)
2024-06-03T04:09:23Z ERROR inventory-service[inv-4b7d] Active Postgres connections: 99/100 (99% pool utilization)
2024-06-03T04:09:24Z ERROR inventory-service[inv-5c9a] Cannot acquire DB connection: pool exhausted (100/100)
2024-06-03T04:09:24Z FATAL inventory-service[inv-5c9a] Returning 503 to caller: no DB connections available
2024-06-03T04:09:25Z ERROR checkout-service[payments-5f9b] inventory-service returned 503 for order 88830
2024-06-03T04:09:30Z WARN checkout-service[payments-5f9b] 45 orders failed in last 60s (threshold: 10)
2024-06-03T04:12:35Z INFO alertmanager Firing: checkout_service_error_rate_high (18.4% errors, threshold 5%)
"""
DEMO_CONTEXT = "PagerDuty shows no other alerts in the payments namespace. Last deployment to checkout-service was 6 hours ago (no config changes). inventory-service had a Redis cluster maintenance window scheduled for 04:00-04:30 UTC today."
# ---------------------------------------------------------------------------
# CLI entry point
# ---------------------------------------------------------------------------
def main() -> None:
parser = argparse.ArgumentParser(description="Incident Response Copilot using Claude")
parser.add_argument("--alert", help="Path to alert payload JSON file")
parser.add_argument("--logs", help="Path to recent log lines text file")
parser.add_argument("--context", help="Path to additional context text file", default="")
parser.add_argument("--demo", action="store_true", help="Run with bundled demo data")
parser.add_argument("--json-out", help="Write raw JSON result to this file")
parser.add_argument("--verbose", action="store_true", help="Print debug info including token counts")
args = parser.parse_args()
if args.demo:
alert = DEMO_ALERT
logs = DEMO_LOGS
context = DEMO_CONTEXT
elif args.alert and args.logs:
with open(args.alert) as f:
alert = json.load(f)
with open(args.logs) as f:
logs = f.read()
context = ""
if args.context:
with open(args.context) as f:
context = f.read()
else:
parser.print_help()
sys.exit(1)
print("Analyzing incident... ", end="", flush=True)
result = run_triage(alert, logs, extra_context=context, verbose=args.verbose)
print("done.")
print_triage(result)
if args.json_out:
with open(args.json_out, "w") as f:
json.dump(result, f, indent=2)
print(f"[INFO] Raw JSON written to {args.json_out}")
if __name__ == "__main__":
main()
Sample Run with Demo Data
Run the demo with:
python incident_copilot.py --demo --verboseExpected output (trimmed for clarity):
[DEBUG] stop_reason=tool_use
[DEBUG] tokens: input=2847 output=1203 cache_creation=2401 cache_read=0
============================================================
INCIDENT TRIAGE REPORT
============================================================
SEVERITY : SEV-1
AFFECTED : checkout-service, inventory-service, redis-inventory, inventory-postgres
------------------------------------------------------------
PRIMARY HYPOTHESIS
------------------------------------------------------------
Title : Redis cluster maintenance window caused cache miss storm, exhausting inventory-postgres connection pool
Confidence : HIGH
Rationale : Log lines from 04:09:21Z show inventory-service pods getting ECONNREFUSED on redis-inventory (10.0.4.22:6379),
matching the scheduled 04:00-04:30 maintenance window. With Redis unavailable, all reservation
requests fell back to Postgres. The pool climbed from 98/100 to 100/100 in under 3 seconds,
triggering inventory 503s, which opened the checkout-service circuit breaker, causing the 18.4%
error rate seen in the alert.
Evidence :
> inventory-service[inv-3c8e] redis-inventory connection error: ECONNREFUSED 10.0.4.22:6379
> inventory-service[inv-3c8e] Active Postgres connections: 98/100 (98% pool utilization)
> inventory-service[inv-5c9a] Cannot acquire DB connection: pool exhausted (100/100)
> checkout-service[payments-7a2c] Circuit breaker OPEN for inventory-service
------------------------------------------------------------
ALTERNATIVE HYPOTHESES
------------------------------------------------------------
[LOW] inventory-postgres query regression causing independent slowness
SELECT ... FOR UPDATE timeouts at 5s suggest possible lock contention or missing
index, but the Redis outage is the more parsimonious explanation for the timing.
[LOW] checkout-service deployment introduced a bug increasing reservation call volume
Last deploy was 6h ago with no config changes; timing aligns with maintenance, not deploy.
------------------------------------------------------------
RUNBOOK STEPS
------------------------------------------------------------
1. Confirm Redis cluster status
CMD: redis-cli -h 10.0.4.22 -p 6379 PING
Expected: If maintenance is still in progress you will get ECONNREFUSED or timeout. If complete, PONG.
2. Check maintenance window status with the Redis team
CMD: (contact catalog-team or check #infra-maintenance Slack channel)
Expected: Confirm 04:00-04:30 window scope and whether Redis is back online.
3. Monitor inventory-service Postgres connection pool
CMD: kubectl exec -n payments deploy/inventory-service -- curl localhost:8080/metrics | grep db_pool
Expected: Pool utilization dropping below 80% as Redis reconnects and traffic shifts back from Postgres.
4. Check circuit breaker state on checkout-service
CMD: kubectl exec -n payments deploy/checkout-service -- curl localhost:8080/actuator/health
Expected: circuit-breaker for inventory-service should transition from OPEN to HALF_OPEN to CLOSED.
5. If Redis maintenance ended and pool is still exhausted, restart inventory-service pods
CMD: kubectl rollout restart deployment/inventory-service -n payments
Expected: Fresh pods reconnect to Redis, bypassing the exhausted connection pool state.
6. Verify error rate drops below threshold after circuit breaker closes
CMD: (watch Grafana: checkout_service_error_rate panel)
Expected: Error rate drops below 5% within 2-3 minutes of inventory-service recovery.
------------------------------------------------------------
STATUS PAGE UPDATE
------------------------------------------------------------
Title : Elevated checkout errors affecting some orders
Component : Checkout
Status : investigating
Body:
We are currently investigating elevated error rates affecting a
portion of checkout attempts. Some customers may experience failed
order placements. Our team has identified the issue and is working
to restore full service. We will provide an update within 15
minutes.
------------------------------------------------------------
ESCALATE NOW
------------------------------------------------------------
- payments-team-p1 (checkout-service owner)
- catalog-team-p1 (inventory-service owner)
------------------------------------------------------------
GATHER MORE DATA
------------------------------------------------------------
- Redis cluster status page or maintenance ticket to confirm 04:30 end time
- Postgres slow query log for inventory-postgres to rule out independent query regression
- Number of orders successfully processed vs failed in the window (business impact quantification)
============================================================
Integrating the Copilot with Your Existing Stack
PagerDuty Webhook Integration
PagerDuty’s Event Rules can call a webhook on any new incident. Your Lambda or Cloud Run function receives the payload, fetches the last N log lines from your log aggregator (Datadog, Loki, CloudWatch), pulls service metadata from a YAML file or API, and calls run_triage(). The structured result goes into three places simultaneously: a Slack thread on the incident channel, a Jira ticket body, and a draft Statuspage incident (held in “investigating” status until a human confirms and publishes).
Streaming for Real-Time Display
For a CLI or terminal dashboard experience, you can stream the response before the tool call completes by using Claude’s streaming API. See Part 26: Streaming Responses with Claude for the full pattern. Note that streaming and forced tool use work together: the stream emits token-by-token text until the tool_use block is fully assembled, at which point you parse the JSON.
Multi-Alert Batching During Storms
When 15 alerts fire at once, you do not want 15 sequential Claude calls. Use Python’s asyncio or a thread pool to run triage calls in parallel. The prompt cache means the second through fifteenth calls all read the same cached system prompt rather than paying creation costs again. The pattern is covered in depth in Part 27: Cut AI Costs: Model Routing and Batching with Claude.
Common Pitfalls
- Sending too many log lines. Dumping 10,000 lines into the context is wasteful and often counterproductive. The model’s attention is finite. Send the 50 to 200 lines closest in time to the alert, filtered to the affected service. Pre-filter with a quick grep or a log query before calling the API.
- Trusting high-confidence outputs blindly. Claude can be confidently wrong when the log evidence is ambiguous or when the true cause is something not in the service catalog (a network partition, a third-party API outage, a hardware failure). Always treat the output as a hypothesis, not a diagnosis.
- Leaking internal hostnames or IPs into status page drafts. The structured output schema helps here: the status_page_update field has explicit instructions not to include internal service names. But audit the output before posting. Add a post-processing step that scans for known internal hostnames.
- Not bumping max_tokens for complex incidents. A service with 8 dependencies and a 150-line log can produce a triage result that exceeds 1,024 tokens. The POC uses 4,096 to be safe. Hitting the token limit mid-JSON produces an unparseable response.
- Re-using the same client instance across event loop iterations without handling rate limits. Add exponential backoff on
anthropic.RateLimitError. For alert storms, consider a local queue that throttles to 5 concurrent Claude calls rather than 50 simultaneous ones. - Skipping the cache on the first call after a redeploy. The prompt cache is ephemeral (5 minutes by default on the API). In a Lambda function that cold-starts infrequently, you may see cache creation costs on every invocation. This is still cheaper than an uncached call for a large system prompt, but budget for it.
- Putting dynamic data in the system prompt. If you include the current time, the alert ID, or any per-alert field in the system prompt, the cache will never hit. All dynamic content belongs in the user message.
Cost and Latency
The numbers below are based on the demo scenario: a 2,847-token system prompt (with 2,401 tokens cached after the first call) and a 1,203-token output. Costs are at published Claude API rates as of mid-2025; check anthropic.com/pricing for current figures.
| Scenario | Model | Input tokens | Cache hit | Output tokens | Approx cost | Latency (p50) |
|---|---|---|---|---|---|---|
| First call (cache miss) | claude-sonnet-4-6 | 2,847 | No | 1,200 | ~$0.015 | 2.8s |
| Repeat call (cache hit) | claude-sonnet-4-6 | 2,847 (446 unique) | Yes (2,401) | 1,200 | ~$0.004 | 1.6s |
| Post-incident report | claude-opus-4-8 | 6,000 | Yes (4,800) | 2,400 | ~$0.11 | 8s |
| Alert severity classification only | claude-haiku-4-5 | 800 | N/A | 50 | ~$0.0002 | 0.4s |
For a heavy on-call night with 30 triage calls, all using the cached system prompt, total cost is roughly $0.15 to $0.35 depending on output length. That is well under the cost of one engineer-minute at a senior SWE salary, and it provides value well before that engineer opens their first dashboard.
| Task | Recommended model | Reason |
|---|---|---|
| Real-time triage during live incident | claude-sonnet-4-6 | Best balance of reasoning quality and latency under 3s |
| Severity classification (routing only) | claude-haiku-4-5 | Sub-500ms, cost near zero, sufficient for binary severity judgement |
| Post-incident root cause analysis | claude-opus-4-8 | Deeper causal reasoning justifies higher cost when not time-critical |
| Batch processing past incidents for pattern detection | claude-haiku-4-5 or claude-sonnet-4-6 | Use Anthropic’s batch API to cut costs 50% further on async work |
Beyond the MVP: Where to Take This Next
The POC above gets you to a working copilot in an afternoon. Here is where teams typically extend it once they see value:
Feedback Loop for Hypothesis Quality
After an incident closes, record whether the primary hypothesis was correct, partially correct, or wrong. Feed this signal back as examples in the system prompt (a few-shot section after the service catalog). Over 20 to 30 incidents, hypothesis accuracy improves noticeably for your specific service topology. This is the same pattern used in Part 24: Evaluate Your Claude App.
Tool Use for Live Data Fetching
Instead of pre-fetching logs before calling Claude, give Claude tools to fetch them itself: a get_logs tool that calls your Loki API, a get_metrics tool that queries Prometheus, a get_deployment_history tool that hits your CI system. This turns the copilot into an active investigator, not just a classifier. The pattern is covered in Part 2: Tool Use with Claude.
Autonomous Agent Loop
Push the concept further: an agent that receives an alert, fetches its own evidence, forms a hypothesis, executes a safe read-only diagnostic command (like checking pod status or querying connection pool metrics), observes the result, and refines the hypothesis. This is the territory of Part 22: Build an Autonomous Agent Loop with Claude. The safety constraint is critical: the agent must only run read-only operations autonomously; any remediation action requires a human confirmation step.
Observability Integration
Log every call to Claude, the input token count, the hypothesis produced, and the final outcome (correct/incorrect) to your observability stack. This gives you a real-time view of copilot performance and lets you detect drift: if hypothesis accuracy drops, the service catalog may be stale or the model behavior may have changed. See Part 28: Observability for LLM Apps for the tracing patterns.
Frequently Asked Questions
Can this replace PagerDuty or Opsgenie entirely?
No, and it should not try to. PagerDuty handles on-call scheduling, escalation policies, phone calls, and acknowledgment SLAs. This copilot is a layer on top of those systems that enriches the notification with a hypothesis and a runbook before the engineer even picks up the phone. Think of it as augmenting the alert, not replacing the alerting system.
How do I handle sensitive data in log lines passed to the Claude API?
Scrub PII and credentials before sending. Build a log sanitizer that strips email addresses, credit card numbers, JWT tokens, and API keys using regex before the log block is constructed. Anthropic’s data handling policies (see anthropic.com/legal/privacy) apply, but your compliance requirements may be stricter. For highly regulated environments, consider a self-hosted or private-cloud deployment path, or strip logs to stack traces and error codes only.
What if the log lines are from multiple services?
Pass them all, but label them clearly in the user message. Use section headers like [LOGS: checkout-service] and [LOGS: inventory-service]. Claude handles multi-service correlation well when the structure is clear. In fact, cross-service log correlation is one of the areas where the copilot adds the most value over a human reading each service’s logs separately.
How does this perform during a total outage when logs are unavailable?
Tell Claude explicitly in the user message that logs are unavailable. The model will lower its hypothesis confidence accordingly and populate the data_needed field with what to gather first. A low-confidence hypothesis with clear “gather this data first” instructions is still more useful than an engineer staring at a dark dashboard with no starting point.
Can I use this for mobile app crashes, not just server incidents?
Yes. Replace the log lines with a symbolicated crash trace and the alert payload with a crash count / affected-version metric from Firebase or Sentry. Update the service catalog to describe your app versions and their known issues. The structured output schema works identically; just rename “runbook_steps” semantically to match your mobile release process (steps might involve a hotfix push or a feature flag toggle rather than a kubectl command).
Why not just use a fine-tuned model specific to my infrastructure?
Fine-tuning is expensive to set up and maintain, and incident patterns change as your architecture evolves. The few-shot examples approach (adding your past correct hypotheses to the system prompt) gives you 80% of the adaptation benefit with none of the fine-tuning operational overhead. For most teams, the service catalog and past examples in the system prompt are sufficient to get highly relevant, infrastructure-specific output.
Is claude-sonnet-4-6 accurate enough for production incident triage, or should I always use Opus?
In practice, claude-sonnet-4-6 performs well for the structured triage task described here. The model’s reasoning over evidence in a well-structured prompt is strong. claude-opus-4-8 adds value for post-incident root cause analysis where you are reasoning over longer timelines, multiple contributing factors, and potentially ambiguous causal chains. For real-time triage where latency matters, sonnet is the right default. Start with sonnet, A/B test with opus on a sample of incidents, and switch only if you see measurable accuracy improvement on your specific workloads.
View all articles in the AI in Production series.
External references: Anthropic Tool Use documentation, Anthropic Prompt Caching documentation, Anthropic API pricing, PagerDuty Webhook V3 reference, Grafana alerting webhook notifier.
Leave a Reply