TL;DR
- AI log analysis with Claude reduces mean time to diagnose (MTTD) by grouping hundreds of raw log lines into a handful of root-cause clusters, each ranked by severity.
- The POC uses Claude’s structured output (tool-forcing) to return a deterministic JSON payload: clusters, severity scores, affected services, and proposed next actions.
- Prompt caching pins the system prompt and sample log corpus so repeated triage calls cost a fraction of the initial price.
- Claude Sonnet 4.6 hits the right balance for this workload: fast enough for near-real-time triage, accurate enough to spot non-obvious cross-service patterns.
- The full Python project (under 200 lines) reads a plain-text log file, calls the API, and prints a colour-coded triage report to stdout.
- Production hardening tips cover chunk sizing for large log files, retry logic, and how to wire the output into PagerDuty or Slack.
Why AI Log Analysis Matters More Than You Think
On a quiet Wednesday morning your on-call engineer opens Slack to find 1,400 unread log lines fired by a distributed payment service. Some are noise (health-check timeouts), some are symptoms (database connection pool exhaustion), and one buried line is the actual cause (a bad certificate rotation that made the auth service reject every token silently). Finding that single line by hand takes 20 minutes if you are experienced, an hour if you are not.
That is the routine cost. When an incident spans three services and two cloud providers, the number climbs. Postmortem data from multiple SRE teams consistently shows that 30 to 50 percent of time-to-resolve is spent reading logs, not fixing anything.
AI log analysis does not replace an engineer. It compresses the reading phase from minutes to seconds, hands the engineer a structured brief (root causes, severity ranking, suggested actions), and lets them start fixing immediately. Claude is well-suited here because it reads free-form text naturally, understands stack traces, SQL errors, HTTP status codes, and cloud provider messages without custom parsers for each.
Why Log Reading Is the Slow Part of an Incident
Incident response has three rough phases: detection, diagnosis, and remediation. Detection is mostly solved by alerting rules and dashboards. Remediation is usually fast once you know the cause: restart a pod, roll back a deploy, scale a pool. The expensive middle phase is diagnosis, and the bulk of diagnosis is reading. An engineer scrolls through interleaved log streams from several services, holds a mental timeline in their head, and tries to separate cause from symptom. The harder the incident, the more streams there are, and the more the human working memory becomes the bottleneck.
A language model does not get tired at line 800 and does not lose the thread when a fourth service shows up halfway down the file. It reads the entire batch at once and reasons about relationships across all of it in a single pass. That is the specific capability that makes log triage a good fit for an LLM, rather than a generic “AI is good at text” hand-wave. The model is doing the part a human is worst at: holding a large, messy, multi-source timeline in attention without dropping anything.
What “Good Triage” Actually Means
A useful triage output is not a summary of what is in the logs. It is a ranked set of distinct problems, each separated from the others, each with a plausible causal story and a concrete first move. The distinction matters because a single incident often presents as a wall of identical-looking error lines that actually trace back to two or three unrelated causes. The job is to untangle them. The schema later in this article encodes exactly that goal: clusters by root cause, severity per cluster, and ordered next actions per cluster.
Who Should Use This Pattern
- SRE and platform teams that manage more services than people.
- Backend teams with no dedicated observability tooling who pipe logs to files or S3.
- Startups where the founding engineer is also the on-call rotation.
- Teams that already have Datadog or Grafana but want an LLM layer that explains what a dashboard is telling them in plain language.
What This Article Covers
This article explains the design choices, walks through a complete runnable Python POC, and gives you the production considerations you need before putting this in front of a real incident. The code is in one file, reads a sample log from disk, calls the Anthropic API using structured output (tool-forcing), and prints a triage report. Everything you need is here.
If you are new to the series, Part 3 on structured output explains the tool-forcing pattern in detail. Part 4 on prompt caching covers the caching technique used in the POC to keep repeated triage calls cheap.
The Architecture: From Raw Logs to Structured Triage
Choosing the Right Output Shape
The key design decision is how Claude returns results. Free-form prose is readable but fragile to parse downstream. A structured JSON payload with a fixed schema is what you want when the output feeds a Slack alert, a ticket creator, or a CI gate.
Claude’s tool-forcing pattern gives you this reliably. You define one tool whose input_schema is exactly the JSON shape you want, then set tool_choice={"type": "tool", "name": "triage_logs"}. Claude fills in the schema fields. The result is deterministic, every call returns valid JSON matching the schema, which you can parse with block.input directly. No regex, no JSON extraction hacks.
This is the same pattern covered in Part 2 on tool use. If you have not read that article yet, the code below is self-contained, but the conceptual background there will help.
Prompt Caching for Repeated Triage Calls
In a real deployment the system prompt (which includes your log format spec and triage instructions) is the same on every call. Marking it with cache_control: {"type": "ephemeral"} turns that static block into a prompt cache entry. Subsequent calls within the five-minute cache window pay only for the new log lines, not for re-encoding the system prompt. On a 1,000-token system prompt running 100 triage calls per hour, this cuts input token cost by roughly 80 percent. The POC shows exactly how to wire this.
The Triage JSON Schema
The output schema is the contract between Claude and every downstream consumer. Keep it minimal but complete.
| Field | Type | Description |
|---|---|---|
clusters |
array of objects | Each object is one root-cause group |
clusters[].id |
string | Short slug, e.g. db-pool-exhaustion |
clusters[].title |
string | One-line human summary of the root cause |
clusters[].severity |
string enum | critical, high, medium, low |
clusters[].affected_services |
array of strings | Services mentioned in the clustered lines |
clusters[].log_line_indices |
array of integers | 0-based indices of grouped log lines |
clusters[].explanation |
string | Two to four sentences explaining root cause and impact |
clusters[].next_actions |
array of strings | Ordered list of concrete remediation steps |
summary |
string | Paragraph-level overall assessment |
total_errors |
integer | Count of lines classified as errors or critical |
AI Log Analysis: The Complete Python POC
Project Structure
log-triage/
├── .env
├── requirements.txt
├── sample_logs.txt
└── log_triage.py
Install
pip install anthropic python-dotenvrequirements.txt
anthropic>=0.28.0
python-dotenv>=1.0.0
.env
ANTHROPIC_API_KEY=sk-ant-...
sample_logs.txt
This file represents about 90 seconds of output from a payment microservice, a database proxy, and an auth service. It contains multiple overlapping problems.
2026-06-04T08:00:01Z INFO [api-gateway] GET /health 200 12ms
2026-06-04T08:00:03Z ERROR [payment-svc] DB connection timeout after 30s: pool_size=20 active=20 waiting=14
2026-06-04T08:00:03Z ERROR [payment-svc] DB connection timeout after 30s: pool_size=20 active=20 waiting=15
2026-06-04T08:00:04Z ERROR [payment-svc] Failed to process payment txn=TXN-9823: upstream database unreachable
2026-06-04T08:00:05Z ERROR [auth-svc] Token validation failed: certificate verify failed (SSL: CERTIFICATE_VERIFY_FAILED)
2026-06-04T08:00:05Z ERROR [auth-svc] Token validation failed: certificate verify failed (SSL: CERTIFICATE_VERIFY_FAILED)
2026-06-04T08:00:06Z WARN [api-gateway] Upstream auth-svc returned 500 for POST /auth/verify
2026-06-04T08:00:06Z ERROR [payment-svc] DB connection timeout after 30s: pool_size=20 active=20 waiting=18
2026-06-04T08:00:07Z ERROR [payment-svc] Failed to process payment txn=TXN-9824: upstream database unreachable
2026-06-04T08:00:07Z ERROR [payment-svc] Failed to process payment txn=TXN-9825: upstream database unreachable
2026-06-04T08:00:08Z ERROR [auth-svc] Token validation failed: certificate verify failed (SSL: CERTIFICATE_VERIFY_FAILED)
2026-06-04T08:00:08Z ERROR [notification-svc] SMTP connection refused: host=smtp.internal port=587
2026-06-04T08:00:09Z INFO [api-gateway] GET /health 200 11ms
2026-06-04T08:00:09Z ERROR [payment-svc] DB connection timeout after 30s: pool_size=20 active=20 waiting=20
2026-06-04T08:00:10Z CRITICAL [payment-svc] Connection pool fully saturated. All new requests will fail immediately.
2026-06-04T08:00:10Z ERROR [payment-svc] Failed to process payment txn=TXN-9826: upstream database unreachable
2026-06-04T08:00:11Z ERROR [auth-svc] Token validation failed: certificate verify failed (SSL: CERTIFICATE_VERIFY_FAILED)
2026-06-04T08:00:11Z ERROR [api-gateway] Circuit breaker OPEN for payment-svc after 5 consecutive failures
2026-06-04T08:00:12Z ERROR [auth-svc] Token validation failed: certificate verify failed (SSL: CERTIFICATE_VERIFY_FAILED)
2026-06-04T08:00:13Z ERROR [payment-svc] Failed to process payment txn=TXN-9827: circuit breaker open
2026-06-04T08:00:13Z ERROR [notification-svc] SMTP connection refused: host=smtp.internal port=587
2026-06-04T08:00:14Z WARN [db-proxy] Max connections reached: host=pg-primary-01 current=250 limit=250
2026-06-04T08:00:14Z ERROR [payment-svc] Failed to process payment txn=TXN-9828: circuit breaker open
2026-06-04T08:00:15Z ERROR [auth-svc] Token validation failed: certificate verify failed (SSL: CERTIFICATE_VERIFY_FAILED)
2026-06-04T08:00:15Z INFO [api-gateway] GET /health 200 14ms
2026-06-04T08:00:16Z ERROR [notification-svc] SMTP connection refused: host=smtp.internal port=587
2026-06-04T08:00:16Z ERROR [payment-svc] Failed to process payment txn=TXN-9829: circuit breaker open
2026-06-04T08:00:17Z WARN [db-proxy] Replication lag on pg-replica-02: 14.3s (threshold: 5s)
2026-06-04T08:00:18Z ERROR [payment-svc] Failed to process payment txn=TXN-9830: circuit breaker open
2026-06-04T08:00:18Z ERROR [auth-svc] Token validation failed: certificate verify failed (SSL: CERTIFICATE_VERIFY_FAILED)
2026-06-04T08:00:19Z INFO [api-gateway] GET /health 200 13ms
2026-06-04T08:00:20Z CRITICAL [db-proxy] Primary host pg-primary-01 unreachable. Initiating failover.
2026-06-04T08:00:21Z ERROR [payment-svc] Failed to process payment txn=TXN-9831: circuit breaker open
2026-06-04T08:00:21Z ERROR [notification-svc] SMTP connection refused: host=smtp.internal port=587
log_triage.py (complete, runnable)
"""
log_triage.py
=============
Feed raw log lines to Claude, get back a structured triage report:
- root-cause clusters
- severity ranking (critical / high / medium / low)
- affected services
- next actions
Uses:
- claude-sonnet-4-6 (balanced cost/accuracy for this workload)
- Tool-forcing for deterministic JSON output
- Prompt caching on the static system prompt block
Usage:
python log_triage.py sample_logs.txt
python log_triage.py sample_logs.txt --max-lines 500
"""
import json
import os
import sys
import argparse
import time
from pathlib import Path
from typing import Any
import anthropic
from dotenv import load_dotenv
load_dotenv()
# ---------------------------------------------------------------------------
# Constants
# ---------------------------------------------------------------------------
MODEL = "claude-sonnet-4-6"
MAX_TOKENS = 4096
MAX_LINES_PER_CALL = 200 # chunk size; adjust for very large files
# Colour codes for terminal output (degrade gracefully on Windows)
COLOURS = {
"critical": "\033[91m", # bright red
"high": "\033[93m", # yellow
"medium": "\033[94m", # blue
"low": "\033[92m", # green
"reset": "\033[0m",
"bold": "\033[1m",
}
def c(colour: str, text: str) -> str:
"""Wrap text in a terminal colour code."""
if not sys.stdout.isatty():
return text
return f"{COLOURS.get(colour, '')}{text}{COLOURS['reset']}"
# ---------------------------------------------------------------------------
# Schema: the tool that Claude must fill in
# ---------------------------------------------------------------------------
TRIAGE_TOOL: dict[str, Any] = {
"name": "triage_logs",
"description": (
"Return a structured triage report for the provided log lines. "
"Group related errors by root cause, rank by severity, and propose next actions."
),
"input_schema": {
"type": "object",
"properties": {
"clusters": {
"type": "array",
"description": "List of root-cause clusters, ordered from most to least severe.",
"items": {
"type": "object",
"properties": {
"id": {
"type": "string",
"description": "Short kebab-case slug for this cluster, e.g. db-pool-exhaustion."
},
"title": {
"type": "string",
"description": "One-line human-readable summary of the root cause."
},
"severity": {
"type": "string",
"enum": ["critical", "high", "medium", "low"],
"description": "Severity level based on blast radius and urgency."
},
"affected_services": {
"type": "array",
"items": {"type": "string"},
"description": "Service names extracted from the log lines in this cluster."
},
"log_line_indices": {
"type": "array",
"items": {"type": "integer"},
"description": "0-based indices of the log lines that belong to this cluster."
},
"explanation": {
"type": "string",
"description": (
"Two to four sentences: what is happening, why it started, "
"and what the downstream impact is."
)
},
"next_actions": {
"type": "array",
"items": {"type": "string"},
"description": (
"Ordered, concrete remediation steps an on-call engineer "
"should take right now."
)
}
},
"required": [
"id", "title", "severity", "affected_services",
"log_line_indices", "explanation", "next_actions"
]
}
},
"summary": {
"type": "string",
"description": (
"Two to five sentences summarising the overall health of the system "
"based on these log lines."
)
},
"total_errors": {
"type": "integer",
"description": "Count of log lines classified as ERROR or CRITICAL in the input."
}
},
"required": ["clusters", "summary", "total_errors"]
}
}
# ---------------------------------------------------------------------------
# System prompt (static; will be cached)
# ---------------------------------------------------------------------------
SYSTEM_PROMPT = """You are an expert SRE (Site Reliability Engineer) specialising in distributed-systems
incident triage. You receive a numbered list of raw log lines from one or more services.
Your job:
1. Read every line carefully.
2. Group lines by root cause (not by service or log level). A single root cause may span multiple services.
3. For each group, determine the severity: critical (system-wide outage or data loss risk), high (major
feature broken, users directly affected), medium (degraded performance or isolated failures), low
(warnings, informational anomalies).
4. Order clusters from highest to lowest severity.
5. For each cluster, list concrete, ordered next actions an engineer should take right now.
6. Assign each log line to exactly one cluster. Do not leave lines unassigned.
7. Count only lines whose level is ERROR or CRITICAL toward total_errors.
Be specific: name the actual service, the actual error message, the actual config value mentioned.
Do not give vague actions like "investigate the issue". Give actions like
"Check pg-primary-01 disk utilisation with: df -h on the database host" or
"Rotate the TLS certificate for auth-svc and restart the pod"."""
# ---------------------------------------------------------------------------
# Core triage function
# ---------------------------------------------------------------------------
def triage_log_chunk(
client: anthropic.Anthropic,
lines: list[str],
chunk_index: int = 0,
) -> dict[str, Any]:
"""
Send a chunk of log lines to Claude and return the parsed triage dict.
Uses tool-forcing (tool_choice) to guarantee structured JSON output.
Marks the system prompt with cache_control so repeated calls pay
only for the new log content.
"""
# Number the lines so Claude can reference them by index
numbered = "\n".join(f"[{i}] {line}" for i, line in enumerate(lines))
user_message = f"Triage the following {len(lines)} log lines:\n\n{numbered}"
try:
response = client.messages.create(
model=MODEL,
max_tokens=MAX_TOKENS,
system=[
{
"type": "text",
"text": SYSTEM_PROMPT,
# Pin system prompt in cache; saves tokens on every subsequent call
"cache_control": {"type": "ephemeral"},
}
],
tools=[TRIAGE_TOOL],
tool_choice={"type": "tool", "name": "triage_logs"},
messages=[{"role": "user", "content": user_message}],
)
except anthropic.APIError as exc:
print(f"[chunk {chunk_index}] API error: {exc}", file=sys.stderr)
# Simple one-retry with backoff
time.sleep(5)
response = client.messages.create(
model=MODEL,
max_tokens=MAX_TOKENS,
system=[
{
"type": "text",
"text": SYSTEM_PROMPT,
"cache_control": {"type": "ephemeral"},
}
],
tools=[TRIAGE_TOOL],
tool_choice={"type": "tool", "name": "triage_logs"},
messages=[{"role": "user", "content": user_message}],
)
# Extract the tool_use block
triage_data: dict[str, Any] = {}
for block in response.content:
if block.type == "tool_use" and block.name == "triage_logs":
triage_data = block.input
break
# Attach token usage for cost tracking
triage_data["_usage"] = {
"input_tokens": response.usage.input_tokens,
"output_tokens": response.usage.output_tokens,
"cache_creation_input_tokens": getattr(
response.usage, "cache_creation_input_tokens", 0
),
"cache_read_input_tokens": getattr(
response.usage, "cache_read_input_tokens", 0
),
}
return triage_data
# ---------------------------------------------------------------------------
# Merge results from multiple chunks (for large files)
# ---------------------------------------------------------------------------
def merge_triage_results(chunks: list[dict[str, Any]]) -> dict[str, Any]:
"""
When a log file is split across multiple API calls, merge the results
into a single triage report. Cluster IDs from different chunks that
share the same root cause are left separate (a second LLM pass could
deduplicate them, but for most files one chunk is sufficient).
"""
if len(chunks) == 1:
return chunks[0]
merged: dict[str, Any] = {
"clusters": [],
"summary": "",
"total_errors": 0,
"_usage": {
"input_tokens": 0,
"output_tokens": 0,
"cache_creation_input_tokens": 0,
"cache_read_input_tokens": 0,
},
}
summaries = []
for i, chunk in enumerate(chunks):
merged["clusters"].extend(chunk.get("clusters", []))
merged["total_errors"] += chunk.get("total_errors", 0)
if chunk.get("summary"):
summaries.append(f"[Chunk {i + 1}] {chunk['summary']}")
for key in merged["_usage"]:
merged["_usage"][key] += chunk.get("_usage", {}).get(key, 0)
merged["summary"] = " ".join(summaries)
# Re-sort all clusters by severity
sev_order = {"critical": 0, "high": 1, "medium": 2, "low": 3}
merged["clusters"].sort(
key=lambda cl: sev_order.get(cl.get("severity", "low"), 3)
)
return merged
# ---------------------------------------------------------------------------
# Reporting
# ---------------------------------------------------------------------------
def print_triage_report(report: dict[str, Any], log_lines: list[str]) -> None:
"""Print a human-readable triage report to stdout."""
print()
print(c("bold", "=" * 60))
print(c("bold", " LOG TRIAGE REPORT"))
print(c("bold", "=" * 60))
usage = report.get("_usage", {})
print(
f"\nTokens: {usage.get('input_tokens', 0)} in / "
f"{usage.get('output_tokens', 0)} out | "
f"Cache created: {usage.get('cache_creation_input_tokens', 0)} | "
f"Cache read: {usage.get('cache_read_input_tokens', 0)}"
)
print(f"Total error/critical lines: {report.get('total_errors', 0)}")
print()
print(c("bold", "Summary"))
print(report.get("summary", ""))
print()
clusters = report.get("clusters", [])
if not clusters:
print("No clusters returned.")
return
for idx, cluster in enumerate(clusters, 1):
sev = cluster.get("severity", "low")
print(c(sev, f" Cluster {idx}: [{sev.upper()}] {cluster.get('title', '')}"))
print(f" ID: {cluster.get('id', '')}")
services = ", ".join(cluster.get("affected_services", []))
print(f" Affected services: {services}")
print()
print(f" {cluster.get('explanation', '')}")
print()
indices = cluster.get("log_line_indices", [])
if indices:
print(" Sample log lines:")
for li in indices[:4]:
if 0 <= li < len(log_lines):
print(f" [{li}] {log_lines[li].rstrip()}")
if len(indices) > 4:
print(f" ... and {len(indices) - 4} more lines")
print()
actions = cluster.get("next_actions", [])
if actions:
print(" Next actions:")
for step, action in enumerate(actions, 1):
print(f" {step}. {action}")
print()
print(" " + "-" * 55)
print()
# ---------------------------------------------------------------------------
# CLI entry point
# ---------------------------------------------------------------------------
def main() -> None:
parser = argparse.ArgumentParser(description="Triage log files with Claude AI.")
parser.add_argument("log_file", help="Path to a plain-text log file.")
parser.add_argument(
"--max-lines",
type=int,
default=MAX_LINES_PER_CALL,
help=f"Lines per API call chunk (default: {MAX_LINES_PER_CALL}).",
)
parser.add_argument(
"--json-out",
metavar="FILE",
help="Write the raw triage JSON to a file (optional).",
)
args = parser.parse_args()
log_path = Path(args.log_file)
if not log_path.exists():
print(f"File not found: {log_path}", file=sys.stderr)
sys.exit(1)
log_lines = log_path.read_text(encoding="utf-8").splitlines()
print(f"Loaded {len(log_lines)} log lines from {log_path.name}")
client = anthropic.Anthropic() # reads ANTHROPIC_API_KEY from env
# Split into chunks if needed
chunks_input = [
log_lines[i: i + args.max_lines]
for i in range(0, len(log_lines), args.max_lines)
]
print(f"Sending {len(chunks_input)} chunk(s) to {MODEL}...")
chunk_results = []
for chunk_idx, chunk in enumerate(chunks_input):
print(f" Chunk {chunk_idx + 1}/{len(chunks_input)}: {len(chunk)} lines")
result = triage_log_chunk(client, chunk, chunk_index=chunk_idx)
chunk_results.append(result)
report = merge_triage_results(chunk_results)
print_triage_report(report, log_lines)
if args.json_out:
out_path = Path(args.json_out)
out_path.write_text(json.dumps(report, indent=2), encoding="utf-8")
print(f"\nJSON report written to {out_path}")
if __name__ == "__main__":
main()
Sample Run
$ python log_triage.py sample_logs.txt
Loaded 34 log lines from sample_logs.txt
Sending 1 chunk(s) to claude-sonnet-4-6...
Chunk 1/1: 34 lines
============================================================
LOG TRIAGE REPORT
============================================================
Tokens: 1847 in / 812 out | Cache created: 1423 | Cache read: 0
Total error/critical lines: 24
Summary
The system is experiencing a cascading failure originating from two separate
root causes: an SSL certificate problem in auth-svc and a full connection-pool
exhaustion on the primary PostgreSQL host. The database issue has escalated to
a complete primary failover event. Payment processing is fully offline behind
an open circuit breaker. SMTP delivery is also failing independently.
Cluster 1: [CRITICAL] PostgreSQL primary host unreachable, triggering failover
ID: postgres-primary-failover
Affected services: payment-svc, db-proxy
pg-primary-01 exhausted its 250-connection limit and became unreachable,
triggering an automatic failover. This caused payment-svc to exhaust its
own connection pool (all 20 slots active, 20 waiting), which then tripped
the api-gateway circuit breaker for payment-svc after 5 consecutive failures.
All payment transactions from TXN-9826 onward are failing. Replication lag
on pg-replica-02 (14.3s) suggests the replica was already under stress before
the failover initiated.
Sample log lines:
[2] 2026-06-04T08:00:03Z ERROR [payment-svc] DB connection timeout ...
[14] 2026-06-04T08:00:10Z CRITICAL [payment-svc] Connection pool fully saturated
[21] 2026-06-04T08:00:14Z WARN [db-proxy] Max connections reached ...
[32] 2026-06-04T08:00:20Z CRITICAL [db-proxy] Primary host pg-primary-01 unreachable
Next actions:
1. SSH to db-proxy host and confirm failover completed: check pg-replica-02 is now primary
2. Monitor pg-replica-02 replication status with: SELECT pg_is_in_recovery();
3. Once failover is confirmed, drain payment-svc connection pool: restart payment-svc pods
4. Reset circuit breaker in api-gateway config or redeploy api-gateway pod
5. Investigate why pg-primary-01 hit 250 connections: run pg_stat_activity query on new primary
-------------------------------------------------------
Cluster 2: [HIGH] auth-svc TLS certificate validation failure
ID: auth-svc-cert-failure
Affected services: auth-svc, api-gateway
auth-svc is rejecting every token because SSL certificate verification is
failing, likely due to a certificate rotation that introduced an invalid or
expired cert. The api-gateway is receiving 500 responses from auth-svc and
logging upstream errors. This affects all authenticated requests across the
system, independent of the database issue.
Sample log lines:
[4] 2026-06-04T08:00:05Z ERROR [auth-svc] Token validation failed: certificate verify failed
[5] 2026-06-04T08:00:05Z ERROR [auth-svc] Token validation failed: certificate verify failed
[6] 2026-06-04T08:00:06Z WARN [api-gateway] Upstream auth-svc returned 500
[10] 2026-06-04T08:00:08Z ERROR [auth-svc] Token validation failed: certificate verify failed
Next actions:
1. Check recent certificate changes: kubectl describe secret auth-svc-tls -n prod
2. Validate the certificate chain: openssl s_client -connect auth-svc:443
3. If certificate is expired or wrong, redeploy the correct cert and restart auth-svc
4. Confirm api-gateway upstream health check recovers after auth-svc restart
-------------------------------------------------------
Cluster 3: [MEDIUM] SMTP server refusing connections for notification-svc
ID: smtp-connection-refused
Affected services: notification-svc
notification-svc cannot connect to the internal SMTP relay on port 587.
This is an independent failure from the database and auth issues. Email
notifications are silently dropped. No data loss risk but user-visible
effects (missed confirmations, alerts) will accumulate.
Sample log lines:
[11] 2026-06-04T08:00:08Z ERROR [notification-svc] SMTP connection refused
[20] 2026-06-04T08:00:13Z ERROR [notification-svc] SMTP connection refused
[25] 2026-06-04T08:00:16Z ERROR [notification-svc] SMTP connection refused
[33] 2026-06-04T08:00:21Z ERROR [notification-svc] SMTP connection refused
Next actions:
1. Check whether smtp.internal is reachable: telnet smtp.internal 587
2. Verify the SMTP relay service is running on its host
3. Check smtp.internal firewall rules for port 587
Design Choices and Trade-offs
Model Selection
| Model | Best for log triage when | Approx. cost (input/1M tokens) | Typical latency (200 lines) |
|---|---|---|---|
| claude-haiku-4-5 | High-volume, simple logs; single-service; tight budget | $0.80 | 1.5 to 3s |
| claude-sonnet-4-6 | Multi-service incidents; cross-cutting patterns; most production use | $3.00 | 5 to 12s |
| claude-opus-4-8 | Complex distributed traces; postmortem root-cause writeups | $15.00 | 15 to 40s |
For real-time triage (on-call alert fires, engineer wants a brief in under 15 seconds) Sonnet 4.6 is the right default. For async postmortem analysis where you want a detailed narrative, Opus 4.8 is worth the cost. For high-volume pipeline analytics (process 50,000 log batches per day, flag ones worth a human look), route to Haiku first; escalate interesting ones to Sonnet. Part 27 on model routing covers this escalation pattern in depth.
Chunk Size
The default MAX_LINES_PER_CALL of 200 fits comfortably within a 2,000-token user message. A typical structured log line is 80 to 120 characters, so 200 lines is roughly 1,600 to 2,400 tokens of user content. Combine that with the 1,400-token cached system prompt and you are well within the 200K context window with room for the schema and output.
For very dense lines (Java stack traces, multiline JSON) reduce the chunk size to 50 to 100 lines. For short single-field log lines (nginx access logs) you can push 500 per call safely.
Why Not Just Use Grep or a SIEM Rule?
Grep finds known patterns. Claude finds unknown relationships. The sample log above contains a non-obvious causal chain: the certificate failure in auth-svc is independent of the database exhaustion, but both together produce the same surface symptom (payment failures). A grep rule would tag both as “payment errors.” Claude correctly separates them into two clusters with different remediation paths. That distinction is the entire value.
Wiring Into Your Production Stack
Sending to Slack
The triage JSON maps cleanly to a Slack Block Kit message. Use the clusters[0] (highest severity) for the alert title, the summary for the text body, and the next_actions list as a numbered block. Here is the relevant fragment, to add after your main() call in a pipeline:
import urllib.request, json as _json
def post_to_slack(webhook_url: str, report: dict) -> None:
clusters = report.get("clusters", [])
top = clusters[0] if clusters else {}
blocks = [
{"type": "header", "text": {"type": "plain_text",
"text": f"[{top.get('severity','').upper()}] {top.get('title','')}"}},
{"type": "section", "text": {"type": "mrkdwn",
"text": report.get("summary", "")}},
{"type": "section", "text": {"type": "mrkdwn",
"text": "*Next actions:*\n" + "\n".join(
f"{i+1}. {a}" for i, a in enumerate(top.get("next_actions", []))
)}},
]
payload = _json.dumps({"blocks": blocks}).encode()
req = urllib.request.Request(webhook_url, data=payload,
headers={"Content-Type": "application/json"})
urllib.request.urlopen(req)
CI Gate Pattern
In a CI pipeline that runs integration tests and captures application logs, you can pipe the logs through the triage script and exit non-zero if any cluster has severity critical or high. This stops a deployment if a new build introduces database connection errors visible in the test harness logs.
python log_triage.py test-run.log --json-out triage.json && python -c "import json,sys; r=json.load(open('triage.json')); sys.exit(1 if any(c['severity'] in ('critical','high') for c in r['clusters']) else 0)"For a deeper look at agent loops that do multi-step reasoning over logs (fetch more context, open a ticket, run a remediation script), see Part 22 on autonomous agent loops.
Common Pitfalls
Sending Too Many Lines Without Context
Feeding 5,000 lines of access logs (mostly 200 OK) drowns out the 30 error lines. Pre-filter with a grep or log-level filter before sending to Claude. You only need the ERROR, WARN, and CRITICAL lines plus enough INFO context (5 to 10 lines before the first error) to give Claude the timeline. Sending clean input costs half the tokens and produces better clusters.
Expecting Line-by-Line Attribution to Be Perfect
The log_line_indices field is a guide, not a formal database query. On ambiguous cases (an INFO line that contextualises an ERROR) Claude may place it differently between runs. Do not write brittle downstream logic that requires a specific line to be in a specific cluster. Use the indices to jump to the right area of the log, not as a primary key.
Ignoring the Cache on the Second Call
The first call in a session always returns cache_creation_input_tokens > 0 and cache_read_input_tokens == 0. The second call within 5 minutes should show cache_read_input_tokens equal to the previous cache_creation_input_tokens. If you are not seeing cache hits, confirm you are using the identical system list (same string, same structure) on both calls. Even a single character difference creates a cache miss.
Using the Wrong Model for the Volume
Sonnet at $3 per million tokens on a 1,800-token call (typical for 200 lines) costs about $0.0054 per triage. Across 1,000 triage calls per day that is $5.40 a day. That is fine for an incident response tool. If you plan to run this on every minute of logs from every service, switch to Haiku first and escalate interesting batches. The cost difference is roughly 4x.
Not Handling Multiline Log Entries
Stack traces and JSON-format logs often span multiple physical lines. If you split on \n naively, Claude sees fragments. Use a simple heuristic: if a line does not start with a timestamp pattern, concatenate it to the previous line before numbering. The POC uses simple line splitting (fine for the sample); add this pre-processing step before passing to triage_log_chunk in production.
Treating a Single Call as Ground Truth for Severity
The severity label is a judgment, and judgments vary at the margins. A failure that one run calls “high” another run may call “critical” when the same lines are present. This is fine for a brief that a human reads, but it is a trap if you wire an automated paging decision to a strict equality check on one call. Two defenses help. First, keep your automated gates coarse: page on “critical or high,” not on “critical only,” so a one-level wobble does not change the outcome. Second, for the calls where the decision genuinely matters, set a lower temperature and consider running the triage twice and taking the higher severity. The cost of a second Sonnet call is a few tenths of a cent, which is nothing next to a missed page.
Leaking Secrets Into the Prompt
Logs are full of things you do not want to send anywhere: bearer tokens, session cookies, connection strings with embedded passwords, customer email addresses, internal hostnames. Build a redaction pass that runs before the log lines ever reach triage_log_chunk. A short list of regular expressions covers most of it: anything that looks like a JWT, anything after password= or token=, anything matching an email pattern. Replace the match with a placeholder like [REDACTED-TOKEN]. The triage quality does not suffer, because the model is reasoning about the shape of the error, not the secret value inside it.
Forgetting to Bound max_tokens for the Output
A very noisy batch can produce a long structured response with many clusters and long action lists. If max_tokens is set too low the tool call gets truncated and block.input comes back as malformed or partial JSON. The POC sets MAX_TOKENS = 4096, which comfortably holds a dozen detailed clusters. If you raise the chunk size well past 200 lines, raise max_tokens in step, and check response.stop_reason: a value of "max_tokens" means the output was cut off and you should retry with a higher ceiling or a smaller chunk.
Cost and Latency Reference
The numbers below assume a 200-line chunk, mixed log levels, Anthropic’s published pricing as of mid-2026. Cache hit pricing is 10 percent of the base input price.
| Scenario | Model | Input tokens (est.) | Output tokens (est.) | Cost per call | p50 latency |
|---|---|---|---|---|---|
| First call (cold cache) | Sonnet 4.6 | 2,300 | 800 | $0.011 | 8s |
| Subsequent calls (warm cache) | Sonnet 4.6 | 900 uncached + 1,400 cached | 800 | $0.005 | 6s |
| High-volume routing (cold) | Haiku 4.5 | 2,300 | 600 | $0.002 | 2s |
| Deep postmortem | Opus 4.8 | 2,300 | 1,200 | $0.052 | 25s |
With a warm cache the per-call cost for Sonnet drops from $0.011 to $0.005. Over 10,000 calls a month that is the difference between $110 and $50. The cache is free to use; the only requirement is re-sending the same cache_control block. Always do it.
For observability of these calls in production (tracking latency percentiles, cache hit rates, per-service cost allocation), the pattern from Part 28 on LLM observability applies directly here: wrap triage_log_chunk in a span, record _usage as span attributes, and emit to your OTEL collector.
Frequently Asked Questions
Can this handle logs from multiple services in the same file?
Yes, that is actually where it produces the most value. The sample log above contains four services in one file. Claude naturally cross-references them and identifies that the payment failures are a downstream effect of the database exhaustion, not an independent cause. Single-service logs are also fine; the clusters just end up being purely temporal or error-type groupings.
Does the model hallucinate log lines that are not there?
On rare occasions it will misattribute a line index (say it is index 7 when it is actually index 8). The explanation and next actions are grounded in the actual content you sent. Treating the indices as approximate pointers (use them to jump near the right area, then read a few lines around them) prevents the attribution ambiguity from mattering. The structured output schema does not prevent index off-by-one errors, but the explanations themselves do not invent errors that are not in the input.
How do I handle log files with 50,000 lines?
Pre-filter aggressively. Keep only ERROR, WARN, and CRITICAL lines plus a short window of context around each. In most applications this reduces a 50,000-line file to under 1,000 relevant lines. Then batch those 1,000 lines into chunks of 200 and run them through the chunker. If you need a unified summary across all chunks, add a second LLM call that takes the per-chunk cluster summaries as input and consolidates them.
What log formats does this work with?
Any format that a human can read. The model does not require a specific timestamp format, log level syntax, or field order. It has seen syslog, JSON-structured logs, Python logging output, Rails logs, nginx access logs, Spring Boot output, and Kubernetes pod logs in training. If a line is meaningful to you, it is meaningful to Claude.
Can I use this for security log analysis?
Yes, with one caveat: do not send raw authentication logs, session tokens, PII, or credentials to the API. Filter those fields before sending. For security event correlation (firewall deny logs, failed SSH attempts, unusual API call patterns) the clustering approach works well. The model can identify patterns like “10 IPs probing port 22 in a 60-second window” from a batch of firewall logs.
How does this compare to using a SIEM like Splunk or Elastic?
SIEM tools are excellent at known-pattern alerting at high volume with sub-second latency. Claude is better at novel-pattern discovery on moderate volume (hundreds to thousands of lines per call) where you need a natural-language explanation and prioritised remediation steps. The two are complementary: your SIEM fires the alert, your log triage script generates the incident brief. The combination is more useful than either alone.
Is the triage output good enough to page a human directly?
The output is good enough to include in the page. Whether to fire the page based solely on the triage output depends on your confidence in the log pre-filtering and your tolerance for false positives. A sensible initial policy: run the triage script, post the summary to your Slack incident channel, and let the existing PagerDuty rules (based on error rate thresholds, not on AI output) make the actual paging decision. Once you have calibrated the severity labels against real incidents for a few weeks, you can gate pages directly on the triage result.
Browse the complete series at skillsuites.com/category/ai-use-cases/.
Additional reading:
Leave a Reply