Claude Message Batches API: Batch 10,000 Requests at 50% Off

By Asif·June 6, 2026·8 min read·AI Use Cases·Updated June 15, 2026

The Claude Message Batches API lets you send up to 100,000 requests in a single asynchronous job and pay 50% less for every input and output token. If you are running classification, summarization, extraction, or evaluation over a large dataset, batching is the single easiest way to cut your Claude bill in half without touching prompt quality or switching models. This guide is a code-first walkthrough: you will create a batch, poll it to completion, retrieve results, stack it with prompt caching, and handle every edge case in production Python.

Everything here uses the official anthropic Python SDK and the current Claude models (Opus 4.8 by default — claude-opus-4-8). No fluff, no filler, just working examples you can run today.

When to use the Claude Message Batches API (and when not to)

The Messages API has two delivery modes for the same model intelligence:

Synchronous (/v1/messages) — you send one request, you block, you get one answer. Use it for anything a human is waiting on: chat, autocomplete, an API a user is calling live.
Asynchronous / batch (/v1/messages/batches) — you submit thousands of requests at once and collect the results later. Use it for anything not latency-sensitive.

Reach for the Batches API whenever the work is bulk and offline: classifying a backlog of support tickets, summarizing every document in a data lake, generating product descriptions for a catalog, extracting fields from thousands of PDFs, or running a large evaluation set against a new prompt. The trade-off is simple — you give up real-time responses, and in exchange you get half-price tokens and a much higher throughput ceiling.

Key limits and specifications

Before you architect around batching, know the hard numbers:

Property	Limit
Requests per batch	Up to 100,000
Batch payload size	Up to 256 MB (whichever cap you hit first)
Typical completion time	Most batches finish within 1 hour
Maximum processing window	24 hours — anything not finished expires
Result availability	Results downloadable for 29 days after creation
Cost	50% discount on input and output tokens
Feature support	All Messages API features — vision, tools, system prompts, prompt caching

That last row matters: batching is not a stripped-down endpoint. Anything you can do in a single message — multimodal inputs, tool use, structured outputs — works inside a batch request too.

How much does it actually save? A worked example

Say you need to process 10,000 requests, each roughly 1,000 input tokens and 500 output tokens, on Claude Opus 4.8 (standard pricing: $5 / 1M input, $25 / 1M output).

Input: 10M tokens → $50 synchronous → $25 batched
Output: 5M tokens → $125 synchronous → $62.50 batched
Total: $175 → $87.50 — a clean 50% cut

For high-volume, simpler jobs you can compound the savings by also choosing a faster, cheaper model such as Claude Haiku 4.5 — the discount applies on top of whatever model you pick. (Model choice is yours; the batch discount is identical either way.) For more levers, see our guide on cutting AI costs with model routing and batching.

Create a batch in Python

A batch is just a list of normal Messages API requests, each tagged with your own custom_id so you can match results back to inputs. Install the SDK with pip install anthropic, then:

import anthropic
from anthropic.types.message_create_params import MessageCreateParamsNonStreaming
from anthropic.types.messages.batch_create_params import Request

client = anthropic.Anthropic()

message_batch = client.messages.batches.create(
    requests=[
        Request(
            custom_id="request-1",
            params=MessageCreateParamsNonStreaming(
                model="claude-opus-4-8",
                max_tokens=1024,
                messages=[{"role": "user", "content": "Summarize the impacts of climate change"}],
            ),
        ),
        Request(
            custom_id="request-2",
            params=MessageCreateParamsNonStreaming(
                model="claude-opus-4-8",
                max_tokens=1024,
                messages=[{"role": "user", "content": "Explain quantum computing basics"}],
            ),
        ),
    ]
)

print(f"Batch ID: {message_batch.id}")
print(f"Status: {message_batch.processing_status}")

The custom_id is the one field you must not skip. Results come back in no guaranteed order, so custom_id is how you reconnect each answer to its source row. Use something meaningful — a database primary key, a file path, a ticket number.

Poll for completion

Batches are asynchronous, so you retrieve the batch object and watch its processing_status until it reads ended:

import time

while True:
    batch = client.messages.batches.retrieve(message_batch.id)
    if batch.processing_status == "ended":
        break
    print(f"Processing: {batch.request_counts.processing}")
    time.sleep(60)

print("Batch complete!")
print(f"Succeeded: {batch.request_counts.succeeded}")
print(f"Errored:   {batch.request_counts.errored}")

The request_counts object (processing, succeeded, errored, canceled, expired) gives you a live progress breakdown. Poll on a sane interval — every 30–60 seconds is plenty. For very large jobs, treat the batch ID as durable state: persist it, and let a separate worker or cron job check back rather than blocking a process for an hour.

Retrieve and handle results

Once the batch has ended, stream the results. Each result carries a type — and a robust pipeline handles all of them, not just the happy path:

for result in client.messages.batches.results(message_batch.id):
    match result.result.type:
        case "succeeded":
            msg = result.result.message
            text = next((b.text for b in msg.content if b.type == "text"), "")
            print(f"[{result.custom_id}] {text[:120]}")
        case "errored":
            if result.result.error.type == "invalid_request":
                print(f"[{result.custom_id}] Validation error — fix the request and resubmit")
            else:
                print(f"[{result.custom_id}] Server error — safe to retry")
        case "canceled":
            print(f"[{result.custom_id}] Canceled")
        case "expired":
            print(f"[{result.custom_id}] Expired — resubmit")

The four result types map cleanly to actions: succeeded → use the message; errored / invalid_request → a bug in your request, fix and resend; errored / other → a transient server error, safe to retry; expired → the 24-hour window closed, resubmit. Build retry logic around these from day one and a 100,000-request job becomes boringly reliable.

Stack batching with prompt caching for even bigger savings

Batch discounts stack with prompt caching. If every request in your batch shares a large common prefix — a long system prompt, a reference document, a set of few-shot examples — mark it cacheable once and every subsequent request reads it at roughly a tenth of the input price on top of the 50% batch discount:

shared_system = [
    {"type": "text", "text": "You are a meticulous literary analyst."},
    {
        "type": "text",
        "text": large_reference_document,   # shared across all requests
        "cache_control": {"type": "ephemeral"},
    },
]

message_batch = client.messages.batches.create(
    requests=[
        Request(
            custom_id=f"analysis-{i}",
            params=MessageCreateParamsNonStreaming(
                model="claude-opus-4-8",
                max_tokens=1024,
                system=shared_system,
                messages=[{"role": "user", "content": question}],
            ),
        )
        for i, question in enumerate(questions)
    ]
)

For a deeper dive on what invalidates a cache and how to place breakpoints, see the official Anthropic batch processing documentation.

Real-world use cases that fit batching perfectly

Bulk document & invoice extraction — pair batches with Claude Vision to parse thousands of PDFs overnight. See AI invoice data extraction with Claude Vision.
Ticket and email classification — score an entire backlog in one job. See ticket classification and routing with Claude.
Evaluation harnesses — run a whole eval set against a new prompt at half cost. See building a practical eval harness in Python.
Content generation pipelines — generate descriptions, summaries, or translations for a full catalog.

Production best practices and gotchas

Always set a unique custom_id. Results are unordered; this is your only join key.
Don’t block a web process polling for an hour. Persist the batch ID and poll from a background worker or scheduled task.
Download results within 29 days. After that the result set is no longer retrievable.
Plan for the 24-hour expiry. Most batches finish in under an hour, but build resubmit logic for the expired case.
Right-size max_tokens. Output tokens cost real money even at 50% off — don’t request 16K tokens for a one-word classification.
Validate before you submit. A malformed request returns invalid_request per-item; it won’t sink the whole batch, but it wastes a cycle.

Frequently asked questions

How much cheaper is the Claude Message Batches API?

You pay 50% of standard pricing on both input and output tokens for every request processed through a batch. The discount applies to any supported model.

How many requests can one batch hold?

Up to 100,000 requests or 256 MB of payload, whichever limit you reach first. For larger jobs, split the work across multiple batches.

How long does a batch take to process?

Most batches complete within an hour. The maximum processing window is 24 hours; any request not finished by then is marked expired and should be resubmitted.

Can I use tools, vision, or prompt caching inside a batch?

Yes. The Batches API supports every Messages API feature, and the batch discount stacks with prompt caching for even lower input costs on shared context.

How long are results available?

Batch results remain downloadable for 29 days after the batch is created. Persist them to your own storage if you need them longer.

Conclusion

If your workload is bulk and offline, the Claude Message Batches API is close to free money: the same model, the same prompts, half the token cost, and a 100,000-request throughput ceiling. Create a batch, poll processing_status, handle the four result types, and layer prompt caching on top of shared context. Wire in resubmit logic for expirations and you have a pipeline that scales to millions of requests without scaling your bill the same way.

Want to push costs down further? Read our companion guide on model routing and cost optimization with Claude.