TL;DR
- Claude’s vision API accepts base64-encoded images as content blocks alongside text, making multimodal calls a straightforward extension of the standard messages API.
- A single script can handle four distinct vision tasks: scene description, OCR-style text extraction, product classification, and chart Q&A, each using only a different prompt and image.
- Use
claude-haiku-4-5for high-volume classification pipelines andclaude-sonnet-4-6when you need detailed descriptions or structured data extraction from complex visuals. - Token costs for images depend on image resolution; resizing large images before sending can cut input token counts by 60-80% with minimal quality loss on most tasks.
- Claude vision image analysis integrates cleanly with structured output via tool use, so you can return typed Python objects rather than raw strings from every vision call.
- The full POC below is under 200 lines of Python, requires no special dependencies beyond the Anthropic SDK and Pillow, and can process a folder of images from the command line.
Why Vision Matters for Production Apps
Most teams building AI features start with text: chatbots, summarizers, classifiers. Vision gets added later, often as an afterthought, because it feels harder. In practice, the Anthropic SDK makes image input nearly identical to text input. You swap a string for an image content block. That’s most of the change.
The business value is significant. Consider where unstructured visual data sits in a typical company’s workflow:
- Support tickets arrive with screenshots that agents have to manually interpret before routing.
- E-commerce catalogs contain product photos that need automated tagging for search.
- Finance teams export charts from dashboards and paste them into slide decks where someone has to re-read the numbers.
- Logistics teams photograph damaged parcels that need to be classified and linked to a claim.
- Document-heavy industries (legal, medical, construction) work with scanned forms, site photos, and annotated blueprints.
In every case, the current solution is either manual human review or a specialized CV model trained on narrow domain data. Claude vision image analysis replaces both with a single general model that accepts natural language instructions. You don’t need a labeled dataset. You don’t need to retrain when requirements change. You write a prompt.
This article covers four concrete tasks: describing an image, extracting text from it, classifying a product photo, and answering a question about a chart. Each is a production-relevant pattern. We’ll build them all into one script, then talk about cost, latency, and the pitfalls teams hit when they first ship vision features.
How Claude Vision Image Analysis Works Under the Hood
The Content Block Model
Every Claude API call is a list of messages. Each message has a role and content. Content, in the multimodal case, is a list of blocks. A block can be text or image. That’s it. There’s no separate endpoint, no multimodal-specific client method, no configuration flag.
An image block looks like this:
{
"type": "image",
"source": {
"type": "base64",
"media_type": "image/png",
"data": "<base64-encoded bytes>"
}
}
You can include multiple image blocks in a single message, mix them with text blocks, and position them before or after the text. Claude reads the conversation as an ordered sequence, so placing the image before the question is slightly more natural, but both orderings work.
What Claude Sees
Claude processes the image pixels directly. It is not doing OCR as a separate step and then reading the text. It does not run a feature extractor and embed the image into a vector. The vision model processes the image and text together as part of the same forward pass, which is why it can answer questions that require correlating text in the image with visual structure, for example: “what does the tallest bar in this chart represent?”
The practical implication is that you can ask genuinely multimodal questions. “Which product in this photo matches the description in the label below the image?” is a valid prompt.
Supported Formats and Limits
| Format | MIME type | Typical use |
|---|---|---|
| PNG | image/png | Screenshots, diagrams, UI capture |
| JPEG | image/jpeg | Product photos, site photography |
| GIF | image/gif | Single-frame only; use PNG instead |
| WebP | image/webp | Web assets; good compression ratio |
The current limit is 20 MB per image after base64 encoding. In practice, anything over 2000px on its longest edge is larger than Claude needs for most tasks. A 4K image adds tokens without improving answer quality for most description or classification tasks. Resize before sending.
The Four Vision Tasks: What Each One Is Good For
Task 1: Scene and Object Description
Ask Claude to describe an image and you get structured natural language output. The model identifies objects, their spatial relationships, colors, text visible in the image, and contextual clues about the setting. This is useful for:
- Generating alt text for accessibility compliance at scale.
- Building a searchable metadata index over a photo library.
- Automating content moderation pre-screening before a human review step.
- Creating captions for social media posts from product photography.
The key prompt design choice is specificity. “Describe this image” produces a general paragraph. “List the objects in this image as a JSON array with a confidence score for each” produces something you can actually process downstream. This connects to the structured output pattern covered in Part 3: Structured Output from Claude.
Task 2: Text Extraction from Images
Claude reads text embedded in images accurately, including handwritten notes, printed labels, scanned forms, and on-screen UI text. It is not a traditional OCR engine (it won’t give you bounding boxes or per-character confidence scores), but for extracting the content of text in natural language form, it works well and handles messy real-world images better than many classical OCR tools.
Practical applications include:
- Reading product SKUs and prices from shelf photos in retail audits.
- Extracting field values from scanned forms before routing to a CRM.
- Parsing meter readings, serial numbers, or barcode text from field-service photos.
- Pulling ingredient lists or nutritional data from food packaging images.
For structured document extraction (invoices, PDFs), see Part 20: Extract Data from PDFs and Invoices Using Claude Vision, which covers the full pipeline including multi-page handling.
Task 3: Product Photo Classification
Given a product image and a taxonomy, Claude assigns a category. You can make this zero-shot (just describe the taxonomy in the prompt) or few-shot (include example images with their correct labels as earlier content blocks). For most product catalogs, zero-shot with a well-written taxonomy description performs adequately. Few-shot examples are worth adding when category boundaries are ambiguous or domain-specific.
The structured output approach works particularly well here: define a tool whose input schema includes category, subcategory, and confidence fields, force Claude to call it, and read block.input as a typed dict. This is the same tool-use pattern described in Part 2: Tool Use with Claude.
Task 4: Chart and Data Visualization Q&A
This is the one that surprises people the most. Claude can read bar charts, line graphs, pie charts, scatter plots, and tables rendered as images, then answer questions about the data. It can extract approximate values, identify trends, compare series, and spot anomalies.
“Approximate” is the operative word. For a bar chart with clear axis labels and gridlines, Claude reads values to within a few percent of the true number. For complex or poorly labeled charts, precision drops. The pattern is useful for: analytics reporting pipelines where a human will verify important numbers, early-stage data exploration where you want a quick narrative summary, and accessibility tooling that converts charts to text descriptions.
Choosing the Right Model for Claude Vision Image Analysis
| Task | Recommended model | Why | Approx. cost per 1K calls |
|---|---|---|---|
| Product classification (binary / fixed taxonomy) | claude-haiku-4-5 | Fast, cheap; taxonomy is in the prompt so deep reasoning not needed | $0.25 input + image tokens |
| Scene description / alt text generation | claude-sonnet-4-6 | Balanced; good detail without opus-level cost | $3 input + image tokens |
| Text extraction from messy/handwritten images | claude-sonnet-4-6 | Haiku struggles with low-quality scans; Sonnet handles them well | $3 input + image tokens |
| Chart Q&A with precise numerical reads | claude-opus-4-8 | Complex reasoning over visual data benefits from the most capable model | $15 input + image tokens |
| Multi-image batch processing (high volume) | claude-haiku-4-5 | Cost scales with volume; use haiku unless quality is insufficient | $0.25 input + image tokens |
The routing pattern from Part 27: Cut AI Costs with Model Routing and Batching applies directly here: run a cheap classifier on incoming images to route simple cases to Haiku and complex ones to Sonnet or Opus.
The Full POC: Four Vision Tasks in One Script
Project Layout
This script accepts a folder of images, runs all four vision tasks on each image, and writes results to a JSON file. It uses base64 encoding (the simplest path, no URL serving required), handles multiple image blocks in one API call for the chart task, and returns structured output via tool use for the classification task.
Install and Setup
pip install anthropic pillow python-dotenvrequirements.txt
anthropic>=0.28.0
pillow>=10.0.0
python-dotenv>=1.0.0
.env
ANTHROPIC_API_KEY=sk-ant-your-key-here
Full Source: vision_tasks.py
"""
vision_tasks.py
Four vision tasks in one script using the Anthropic SDK:
1. Image description
2. Text extraction
3. Product classification (structured output via tool use)
4. Chart Q&A
Usage:
python vision_tasks.py --image path/to/image.jpg --task describe
python vision_tasks.py --image path/to/image.jpg --task extract_text
python vision_tasks.py --image path/to/product.jpg --task classify --taxonomy "Electronics,Clothing,Food,Furniture,Other"
python vision_tasks.py --image path/to/chart.png --task chart_qa --question "Which category has the highest value?"
python vision_tasks.py --folder path/to/images/ --task describe --output results.json
"""
import argparse
import base64
import json
import os
import sys
import time
from pathlib import Path
from typing import Optional
import anthropic
from dotenv import load_dotenv
try:
from PIL import Image
PIL_AVAILABLE = True
except ImportError:
PIL_AVAILABLE = False
load_dotenv()
# ---------------------------------------------------------------------------
# Client
# ---------------------------------------------------------------------------
client = anthropic.Anthropic() # reads ANTHROPIC_API_KEY from environment
# ---------------------------------------------------------------------------
# Helpers
# ---------------------------------------------------------------------------
SUPPORTED_EXTENSIONS = {".jpg", ".jpeg", ".png", ".gif", ".webp"}
MAX_LONG_EDGE_PX = 1568 # Resize images larger than this; saves tokens with minimal quality loss
def load_image_as_b64(image_path: str, max_long_edge: int = MAX_LONG_EDGE_PX) -> tuple[str, str]:
"""
Load an image from disk, optionally resize it, and return (base64_data, media_type).
Resizing is only applied when Pillow is available.
"""
path = Path(image_path)
suffix = path.suffix.lower()
media_type_map = {
".jpg": "image/jpeg",
".jpeg": "image/jpeg",
".png": "image/png",
".gif": "image/gif",
".webp": "image/webp",
}
if suffix not in media_type_map:
raise ValueError(f"Unsupported image format: {suffix}")
media_type = media_type_map[suffix]
if PIL_AVAILABLE:
with Image.open(image_path) as img:
# Convert palette images to RGB to avoid mode issues
if img.mode in ("P", "RGBA") and suffix in (".jpg", ".jpeg"):
img = img.convert("RGB")
# Resize if needed to reduce token cost
w, h = img.size
if max(w, h) > max_long_edge:
scale = max_long_edge / max(w, h)
img = img.resize((int(w * scale), int(h * scale)), Image.LANCZOS)
import io
buf = io.BytesIO()
save_format = "JPEG" if suffix in (".jpg", ".jpeg") else suffix.lstrip(".").upper()
if save_format == "JPG":
save_format = "JPEG"
img.save(buf, format=save_format)
raw_bytes = buf.getvalue()
else:
with open(image_path, "rb") as f:
raw_bytes = f.read()
b64_data = base64.standard_b64encode(raw_bytes).decode("utf-8")
return b64_data, media_type
def image_block(b64_data: str, media_type: str) -> dict:
"""Build a Claude image content block."""
return {
"type": "image",
"source": {
"type": "base64",
"media_type": media_type,
"data": b64_data,
},
}
def text_block(text: str) -> dict:
"""Build a plain text content block."""
return {"type": "text", "text": text}
def call_with_retry(fn, max_retries: int = 3, base_delay: float = 2.0):
"""Simple exponential backoff wrapper for API calls."""
for attempt in range(max_retries):
try:
return fn()
except anthropic.RateLimitError:
if attempt == max_retries - 1:
raise
wait = base_delay * (2 ** attempt)
print(f" Rate limit hit, waiting {wait:.0f}s before retry {attempt + 2}/{max_retries}...")
time.sleep(wait)
except anthropic.APIError as exc:
if attempt == max_retries - 1:
raise
print(f" API error ({exc}), retrying {attempt + 2}/{max_retries}...")
time.sleep(base_delay)
# ---------------------------------------------------------------------------
# Task 1: Image Description
# ---------------------------------------------------------------------------
def describe_image(image_path: str) -> dict:
"""
Ask Claude to produce a structured description of the image.
Returns a dict with 'description', 'objects', 'text_visible', and 'model_used'.
"""
b64, media = load_image_as_b64(image_path)
def _call():
return client.messages.create(
model="claude-sonnet-4-6",
max_tokens=1024,
system=(
"You are a precise image analyst. When given an image, respond with valid JSON only. "
"No markdown fences, no commentary outside the JSON."
),
messages=[
{
"role": "user",
"content": [
image_block(b64, media),
text_block(
"Describe this image. Return a JSON object with these keys:\n"
" description (string): one clear paragraph describing the scene\n"
" objects (array of strings): main objects or subjects visible\n"
" dominant_colors (array of strings): 2-4 dominant colors\n"
" text_visible (string): any text you can read in the image, or empty string\n"
" setting (string): indoor / outdoor / digital / document / other"
),
],
}
],
)
msg = call_with_retry(_call)
raw = msg.content[0].text.strip()
try:
result = json.loads(raw)
except json.JSONDecodeError:
result = {"raw_response": raw}
result["model_used"] = "claude-sonnet-4-6"
result["input_tokens"] = msg.usage.input_tokens
result["output_tokens"] = msg.usage.output_tokens
return result
# ---------------------------------------------------------------------------
# Task 2: Text Extraction
# ---------------------------------------------------------------------------
def extract_text(image_path: str) -> dict:
"""
Extract all readable text from the image.
Preserves layout using whitespace where possible.
"""
b64, media = load_image_as_b64(image_path)
def _call():
return client.messages.create(
model="claude-sonnet-4-6",
max_tokens=2048,
system=(
"You extract text from images. Reproduce the text you see exactly, "
"preserving line breaks and indentation. If no text is present, reply with the single word NONE."
),
messages=[
{
"role": "user",
"content": [
image_block(b64, media),
text_block(
"Extract all text visible in this image. "
"Preserve formatting as closely as possible. "
"If there are multiple distinct text regions (e.g., a header and a table), "
"separate them with a blank line and a label like [HEADER], [TABLE], [LABEL]."
),
],
}
],
)
msg = call_with_retry(_call)
extracted = msg.content[0].text.strip()
return {
"extracted_text": extracted,
"has_text": extracted.upper() != "NONE",
"model_used": "claude-sonnet-4-6",
"input_tokens": msg.usage.input_tokens,
"output_tokens": msg.usage.output_tokens,
}
# ---------------------------------------------------------------------------
# Task 3: Product Classification (structured output via tool use)
# ---------------------------------------------------------------------------
CLASSIFY_TOOL = {
"name": "submit_classification",
"description": "Submit the product classification result.",
"input_schema": {
"type": "object",
"properties": {
"category": {
"type": "string",
"description": "The top-level category from the provided taxonomy.",
},
"subcategory": {
"type": "string",
"description": "A more specific subcategory within the top-level category.",
},
"confidence": {
"type": "number",
"description": "Confidence score from 0.0 to 1.0.",
},
"reasoning": {
"type": "string",
"description": "One sentence explaining the classification.",
},
},
"required": ["category", "subcategory", "confidence", "reasoning"],
},
}
def classify_product(image_path: str, taxonomy: str) -> dict:
"""
Classify a product image into the provided taxonomy.
Uses tool use to get structured output.
taxonomy: comma-separated category names, e.g. "Electronics,Clothing,Food,Furniture,Other"
"""
b64, media = load_image_as_b64(image_path)
categories = [c.strip() for c in taxonomy.split(",")]
def _call():
return client.messages.create(
model="claude-haiku-4-5",
max_tokens=512,
system="You are a product catalog classifier. Classify the product in the image into the given taxonomy.",
tools=[CLASSIFY_TOOL],
tool_choice={"type": "tool", "name": "submit_classification"},
messages=[
{
"role": "user",
"content": [
image_block(b64, media),
text_block(
f"Classify this product into one of the following categories: {', '.join(categories)}.\n"
"Use the submit_classification tool to return your answer."
),
],
}
],
)
msg = call_with_retry(_call)
result = {}
for block in msg.content:
if block.type == "tool_use" and block.name == "submit_classification":
result = block.input
break
result["model_used"] = "claude-haiku-4-5"
result["input_tokens"] = msg.usage.input_tokens
result["output_tokens"] = msg.usage.output_tokens
return result
# ---------------------------------------------------------------------------
# Task 4: Chart Q&A (multi-image block example)
# ---------------------------------------------------------------------------
def chart_qa(chart_path: str, question: str, reference_image_path: Optional[str] = None) -> dict:
"""
Answer a question about a chart image.
Optionally accepts a second reference image (e.g., a legend) as an additional block.
"""
b64_chart, media_chart = load_image_as_b64(chart_path)
content_blocks = []
if reference_image_path:
# Demonstrate sending multiple image blocks in one call
b64_ref, media_ref = load_image_as_b64(reference_image_path)
content_blocks.append(text_block("[Reference image for context:]"))
content_blocks.append(image_block(b64_ref, media_ref))
content_blocks.append(text_block("[Chart to analyze:]"))
content_blocks.append(image_block(b64_chart, media_chart))
content_blocks.append(
text_block(
f"Answer the following question about this chart:\n\n{question}\n\n"
"Be specific. If you are reading a numerical value, state the approximate number. "
"If the chart does not contain enough information to answer, say so clearly."
)
)
def _call():
return client.messages.create(
model="claude-sonnet-4-6",
max_tokens=512,
system=(
"You are a data analyst who reads charts accurately. "
"When answering questions about chart values, give specific numbers where visible. "
"State your confidence if a value is hard to read precisely."
),
messages=[{"role": "user", "content": content_blocks}],
)
msg = call_with_retry(_call)
answer = msg.content[0].text.strip()
return {
"question": question,
"answer": answer,
"model_used": "claude-sonnet-4-6",
"input_tokens": msg.usage.input_tokens,
"output_tokens": msg.usage.output_tokens,
}
# ---------------------------------------------------------------------------
# Batch folder processing
# ---------------------------------------------------------------------------
def process_folder(folder: str, task: str, **kwargs) -> list[dict]:
"""Run a task over every image in a folder. Returns list of results."""
results = []
folder_path = Path(folder)
image_files = [
f for f in sorted(folder_path.iterdir())
if f.suffix.lower() in SUPPORTED_EXTENSIONS
]
if not image_files:
print(f"No supported images found in {folder}")
return results
print(f"Processing {len(image_files)} images with task '{task}'...")
for idx, img_path in enumerate(image_files, 1):
print(f" [{idx}/{len(image_files)}] {img_path.name}")
try:
if task == "describe":
result = describe_image(str(img_path))
elif task == "extract_text":
result = extract_text(str(img_path))
elif task == "classify":
taxonomy = kwargs.get("taxonomy", "Electronics,Clothing,Food,Furniture,Other")
result = classify_product(str(img_path), taxonomy)
elif task == "chart_qa":
question = kwargs.get("question", "What is the main trend shown in this chart?")
result = chart_qa(str(img_path), question)
else:
print(f" Unknown task: {task}")
continue
result["file"] = str(img_path)
results.append(result)
except anthropic.APIError as exc:
print(f" ERROR processing {img_path.name}: {exc}")
results.append({"file": str(img_path), "error": str(exc)})
return results
# ---------------------------------------------------------------------------
# CLI
# ---------------------------------------------------------------------------
def main():
parser = argparse.ArgumentParser(description="Claude vision tasks demo")
parser.add_argument("--image", help="Path to a single image file")
parser.add_argument("--folder", help="Path to a folder of images (batch mode)")
parser.add_argument(
"--task",
required=True,
choices=["describe", "extract_text", "classify", "chart_qa"],
help="Which vision task to run",
)
parser.add_argument("--taxonomy", default="Electronics,Clothing,Food,Furniture,Other",
help="Comma-separated taxonomy for classify task")
parser.add_argument("--question", default="What is the main trend shown in this chart?",
help="Question to ask for chart_qa task")
parser.add_argument("--reference", help="Optional second image for chart_qa (e.g. legend)")
parser.add_argument("--output", help="Write JSON results to this file")
args = parser.parse_args()
if not args.image and not args.folder:
parser.error("Provide either --image or --folder")
if args.folder:
results = process_folder(
args.folder, args.task,
taxonomy=args.taxonomy,
question=args.question,
)
output_data = results
else:
img = args.image
if args.task == "describe":
output_data = describe_image(img)
elif args.task == "extract_text":
output_data = extract_text(img)
elif args.task == "classify":
output_data = classify_product(img, args.taxonomy)
elif args.task == "chart_qa":
output_data = chart_qa(img, args.question, args.reference)
else:
print(f"Unknown task: {args.task}")
sys.exit(1)
print(json.dumps(output_data, indent=2))
if args.output:
with open(args.output, "w") as f:
json.dump(output_data, f, indent=2)
print(f"\nResults written to {args.output}")
if __name__ == "__main__":
main()
Sample Runs with Realistic Output
Task 1: Describe a product photo
$ python vision_tasks.py --image product_shoe.jpg --task describe
{
"description": "A white leather athletic sneaker photographed on a clean white background. The shoe faces left at a 45-degree angle, showing the full side profile and partial top. The midsole is light grey with a subtle texture pattern. A small brand logo in orange is visible on the heel tab.",
"objects": ["sneaker", "shoelace", "midsole", "heel tab", "brand logo"],
"dominant_colors": ["white", "light grey", "orange"],
"text_visible": "SPRINT",
"setting": "digital",
"model_used": "claude-sonnet-4-6",
"input_tokens": 1842,
"output_tokens": 118
}
Task 2: Extract text from a form image
$ python vision_tasks.py --image invoice_scan.png --task extract_text
{
"extracted_text": "[HEADER]\nTAX INVOICE\nInvoice #: INV-2024-00847\nDate: 15 March 2024\n\n[TABLE]\nItem Qty Unit Price Total\nStainless Bracket M6 4 $12.50 $50.00\nHex Bolt Set 50pc 2 $8.75 $17.50\nMounting Kit Pro 1 $34.00 $34.00\n\n[FOOTER]\nSubtotal: $101.50\nGST (10%): $10.15\nTotal Due: $111.65\nPayment due 30 days from invoice date.",
"has_text": true,
"model_used": "claude-sonnet-4-6",
"input_tokens": 2104,
"output_tokens": 156
}
Task 3: Classify a product photo
$ python vision_tasks.py --image laptop_bag.jpg --task classify \
--taxonomy "Electronics,Clothing,Bags and Luggage,Food,Furniture,Other"
{
"category": "Bags and Luggage",
"subcategory": "Laptop Bags",
"confidence": 0.97,
"reasoning": "The image shows a padded neoprene sleeve with a zipper, sized to fit a 15-inch laptop, with a visible carry handle.",
"model_used": "claude-haiku-4-5",
"input_tokens": 1456,
"output_tokens": 78
}
Task 4: Chart Q&A
$ python vision_tasks.py --image q1_revenue_bar_chart.png --task chart_qa \
--question "Which month had the highest revenue, and what was the approximate value?"
{
"question": "Which month had the highest revenue, and what was the approximate value?",
"answer": "March had the highest revenue, with a bar reaching approximately $2.4 million based on the y-axis scale. This is around 18% higher than the next highest month (February at roughly $2.0 million). The y-axis is labeled in millions of USD with gridlines at each $0.5M increment.",
"model_used": "claude-sonnet-4-6",
"input_tokens": 1783,
"output_tokens": 89
}
Architecture Patterns for Production Vision Pipelines
Async Processing with a Queue
Vision calls are slower than text calls because the image tokens add to the prompt processing time. For a user-facing feature, you generally do not want to block the HTTP response on the Claude call. The standard pattern is:
- Accept the image upload, store it in object storage, return a job ID immediately (HTTP 202).
- Enqueue a background job with the image path and task parameters.
- The worker calls the Claude API and writes results to a database row keyed by job ID.
- The client polls or receives a webhook when results are ready.
This is the same pattern used in Part 13: Build a Customer Support Agent with Claude for async tool calls. The vision POC above fits into this pattern without modification: the describe_image, classify_product, etc. functions are already synchronous, side-effect-free units that a worker can call.
Caching Frequently Seen Images
If your pipeline processes the same product catalog images repeatedly (daily re-indexing, for example), hash the image content and cache results in Redis or a database with a TTL. Image tokens are expensive at scale; re-processing the same 10,000 SKU photos every night is waste you can eliminate cheaply.
For cases where you process many images against the same long system prompt (a detailed product taxonomy description, for example), prompt caching from Part 4: Prompt Caching with Claude applies. Mark the system prompt block as ephemeral, and the prompt tokens only count against your input cost on the first call per cache window.
Common Pitfalls
Sending Oversized Images
A 4K PNG can be 10 MB or more. Base64 encoding expands that by about 33%. The resulting image token count can be 3,000 to 4,000 tokens, which adds $0.012 to $0.06 per call at Sonnet and Opus pricing, respectively. Multiply by 100,000 calls per month and the cost becomes meaningful. Always resize to 1568px on the longest edge before sending. For most vision tasks (description, classification, text extraction), this has no measurable effect on output quality.
Assuming Pixel-Perfect OCR
Claude reads text very well, but it is not a deterministic OCR engine. On high-quality printed documents, accuracy is excellent. On low-resolution scans, handwriting, or text with complex backgrounds, it will sometimes miss or misread characters. For tasks where accuracy on every character matters (e.g., parsing a check amount), always build a human review step into the workflow for low-confidence outputs.
Sending Images Without Context
A prompt like “what is this?” is much less effective than “this is a photo taken by a field technician of an electrical panel. Identify any visible safety issues.” Claude performs better when it knows the domain context. Add a short system prompt that describes the business context, and your outputs will be more accurate and more consistently formatted.
Not Handling API Errors in Batch Jobs
The retry wrapper in the POC above is intentional. Vision calls occasionally hit rate limits because image tokens count toward your token-per-minute quota, and a folder of large images can saturate the limit quickly. The exponential backoff in call_with_retry handles this. Without it, a batch job will fail on the first rate limit and leave your output half-complete.
Expecting Exact Numerical Reads on Dense Charts
Claude reads chart values by visually interpolating against axis gridlines. On a chart with clearly labeled gridlines and simple bar positions, it gets within 2-5% of the true value. On a stacked area chart with 8 series and no gridlines, it will give you approximate values with appropriate uncertainty. Design your prompts to ask for the shape of the data (“which quarter shows the sharpest decline”) when precision matters less, and save exact reads for simple charts where Claude can clearly interpolate.
Forgetting to HTML-Escape Dynamic Prompts
If you build prompts by interpolating user-supplied strings, a user can inject instructions. This is prompt injection, covered in detail in Part 25: Guardrails and Prompt Injection Defense for Claude Apps. For vision pipelines, the attack surface is the question text (chart Q&A) or the taxonomy string (classification). Validate and sanitize inputs before interpolation.
Cost and Latency Reference
The numbers below are representative estimates based on typical image sizes and task complexity. Actual costs depend on image resolution (which determines image token count) and response length.
| Task | Model | Typical input tokens (image + prompt) | Typical output tokens | P50 latency | Approx. cost per call |
|---|---|---|---|---|---|
| Product classification | claude-haiku-4-5 | 1,400-1,800 | 60-100 | 0.8s | $0.0004 |
| Image description | claude-sonnet-4-6 | 1,800-2,400 | 100-200 | 2.5s | $0.009 |
| Text extraction | claude-sonnet-4-6 | 2,000-3,000 | 150-400 | 3.5s | $0.015 |
| Chart Q&A | claude-sonnet-4-6 | 1,700-2,200 | 80-150 | 2.8s | $0.010 |
| Chart Q&A (complex) | claude-opus-4-8 | 1,700-2,200 | 100-200 | 5.0s | $0.045 |
For streaming responses on long descriptions, use the streaming pattern from Part 26: Streaming Responses with Claude. Time-to-first-token is typically under 1 second even for vision calls, so streaming gives users immediate feedback while the full description arrives.
Frequently Asked Questions
Can I send a URL instead of base64-encoded image data?
Yes. The image source can be {"type": "url", "url": "https://..."} instead of the base64 block. However, the URL must be publicly accessible at call time, and Anthropic’s servers will fetch it. For production workloads where images are stored in private object storage, base64 is more reliable and avoids permission issues. The POC above uses base64 for this reason.
How many images can I include in a single API call?
Multiple image blocks are supported in a single message. The practical limit is your token budget for the call. Each image at 1568px contributes roughly 1,100-1,600 image tokens. With a 200K context window on Sonnet and Opus, you could technically send dozens of images in one call, but cost and latency scale linearly with image count. For bulk processing, it is usually better to process images in parallel individual calls than to batch many into one.
Does vision work with the streaming API?
Yes. Vision calls support streaming exactly like text calls. Replace client.messages.create() with client.messages.stream() and iterate stream.text_stream. The image block goes in the messages array the same way. Streaming is useful when the output is a long description that you want to display progressively.
What is the difference between Claude vision and dedicated OCR tools like Textract?
Dedicated OCR tools like Amazon Textract return bounding box coordinates, per-word confidence scores, and structured form/table data in a machine-readable format. Claude vision returns natural language, optionally structured via tool use. For document processing where you need exact bounding boxes or character-level confidence, use a dedicated OCR service. For tasks where you want to extract meaning (what does this field say, what is this form about), Claude is often faster to integrate and handles irregular layouts better. Many production pipelines use both: Textract for structured fields, Claude for semantic interpretation of the extracted text.
Can Claude identify logos, brands, or specific products by name?
Claude recognizes well-known brands and logos that appear prominently in its training data. It can identify common product categories and consumer brands with reasonable accuracy. It will not reliably identify obscure regional brands, internal product codes, or novel items that do not appear in its training distribution. For brand-specific recognition tasks, few-shot prompting with labeled examples in the prompt improves accuracy significantly.
How do I test that my vision prompts are consistently accurate?
Build an eval harness: collect a labeled test set of 50-100 images per task with known ground truth, run your prompt against them, and score the outputs. For classification, accuracy is straightforward to compute. For description and extraction, use Claude itself as a scorer: “given this ground truth and this predicted output, rate agreement from 0 to 5.” This is the evaluation approach covered in Part 24: Evaluate Your Claude App. Run your eval on every prompt change before deploying to production.
What happens if I send a non-image file as an image block?
The API will return an error if the base64 data does not decode to a valid image in the declared format. Validate the MIME type and do a minimal image decode check (Pillow’s Image.open(...).verify()) before constructing the API call. This prevents confusing errors downstream and keeps your batch jobs from failing mid-run on corrupt files.
View the full AI in Production series
Further reading: Anthropic Vision Documentation covers supported formats, token cost formulas, and example code. The Messages API reference documents the complete content block schema. For image token pricing details, see the Anthropic pricing page. For Pillow image processing, the Pillow documentation covers all resize and format conversion methods used in the POC.
Leave a Reply