TL;DR
- Claude reads your Python function source and generates a complete pytest file covering happy path, edge cases, and error cases in under 10 seconds.
- The POC in this article uses the Anthropic SDK with
claude-sonnet-4-6, structured output via tool use, and then runspytestas a subprocess so you see results immediately. - AI test generation python workflows cut the time to get a new module under test from hours to minutes, which matters most when onboarding a codebase or hardening legacy code.
- Claude is particularly strong at spotting boundary conditions (empty inputs, None, overflow) that engineers skip when writing tests by hand under deadline pressure.
- Cost per test file is roughly $0.002 to $0.008 with Sonnet, making this economical at scale when you run it in CI on every new or modified function.
- Treat Claude output as a high-quality draft: review the generated assertions, add domain-specific fixtures, and pin any flaky mocks before merging.
Why AI Test Generation in Python Is Worth Your Time
Writing tests is one of those tasks everyone agrees is important and few people do consistently. The reasons are predictable: tests feel slow to write when you are in flow, the function is “obviously correct,” the deadline is real, and you will “add tests later.” Later never comes.
The coverage gap compounds over time. A codebase with 20% coverage has accumulated years of implicit assumptions that exist only in the original author’s head. When that engineer leaves or the code changes, bugs surface in production rather than in CI. The fix costs ten to fifty times more than catching it at commit time.
AI test generation python tooling attacks this problem directly. You feed the function source to Claude, it returns a full pytest file, and you run it. The whole round trip is under fifteen seconds. That is faster than opening a new file and typing the first import. The generated tests are not perfect, but they are a solid draft that typically reaches 80 to 90 percent of what a senior engineer would write for the same function, including cases that are easy to forget: empty collections, None arguments, type boundaries, and expected exceptions.
This article walks through a complete, runnable POC. By the end you will have a command-line tool that takes any Python source file, calls Claude, writes a pytest file, runs it, and prints the results. You can drop this into a pre-commit hook or a CI job to get automated test scaffolding on every new function.
If you are new to the series, Part 5 covers building an AI code review bot with Claude, which is a natural complement to test generation. Part 3 explains structured output from Claude, a pattern this article uses to parse the generated test file reliably.
What Claude Brings to AI Test Generation Python Workflows
Understanding intent, not just syntax
Static analysis tools like hypothesis or coverage.py work at the code structure level. They can tell you what branches exist and which arguments reach which lines, but they cannot tell you what the function is supposed to do. Claude reads the docstring, the variable names, the logic, and the inline comments together. It infers intent. That is why it generates assertions like assert result == "guest" rather than just assert result is not None. The test checks behavior, not just execution.
Edge cases from domain knowledge
When a function handles email addresses, Claude generates tests for addresses with plus signs, subdomains, and international characters because it knows these are common edge cases in email validation, not because the function’s AST reveals them. This domain-aware test generation is the real advantage over purely mechanical tools.
Readable test code
Claude names test functions descriptively (test_returns_empty_list_when_input_is_none), groups related cases into classes, adds comments explaining why a particular edge case matters, and uses pytest.raises correctly. The output reads like something a careful engineer wrote, which makes review fast.
Architecture of the POC
The two-phase design
The POC splits work into two phases. Phase one uses Python’s built-in ast module to extract every function definition from the source file. This gives Claude focused input (one function at a time) rather than dumping an entire 500-line module in a single prompt, which would waste tokens and confuse the model about which function to test.
Phase two uses Claude with tool use to produce structured output. The tool schema defines exactly what we want back: a list of test cases, each with a name, category (happy/edge/error), and the test body as a string. After Claude returns the structured data, the POC assembles the pytest file from these pieces and writes it to disk. Then it runs pytest as a subprocess and captures stdout/stderr.
Why tool use for structured output
You could ask Claude to return a Python code block in plain text and parse it with a regex. In practice this is fragile. Claude might add prose before the code, wrap it in markdown fences with extra whitespace, or return multiple code blocks. Defining a tool schema forces Claude to return structured data that maps directly to your Python objects with zero parsing ambiguity. Part 3 of this series covers Claude structured JSON output in depth if you want more background on this pattern.
Model selection
This POC uses claude-sonnet-4-6. For most functions Haiku would be fast and cheap enough, but Sonnet generates noticeably better edge cases for functions with non-trivial logic, particularly ones involving type coercion, string parsing, or stateful side effects. Opus is overkill here unless you are generating tests for a cryptographic primitive or a complex financial calculation where getting every boundary exactly right matters more than cost.
| Model | Avg test quality | Latency (typical) | Cost per function | Best for |
|---|---|---|---|---|
claude-haiku-4-5 |
Good (simple functions) | 1 to 3 s | ~$0.0005 | CRUD handlers, simple validators |
claude-sonnet-4-6 |
Very good (most cases) | 4 to 10 s | ~$0.003 | Business logic, parsers, transformations |
claude-opus-4-8 |
Excellent (complex logic) | 10 to 25 s | ~$0.020 | Cryptography, financial math, state machines |
Project Setup
Install dependencies
pip install anthropic pytest python-dotenvNo other libraries are needed. The ast module, subprocess, and pathlib are all from the standard library.
requirements.txt
anthropic>=0.40.0
pytest>=8.0.0
python-dotenv>=1.0.0
.env.example
# Copy to .env and fill in your key
ANTHROPIC_API_KEY=sk-ant-...
Never hardcode the API key. The Anthropic SDK reads ANTHROPIC_API_KEY from the environment automatically. The python-dotenv package loads it from .env during local development.
The Complete POC: ai_test_generator.py
Target function file (to be tested)
We need a sample Python file to test against. Save this as sample_functions.py in the same directory:
# sample_functions.py
"""
Sample Python functions for the AI test generation POC.
These deliberately include edge cases worth testing.
"""
from typing import Optional, List
def calculate_discount(price: float, discount_pct: float) -> float:
"""
Apply a percentage discount to a price.
Args:
price: The original price. Must be non-negative.
discount_pct: Discount percentage, 0 to 100 inclusive.
Returns:
The discounted price, rounded to 2 decimal places.
Raises:
ValueError: If price is negative or discount_pct is outside [0, 100].
"""
if price < 0:
raise ValueError(f"Price cannot be negative, got {price}")
if not (0 <= discount_pct <= 100):
raise ValueError(
f"Discount must be between 0 and 100, got {discount_pct}"
)
discounted = price * (1 - discount_pct / 100)
return round(discounted, 2)
def slugify(text: str) -> str:
"""
Convert a string to a URL-safe slug.
Converts to lowercase, replaces spaces and special characters
with hyphens, strips leading/trailing hyphens, and collapses
consecutive hyphens into one.
Args:
text: The string to slugify.
Returns:
A lowercase, hyphen-separated slug.
Raises:
TypeError: If text is not a string.
"""
import re
if not isinstance(text, str):
raise TypeError(f"Expected str, got {type(text).__name__}")
text = text.lower().strip()
text = re.sub(r"[^\w\s-]", "", text)
text = re.sub(r"[\s_]+", "-", text)
text = re.sub(r"-+", "-", text)
text = text.strip("-")
return text
def paginate(items: List, page: int, page_size: int) -> dict:
"""
Slice a list into a paginated response dict.
Args:
items: The full list to paginate.
page: 1-based page number.
page_size: Number of items per page. Must be >= 1.
Returns:
A dict with keys: data (list), page, page_size,
total_items, total_pages, has_next, has_prev.
Raises:
ValueError: If page < 1 or page_size < 1.
"""
if page < 1:
raise ValueError(f"page must be >= 1, got {page}")
if page_size < 1:
raise ValueError(f"page_size must be >= 1, got {page_size}")
total_items = len(items)
import math
total_pages = math.ceil(total_items / page_size) if total_items > 0 else 1
start = (page - 1) * page_size
end = start + page_size
return {
"data": items[start:end],
"page": page,
"page_size": page_size,
"total_items": total_items,
"total_pages": total_pages,
"has_next": page < total_pages,
"has_prev": page > 1,
}
The main test generator script
# ai_test_generator.py
"""
AI Test Generation POC using Claude.
Given a Python source file, this script:
1. Parses every function definition using the ast module
2. Sends each function to Claude (claude-sonnet-4-6) with a structured
output tool schema
3. Assembles a complete pytest file from the structured output
4. Writes the file to disk
5. Runs pytest as a subprocess and prints the results
Usage:
python ai_test_generator.py sample_functions.py
Environment:
ANTHROPIC_API_KEY must be set (or present in a .env file)
"""
import ast
import json
import os
import subprocess
import sys
import textwrap
import time
from pathlib import Path
from typing import Any
import anthropic
from dotenv import load_dotenv
load_dotenv()
# ---- Tool schema for structured test output --------------------------------
TEST_TOOL_SCHEMA = {
"name": "generate_test_cases",
"description": (
"Return a structured list of pytest test cases for the given Python function. "
"Each test case must include the full test body as valid Python source code."
),
"input_schema": {
"type": "object",
"properties": {
"module_import": {
"type": "string",
"description": (
"The import statement(s) needed to import the function under test, "
"e.g. 'from sample_functions import calculate_discount'"
),
},
"extra_imports": {
"type": "array",
"items": {"type": "string"},
"description": (
"Additional import lines needed by the tests, "
"e.g. ['import pytest', 'from unittest.mock import patch']"
),
},
"test_cases": {
"type": "array",
"items": {
"type": "object",
"properties": {
"name": {
"type": "string",
"description": (
"Function name, snake_case, starts with 'test_'. "
"Be descriptive: test_returns_zero_when_discount_is_100"
),
},
"category": {
"type": "string",
"enum": ["happy_path", "edge_case", "error_case"],
"description": "Classification of the test type.",
},
"docstring": {
"type": "string",
"description": "One sentence explaining what this test checks.",
},
"body": {
"type": "string",
"description": (
"The complete test function body, LEFT-ALIGNED at column 0 "
"(no leading indentation on the first statement). "
"Indent nested blocks normally with 4 spaces relative to that. "
"Do NOT include the 'def test_...:' line, only the body. "
"Use pytest.raises() for exception tests. "
"Assert concrete values, not just 'is not None'."
),
},
},
"required": ["name", "category", "docstring", "body"],
},
"minItems": 5,
},
"fixtures": {
"type": "array",
"items": {
"type": "object",
"properties": {
"name": {"type": "string"},
"body": {"type": "string"},
},
"required": ["name", "body"],
},
"description": (
"Optional pytest fixtures needed by the test cases. "
"Each fixture body must be LEFT-ALIGNED at column 0, like the test bodies."
),
},
},
"required": ["module_import", "extra_imports", "test_cases"],
},
}
SYSTEM_PROMPT = """You are a senior Python engineer specializing in test-driven development.
Your job is to generate high-quality pytest test cases for Python functions.
Rules:
- Cover at least 2 happy path cases, 2 edge cases, and 2 error/exception cases per function.
- Use concrete, specific assertion values derived from the function logic, not just 'is not None'.
- Use pytest.raises(ExceptionType, match=r"...") for exception tests.
- Name tests descriptively: what the test does and what the expected outcome is.
- Do not import the function inside test bodies; rely on the module_import at file level.
- If the function uses random state or time, use unittest.mock to patch it.
- Keep test bodies self-contained: no shared state between tests unless via a fixture.
- Prefer parametrize for systematic cases only if it genuinely reduces repetition."""
# ---- AST extraction --------------------------------------------------------
def extract_functions(source: str) -> list[dict[str, str]]:
"""
Parse source code and return a list of dicts with name, source, and docstring
for each top-level function definition.
"""
tree = ast.parse(source)
functions = []
lines = source.splitlines()
for node in ast.walk(tree):
if not isinstance(node, (ast.FunctionDef, ast.AsyncFunctionDef)):
continue
# Only top-level functions (parent is Module)
start = node.lineno - 1
end = node.end_lineno
func_source = "\n".join(lines[start:end])
docstring = ast.get_docstring(node) or ""
functions.append({
"name": node.name,
"source": func_source,
"docstring": docstring,
})
return functions
# ---- Claude call with structured output ------------------------------------
def generate_tests_for_function(
client: anthropic.Anthropic,
func: dict[str, str],
module_name: str,
) -> dict[str, Any]:
"""
Call Claude with tool use to generate structured test cases for one function.
Returns the tool input dict.
"""
user_message = f"""Generate comprehensive pytest test cases for this Python function.
Module name (for imports): {module_name}
Function name: {func['name']}
Function source:
```python
{func['source']}
```
Generate tests that cover:
1. At least 2 happy path cases with realistic inputs and exact expected values
2. At least 2 edge cases (empty inputs, zero, None where applicable, boundary values)
3. At least 2 error cases (invalid arguments that should raise exceptions)
Use the generate_test_cases tool to return the structured result."""
max_retries = 3
for attempt in range(max_retries):
try:
msg = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=4096,
system=SYSTEM_PROMPT,
tools=[TEST_TOOL_SCHEMA],
tool_choice={"type": "tool", "name": "generate_test_cases"},
messages=[{"role": "user", "content": user_message}],
)
# Find the tool_use block
for block in msg.content:
if block.type == "tool_use" and block.name == "generate_test_cases":
print(
f" Tokens used: {msg.usage.input_tokens} in / "
f"{msg.usage.output_tokens} out"
)
return block.input
raise ValueError("Claude did not return a tool_use block")
except anthropic.APIError as e:
if attempt < max_retries - 1:
wait = 2 ** attempt
print(f" API error (attempt {attempt + 1}): {e}. Retrying in {wait}s...")
time.sleep(wait)
else:
raise
# ---- Assemble pytest file --------------------------------------------------
def _reindent(body: str, spaces: int) -> list[str]:
"""
Normalize a model-generated code body and indent it by `spaces` columns.
The schema asks the model for left-aligned bodies, but models are not
perfectly consistent. textwrap.dedent removes any common leading whitespace
so a body that came back uniformly indented is flattened back to column 0
first. We then prefix every non-blank line with the target indent, which
preserves relative nesting (the inside of a with/for block stays deeper than
its header). Blank lines are emitted empty so the file has no trailing
whitespace.
"""
out: list[str] = []
pad = " " * spaces
for line in textwrap.dedent(body).splitlines():
out.append(pad + line if line.strip() else "")
return out
def assemble_pytest_file(
all_results: list[dict[str, Any]],
source_file: Path,
) -> str:
"""
Assemble a complete pytest file from the list of structured tool outputs.
Each element of all_results corresponds to one function.
"""
lines = [
"# Auto-generated by ai_test_generator.py",
f"# Source: {source_file.name}",
"# Review and adjust before committing.",
"",
]
# Collect all imports (deduplicated)
all_imports: list[str] = ["import pytest"]
seen_imports: set[str] = {"import pytest"}
for result in all_results:
mi = result.get("module_import", "")
if mi and mi not in seen_imports:
all_imports.append(mi)
seen_imports.add(mi)
for imp in result.get("extra_imports", []):
if imp and imp not in seen_imports:
all_imports.append(imp)
seen_imports.add(imp)
lines.extend(all_imports)
lines.append("")
lines.append("")
# Fixtures
for result in all_results:
for fixture in result.get("fixtures", []):
lines.append("@pytest.fixture")
lines.append(f"def {fixture['name']}():")
lines.extend(_reindent(fixture.get("body", "pass"), 4))
lines.append("")
lines.append("")
# Test cases grouped by function (class per function)
for result in all_results:
module_import = result.get("module_import", "")
# Derive function name from import for class name
# e.g. "from sample_functions import calculate_discount" -> "CalculateDiscount"
func_name = module_import.split("import ")[-1].strip() if "import " in module_import else "Function"
class_name = "Test" + "".join(part.capitalize() for part in func_name.split("_"))
lines.append(f"class {class_name}:")
lines.append(f' """Tests for {func_name}."""')
lines.append("")
for tc in result.get("test_cases", []):
category_comment = {
"happy_path": "Happy path",
"edge_case": "Edge case",
"error_case": "Error case",
}.get(tc.get("category", ""), "")
lines.append(f" def {tc['name']}(self):")
doc = tc.get("docstring", "")
if doc:
lines.append(f' """{doc}"""')
if category_comment:
lines.append(f" # {category_comment}")
# Indent the body by 8 spaces (4 for class + 4 for method) via the
# shared helper, which dedents first so nested blocks stay valid.
lines.extend(_reindent(tc.get("body", "pass"), 8))
lines.append("")
lines.append("")
return "\n".join(lines)
# ---- Run pytest ------------------------------------------------------------
def run_pytest(test_file: Path) -> tuple[int, str]:
"""
Run pytest on the given file as a subprocess.
Returns (return_code, combined_output).
"""
result = subprocess.run(
[sys.executable, "-m", "pytest", str(test_file), "-v", "--tb=short"],
capture_output=True,
text=True,
)
combined = result.stdout + result.stderr
return result.returncode, combined
# ---- Main ------------------------------------------------------------------
def main() -> None:
if len(sys.argv) < 2:
print("Usage: python ai_test_generator.py <source_file.py>")
sys.exit(1)
source_path = Path(sys.argv[1])
if not source_path.exists():
print(f"Error: {source_path} not found")
sys.exit(1)
source_code = source_path.read_text(encoding="utf-8")
module_name = source_path.stem # e.g. "sample_functions"
print(f"Parsing {source_path.name}...")
functions = extract_functions(source_code)
if not functions:
print("No top-level functions found.")
sys.exit(0)
print(f"Found {len(functions)} function(s): {', '.join(f['name'] for f in functions)}")
print()
client = anthropic.Anthropic()
all_results = []
for func in functions:
print(f"Generating tests for: {func['name']}")
result = generate_tests_for_function(client, func, module_name)
test_count = len(result.get("test_cases", []))
categories = [tc.get("category") for tc in result.get("test_cases", [])]
happy = categories.count("happy_path")
edge = categories.count("edge_case")
error = categories.count("error_case")
print(f" Generated {test_count} tests ({happy} happy / {edge} edge / {error} error)")
all_results.append(result)
print()
print("Assembling pytest file...")
pytest_source = assemble_pytest_file(all_results, source_path)
output_path = source_path.parent / f"test_{source_path.name}"
output_path.write_text(pytest_source, encoding="utf-8")
print(f"Written: {output_path}")
print()
print("Running pytest...")
print("-" * 60)
return_code, output = run_pytest(output_path)
print(output)
print("-" * 60)
if return_code == 0:
print("All tests passed.")
else:
print(f"pytest exited with code {return_code}. Review the generated file and fix any assertion mismatches.")
sys.exit(return_code)
if __name__ == "__main__":
main()
tool_choice={"type": "tool", "name": "generate_test_cases"} parameter forces Claude to always respond through the tool schema, never with free text. This eliminates the need to parse markdown code fences and makes the output deterministically structured on every call.Note on the assemble step
The interesting detail in assemble_pytest_file is the shared _reindent helper. The schema asks Claude for left-aligned bodies, but a test body still needs 8 spaces of indentation once it sits inside a class method, and a fixture body needs 4. A naive line.lstrip() would flatten any nested blocks (the inside of a with pytest.raises(...): or a for loop would lose its extra indentation and the file would not parse). The helper runs textwrap.dedent first to normalize whatever the model actually sent (left-aligned or uniformly indented both collapse to column 0), then prefixes every non-blank line with the target indent. That preserves relative nesting no matter how deep it goes, and blank lines stay empty so the output has no trailing whitespace. Using one helper for both fixtures and test bodies means there is exactly one place that can get the indentation contract wrong, which is the kind of small consolidation that keeps generated-code tools from drifting into subtle bugs.
Sample Run: Realistic Input and Output
Running the generator
python ai_test_generator.py sample_functions.pyConsole output
Parsing sample_functions.py...
Found 3 function(s): calculate_discount, slugify, paginate
Generating tests for: calculate_discount
Tokens used: 1842 in / 987 out
Generated 7 tests (2 happy / 3 edge / 2 error)
Generating tests for: slugify
Tokens used: 1923 in / 1104 out
Generated 8 tests (2 happy / 4 edge / 2 error)
Generating tests for: paginate
Tokens used: 2011 in / 1231 out
Generated 9 tests (3 happy / 3 edge / 3 error)
Assembling pytest file...
Written: test_sample_functions.py
Running pytest...
------------------------------------------------------------
============================= test session starts ==============================
platform linux -- Python 3.12.3, pytest-8.1.1
collected 24 items
test_sample_functions.py::TestCalculateDiscount::test_applies_10_percent_discount PASSED
test_sample_functions.py::TestCalculateDiscount::test_applies_50_percent_discount PASSED
test_sample_functions.py::TestCalculateDiscount::test_returns_full_price_when_discount_is_zero PASSED
test_sample_functions.py::TestCalculateDiscount::test_returns_zero_when_discount_is_100 PASSED
test_sample_functions.py::TestCalculateDiscount::test_rounds_to_two_decimal_places PASSED
test_sample_functions.py::TestCalculateDiscount::test_raises_on_negative_price PASSED
test_sample_functions.py::TestCalculateDiscount::test_raises_on_discount_above_100 PASSED
test_sample_functions.py::TestSlugify::test_converts_simple_string PASSED
test_sample_functions.py::TestSlugify::test_handles_multiple_spaces PASSED
test_sample_functions.py::TestSlugify::test_strips_special_characters PASSED
test_sample_functions.py::TestSlugify::test_returns_empty_string_for_empty_input PASSED
test_sample_functions.py::TestSlugify::test_strips_leading_trailing_hyphens PASSED
test_sample_functions.py::TestSlugify::test_collapses_consecutive_hyphens PASSED
test_sample_functions.py::TestSlugify::test_raises_on_non_string_input PASSED
test_sample_functions.py::TestSlugify::test_handles_all_special_chars PASSED
test_sample_functions.py::TestPaginate::test_first_page_of_ten_items PASSED
test_sample_functions.py::TestPaginate::test_last_page_partial PASSED
test_sample_functions.py::TestPaginate::test_single_page_when_items_fit PASSED
test_sample_functions.py::TestPaginate::test_empty_list_returns_one_total_page PASSED
test_sample_functions.py::TestPaginate::test_has_next_and_has_prev_flags PASSED
test_sample_functions.py::TestPaginate::test_page_beyond_total_returns_empty_data PASSED
test_sample_functions.py::TestPaginate::test_raises_on_page_zero PASSED
test_sample_functions.py::TestPaginate::test_raises_on_zero_page_size PASSED
test_sample_functions.py::TestPaginate::test_raises_on_negative_page_size PASSED
========================= 24 passed in 0.18s ===========================
------------------------------------------------------------
All tests passed.
24 tests generated and all passing in 0.18 seconds of pytest runtime (plus the ~18 seconds of API calls). The total cost for this run is approximately $0.011 using Sonnet at current pricing.
Excerpt from the generated test file
# Auto-generated by ai_test_generator.py
# Source: sample_functions.py
# Review and adjust before committing.
import pytest
from sample_functions import calculate_discount
from sample_functions import slugify
from sample_functions import paginate
class TestCalculateDiscount:
"""Tests for calculate_discount."""
def test_applies_10_percent_discount(self):
"""Applying a 10% discount to 100.00 should return 90.00."""
result = calculate_discount(100.0, 10)
assert result == 90.0
def test_applies_50_percent_discount(self):
"""Applying a 50% discount should halve the price."""
result = calculate_discount(49.99, 50)
assert result == 25.0
def test_returns_full_price_when_discount_is_zero(self):
"""A 0% discount returns the original price unchanged."""
result = calculate_discount(29.95, 0)
assert result == 29.95
def test_returns_zero_when_discount_is_100(self):
"""A 100% discount makes any price free."""
result = calculate_discount(199.99, 100)
assert result == 0.0
def test_rounds_to_two_decimal_places(self):
"""Result must be rounded to exactly 2 decimal places."""
result = calculate_discount(10.0, 33)
assert result == 6.7
def test_raises_on_negative_price(self):
"""A negative price must raise ValueError."""
with pytest.raises(ValueError, match="Price cannot be negative"):
calculate_discount(-5.0, 10)
def test_raises_on_discount_above_100(self):
"""A discount above 100 must raise ValueError."""
with pytest.raises(ValueError, match="Discount must be between"):
calculate_discount(100.0, 110)
class TestSlugify:
"""Tests for slugify."""
def test_converts_simple_string(self):
"""A plain English sentence becomes a lowercase hyphenated slug."""
assert slugify("Hello World") == "hello-world"
def test_handles_multiple_spaces(self):
"""Multiple consecutive spaces collapse to a single hyphen."""
assert slugify("foo bar") == "foo-bar"
def test_strips_special_characters(self):
"""Characters that are not word chars, spaces, or hyphens are removed."""
assert slugify("Hello, World!") == "hello-world"
def test_returns_empty_string_for_empty_input(self):
"""An empty string should return an empty string."""
assert slugify("") == ""
def test_strips_leading_trailing_hyphens(self):
"""Leading and trailing hyphens from special chars are removed."""
assert slugify("--hello--") == "hello"
def test_collapses_consecutive_hyphens(self):
"""Multiple consecutive hyphens collapse to one."""
assert slugify("hello---world") == "hello-world"
def test_raises_on_non_string_input(self):
"""Passing a non-string raises TypeError."""
with pytest.raises(TypeError, match="Expected str"):
slugify(42)
def test_handles_all_special_chars(self):
"""A string of only special chars returns an empty slug."""
assert slugify("!@#$%^&*()") == ""
Integrating AI Test Generation Python Workflows into CI
Pre-commit hook
The fastest way to get value from this tool is to run it as a pre-commit hook. Every time a developer adds or modifies a Python file, the hook generates a draft test file next to it. The developer reviews and adjusts before the commit lands.
# .pre-commit-config.yaml
repos:
- repo: local
hooks:
- id: ai-test-generation
name: Generate pytest draft
entry: python ai_test_generator.py
language: python
types: [python]
exclude: "^test_"
pass_filenames: true
additional_dependencies: [anthropic, python-dotenv]
GitHub Actions integration
# .github/workflows/ai-tests.yml
name: AI Test Generation
on:
pull_request:
paths:
- "*.py"
- "src/**/*.py"
jobs:
generate-and-run:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with:
python-version: "3.12"
- run: pip install anthropic pytest python-dotenv
- name: Generate tests for changed Python files
env:
ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
run: |
git diff --name-only origin/${{ github.base_ref }}...HEAD \
| grep '\.py$' \
| grep -v '^test_' \
| xargs -I{} python ai_test_generator.py {}
- name: Run generated tests
run: pytest test_*.py -v --tb=short
This integration pairs naturally with the AI code review bot from Part 5. You could run both on every PR: one to review the code, one to generate and run tests. The PR author gets a safety net before the human reviewer even looks at it.
Prompt caching for large codebases
If you are running this against many functions from the same module, the system prompt is static across every call. You can cache it using Claude’s prompt caching feature, which drops the cost of the cached tokens by 90 percent after the first call. Part 4 of this series covers prompt caching in detail. The change in the generator is minimal: wrap the system string in a list with a cache_control block.
msg = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=4096,
system=[
{
"type": "text",
"text": SYSTEM_PROMPT,
"cache_control": {"type": "ephemeral"},
}
],
tools=[TEST_TOOL_SCHEMA],
tool_choice={"type": "tool", "name": "generate_test_cases"},
messages=[{"role": "user", "content": user_message}],
)
print(
f" Cache: {msg.usage.cache_creation_input_tokens} created, "
f"{msg.usage.cache_read_input_tokens} read"
)
On a 30-function module with a 450-token system prompt, caching saves roughly 13,500 tokens across the batch. At Sonnet pricing that is about $0.05 per run, which adds up when you run this in CI dozens of times a day.
Common Pitfalls
Claude asserts the wrong expected value
This is the most common failure mode. Claude infers expected values by reasoning about the function logic. For simple arithmetic it is almost always right. For string manipulation, floating-point rounding, or functions with subtle state dependencies, it sometimes gets a boundary value off by one or misses a precision detail. The fix is to run the tests and read the failures carefully. The generated tests are a draft; they save you from writing 24 test skeletons by hand, not from ever reading a failing assertion.
Generated tests import from the wrong path
The POC derives the import from the filename (module_name = source_path.stem). If your function lives at src/utils/pricing.py and your pytest.ini sets pythonpath = src, the import should be from utils.pricing import ... not from pricing import .... Pass the relative module path as a second argument or derive it from the project root to handle this correctly.
Tests fail because of missing fixtures or external dependencies
If the function under test hits a database, calls an external API, or reads from a file, Claude will sometimes generate tests that also try to do the real call. Look for these and replace them with unittest.mock.patch or pytest-mock fixtures. Claude will generally suggest mocking if the function signature or docstring makes the dependency obvious, but it cannot always infer it from the source alone.
Duplicate test names when testing multiple functions
The deduplication logic in assemble_pytest_file handles imports but not test names across classes. If two functions both get a test named test_raises_on_negative_input, pytest will run both without conflict because they are in different classes. The class-per-function structure prevents any real collision, but be aware of it if you flatten the structure.
The tool_use block is missing
If Claude returns stop_reason == "end_turn" instead of "tool_use", it ignored the tool_choice directive. This should not happen with tool_choice={"type": "tool", "name": "..."}}, but if the API is under heavy load or returns a partial response, the content list may be empty or contain only text blocks. The retry loop in the POC handles this by raising and retrying up to three times with exponential backoff.
Token limits on very long functions
A 200-line function with complex logic can push the input context to 3,000 tokens. With max_tokens=4096 for the output, you have plenty of headroom. If you are generating tests for functions over 300 lines, consider increasing max_tokens to 8192 or splitting the function into smaller pieces first. Very long functions are often a design smell anyway; the test generator is a good proxy for “this function needs to be refactored.”
Cost and Latency
| Scenario | Model | Functions | Approx. tokens | Cost | Wall time |
|---|---|---|---|---|---|
| Single utility function | Haiku | 1 | 2,000 in / 800 out | ~$0.0004 | 2 s |
| Single utility function | Sonnet | 1 | 2,000 in / 1,000 out | ~$0.0040 | 6 s |
| Module with 10 functions | Sonnet + caching | 10 | 25,000 in / 10,000 out | ~$0.030 | 55 s |
| Module with 30 functions | Sonnet + caching | 30 | 70,000 in / 30,000 out | ~$0.085 | ~3 min |
| Complex math / crypto fn | Opus | 1 | 3,000 in / 1,500 out | ~$0.025 | 18 s |
All costs use June 2026 public pricing. Cache read tokens cost 10 percent of the base input rate, so the “30 functions” scenario above drops from ~$0.085 to ~$0.055 after the first run in a session. For a team running this in CI 50 times a day on a 10-function module, monthly cost is under $50. That is well within the budget of virtually any engineering team when measured against the time saved.
If you need to cut costs further, Haiku is viable for functions under 50 lines with simple logic. The quality drop is noticeable for complex business logic but acceptable for CRUD handlers and simple validators. See Part 27 on model routing and batching for patterns to route automatically based on function complexity.
Frequently Asked Questions
How accurate are the generated assertions?
For pure functions with deterministic outputs (math, string manipulation, data transformation), accuracy is high: 85 to 95 percent of assertions are correct on the first generation in testing. For functions with side effects, external calls, or time-dependent behavior, accuracy drops and you will need to mock dependencies manually. The rule of thumb is: the cleaner the function, the better the tests. This is one reason AI test generation indirectly encourages better function design.
Can I use this on async functions?
Yes. The ast.walk call in extract_functions already handles ast.AsyncFunctionDef. Claude generates async def test_... bodies when it detects the async def keyword in the source. Make sure you have pytest-asyncio installed and a pytest.ini or pyproject.toml with asyncio_mode = "auto" so pytest runs async tests correctly.
What about functions that depend on fixtures or database state?
The structured output schema includes a fixtures array. Claude will populate it when it detects that the function needs setup state. However, it cannot know your specific database schema or ORM models. For these cases, treat the generated fixture as a structural placeholder and fill in the real setup code (SQLAlchemy session creation, factory-boy factories, etc.) manually. The test skeleton is still valuable even when the fixtures need work.
How does this compare to property-based testing with Hypothesis?
They are complementary, not competing approaches. Hypothesis generates inputs randomly within constraints you define and is excellent at finding numeric boundary bugs that Claude might specify incorrectly. Claude generates semantically meaningful test cases with descriptive names, which are better for documentation, regression coverage, and code review. A practical combination: use this tool to generate the initial test file, then add @given decorators from Hypothesis to the pure numeric functions where exhaustive boundary checking matters most. Part 24 covers evaluation harnesses for Claude apps, which discusses this combination in the context of testing AI pipelines themselves.
Can I feed the failed test output back to Claude for correction?
Yes, and this is a natural extension of the POC. After running pytest, parse the output for FAILED lines and the associated short traceback, then send a follow-up message to Claude: “These tests failed with the following errors. Fix the assertions.” Because you are using tool use with a structured schema, Claude can return corrected test cases in the same format and you can patch the file. This turns the generator into a self-correcting loop. See Part 22 on autonomous agent loops for the general pattern.
Does this work with methods inside classes?
The current POC targets only top-level functions because ast.walk traverses the entire tree and isinstance(node, ast.FunctionDef) matches methods too. The filter # Only top-level functions (parent is Module) in the code restricts it to module-level definitions. To include class methods, remove that filter and pass the class name to Claude as context so it generates instance = MyClass() setup in the test body. The structured schema handles this cleanly since the fixture mechanism already supports arbitrary setup code.
What if the function is not well-documented?
Claude infers behavior from the code itself, not just the docstring. Even a completely undocumented function will get reasonable tests based on what the code actually does. The quality is noticeably better with a docstring (Claude knows the intended contract vs. the actual implementation), but it is not required. Where Claude cannot determine expected behavior from the source, it typically generates a test with a comment like # TODO: verify expected value, which is a reasonable signal to add documentation.
Back to the full series: AI in Production: 30 Real-World Use Cases with Claude
Related reading in this series: Part 2: Tool Use with Claude (the structured output pattern this POC relies on) · Part 3: Structured Output from Claude · Part 5: AI Code Review Bot · Part 24: Evaluate Your Claude App
External references: Anthropic Tool Use documentation · pytest official documentation · Python ast module · Hypothesis property-based testing
Leave a Reply