Hardening Codebases for Agentic Coding: 10 Essential Techniques¶
How to make your codebase safe and effective for AI agents to work with
Note: This article practices what it preaches! All code examples are validated using pytest-examples in pre-commit hooks. See tests/test_blog_examples.py for how we ensure documentation stays accurate.
Introduction¶
AI coding agents like Claude Code, GitHub Copilot, and Cursor are transforming how we write software. But there's a catch: not all codebases are ready for agentic development. Without proper guardrails, AI agents can introduce bugs, violate architecture boundaries, or create unsafe code.
After building the Taxonomy-Ontology-Accelerator, a production system designed from the ground up for agentic coding, I've identified 10 essential techniques that make codebases safe and effective for AI agents to work with.
This isn't theoretical—these are battle-tested patterns from a codebase with: - 1,927 defensive assertions catching impossible conditions (NASA05 compliance) - 29,157 lines of test code across 89 test files - Property-based fuzzing with Hypothesis (NASA/TIGER style) - Architecture tests preventing forbidden dependencies - Mock detection enforcing real implementations - CLI quality assurance with automated validation - 62% minimum test coverage enforced in CI
Let's dive into what makes a codebase "agent-ready."
1. Type Safety: Teaching Agents Your Data Contracts¶
Why it matters: AI agents need to understand what data looks like. Without type hints, agents guess—and guesses introduce bugs.
Pydantic Models as Single Source of Truth¶
Instead of loose dictionaries, use Pydantic models with validation:
from pydantic import BaseModel, Field, field_validator
class ExtractedConcept(BaseModel):
"""A concept extracted from text by an LLM agent."""
name: str = Field(..., min_length=1)
concept_type: str
description: str | None = None
confidence: float = Field(ge=0.0, le=1.0)
@field_validator("concept_type")
@classmethod
def normalize_concept_type(cls, v: str) -> str:
assert v is not None, "concept_type must not be None"
assert isinstance(v, str), "concept_type must be a string"
return v.strip().lower()
What this gives you:
- AI agents see the structure and constraints
- Invalid data fails immediately with clear errors
- Validators document business rules
- Field constraints prevent out-of-range values
TypedDict for Structured State¶
For data structures that don't need validation logic, use TypedDict:
from typing import TypedDict
class FileProcessingStatus(TypedDict):
"""Status tracking for file processing during extraction."""
has_failed_chunks: bool
processed_chunks: int
total_chunks: int
Agents understand the structure, type checkers validate usage, and you avoid the overhead of full models.
Protocol Classes for Dependency Injection¶
Define interfaces with Protocol classes to enable testability without mocks:
from typing import Protocol
class StageValidator(Protocol):
"""Protocol for validators that run between pipeline stages."""
def validate(self, **kwargs: Any) -> ValidationResult:
"""Validate stage preconditions or outputs."""
...
Agents can implement new validators without touching existing code. Tests use fake implementations instead of mocks.
Impact: Type safety reduces agent errors by ~40% in our experience. Agents know what's expected and the type checker catches mistakes before runtime.
2. Defensive Assertions: NASA-Grade Safety¶
Why it matters: AI agents make assumptions. Assertions catch wrong assumptions before they corrupt state.
NASA05: The Two-Assertion Rule¶
NASA's Power of 10 rules for safety-critical code require minimum 2 assertions per function. This catches: - Invalid parameters before processing - Broken invariants before they propagate - State corruption before it cascades
def add_error(
self,
error_type: str,
message: str,
severity: str = "error",
**kwargs: object,
) -> None:
"""Add an error to the collection."""
# First assertion: parameter validation
assert error_type is not None, "error_type must not be None"
assert isinstance(error_type, str), "error_type must be a string"
# Second assertion: state validation
assert isinstance(self._errors, list), "Internal errors must be a list"
assert severity in {"error", "warning", "info"}, f"Invalid severity: {severity}"
error_dict = {
"type": error_type,
"message": message,
"severity": severity,
**kwargs,
}
# Third assertion: output validation
assert isinstance(error_dict, dict), "Error must be a dictionary"
assert "type" in error_dict, "Error must have a type field"
self._errors.append(error_dict)
What to Assert¶
Always assert:
- Parameters are not None (unless explicitly Optional)
- Types are correct (isinstance checks)
- Numeric values are in valid ranges
- Data structures have required keys
- State invariants hold before critical operations
Never assert: - User input validation (use proper error handling) - Expected runtime errors (use try/except) - Business logic conditions (use if/else)
Impact: The TOA codebase has 1,927 assertions that have caught hundreds of bugs during development—bugs that would have been silent data corruption without them.
3. Property-Based Fuzzing: NASA/TIGER Style Testing¶
Why it matters: Example-based tests only cover cases you think of. Property-based testing (fuzzing) generates thousands of random inputs to find edge cases agents might create.
Hypothesis: Generative Testing for Functions¶
Instead of writing individual test cases, define properties that should always hold:
import math
from hypothesis import assume, given
from hypothesis import strategies as st
# Define strategies for valid inputs
finite_floats = st.floats(
min_value=-1e10,
max_value=1e10,
allow_infinity=False,
allow_nan=False,
)
class TestSafeDivide:
"""Property-based tests for safe_divide function."""
@given(a=finite_floats, b=st.floats(min_value=1e-10, max_value=1e10))
def test_division_reversibility(self, a: float, b: float):
"""Property: safe_divide(a*b, b) == a when b > 0."""
product = a * b
assume(math.isfinite(product))
result = safe_divide(product, b)
assert math.isclose(result, a, rel_tol=1e-9, abs_tol=1e-9)
@given(numerator=finite_floats, denominator=finite_floats)
def test_non_positive_denominator_returns_default(
self, numerator: float, denominator: float
):
"""Property: safe_divide returns default when denominator <= 0."""
assume(denominator <= 0)
custom_default = 42.0
result = safe_divide(numerator, denominator, default=custom_default)
assert result == custom_default
What Hypothesis does: - Generates thousands of test inputs automatically - Finds edge cases you didn't think of - Shrinks failing inputs to minimal examples - Pairs perfectly with assertions (NASA/TIGER style)
Multiple Testing Profiles¶
Configure Hypothesis for different contexts:
from hypothesis import HealthCheck, Verbosity, settings
# Configure profiles in conftest.py
settings.register_profile("ci", max_examples=1000, verbosity=Verbosity.verbose)
settings.register_profile("dev", max_examples=100)
settings.register_profile("debug", max_examples=10, verbosity=Verbosity.verbose)
settings.register_profile(
"fast",
max_examples=50,
suppress_health_check=[HealthCheck.too_slow],
)
# Load profile from environment
settings.load_profile("dev")
Usage: - dev: 100 examples, fast feedback during development - ci: 1,000 examples, thorough validation before merge - debug: 10 examples with verbose output for investigating failures - fast: 50 examples, skip slow checks for quick iterations
Properties to Test¶
Mathematical properties:
@given(st.floats(), st.floats())
def test_addition_commutative(a, b):
"""a + b == b + a"""
assert add(a, b) == add(b, a)
Idempotency:
@given(st.text())
def test_normalization_idempotent(text):
"""normalize(normalize(x)) == normalize(x)"""
once = normalize(text)
twice = normalize(once)
assert once == twice
Invariants:
@given(st.lists(st.integers()))
def test_sort_preserves_length(lst):
"""len(sort(x)) == len(x)"""
assert len(sort(lst)) == len(lst)
Round-trip properties:
@given(st.text())
def test_encode_decode_roundtrip(text):
"""decode(encode(x)) == x"""
assert decode(encode(text)) == text
Impact: Hypothesis found ~30 edge case bugs in TOA that manual tests missed. The combination of property-based fuzzing + defensive assertions catches bugs that neither technique alone would find.
4. Comprehensive Testing: Documentation That Runs¶
Why it matters: Agents need to understand what code does and be confident their changes don't break things.
Test Your Documentation Examples¶
Use pytest-examples to test code examples in docstrings and markdown:
# In conftest.py
pytest_plugins = ["pytest_examples"]
# In test_doc_examples.py
def test_readme_examples(file_path="README.md"):
"""Ensure all code examples in README actually work."""
pytest_examples.process_file(file_path)
Benefits: - Documentation never goes stale - Agents see working examples - Examples serve as integration tests
Architecture Tests: Enforce Boundaries with Code¶
Prevent agents from violating architecture rules using executable tests:
import ast
from pathlib import Path
import pytest
def get_imports_from_file(file_path: Path) -> set[str]:
"""Extract all imports from a Python file using AST."""
with open(file_path) as f:
tree = ast.parse(f.read())
imports = set()
for node in ast.walk(tree):
if isinstance(node, ast.Import):
for alias in node.names:
imports.add(alias.name.split(".")[0])
elif isinstance(node, ast.ImportFrom) and node.module:
imports.add(node.module.split(".")[0])
return imports
def check_no_cross_imports(
source_package: Path,
forbidden_package_name: str,
) -> list[str]:
"""Check that source_package doesn't import from forbidden_package_name."""
violations = []
for py_file in source_package.rglob("*.py"):
with open(py_file) as f:
content = f.read()
# Check for forbidden imports
if f"from {forbidden_package_name}" in content:
violations.append(f"{py_file}: imports from {forbidden_package_name}")
if f"import {forbidden_package_name}" in content:
violations.append(f"{py_file}: imports {forbidden_package_name}")
return violations
@pytest.mark.architecture
def test_commons_does_not_import_engines():
"""Commons module should NOT import from any engine."""
commons_dir = Path("myproject/commons")
violations = []
violations.extend(check_no_cross_imports(commons_dir, "engine_a"))
violations.extend(check_no_cross_imports(commons_dir, "engine_b"))
assert (
len(violations) == 0
), "Commons should not import from engines:\n" + "\n".join(violations)
@pytest.mark.architecture
def test_no_wildcard_imports():
"""Main code should not use wildcard imports (from x import *)."""
violations = []
for py_file in Path("myproject/src").rglob("*.py"):
with open(py_file) as f:
tree = ast.parse(f.read())
for node in ast.walk(tree):
if isinstance(node, ast.ImportFrom):
for alias in node.names:
if alias.name == "*":
violations.append(str(py_file))
assert len(violations) == 0, "Found wildcard imports"
@pytest.mark.architecture
def test_expected_directory_structure():
"""Verify the expected directory structure exists."""
expected_dirs = [
Path("myproject/commons"),
Path("myproject/engine_a"),
Path("myproject/engine_a/core"),
Path("myproject/engine_a/config"),
]
missing = [str(d) for d in expected_dirs if not d.exists()]
assert len(missing) == 0, f"Missing expected directories: {missing}"
What architecture tests enforce: - Layer boundaries: Commons can't import from engines - Import hygiene: No wildcard imports - Structural rules: Required directories exist - Naming conventions: Test files only in tests/ - Module independence: Engines don't cross-import
Run separately with markers:
# Run only architecture tests
pytest -m architecture
# Run everything except architecture tests
pytest -m "not architecture"
Impact: Architecture tests caught ~15 boundary violations during development. Agents can't accidentally create circular dependencies or violate layer rules.
Test Markers: Speed vs Thoroughness¶
Organize tests by speed and scope:
# pytest.ini / pyproject.toml
[tool.pytest.ini_options]
markers = [
"unit: Fast unit tests (< 100ms)",
"integration: Slower integration tests requiring APIs",
"architecture: Architectural boundary tests",
"examples: Tests from documentation examples",
]
addopts = [
"-m",
"not integration", # Skip integration tests by default
"--ff", # Run failures first
"-n",
"auto", # Parallel execution
]
Fast feedback for agents: run unit tests in <1 second, integration tests only in CI.
Impact: 29,157 lines of test code provide a safety net. Agents can refactor confidently, knowing tests catch regressions.
5. Pre-Commit Quality Gates: Fail Fast, Fail Local¶
Why it matters: Agents generate code fast. Quality gates catch issues before they reach CI or production.
The Essential Pre-Commit Stack¶
# .pre-commit-config.yaml
repos:
# 1. Format and lint
- repo: local
hooks:
- id: ruff-format
name: ruff format
entry: uv run ruff format
language: system
types: [python]
- id: ruff-check
name: ruff check
entry: uv run ruff check --fix
language: system
types: [python]
# 2. Type checking
- id: basedpyright
name: basedpyright
entry: uv run basedpyright
language: system
types: [python]
pass_filenames: false
# 3. Dead code detection
- id: vulture
name: vulture
entry: uv run vulture --min-confidence 80
language: system
types: [python]
# 4. Security scanning
- id: bandit
name: bandit
entry: uv run bandit -c pyproject.toml
language: system
types: [python]
# 5. Dependency auditing
- id: deptry
name: deptry
entry: uv run deptry .
language: system
pass_filenames: false
# 6. Mock detection (CRITICAL)
- id: mockbuster
name: mockbuster
entry: uv run mockbuster
language: system
types: [python]
# 7. CLI quality assurance
- id: cliqa
name: cliqa
entry: uv run cliqa analyze myapp
language: system
files: ^.*cli\.py$
pass_filenames: false
# 8. Documentation example tests
- id: runbook-examples
name: runbook examples
entry: uv run pytest docs/RUNBOOK.md --examples-only
language: system
files: ^docs/RUNBOOK\.md$
Why Each Hook Matters¶
- ruff: Format and lint in one pass. Agents follow consistent style.
- basedpyright: Type checking catches type errors before runtime.
- vulture: Detects unused code. Prevents dead code accumulation.
- bandit: Security scanning. Prevents
eval(), unsafeyaml.load(), etc. - deptry: Dependency health. Catches unused imports and missing dependencies.
- mockbuster: Prevents mock usage. Enforces real implementations (see below).
- cliqa: CLI quality assurance. Validates CLI help text, examples, commands (see below).
- runbook examples: Documentation stays up-to-date.
Configuration tip: Configure tools in pyproject.toml for single source of truth:
[tool.ruff]
line-length = 100
target-version = "py313"
[tool.ruff.lint]
select = ["E", "W", "F", "I", "B", "UP", "SIM", "PT"]
ignore = ["E501"] # Line too long (handled by formatter)
[tool.basedpyright]
typeCheckingMode = "basic"
reportMissingImports = "error"
reportUndefinedVariable = "error"
[tool.bandit]
exclude_dirs = ["tests/", "scripts/"]
skips = ["B101", "B404", "B603"]
[tool.vulture]
min_confidence = 80
paths = ["myproject"]
Impact: Pre-commit hooks catch ~80% of issues before CI, giving agents instant feedback.
6. Mock Detection: Enforce Real Implementations¶
Why it matters: Mocks hide bugs. Tests with mocks pass even when real code is broken. Agents should use real implementations or test doubles.
mockbuster: Detect and Prevent Mock Usage¶
# BAD: Using mocks (will be caught by mockbuster)
from unittest.mock import Mock
def test_with_mock():
mock_api = Mock()
mock_api.get_data.return_value = {"key": "value"}
result = process(mock_api)
assert result == "value"
# GOOD: Using dependency injection with real test implementation
class FakeAPI:
"""Test double that implements the real API interface."""
def __init__(self):
self.calls = []
def get_data(self):
self.calls.append("get_data")
return {"key": "value"}
def test_with_real_implementation():
fake_api = FakeAPI()
result = process(fake_api)
assert result == "value"
assert "get_data" in fake_api.calls
Allowed Exceptions¶
Sometimes mocks are necessary (CLI testing, system calls). Use inline comments:
def test_cli_with_system_interaction(monkeypatch): # mockbuster: ignore - testing CLI
"""Test CLI without actually calling the system."""
monkeypatch.setattr(sys, "exit", lambda code: None)
# CLI test code...
What mockbuster enforces:
- No unittest.mock.Mock or unittest.mock.MagicMock
- No unittest.mock.patch or @patch decorators
- No pytest.monkeypatch (unless explicitly allowed)
- Forces dependency injection patterns
Impact: Zero production bugs from mocked test dependencies. Tests use real implementations, catching integration issues early.
7. CLI Quality Assurance: Validate Command Interfaces¶
Why it matters: CLI commands are the interface agents use. Broken help text or missing examples make CLIs unusable for agents.
cliqa: Automated CLI Validation¶
cliqa validates that your Typer CLI follows best practices:
import typer
# Create the app with examples in docstring for cliqa
app = typer.Typer(
name="myapp",
help="My Application CLI",
epilog="Examples:\n myapp process path/to/data\n myapp info path/to/data",
)
@app.command()
def process(
data_path: Path = typer.Argument(
...,
help="Path to data directory containing input files",
exists=True,
file_okay=False,
dir_okay=True,
),
verbose: bool = typer.Option(
False,
"--verbose",
"-v",
help="Enable verbose output",
),
) -> None:
"""
Process data from input files.
Processes all files in the data directory
and generates outputs.
Examples:
myapp process data/my_project
myapp process data/my_project --verbose
"""
# Implementation...
What cliqa validates: 1. Help text exists: Every command and option has help text 2. Examples provided: Commands include usage examples 3. Argument descriptions: All arguments documented 4. Type hints present: All parameters have type annotations 5. Consistent naming: Commands follow kebab-case convention
Pre-commit integration:
- id: cliqa
name: cliqa
entry: uv run cliqa analyze myapp
language: system
files: ^.*cli\.py$
pass_filenames: false
Impact: 100% of CLI commands have examples and documentation. Agents can discover and use commands correctly without guessing.
CLIs as Agent Debugging Interface¶
Critical insight: CLIs aren't just for users—they're how agents explore, test, and debug your codebase.
When an agent encounters an error or needs to understand behavior, they can:
# Discover available commands
myapp --help
# Run processing on a small test case
myapp process test_data/ --verbose
# Check configuration
myapp config show
# Validate outputs
myapp validate output/
# Run health checks
myapp health
Why this matters:
-
Reproducibility: Agents can reproduce issues in isolation
-
Incremental testing: Test components independently
-
State inspection: Examine system state at any point
-
Hypothesis generation: Quickly test theories about bugs
Design principle: Every major component should have a CLI command. If an agent can't invoke it from the command line, they can't debug it effectively.
Example pattern: Break your pipeline into discrete CLI commands:
- myapp process - Full processing pipeline
- myapp transform - Just data transformation
- myapp validate - Just validation
- myapp config - Configuration management
- myapp health - System health check
This granularity lets agents isolate issues to specific stages without diving into code.
8. Modular Architecture: Enforced Separation¶
Why it matters: Agents need clear boundaries. Without them, they create spaghetti code.
Layer Architecture with Enforcement¶
myproject/
├── commons/ # Shared utilities (no engine imports)
│ ├── config/ # Configuration loaders
│ ├── io/ # File I/O abstraction
│ ├── errors.py # Error collection
│ └── utils/ # Logging, helpers
├── engine_a/ # Processing engine A
│ ├── config/ # Engine configuration
│ ├── core/ # Main processing logic
│ ├── transforms/ # Data transformations
│ └── storage/ # File management
└── engine_b/ # Processing engine B
Enforced rules:
1. commons/ cannot import from engines
2. Engines are independent
3. Architecture tests fail if rules violated
Dependency Injection for Testability¶
Make code testable without mocks:
# BAD: Hard to test without mocks
class OntologyExtractor:
def __init__(self):
self.agent = Agent() # Hard dependency
self.db = Neo4jConnection() # Hard dependency
# GOOD: Easy to test with real test implementations
class OntologyExtractor:
def __init__(
self,
agent: Agent[None, ExtractionResult],
storage: StorageProtocol,
):
self.agent = agent # Injected
self.storage = storage # Injected
# Test with real implementation
class FakeStorage:
def __init__(self):
self.stored_items = []
def save(self, item):
self.stored_items.append(item)
Agents understand interfaces. Tests use real fakes, not mocks.
Impact: Clear architecture prevents 75% of coupling issues. Agents understand boundaries and tests enforce them.
9. Documentation as Code: Contracts and Runbooks¶
Why it matters: Agents need to understand system contracts. Comments go stale; executable contracts don't.
Contract Specifications¶
Document input/output structures in versioned files:
# Extraction Input Contract
## File Structure
Files must follow this structure:
- `domain_name/`
- `config.yaml` - Configuration overrides
- `inputs/` - Input text files
- `file1.txt`
- `file2.md`
- `outputs/` - Generated outputs (created by system)
## Configuration Schema
```yaml
llm:
model_name: "gemini-2.0-flash-001" # KnownModelName from pydantic-ai
temperature: 0.7 # Range: 0.0-2.0
max_tokens: 8192 # Minimum: 100
extraction:
chunk_size: 4000 # Range: 100-100000
chunk_overlap: 200 # Minimum: 0
batch_size: 10 # Range: 1-100
Output Contract¶
Graph Structure (JSON)¶
{
"nodes": [
{
"id": "concept_1",
"type": "concept",
"properties": {
"name": "Machine Learning",
"description": "...",
"concept_type": "technology"
}
}
],
"edges": [
{
"source": "concept_1",
"target": "concept_2",
"type": "relates_to",
"properties": {
"relationship_type": "uses",
"confidence": 0.85
}
}
]
}
Agents can read contracts and generate conforming code.
Executable Runbooks¶
Write runbooks as executable code:
# Processing Runbook
## Process data from source
```python
from myproject import process_data
result = await process_data(data_path="data/my_dataset")
print(f"Processed {result.items_count} items")
```
## Check processing metrics
```python
# Continuing from previous example
print(f"Files processed: {result.files_processed}")
print(f"Files failed: {len(result.failed_files)}")
print(f"Total items: {result.items_count}")
print(f"Total errors: {len(result.errors)}")
```
Test the runbook with pytest --examples-only docs/RUNBOOK.md.
Impact: Zero stale documentation. Contracts are tested, runbooks are executable.
10. Observability: Rich Logging and Progress¶
Why it matters: Agents need feedback. Good observability helps agents and humans debug.
Rich Console Logging¶
from rich.console import Console
from rich.progress import Progress, SpinnerColumn, TextColumn, BarColumn
from rich.panel import Panel
console = Console()
def log_extraction_start(domain: str, file_count: int) -> None:
"""Log extraction start with formatted output."""
assert domain is not None, "domain must not be None"
assert file_count > 0, "file_count must be positive"
console.print(Panel.fit(
f"[bold blue]Starting extraction[/bold blue]\n"
f"Domain: {domain}\n"
f"Files: {file_count}",
border_style="blue"
))
async def process_with_progress(
files: list[Path],
) -> list[ExtractionResult]:
"""Process files with progress bar."""
assert files is not None, "files must not be None"
assert len(files) > 0, "files must not be empty"
results = []
with Progress(
SpinnerColumn(),
TextColumn("[progress.description]{task.description}"),
BarColumn(),
TextColumn("[progress.percentage]{task.percentage:>3.0f}%"),
console=console,
) as progress:
task = progress.add_task("Processing files", total=len(files))
for file_path in files:
result = await extract_from_file(file_path)
results.append(result)
progress.update(task, advance=1)
return results
Structured Metrics¶
Track metrics for debugging and optimization:
@dataclass
class ExtractionMetrics:
"""Metrics tracked during extraction."""
start_time: float
end_time: float | None = None
files_processed: int = 0
files_failed: int = 0
chunks_processed: int = 0
concepts_extracted: int = 0
relationships_extracted: int = 0
llm_calls: int = 0
llm_tokens_input: int = 0
llm_tokens_output: int = 0
@property
def duration_seconds(self) -> float:
"""Calculate total duration."""
assert self.start_time > 0, "start_time must be set"
end = self.end_time or time.time()
duration = end - self.start_time
assert duration >= 0, "duration must be non-negative"
return duration
@property
def throughput_files_per_second(self) -> float:
"""Calculate file processing throughput."""
assert self.files_processed >= 0, "files_processed must be non-negative"
if self.duration_seconds == 0:
return 0.0
return self.files_processed / self.duration_seconds
Benefits: - Beautiful progress output - Structured metrics for analysis - Easy debugging - Performance optimization data
Impact: 10x faster debugging. Clear feedback helps agents and humans understand what's happening.
Putting It All Together¶
Here's how these 10 techniques work together in a real extraction pipeline:
async def process_pipeline(data_path: Path) -> PipelineOutput:
"""
Process data pipeline with full hardening.
Demonstrates all 10 techniques:
1. Type safety: Pydantic models
2. Defensive assertions: 2+ per function (NASA05)
3. Property-based fuzzing: Hypothesis for edge cases
4. Comprehensive testing: pytest markers, doc tests, arch tests
5. Pre-commit gates: enforced quality
6. Mock detection: real implementations only
7. CLI quality: validated with cliqa
8. Modular architecture: dependency injection
9. Documentation: contracts and runbooks
10. Observability: rich logging and metrics
"""
# Defensive assertions (Technique 2)
assert domain_path is not None, "domain_path must not be None"
assert domain_path.exists(), f"domain_path must exist: {domain_path}"
assert domain_path.is_dir(), f"domain_path must be directory: {domain_path}"
# Structured configuration (Technique 6)
config = load_config(domain_path)
# State management (Technique 8)
state = ExtractionState(
domain_path=domain_path,
config=config,
errors=ErrorCollector(),
metrics=ExtractionMetrics(start_time=time.time()),
)
# Observability (Technique 10)
log_extraction_start(domain_path.name, len(state.files))
# Modular architecture with DI (Technique 5)
agent = create_extraction_agent(config.llm)
storage = create_storage(domain_path)
try:
# Error collection (Technique 7)
results = await process_with_error_collection(
files=state.files,
agent=agent,
collector=state.errors,
)
# Type safety (Technique 1)
for result in results:
assert isinstance(result, ExtractionResult), "Must be ExtractionResult"
state.concepts.extend(result.concepts)
state.relationships.extend(result.relationships)
# Save outputs (contracts from Technique 9)
await storage.save_graph(state.concepts, state.relationships)
finally:
state.metrics.end_time = time.time()
log_extraction_complete(state.metrics)
return OntologyOutput(
concepts=state.concepts,
relationships=state.relationships,
metrics=state.metrics,
errors=state.errors.errors,
)
Measuring Success: Before and After¶
Here's what changed in the Taxonomy-Ontology-Accelerator after implementing these techniques:
| Metric | Before | After | Improvement |
|---|---|---|---|
| Agent-introduced bugs | ~15/week | ~2/week | 87% reduction |
| Edge case bugs found | ~5 (manual) | ~35 (Hypothesis) | 7x more found |
| Time to debug issues | ~2 hours | ~15 minutes | 88% faster |
| Test coverage | 45% | 62% | +37% |
| CI pass rate | 78% | 96% | +23% |
| Architecture violations | ~8/week | 0 | 100% reduction |
| Mock-related prod bugs | 2/month | 0/6 months | 100% reduction |
| CLI usage errors | ~10/month | 0 | 100% reduction |
| Production incidents | 3/month | 0/3 months | 100% reduction |
Getting Started: Your Action Plan¶
You don't need to implement all 10 techniques at once. Start with the highest-impact changes:
Week 1: Quick Wins (Critical Foundation)¶
- Add pre-commit hooks (ruff + basedpyright)
- Add NASA05 assertions (2+ per function)
- Enable pytest coverage reporting
- Add mockbuster to prevent mock usage
Week 2: Type Safety & Testing¶
- Convert config to Pydantic models
- Add type hints to public APIs
- Create Protocol classes for interfaces
- Add Hypothesis property-based tests for critical functions
Week 3: Architecture & Quality Gates¶
- Add architecture tests for layer boundaries
- Implement dependency injection in main classes
- Add cliqa for CLI validation
Week 4: Documentation & Observability¶
- Write contract specifications for inputs/outputs
- Add executable runbook examples
- Test documentation with pytest-examples
- Add Rich logging with progress bars
Conclusion¶
Making codebases ready for agentic coding isn't about preventing AI from working with your code—it's about enabling AI to work safely and effectively.
The 10 techniques described here create a multi-layered safety net that catches mistakes before they reach production:
- Type safety guides agents to correct usage
- Defensive assertions catch wrong assumptions (NASA05)
- Property-based fuzzing finds edge cases (NASA/TIGER style)
- Comprehensive testing prevents regressions
- Pre-commit gates fail fast locally
- Mock detection enforces real implementations
- CLI quality assurance validates command interfaces
- Modular architecture enforces boundaries
- Documentation as code stays current
- Observability enables debugging
These aren't theoretical ideas—they're battle-tested patterns from a production system that AI agents work with daily.
Start small. Pick one technique, implement it, measure the impact. Then add the next one.
The most impactful combo: NASA05 assertions + Hypothesis fuzzing + mockbuster gives you 80% of the value with 20% of the effort.
Your future self (and your AI coding agent) will thank you.
Want to see these techniques in action? Check out the Taxonomy-Ontology-Accelerator on GitHub.
Have questions or want to share your own hardening techniques? Reach out on GitHub.