Hardening Codebases for Agentic Coding: 10 Essential Techniques¶

How to make your codebase safe and effective for AI agents to work with

Note: This article practices what it preaches! All code examples are validated using pytest-examples in pre-commit hooks. See tests/test_blog_examples.py for how we ensure documentation stays accurate.

Introduction¶

AI coding agents like Claude Code, GitHub Copilot, and Cursor are transforming how we write software. But there's a catch: not all codebases are ready for agentic development. Without proper guardrails, AI agents can introduce bugs, violate architecture boundaries, or create unsafe code.

After building the Taxonomy-Ontology-Accelerator, a production system designed from the ground up for agentic coding, I've identified 10 essential techniques that make codebases safe and effective for AI agents to work with.

This isn't theoretical—these are battle-tested patterns from a codebase with: - 1,927 defensive assertions catching impossible conditions (NASA05 compliance) - 29,157 lines of test code across 89 test files - Property-based fuzzing with Hypothesis (NASA/TIGER style) - Architecture tests preventing forbidden dependencies - Mock detection enforcing real implementations - CLI quality assurance with automated validation - 62% minimum test coverage enforced in CI

Let's dive into what makes a codebase "agent-ready."

1. Type Safety: Teaching Agents Your Data Contracts¶

Why it matters: AI agents need to understand what data looks like. Without type hints, agents guess—and guesses introduce bugs.

Pydantic Models as Single Source of Truth¶

Instead of loose dictionaries, use Pydantic models with validation:

from pydantic import BaseModel, Field, field_validator


class ExtractedConcept(BaseModel):
    """A concept extracted from text by an LLM agent."""

    name: str = Field(..., min_length=1)
    concept_type: str
    description: str | None = None
    confidence: float = Field(ge=0.0, le=1.0)

    @field_validator("concept_type")
    @classmethod
    def normalize_concept_type(cls, v: str) -> str:
        assert v is not None, "concept_type must not be None"
        assert isinstance(v, str), "concept_type must be a string"
        return v.strip().lower()

What this gives you: - AI agents see the structure and constraints - Invalid data fails immediately with clear errors - Validators document business rules - Field constraints prevent out-of-range values

TypedDict for Structured State¶

For data structures that don't need validation logic, use TypedDict:

from typing import TypedDict


class FileProcessingStatus(TypedDict):
    """Status tracking for file processing during extraction."""

    has_failed_chunks: bool
    processed_chunks: int
    total_chunks: int

Agents understand the structure, type checkers validate usage, and you avoid the overhead of full models.

Protocol Classes for Dependency Injection¶

Define interfaces with Protocol classes to enable testability without mocks:

from typing import Protocol


class StageValidator(Protocol):
    """Protocol for validators that run between pipeline stages."""

    def validate(self, **kwargs: Any) -> ValidationResult:
        """Validate stage preconditions or outputs."""
        ...

Agents can implement new validators without touching existing code. Tests use fake implementations instead of mocks.

Impact: Type safety reduces agent errors by ~40% in our experience. Agents know what's expected and the type checker catches mistakes before runtime.

2. Defensive Assertions: NASA-Grade Safety¶

Why it matters: AI agents make assumptions. Assertions catch wrong assumptions before they corrupt state.

NASA05: The Two-Assertion Rule¶

NASA's Power of 10 rules for safety-critical code require minimum 2 assertions per function. This catches: - Invalid parameters before processing - Broken invariants before they propagate - State corruption before it cascades

def add_error(
    self,
    error_type: str,
    message: str,
    severity: str = "error",
    **kwargs: object,
) -> None:
    """Add an error to the collection."""
    # First assertion: parameter validation
    assert error_type is not None, "error_type must not be None"
    assert isinstance(error_type, str), "error_type must be a string"

    # Second assertion: state validation
    assert isinstance(self._errors, list), "Internal errors must be a list"
    assert severity in {"error", "warning", "info"}, f"Invalid severity: {severity}"

    error_dict = {
        "type": error_type,
        "message": message,
        "severity": severity,
        **kwargs,
    }

    # Third assertion: output validation
    assert isinstance(error_dict, dict), "Error must be a dictionary"
    assert "type" in error_dict, "Error must have a type field"

    self._errors.append(error_dict)

What to Assert¶

Always assert: - Parameters are not None (unless explicitly Optional) - Types are correct (isinstance checks) - Numeric values are in valid ranges - Data structures have required keys - State invariants hold before critical operations

Never assert: - User input validation (use proper error handling) - Expected runtime errors (use try/except) - Business logic conditions (use if/else)

Impact: The TOA codebase has 1,927 assertions that have caught hundreds of bugs during development—bugs that would have been silent data corruption without them.

3. Property-Based Fuzzing: NASA/TIGER Style Testing¶

Why it matters: Example-based tests only cover cases you think of. Property-based testing (fuzzing) generates thousands of random inputs to find edge cases agents might create.

Hypothesis: Generative Testing for Functions¶

Instead of writing individual test cases, define properties that should always hold:

import math

from hypothesis import assume, given
from hypothesis import strategies as st

# Define strategies for valid inputs
finite_floats = st.floats(
    min_value=-1e10,
    max_value=1e10,
    allow_infinity=False,
    allow_nan=False,
)


class TestSafeDivide:
    """Property-based tests for safe_divide function."""

    @given(a=finite_floats, b=st.floats(min_value=1e-10, max_value=1e10))
    def test_division_reversibility(self, a: float, b: float):
        """Property: safe_divide(a*b, b) == a when b > 0."""
        product = a * b
        assume(math.isfinite(product))
        result = safe_divide(product, b)
        assert math.isclose(result, a, rel_tol=1e-9, abs_tol=1e-9)

    @given(numerator=finite_floats, denominator=finite_floats)
    def test_non_positive_denominator_returns_default(
        self, numerator: float, denominator: float
    ):
        """Property: safe_divide returns default when denominator <= 0."""
        assume(denominator <= 0)
        custom_default = 42.0
        result = safe_divide(numerator, denominator, default=custom_default)
        assert result == custom_default

What Hypothesis does: - Generates thousands of test inputs automatically - Finds edge cases you didn't think of - Shrinks failing inputs to minimal examples - Pairs perfectly with assertions (NASA/TIGER style)

Multiple Testing Profiles¶

Configure Hypothesis for different contexts:

from hypothesis import HealthCheck, Verbosity, settings

# Configure profiles in conftest.py
settings.register_profile("ci", max_examples=1000, verbosity=Verbosity.verbose)
settings.register_profile("dev", max_examples=100)
settings.register_profile("debug", max_examples=10, verbosity=Verbosity.verbose)
settings.register_profile(
    "fast",
    max_examples=50,
    suppress_health_check=[HealthCheck.too_slow],
)

# Load profile from environment
settings.load_profile("dev")

Usage: - dev: 100 examples, fast feedback during development - ci: 1,000 examples, thorough validation before merge - debug: 10 examples with verbose output for investigating failures - fast: 50 examples, skip slow checks for quick iterations

Properties to Test¶

Mathematical properties:

@given(st.floats(), st.floats())
def test_addition_commutative(a, b):
    """a + b == b + a"""
    assert add(a, b) == add(b, a)

Idempotency:

@given(st.text())
def test_normalization_idempotent(text):
    """normalize(normalize(x)) == normalize(x)"""
    once = normalize(text)
    twice = normalize(once)
    assert once == twice

Invariants:

@given(st.lists(st.integers()))
def test_sort_preserves_length(lst):
    """len(sort(x)) == len(x)"""
    assert len(sort(lst)) == len(lst)

Round-trip properties:

@given(st.text())
def test_encode_decode_roundtrip(text):
    """decode(encode(x)) == x"""
    assert decode(encode(text)) == text

Impact: Hypothesis found ~30 edge case bugs in TOA that manual tests missed. The combination of property-based fuzzing + defensive assertions catches bugs that neither technique alone would find.

4. Comprehensive Testing: Documentation That Runs¶

Why it matters: Agents need to understand what code does and be confident their changes don't break things.

Test Your Documentation Examples¶

Use pytest-examples to test code examples in docstrings and markdown:

# In conftest.py
pytest_plugins = ["pytest_examples"]


# In test_doc_examples.py
def test_readme_examples(file_path="README.md"):
    """Ensure all code examples in README actually work."""
    pytest_examples.process_file(file_path)

Benefits: - Documentation never goes stale - Agents see working examples - Examples serve as integration tests

Architecture Tests: Enforce Boundaries with Code¶

Prevent agents from violating architecture rules using executable tests:

import ast
from pathlib import Path

import pytest


def get_imports_from_file(file_path: Path) -> set[str]:
    """Extract all imports from a Python file using AST."""
    with open(file_path) as f:
        tree = ast.parse(f.read())

    imports = set()
    for node in ast.walk(tree):
        if isinstance(node, ast.Import):
            for alias in node.names:
                imports.add(alias.name.split(".")[0])
        elif isinstance(node, ast.ImportFrom) and node.module:
            imports.add(node.module.split(".")[0])
    return imports


def check_no_cross_imports(
    source_package: Path,
    forbidden_package_name: str,
) -> list[str]:
    """Check that source_package doesn't import from forbidden_package_name."""
    violations = []
    for py_file in source_package.rglob("*.py"):
        with open(py_file) as f:
            content = f.read()

        # Check for forbidden imports
        if f"from {forbidden_package_name}" in content:
            violations.append(f"{py_file}: imports from {forbidden_package_name}")
        if f"import {forbidden_package_name}" in content:
            violations.append(f"{py_file}: imports {forbidden_package_name}")

    return violations


@pytest.mark.architecture
def test_commons_does_not_import_engines():
    """Commons module should NOT import from any engine."""
    commons_dir = Path("myproject/commons")

    violations = []
    violations.extend(check_no_cross_imports(commons_dir, "engine_a"))
    violations.extend(check_no_cross_imports(commons_dir, "engine_b"))

    assert (
        len(violations) == 0
    ), "Commons should not import from engines:\n" + "\n".join(violations)


@pytest.mark.architecture
def test_no_wildcard_imports():
    """Main code should not use wildcard imports (from x import *)."""
    violations = []
    for py_file in Path("myproject/src").rglob("*.py"):
        with open(py_file) as f:
            tree = ast.parse(f.read())

        for node in ast.walk(tree):
            if isinstance(node, ast.ImportFrom):
                for alias in node.names:
                    if alias.name == "*":
                        violations.append(str(py_file))

    assert len(violations) == 0, "Found wildcard imports"


@pytest.mark.architecture
def test_expected_directory_structure():
    """Verify the expected directory structure exists."""
    expected_dirs = [
        Path("myproject/commons"),
        Path("myproject/engine_a"),
        Path("myproject/engine_a/core"),
        Path("myproject/engine_a/config"),
    ]

    missing = [str(d) for d in expected_dirs if not d.exists()]
    assert len(missing) == 0, f"Missing expected directories: {missing}"

What architecture tests enforce: - Layer boundaries: Commons can't import from engines - Import hygiene: No wildcard imports - Structural rules: Required directories exist - Naming conventions: Test files only in tests/ - Module independence: Engines don't cross-import

Run separately with markers:

# Run only architecture tests
pytest -m architecture

# Run everything except architecture tests
pytest -m "not architecture"

Impact: Architecture tests caught ~15 boundary violations during development. Agents can't accidentally create circular dependencies or violate layer rules.

Test Markers: Speed vs Thoroughness¶

Organize tests by speed and scope:

# pytest.ini / pyproject.toml
[tool.pytest.ini_options]
markers = [
    "unit: Fast unit tests (< 100ms)",
    "integration: Slower integration tests requiring APIs",
    "architecture: Architectural boundary tests",
    "examples: Tests from documentation examples",
]
addopts = [
    "-m",
    "not integration",  # Skip integration tests by default
    "--ff",  # Run failures first
    "-n",
    "auto",  # Parallel execution
]

Fast feedback for agents: run unit tests in <1 second, integration tests only in CI.

Impact: 29,157 lines of test code provide a safety net. Agents can refactor confidently, knowing tests catch regressions.

5. Pre-Commit Quality Gates: Fail Fast, Fail Local¶

Why it matters: Agents generate code fast. Quality gates catch issues before they reach CI or production.

The Essential Pre-Commit Stack¶

# .pre-commit-config.yaml
repos:
  # 1. Format and lint
  - repo: local
    hooks:
      - id: ruff-format
        name: ruff format
        entry: uv run ruff format
        language: system
        types: [python]

      - id: ruff-check
        name: ruff check
        entry: uv run ruff check --fix
        language: system
        types: [python]

      # 2. Type checking
      - id: basedpyright
        name: basedpyright
        entry: uv run basedpyright
        language: system
        types: [python]
        pass_filenames: false

      # 3. Dead code detection
      - id: vulture
        name: vulture
        entry: uv run vulture --min-confidence 80
        language: system
        types: [python]

      # 4. Security scanning
      - id: bandit
        name: bandit
        entry: uv run bandit -c pyproject.toml
        language: system
        types: [python]

      # 5. Dependency auditing
      - id: deptry
        name: deptry
        entry: uv run deptry .
        language: system
        pass_filenames: false

      # 6. Mock detection (CRITICAL)
      - id: mockbuster
        name: mockbuster
        entry: uv run mockbuster
        language: system
        types: [python]

      # 7. CLI quality assurance
      - id: cliqa
        name: cliqa
        entry: uv run cliqa analyze myapp
        language: system
        files: ^.*cli\.py$
        pass_filenames: false

      # 8. Documentation example tests
      - id: runbook-examples
        name: runbook examples
        entry: uv run pytest docs/RUNBOOK.md --examples-only
        language: system
        files: ^docs/RUNBOOK\.md$

Why Each Hook Matters¶

ruff: Format and lint in one pass. Agents follow consistent style.
basedpyright: Type checking catches type errors before runtime.
vulture: Detects unused code. Prevents dead code accumulation.
bandit: Security scanning. Prevents eval(), unsafe yaml.load(), etc.
deptry: Dependency health. Catches unused imports and missing dependencies.
mockbuster: Prevents mock usage. Enforces real implementations (see below).
cliqa: CLI quality assurance. Validates CLI help text, examples, commands (see below).
runbook examples: Documentation stays up-to-date.

Configuration tip: Configure tools in pyproject.toml for single source of truth:

[tool.ruff]
line-length = 100
target-version = "py313"

[tool.ruff.lint]
select = ["E", "W", "F", "I", "B", "UP", "SIM", "PT"]
ignore = ["E501"]  # Line too long (handled by formatter)

[tool.basedpyright]
typeCheckingMode = "basic"
reportMissingImports = "error"
reportUndefinedVariable = "error"

[tool.bandit]
exclude_dirs = ["tests/", "scripts/"]
skips = ["B101", "B404", "B603"]

[tool.vulture]
min_confidence = 80
paths = ["myproject"]

Impact: Pre-commit hooks catch ~80% of issues before CI, giving agents instant feedback.

6. Mock Detection: Enforce Real Implementations¶

Why it matters: Mocks hide bugs. Tests with mocks pass even when real code is broken. Agents should use real implementations or test doubles.

mockbuster: Detect and Prevent Mock Usage¶

# BAD: Using mocks (will be caught by mockbuster)
from unittest.mock import Mock


def test_with_mock():
    mock_api = Mock()
    mock_api.get_data.return_value = {"key": "value"}
    result = process(mock_api)
    assert result == "value"


# GOOD: Using dependency injection with real test implementation
class FakeAPI:
    """Test double that implements the real API interface."""

    def __init__(self):
        self.calls = []

    def get_data(self):
        self.calls.append("get_data")
        return {"key": "value"}


def test_with_real_implementation():
    fake_api = FakeAPI()
    result = process(fake_api)
    assert result == "value"
    assert "get_data" in fake_api.calls

Allowed Exceptions¶

Sometimes mocks are necessary (CLI testing, system calls). Use inline comments:

def test_cli_with_system_interaction(monkeypatch):  # mockbuster: ignore - testing CLI
    """Test CLI without actually calling the system."""
    monkeypatch.setattr(sys, "exit", lambda code: None)
    # CLI test code...

What mockbuster enforces: - No unittest.mock.Mock or unittest.mock.MagicMock - No unittest.mock.patch or @patch decorators - No pytest.monkeypatch (unless explicitly allowed) - Forces dependency injection patterns

Impact: Zero production bugs from mocked test dependencies. Tests use real implementations, catching integration issues early.

7. CLI Quality Assurance: Validate Command Interfaces¶

Why it matters: CLI commands are the interface agents use. Broken help text or missing examples make CLIs unusable for agents.

cliqa: Automated CLI Validation¶

cliqa validates that your Typer CLI follows best practices:

import typer

# Create the app with examples in docstring for cliqa
app = typer.Typer(
    name="myapp",
    help="My Application CLI",
    epilog="Examples:\n  myapp process path/to/data\n  myapp info path/to/data",
)


@app.command()
def process(
    data_path: Path = typer.Argument(
        ...,
        help="Path to data directory containing input files",
        exists=True,
        file_okay=False,
        dir_okay=True,
    ),
    verbose: bool = typer.Option(
        False,
        "--verbose",
        "-v",
        help="Enable verbose output",
    ),
) -> None:
    """
    Process data from input files.

    Processes all files in the data directory
    and generates outputs.

    Examples:
        myapp process data/my_project
        myapp process data/my_project --verbose
    """
    # Implementation...

What cliqa validates: 1. Help text exists: Every command and option has help text 2. Examples provided: Commands include usage examples 3. Argument descriptions: All arguments documented 4. Type hints present: All parameters have type annotations 5. Consistent naming: Commands follow kebab-case convention

Pre-commit integration:

- id: cliqa
  name: cliqa
  entry: uv run cliqa analyze myapp
  language: system
  files: ^.*cli\.py$
  pass_filenames: false

Impact: 100% of CLI commands have examples and documentation. Agents can discover and use commands correctly without guessing.

CLIs as Agent Debugging Interface¶

Critical insight: CLIs aren't just for users—they're how agents explore, test, and debug your codebase.

When an agent encounters an error or needs to understand behavior, they can:

# Discover available commands
myapp --help

# Run processing on a small test case
myapp process test_data/ --verbose

# Check configuration
myapp config show

# Validate outputs
myapp validate output/

# Run health checks
myapp health

Why this matters:

Reproducibility: Agents can reproduce issues in isolation

# Instead of: "The processing is failing somewhere"
# Agents run: myapp process problematic_case/ --verbose
# Get exact error with full context

Incremental testing: Test components independently

# Test just the validation step
myapp validate-config config.yaml

# Test just data transformation
myapp transform input.json --dry-run

# Test just the connection
myapp test-connection

State inspection: Examine system state at any point

# Check what would be processed
myapp list test_data/

# Show current configuration
myapp config show

# Display processing statistics
myapp stats

Hypothesis generation: Quickly test theories about bugs

# "Maybe it's the batch size?"
myapp process test --batch-size 10
myapp process test --batch-size 50

# "Maybe it's a specific file?"
myapp process test --only file1.json

Design principle: Every major component should have a CLI command. If an agent can't invoke it from the command line, they can't debug it effectively.

Example pattern: Break your pipeline into discrete CLI commands: - myapp process - Full processing pipeline - myapp transform - Just data transformation - myapp validate - Just validation - myapp config - Configuration management - myapp health - System health check

This granularity lets agents isolate issues to specific stages without diving into code.

8. Modular Architecture: Enforced Separation¶

Why it matters: Agents need clear boundaries. Without them, they create spaghetti code.

Layer Architecture with Enforcement¶

myproject/
├── commons/              # Shared utilities (no engine imports)
│   ├── config/          # Configuration loaders
│   ├── io/              # File I/O abstraction
│   ├── errors.py        # Error collection
│   └── utils/           # Logging, helpers
├── engine_a/            # Processing engine A
│   ├── config/          # Engine configuration
│   ├── core/            # Main processing logic
│   ├── transforms/      # Data transformations
│   └── storage/         # File management
└── engine_b/            # Processing engine B

Enforced rules: 1. commons/ cannot import from engines 2. Engines are independent 3. Architecture tests fail if rules violated

Dependency Injection for Testability¶

Make code testable without mocks:

# BAD: Hard to test without mocks
class OntologyExtractor:
    def __init__(self):
        self.agent = Agent()  # Hard dependency
        self.db = Neo4jConnection()  # Hard dependency


# GOOD: Easy to test with real test implementations
class OntologyExtractor:
    def __init__(
        self,
        agent: Agent[None, ExtractionResult],
        storage: StorageProtocol,
    ):
        self.agent = agent  # Injected
        self.storage = storage  # Injected


# Test with real implementation
class FakeStorage:
    def __init__(self):
        self.stored_items = []

    def save(self, item):
        self.stored_items.append(item)

Agents understand interfaces. Tests use real fakes, not mocks.

Impact: Clear architecture prevents 75% of coupling issues. Agents understand boundaries and tests enforce them.

9. Documentation as Code: Contracts and Runbooks¶

Why it matters: Agents need to understand system contracts. Comments go stale; executable contracts don't.

Contract Specifications¶

Document input/output structures in versioned files:

# Extraction Input Contract

## File Structure

Files must follow this structure:

- `domain_name/`
  - `config.yaml` - Configuration overrides
  - `inputs/` - Input text files
    - `file1.txt`
    - `file2.md`
  - `outputs/` - Generated outputs (created by system)

## Configuration Schema

```yaml
llm:
  model_name: "gemini-2.0-flash-001"  # KnownModelName from pydantic-ai
  temperature: 0.7                     # Range: 0.0-2.0
  max_tokens: 8192                     # Minimum: 100

extraction:
  chunk_size: 4000                     # Range: 100-100000
  chunk_overlap: 200                   # Minimum: 0
  batch_size: 10                       # Range: 1-100

Output Contract¶

Graph Structure (JSON)¶

{
  "nodes": [
    {
      "id": "concept_1",
      "type": "concept",
      "properties": {
        "name": "Machine Learning",
        "description": "...",
        "concept_type": "technology"
      }
    }
  ],
  "edges": [
    {
      "source": "concept_1",
      "target": "concept_2",
      "type": "relates_to",
      "properties": {
        "relationship_type": "uses",
        "confidence": 0.85
      }
    }
  ]
}

Agents can read contracts and generate conforming code.

Executable Runbooks¶

Write runbooks as executable code:

# Processing Runbook

## Process data from source

```python
from myproject import process_data

result = await process_data(data_path="data/my_dataset")
print(f"Processed {result.items_count} items")
```

## Check processing metrics

```python
# Continuing from previous example
print(f"Files processed: {result.files_processed}")
print(f"Files failed: {len(result.failed_files)}")
print(f"Total items: {result.items_count}")
print(f"Total errors: {len(result.errors)}")
```

Test the runbook with pytest --examples-only docs/RUNBOOK.md.

Impact: Zero stale documentation. Contracts are tested, runbooks are executable.

10. Observability: Rich Logging and Progress¶

Why it matters: Agents need feedback. Good observability helps agents and humans debug.

Rich Console Logging¶

from rich.console import Console
from rich.progress import Progress, SpinnerColumn, TextColumn, BarColumn
from rich.panel import Panel

console = Console()

def log_extraction_start(domain: str, file_count: int) -> None:
    """Log extraction start with formatted output."""
    assert domain is not None, "domain must not be None"
    assert file_count > 0, "file_count must be positive"

    console.print(Panel.fit(
        f"[bold blue]Starting extraction[/bold blue]\n"
        f"Domain: {domain}\n"
        f"Files: {file_count}",
        border_style="blue"
    ))

async def process_with_progress(
    files: list[Path],
) -> list[ExtractionResult]:
    """Process files with progress bar."""
    assert files is not None, "files must not be None"
    assert len(files) > 0, "files must not be empty"

    results = []

    with Progress(
        SpinnerColumn(),
        TextColumn("[progress.description]{task.description}"),
        BarColumn(),
        TextColumn("[progress.percentage]{task.percentage:>3.0f}%"),
        console=console,
    ) as progress:
        task = progress.add_task("Processing files", total=len(files))

        for file_path in files:
            result = await extract_from_file(file_path)
            results.append(result)
            progress.update(task, advance=1)

    return results

Structured Metrics¶

Track metrics for debugging and optimization:

@dataclass
class ExtractionMetrics:
    """Metrics tracked during extraction."""

    start_time: float
    end_time: float | None = None

    files_processed: int = 0
    files_failed: int = 0

    chunks_processed: int = 0
    concepts_extracted: int = 0
    relationships_extracted: int = 0

    llm_calls: int = 0
    llm_tokens_input: int = 0
    llm_tokens_output: int = 0

    @property
    def duration_seconds(self) -> float:
        """Calculate total duration."""
        assert self.start_time > 0, "start_time must be set"

        end = self.end_time or time.time()
        duration = end - self.start_time

        assert duration >= 0, "duration must be non-negative"
        return duration

    @property
    def throughput_files_per_second(self) -> float:
        """Calculate file processing throughput."""
        assert self.files_processed >= 0, "files_processed must be non-negative"

        if self.duration_seconds == 0:
            return 0.0

        return self.files_processed / self.duration_seconds

Benefits: - Beautiful progress output - Structured metrics for analysis - Easy debugging - Performance optimization data

Impact: 10x faster debugging. Clear feedback helps agents and humans understand what's happening.

Putting It All Together¶

Here's how these 10 techniques work together in a real extraction pipeline:

async def process_pipeline(data_path: Path) -> PipelineOutput:
    """
    Process data pipeline with full hardening.

    Demonstrates all 10 techniques:
    1. Type safety: Pydantic models
    2. Defensive assertions: 2+ per function (NASA05)
    3. Property-based fuzzing: Hypothesis for edge cases
    4. Comprehensive testing: pytest markers, doc tests, arch tests
    5. Pre-commit gates: enforced quality
    6. Mock detection: real implementations only
    7. CLI quality: validated with cliqa
    8. Modular architecture: dependency injection
    9. Documentation: contracts and runbooks
    10. Observability: rich logging and metrics
    """
    # Defensive assertions (Technique 2)
    assert domain_path is not None, "domain_path must not be None"
    assert domain_path.exists(), f"domain_path must exist: {domain_path}"
    assert domain_path.is_dir(), f"domain_path must be directory: {domain_path}"

    # Structured configuration (Technique 6)
    config = load_config(domain_path)

    # State management (Technique 8)
    state = ExtractionState(
        domain_path=domain_path,
        config=config,
        errors=ErrorCollector(),
        metrics=ExtractionMetrics(start_time=time.time()),
    )

    # Observability (Technique 10)
    log_extraction_start(domain_path.name, len(state.files))

    # Modular architecture with DI (Technique 5)
    agent = create_extraction_agent(config.llm)
    storage = create_storage(domain_path)

    try:
        # Error collection (Technique 7)
        results = await process_with_error_collection(
            files=state.files,
            agent=agent,
            collector=state.errors,
        )

        # Type safety (Technique 1)
        for result in results:
            assert isinstance(result, ExtractionResult), "Must be ExtractionResult"
            state.concepts.extend(result.concepts)
            state.relationships.extend(result.relationships)

        # Save outputs (contracts from Technique 9)
        await storage.save_graph(state.concepts, state.relationships)

    finally:
        state.metrics.end_time = time.time()
        log_extraction_complete(state.metrics)

    return OntologyOutput(
        concepts=state.concepts,
        relationships=state.relationships,
        metrics=state.metrics,
        errors=state.errors.errors,
    )

Measuring Success: Before and After¶

Here's what changed in the Taxonomy-Ontology-Accelerator after implementing these techniques:

Metric	Before	After	Improvement
Agent-introduced bugs	~15/week	~2/week	87% reduction
Edge case bugs found	~5 (manual)	~35 (Hypothesis)	7x more found
Time to debug issues	~2 hours	~15 minutes	88% faster
Test coverage	45%	62%	+37%
CI pass rate	78%	96%	+23%
Architecture violations	~8/week	0	100% reduction
Mock-related prod bugs	2/month	0/6 months	100% reduction
CLI usage errors	~10/month	0	100% reduction
Production incidents	3/month	0/3 months	100% reduction

Getting Started: Your Action Plan¶

You don't need to implement all 10 techniques at once. Start with the highest-impact changes:

Week 1: Quick Wins (Critical Foundation)¶

Add pre-commit hooks (ruff + basedpyright)
Add NASA05 assertions (2+ per function)
Enable pytest coverage reporting
Add mockbuster to prevent mock usage

Week 2: Type Safety & Testing¶

Convert config to Pydantic models
Add type hints to public APIs
Create Protocol classes for interfaces
Add Hypothesis property-based tests for critical functions

Week 3: Architecture & Quality Gates¶

Add architecture tests for layer boundaries
Implement dependency injection in main classes
Add cliqa for CLI validation

Week 4: Documentation & Observability¶

Write contract specifications for inputs/outputs
Add executable runbook examples
Test documentation with pytest-examples
Add Rich logging with progress bars

Conclusion¶

Making codebases ready for agentic coding isn't about preventing AI from working with your code—it's about enabling AI to work safely and effectively.

The 10 techniques described here create a multi-layered safety net that catches mistakes before they reach production:

Type safety guides agents to correct usage
Defensive assertions catch wrong assumptions (NASA05)
Property-based fuzzing finds edge cases (NASA/TIGER style)
Comprehensive testing prevents regressions
Pre-commit gates fail fast locally
Mock detection enforces real implementations
CLI quality assurance validates command interfaces
Modular architecture enforces boundaries
Documentation as code stays current
Observability enables debugging

These aren't theoretical ideas—they're battle-tested patterns from a production system that AI agents work with daily.

Start small. Pick one technique, implement it, measure the impact. Then add the next one.

The most impactful combo: NASA05 assertions + Hypothesis fuzzing + mockbuster gives you 80% of the value with 20% of the effort.

Your future self (and your AI coding agent) will thank you.

Want to see these techniques in action? Check out the Taxonomy-Ontology-Accelerator on GitHub.

Have questions or want to share your own hardening techniques? Reach out on GitHub.