Testing & meta-testing

Two layers, kept separate

Layer	What it tests	When it runs	Speed
Unit	`bfev` Python modules (config, logging, crosscheck, contracts, audit)	`pytest -q` on every change	< 5 s
Contract	Agents refuse correctly on broken inputs	`bfev test <agent>`	seconds
Golden (replay)	Agent post-processing reproduces a frozen output dir from a frozen input dir	`bfev test <agent>`	seconds
Golden (live)	Same, but actually invokes Claude	`bfev test <agent> --live`	minutes

Unit tests

Already in tests/:

pytest -q

Run before every commit.

Contract tests

A contract test wires up an Inputs/Outputs pair (typically pointing at shared golden fixtures) and asserts the contract's check_postconditions returns the expected pass/fail. One file per case under tests/agents/<agent>/contract/<case>.yaml:

setup:
  inputs:
    filled_xlsx:     golden/diesel-only/in/filled.xlsx
    client_yaml:     golden/diesel-only/in/client.yaml
    categories_json: golden/diesel-only/in/categories.json
  outputs:
    aggregates_json: golden/diesel-only/expected/MISSING_aggregates.json
    results_json:    golden/diesel-only/in/results.json
    crosscheck_xlsx: golden/diesel-only/expected/crosscheck.xlsx
    meta_yaml:       golden/diesel-only/expected/meta.yaml
    audit_log:       golden/diesel-only/expected/audit.log

expect:
  contract: fail                     # or "pass"
  failure_codes: ["missing_output"]  # required subset of failure codes

Paths resolve against the agent's fixture root (tests/agents/<agent>/), so contract tests can reuse golden fixtures DRY-ly.

Contract tests run by default against the replay harness (fast, deterministic). The contract check_postconditions() catches most violations without ever calling Claude.

Golden tests

A golden case is a frozen (input dir, expected output dir) pair under tests/agents/<agent>/golden/<case>/:

tests/agents/calculate/golden/diesel-only-toy/
  in/
    filled.xlsx
    client.yaml
    categories.json
  expected/
    aggregates.json
    results.json
    crosscheck.xlsx
  assertions.yaml          # tolerances + required claims

assertions.yaml example (real shape — see tests/agents/calculate/golden/diesel-only/assertions.yaml):

builder: bfev.crosscheck:build           # any pkg.mod:fn callable
builder_args:
  results_json: in/results.json
  out_xlsx:     out/crosscheck.xlsx

tol_pct: 0.5

compare:
  - actual: out/crosscheck.xlsx
    expected: expected/crosscheck.xlsx
    mode: xlsx_structure                  # or json_diff, or tree

contract:                                 # optional: re-validate the assembled out dir
  module: bfev.contracts:CalculateContract
  inputs:  {...}
  outputs: {...}

The harness:

Copies in/ and expected/ into a temp dir, then mkdir out/.
Invokes the deterministic builder declared in builder: (no Claude call).
Runs every compare: entry between actual and expected paths.
Optionally re-runs the agent's contract check_postconditions against the assembled directory — this is what catches the live-formula invariant.
Reports pass/fail per case; with --accept, copies actual → expected.

Updating a golden after an intentional change

bfev test calculate --case diesel-only-toy --accept

This overwrites expected/ with the new output. Review the diff in git before committing — that diff is the agent-prompt review.

Adversarial probes

A handful of cases that try to bait the rules you care about. Cheap insurance:

tests/agents/report/adversarial/
  bait-unsourced-entity.yaml       # injects "BAD" into the source PDF
  bait-leak-phrase-v2.yaml         # source PDF mentions "v2 recalibrage"
  bait-knowbox.yaml                # asks for didactic side-content

These run on every agent-prompt change. Failures are blockers.

Live vs replay

Replay (default): runs the deterministic post-processing layer of an agent (Quarto render, contract checks, audit). Does not call Claude. Fast. Catches: contract violations, audit regressions, formula-vs-literal drift, packaging breakage.
Live (--live): actually invokes Claude. Used after touching SKILL.md or an agent prompt. Caught: rule-following regressions, prompt-induced hallucinations.

In CI: replay only. Live runs are manual.