| Layer | What it tests | When it runs | Speed |
|---|---|---|---|
| Unit | bfev Python modules (config, logging, crosscheck, contracts, audit) |
pytest -q on every change |
< 5 s |
| Contract | Agents refuse correctly on broken inputs | bfev test <agent> |
seconds |
| Golden (replay) | Agent post-processing reproduces a frozen output dir from a frozen input dir | bfev test <agent> |
seconds |
| Golden (live) | Same, but actually invokes Claude | bfev test <agent> --live |
minutes |
Already in tests/:
pytest -q
Run before every commit.
A contract test wires up an Inputs/Outputs pair (typically pointing at
shared golden fixtures) and asserts the contract's check_postconditions
returns the expected pass/fail. One file per case under
tests/agents/<agent>/contract/<case>.yaml:
setup:
inputs:
filled_xlsx: golden/diesel-only/in/filled.xlsx
client_yaml: golden/diesel-only/in/client.yaml
categories_json: golden/diesel-only/in/categories.json
outputs:
aggregates_json: golden/diesel-only/expected/MISSING_aggregates.json
results_json: golden/diesel-only/in/results.json
crosscheck_xlsx: golden/diesel-only/expected/crosscheck.xlsx
meta_yaml: golden/diesel-only/expected/meta.yaml
audit_log: golden/diesel-only/expected/audit.log
expect:
contract: fail # or "pass"
failure_codes: ["missing_output"] # required subset of failure codes
Paths resolve against the agent's fixture root (tests/agents/<agent>/),
so contract tests can reuse golden fixtures DRY-ly.
Contract tests run by default against the replay harness (fast,
deterministic). The contract check_postconditions() catches most violations
without ever calling Claude.
A golden case is a frozen (input dir, expected output dir) pair under
tests/agents/<agent>/golden/<case>/:
tests/agents/calculate/golden/diesel-only-toy/
in/
filled.xlsx
client.yaml
categories.json
expected/
aggregates.json
results.json
crosscheck.xlsx
assertions.yaml # tolerances + required claims
assertions.yaml example (real shape — see
tests/agents/calculate/golden/diesel-only/assertions.yaml):
builder: bfev.crosscheck:build # any pkg.mod:fn callable
builder_args:
results_json: in/results.json
out_xlsx: out/crosscheck.xlsx
tol_pct: 0.5
compare:
- actual: out/crosscheck.xlsx
expected: expected/crosscheck.xlsx
mode: xlsx_structure # or json_diff, or tree
contract: # optional: re-validate the assembled out dir
module: bfev.contracts:CalculateContract
inputs: {...}
outputs: {...}
The harness:
in/ and expected/ into a temp dir, then mkdir out/.builder: (no Claude call).compare: entry between actual and expected paths.check_postconditions against the
assembled directory — this is what catches the live-formula invariant.--accept, copies actual → expected.bfev test calculate --case diesel-only-toy --accept
This overwrites expected/ with the new output. Review the diff in git
before committing — that diff is the agent-prompt review.
A handful of cases that try to bait the rules you care about. Cheap insurance:
tests/agents/report/adversarial/
bait-unsourced-entity.yaml # injects "BAD" into the source PDF
bait-leak-phrase-v2.yaml # source PDF mentions "v2 recalibrage"
bait-knowbox.yaml # asks for didactic side-content
These run on every agent-prompt change. Failures are blockers.
--live): actually invokes Claude. Used after touching SKILL.md
or an agent prompt. Caught: rule-following regressions, prompt-induced
hallucinations.In CI: replay only. Live runs are manual.