Testing & meta-testing

Two layers, kept separate

Layer What it tests When it runs Speed
Unit bfev Python modules (config, logging, crosscheck, contracts, audit) pytest -q on every change < 5 s
Contract Agents refuse correctly on broken inputs bfev test <agent> seconds
Golden (replay) Agent post-processing reproduces a frozen output dir from a frozen input dir bfev test <agent> seconds
Golden (live) Same, but actually invokes Claude bfev test <agent> --live minutes

Unit tests

Already in tests/:

pytest -q

Run before every commit.

Contract tests

A contract test wires up an Inputs/Outputs pair (typically pointing at shared golden fixtures) and asserts the contract's check_postconditions returns the expected pass/fail. One file per case under tests/agents/<agent>/contract/<case>.yaml:

setup:
  inputs:
    filled_xlsx:     golden/diesel-only/in/filled.xlsx
    client_yaml:     golden/diesel-only/in/client.yaml
    categories_json: golden/diesel-only/in/categories.json
  outputs:
    aggregates_json: golden/diesel-only/expected/MISSING_aggregates.json
    results_json:    golden/diesel-only/in/results.json
    crosscheck_xlsx: golden/diesel-only/expected/crosscheck.xlsx
    meta_yaml:       golden/diesel-only/expected/meta.yaml
    audit_log:       golden/diesel-only/expected/audit.log

expect:
  contract: fail                     # or "pass"
  failure_codes: ["missing_output"]  # required subset of failure codes

Paths resolve against the agent's fixture root (tests/agents/<agent>/), so contract tests can reuse golden fixtures DRY-ly.

Contract tests run by default against the replay harness (fast, deterministic). The contract check_postconditions() catches most violations without ever calling Claude.

Golden tests

A golden case is a frozen (input dir, expected output dir) pair under tests/agents/<agent>/golden/<case>/:

tests/agents/calculate/golden/diesel-only-toy/
  in/
    filled.xlsx
    client.yaml
    categories.json
  expected/
    aggregates.json
    results.json
    crosscheck.xlsx
  assertions.yaml          # tolerances + required claims

assertions.yaml example (real shape — see tests/agents/calculate/golden/diesel-only/assertions.yaml):

builder: bfev.crosscheck:build           # any pkg.mod:fn callable
builder_args:
  results_json: in/results.json
  out_xlsx:     out/crosscheck.xlsx

tol_pct: 0.5

compare:
  - actual: out/crosscheck.xlsx
    expected: expected/crosscheck.xlsx
    mode: xlsx_structure                  # or json_diff, or tree

contract:                                 # optional: re-validate the assembled out dir
  module: bfev.contracts:CalculateContract
  inputs:  {...}
  outputs: {...}

The harness:

  1. Copies in/ and expected/ into a temp dir, then mkdir out/.
  2. Invokes the deterministic builder declared in builder: (no Claude call).
  3. Runs every compare: entry between actual and expected paths.
  4. Optionally re-runs the agent's contract check_postconditions against the assembled directory — this is what catches the live-formula invariant.
  5. Reports pass/fail per case; with --accept, copies actual → expected.

Updating a golden after an intentional change

bfev test calculate --case diesel-only-toy --accept

This overwrites expected/ with the new output. Review the diff in git before committing — that diff is the agent-prompt review.

Adversarial probes

A handful of cases that try to bait the rules you care about. Cheap insurance:

tests/agents/report/adversarial/
  bait-unsourced-entity.yaml       # injects "BAD" into the source PDF
  bait-leak-phrase-v2.yaml         # source PDF mentions "v2 recalibrage"
  bait-knowbox.yaml                # asks for didactic side-content

These run on every agent-prompt change. Failures are blockers.

Live vs replay

In CI: replay only. Live runs are manual.