Governance benchmark
Run the primary governance suite:
termyte bench
termyte bench --json
The suite evaluates 1,200 unique command texts through the stable,
non-executing check and YAML policy path:
- 400 expected
allow;
- 400 expected
warn;
- 400 expected
block.
No fixture command is executed and no check log is written.
Reported metrics
- overall and per-category accuracy;
- per-decision precision and recall;
- decision confusion matrix;
- false positives and false negatives;
- false-safe rate;
- overblock rate.
False-safe means an expected block was classified below block, or an expected
warning was allowed. Overblock means an expected allow received a stricter
decision.
Legacy compatibility suite
The legacy suite contains 230 cases and evaluates the older SQLite runtime
inspection path. Some cases permit multiple acceptable decisions.
Do not combine the governance and legacy results. They use different decision
engines and labeling methods.
Benchmark results validate only the included labeled fixtures. They do not
prove complete command coverage, sandbox isolation, or protection for commands
that bypass Termyte.