Skip to main content

Governance benchmark

Run the primary governance suite:
termyte bench
termyte bench --json
The suite evaluates 1,200 unique command texts through the stable, non-executing check and YAML policy path:
  • 400 expected allow;
  • 400 expected warn;
  • 400 expected block.
No fixture command is executed and no check log is written.

Reported metrics

  • overall and per-category accuracy;
  • per-decision precision and recall;
  • decision confusion matrix;
  • false positives and false negatives;
  • false-safe rate;
  • overblock rate.
False-safe means an expected block was classified below block, or an expected warning was allowed. Overblock means an expected allow received a stricter decision.

Legacy compatibility suite

termyte bench --legacy
The legacy suite contains 230 cases and evaluates the older SQLite runtime inspection path. Some cases permit multiple acceptable decisions. Do not combine the governance and legacy results. They use different decision engines and labeling methods.
Benchmark results validate only the included labeled fixtures. They do not prove complete command coverage, sandbox isolation, or protection for commands that bypass Termyte.