Benchmark Runbook
How to run benchmarks, interpret reports, and handle overrides
Benchmark Runbook
Phase 2.4 establishes the benchmark infrastructure for regression prevention. This runbook covers how to run benchmarks, interpret reports, and handle change-control overrides.
Overview
The benchmark harness lives in scripts/benchmark. It provides:
- Shared schemas (
@nous/shared):BenchmarkSpec,RunRecord,EvidenceBundle,ScoreReport - AgentAdapter contract: All target agents integrate through the same interface
- Tier 0 (PR gate): Deterministic subset using mock adapter; blocks merge on hard-gate failure
- Tier 1 (nightly): Full benchmark families with trend history
- Tier 2 (release): Release report contract and hard-stop conditions
Running Benchmarks
Tier 0 — PR Gate
Runs on every PR. Deterministic; uses mock adapter.
pnpm run benchmark:tier0Or run the benchmark package tests directly:
pnpm --filter nous-benchmark run testMemory-quality retrieval (Phase 4.2 / Phase 8.3 runtime): The retrieval benchmark tests relevance (semantically similar results rank higher), policy-conformance (denial returns policyDenial), and deterministic runtime metadata (budgetTelemetry, decision, truncation reason, tie-break behavior). These tests depend on the built retrieval stack. If retrieval tests fail with unexpected response shape or stale package output, rebuild @nous/shared, @nous/memory-retrieval, and @nous/memory-access first.
Memory-quality distillation (Phase 4.3): The distillation benchmark tests clustering consistency (same input → same cluster keys), confidence formula determinism, and provenance completeness. These tests depend on @nous/memory-distillation. AA-004 traceability: distillation-quality regression bar.
Tier 1 — Nightly
pnpm --filter nous-benchmark run testTier 2 — Release Gate
pnpm --filter nous-benchmark run test(Release report generation is wired in the harness; full CI integration is phased.)
SkillBench Admission Evidence (Phase 7.5)
Phase 7.5 adds skill-admission benchmark helpers used to construct promotion-ready evidence bundles and detect fixed-model drift.
Helper surface (benchmark package):
detectFixedModelDrift(runs, modelProfileLocked)buildSkillAttributionEvidenceBundle(input)
Required Evidence Fields
Admission-oriented SkillBench evidence must include:
benchmark_pack_refmodel_profile_lockedbaseline_revision_refcandidate_revision_refseed_set_refrun_record_refs(non-empty)score_report_refs(non-empty)trace_bundle_refs(non-empty)drift_detected
Fail-Closed Admission Implications
- Any fixed-model drift (
drift_detected: true) invalidates benchmark evidence for promotion. - Missing required refs invalidate benchmark evidence for promotion.
- Admission flow remains
pending_cortexuntil benchmark evidence, attribution thesis, and contract validation are all complete.
Report Interpretation
ScoreReport
Each run produces a ScoreReport with:
gate_outcome:pass|fail|blockedhard_gate_violations: List of violation codes when gate failsmetrics: Family-specific metrics (e.g.,time_to_success_ms,intervention_events)
Global Hard Gates
| Gate | Violation | Meaning |
|---|---|---|
| Evidence linkage | missing_evidence_linkage | Run has no evidence_bundle_ref |
| Evidence mismatch | evidence_bundle_ref_mismatch | Artifact bundle ref doesn't match run record |
| Side effects | unauthorized_critical_side_effects | Unauthorized tool/filesystem/network/memory actions |
Gate Outcome
- pass: All hard gates satisfied; run is trendable
- fail: One or more hard gates violated; run is not trendable
- blocked: Adapter returned
capability_mismatchor similar; run skipped
Override and Change Control
Per ratified architecture:
- Threshold changes require explicit Principal approval
- Rubric/scoring logic changes require explicit Principal approval
- Benchmark removal from any blocking tier requires explicit Principal approval
- Benchmark promotion between tiers requires explicit Principal approval
No autonomous threshold adjustment is allowed.
Adapter Conformance
The baseline requires 5 conformance tests:
- Schema validation for all interface payloads
- Lifecycle ordering (
prepare→execute→capture→cleanup) - Idempotency for repeated capture calls
- Evidence bundle completeness
- Timestamp monotonicity
The full contract defines 8 tests. Replay (deterministic seed reuse), capability mismatch behavior, and error mapping (raw target failures → normalized codes) are deferred to follow-up.
Adapters are not eligible for release-gate benchmarking until all conformance tests pass.
Benchmark Families
| Family | Purpose | Tier 0 | Tier 1 |
|---|---|---|---|
| NodeFlowBench | Strict workflow determinism | Smoke | Full |
| MemoryQualityBench | Novel learning quality (retrieval + distillation) | Smoke | Full |
| VendingBench | Autonomy learning | — | Reduced |
| Reference-agent | Comparative usability | — | P0 |