Benchmark Runbook

Phase 2.4 establishes the benchmark infrastructure for regression prevention. This runbook covers how to run benchmarks, interpret reports, and handle change-control overrides.

Overview

The benchmark harness lives in scripts/benchmark. It provides:

Shared schemas (@nous/shared): BenchmarkSpec, RunRecord, EvidenceBundle, ScoreReport
AgentAdapter contract: All target agents integrate through the same interface
Tier 0 (PR gate): Deterministic subset using mock adapter; blocks merge on hard-gate failure
Tier 1 (nightly): Full benchmark families with trend history
Tier 2 (release): Release report contract and hard-stop conditions

Running Benchmarks

Tier 0 — PR Gate

Runs on every PR. Deterministic; uses mock adapter.

pnpm run benchmark:tier0

Or run the benchmark package tests directly:

pnpm --filter nous-benchmark run test

Memory-quality retrieval (Phase 4.2 / Phase 8.3 runtime): The retrieval benchmark tests relevance (semantically similar results rank higher), policy-conformance (denial returns policyDenial), and deterministic runtime metadata (budgetTelemetry, decision, truncation reason, tie-break behavior). These tests depend on the built retrieval stack. If retrieval tests fail with unexpected response shape or stale package output, rebuild @nous/shared, @nous/memory-retrieval, and @nous/memory-access first.

Memory-quality distillation (Phase 4.3): The distillation benchmark tests clustering consistency (same input → same cluster keys), confidence formula determinism, and provenance completeness. These tests depend on @nous/memory-distillation. AA-004 traceability: distillation-quality regression bar.

Tier 1 — Nightly

pnpm --filter nous-benchmark run test

Tier 2 — Release Gate

pnpm --filter nous-benchmark run test

(Release report generation is wired in the harness; full CI integration is phased.)

SkillBench Admission Evidence (Phase 7.5)

Phase 7.5 adds skill-admission benchmark helpers used to construct promotion-ready evidence bundles and detect fixed-model drift.

Helper surface (benchmark package):

detectFixedModelDrift(runs, modelProfileLocked)
buildSkillAttributionEvidenceBundle(input)

Required Evidence Fields

Admission-oriented SkillBench evidence must include:

benchmark_pack_ref
model_profile_locked
baseline_revision_ref
candidate_revision_ref
seed_set_ref
run_record_refs (non-empty)
score_report_refs (non-empty)
trace_bundle_refs (non-empty)
drift_detected

Fail-Closed Admission Implications

Any fixed-model drift (drift_detected: true) invalidates benchmark evidence for promotion.
Missing required refs invalidate benchmark evidence for promotion.
Admission flow remains pending_cortex until benchmark evidence, attribution thesis, and contract validation are all complete.

Report Interpretation

ScoreReport

Each run produces a ScoreReport with:

gate_outcome: pass | fail | blocked
hard_gate_violations: List of violation codes when gate fails
metrics: Family-specific metrics (e.g., time_to_success_ms, intervention_events)

Global Hard Gates

Gate	Violation	Meaning
Evidence linkage	`missing_evidence_linkage`	Run has no `evidence_bundle_ref`
Evidence mismatch	`evidence_bundle_ref_mismatch`	Artifact bundle ref doesn't match run record
Side effects	`unauthorized_critical_side_effects`	Unauthorized tool/filesystem/network/memory actions

Gate Outcome

pass: All hard gates satisfied; run is trendable
fail: One or more hard gates violated; run is not trendable
blocked: Adapter returned capability_mismatch or similar; run skipped

Override and Change Control

Per ratified architecture:

Threshold changes require explicit Principal approval
Rubric/scoring logic changes require explicit Principal approval
Benchmark removal from any blocking tier requires explicit Principal approval
Benchmark promotion between tiers requires explicit Principal approval

No autonomous threshold adjustment is allowed.

Adapter Conformance

The baseline requires 5 conformance tests:

Schema validation for all interface payloads
Lifecycle ordering (prepare → execute → capture → cleanup)
Idempotency for repeated capture calls
Evidence bundle completeness
Timestamp monotonicity

The full contract defines 8 tests. Replay (deterministic seed reuse), capability mismatch behavior, and error mapping (raw target failures → normalized codes) are deferred to follow-up.

Adapters are not eligible for release-gate benchmarking until all conformance tests pass.

Benchmark Families

Family	Purpose	Tier 0	Tier 1
NodeFlowBench	Strict workflow determinism	Smoke	Full
MemoryQualityBench	Novel learning quality (retrieval + distillation)	Smoke	Full
VendingBench	Autonomy learning	—	Reduced
Reference-agent	Comparative usability	—	P0

Benchmark Runbook

Benchmark Runbook

Overview

Running Benchmarks

Tier 0 — PR Gate

Tier 1 — Nightly

Tier 2 — Release Gate

SkillBench Admission Evidence (Phase 7.5)

Required Evidence Fields

Fail-Closed Admission Implications

Report Interpretation

ScoreReport

Global Hard Gates

Gate Outcome

Override and Change Control

Adapter Conformance

Benchmark Families

Links

On this page