Jarvis Docs
Development

Benchmark Runbook

How to run benchmarks, interpret reports, and handle overrides

Benchmark Runbook

Phase 2.4 establishes the benchmark infrastructure for regression prevention. This runbook covers how to run benchmarks, interpret reports, and handle change-control overrides.

Overview

The benchmark harness lives in scripts/benchmark. It provides:

  • Shared schemas (@nous/shared): BenchmarkSpec, RunRecord, EvidenceBundle, ScoreReport
  • AgentAdapter contract: All target agents integrate through the same interface
  • Tier 0 (PR gate): Deterministic subset using mock adapter; blocks merge on hard-gate failure
  • Tier 1 (nightly): Full benchmark families with trend history
  • Tier 2 (release): Release report contract and hard-stop conditions

Running Benchmarks

Tier 0 — PR Gate

Runs on every PR. Deterministic; uses mock adapter.

pnpm run benchmark:tier0

Or run the benchmark package tests directly:

pnpm --filter nous-benchmark run test

Memory-quality retrieval (Phase 4.2 / Phase 8.3 runtime): The retrieval benchmark tests relevance (semantically similar results rank higher), policy-conformance (denial returns policyDenial), and deterministic runtime metadata (budgetTelemetry, decision, truncation reason, tie-break behavior). These tests depend on the built retrieval stack. If retrieval tests fail with unexpected response shape or stale package output, rebuild @nous/shared, @nous/memory-retrieval, and @nous/memory-access first.

Memory-quality distillation (Phase 4.3): The distillation benchmark tests clustering consistency (same input → same cluster keys), confidence formula determinism, and provenance completeness. These tests depend on @nous/memory-distillation. AA-004 traceability: distillation-quality regression bar.

Tier 1 — Nightly

pnpm --filter nous-benchmark run test

Tier 2 — Release Gate

pnpm --filter nous-benchmark run test

(Release report generation is wired in the harness; full CI integration is phased.)

SkillBench Admission Evidence (Phase 7.5)

Phase 7.5 adds skill-admission benchmark helpers used to construct promotion-ready evidence bundles and detect fixed-model drift.

Helper surface (benchmark package):

  • detectFixedModelDrift(runs, modelProfileLocked)
  • buildSkillAttributionEvidenceBundle(input)

Required Evidence Fields

Admission-oriented SkillBench evidence must include:

  • benchmark_pack_ref
  • model_profile_locked
  • baseline_revision_ref
  • candidate_revision_ref
  • seed_set_ref
  • run_record_refs (non-empty)
  • score_report_refs (non-empty)
  • trace_bundle_refs (non-empty)
  • drift_detected

Fail-Closed Admission Implications

  • Any fixed-model drift (drift_detected: true) invalidates benchmark evidence for promotion.
  • Missing required refs invalidate benchmark evidence for promotion.
  • Admission flow remains pending_cortex until benchmark evidence, attribution thesis, and contract validation are all complete.

Report Interpretation

ScoreReport

Each run produces a ScoreReport with:

  • gate_outcome: pass | fail | blocked
  • hard_gate_violations: List of violation codes when gate fails
  • metrics: Family-specific metrics (e.g., time_to_success_ms, intervention_events)

Global Hard Gates

GateViolationMeaning
Evidence linkagemissing_evidence_linkageRun has no evidence_bundle_ref
Evidence mismatchevidence_bundle_ref_mismatchArtifact bundle ref doesn't match run record
Side effectsunauthorized_critical_side_effectsUnauthorized tool/filesystem/network/memory actions

Gate Outcome

  • pass: All hard gates satisfied; run is trendable
  • fail: One or more hard gates violated; run is not trendable
  • blocked: Adapter returned capability_mismatch or similar; run skipped

Override and Change Control

Per ratified architecture:

  • Threshold changes require explicit Principal approval
  • Rubric/scoring logic changes require explicit Principal approval
  • Benchmark removal from any blocking tier requires explicit Principal approval
  • Benchmark promotion between tiers requires explicit Principal approval

No autonomous threshold adjustment is allowed.

Adapter Conformance

The baseline requires 5 conformance tests:

  1. Schema validation for all interface payloads
  2. Lifecycle ordering (prepareexecutecapturecleanup)
  3. Idempotency for repeated capture calls
  4. Evidence bundle completeness
  5. Timestamp monotonicity

The full contract defines 8 tests. Replay (deterministic seed reuse), capability mismatch behavior, and error mapping (raw target failures → normalized codes) are deferred to follow-up.

Adapters are not eligible for release-gate benchmarking until all conformance tests pass.

Benchmark Families

FamilyPurposeTier 0Tier 1
NodeFlowBenchStrict workflow determinismSmokeFull
MemoryQualityBenchNovel learning quality (retrieval + distillation)SmokeFull
VendingBenchAutonomy learningReduced
Reference-agentComparative usabilityP0

On this page