Troubleshooting
Common issues and recovery steps
Troubleshooting
Ollama Not Running
Symptom: Health check shows Ollama as "unhealthy". Chat fails or times out.
Fix:
# Start Ollama (in a separate terminal)
ollama serve
# Pull a model if needed
ollama pull llama3.2:3bBackend Not Reachable
Symptom: CLI or web UI cannot connect. "Connection refused" or similar.
Fix:
- Ensure the backend is running:
pnpm dev:web - Check the API URL: CLI defaults to
http://localhost:4317 - If
4317was busy, installer/dev may have started on the next free port; use the printed URL (or set--api-urlin CLI) - If using a different port, set
--api-urlfor CLI
pnpm install Fails (EPERM, Windows)
Symptom: pnpm install fails with permission errors on Windows.
Fix:
- Run terminal as Administrator
- Or use a different drive (e.g. D:) if C: has restrictions
- Disable antivirus temporarily for the install directory
Tests Fail from Root (Workspace Resolution)
Symptom: Root test runs fail with package import resolution errors (for example Cannot find package '@nous/shared') or discover unexpected dist/ test files.
Fix:
- Run root tests through the workspace command:
pnpm test - Run package-local tests with each package script:
pnpm --filter <package-name> test - Avoid ad-hoc root invocation (
pnpm exec vitest run) unless you explicitly pass the intended config file
Config Validation Errors
Symptom: Startup fails with ConfigError listing validation failures.
Fix:
- Check the config file (JSON5) for syntax errors
- Ensure required fields are present:
profile,pfcTier,providers,storage, etc. - Remove the config file to regenerate defaults, or fix the reported fields
Voice Request Stays in Text Confirmation or Blocked State
Symptom: A voice-originated action does not execute even though the requested intent seems clear.
Meaning: Phase 11.3 keeps risky voice actions behind canonical confidence-governance and confirmation checks. The runtime can return clarification, text confirmation, dual-channel confirmation, or blocked posture instead of executing directly.
Fix:
- Check whether the request is high-risk, destructive, or
T3; those actions require text or dual-channel confirmation. - If the response mentions low confidence, repeat the request more explicitly or continue in text.
- If the active Principal session is missing or stale, complete confirmation from a current trusted session instead of retrying voice-only.
- Treat the returned reason code and confirmation posture as canonical runtime truth.
Voice Degraded Mode Remains Active
Symptom: Voice responses keep directing you to continue in text, or MAO/other surfaces show voice degraded mode even after one successful turn.
Meaning: Degraded mode is a safety posture. Phase 11.3 keeps risky controls text-first until recovery is sustained; a single improved turn does not automatically clear the degraded state.
Fix:
- Continue risky control actions in text while degraded mode is active.
- Check whether recent turns included low ASR confidence, low intent confidence, handoff instability, transport degradation, or interruption recovery.
- After conditions stabilize, start a fresh voice turn and let the runtime clear degraded mode through the sustained recovery path.
- If degraded mode persists unexpectedly, inspect the voice session projection and witness-linked evidence rather than forcing execution.
Cloud Model List Shows Stale or Missing Models
Symptom: The model picker shows a minimal list with a staleness indicator, or expected cloud models do not appear.
Meaning: The dynamic /v1/models API call for that provider failed or returned an error. Nous falls back to a minimal static list when the live API is unreachable.
Fix:
- Check that the provider API key is stored (Configuration > Provider Keys). Providers without a stored key are skipped silently.
- Verify network connectivity to the provider API (
api.anthropic.comorapi.openai.com). - Reopen the configuration panel to trigger a fresh fetch — cached responses expire after 5 minutes.
- If the provider API is experiencing an outage, the fallback list is expected behavior until the API recovers.
Model Selection Does Not Take Effect
Symptom: After selecting a new model, chat responses still use the previous model.
Fix:
- Verify the selection was saved successfully (no error banner in the configuration panel).
- Model selections are applied immediately at runtime. If the selected model spec is invalid or malformed, it is rejected without corrupting provider config — check the configuration panel for error feedback.
- If the selected model is from a cloud provider, ensure the provider key is still configured and valid.
First-Run Loop
Symptom: First-run flow keeps appearing.
Fix: First-run completes when you send a message and receive a response, or when a project exists. Ensure Ollama is running so the health check and model invocation succeed.
Memory Inspector Shows a Global-Scope Warning
Symptom: The /memory page shows a reason-code banner after you switch Scope to Global only or Project + global, and global entries do not appear.
Meaning: The selected project either does not inherit global memory or the runtime denied global-scope inspection. This is an explicit policy result, not an empty-state guess.
Fix:
- Read the banner's reason code and explanation first.
- If you only need project-local memory, switch Scope back to
Project only. - If you expected global visibility, check the project's memory-access policy and whether
inheritsGlobalis enabled. - If the denial should not have happened, inspect trace and audit evidence for the underlying policy decision rather than retrying blindly.
Memory Export Includes More Than the Current Filtered View
Symptom: The exported memory bundle contains entries that are not visible in the current inspector result list.
Meaning: This is expected. The export flow always returns the full project memory bundle, including STM context, durable entries, audit history, and tombstones. Search and filter controls narrow the on-screen inspection view only.
Fix:
- Use the inspector filters to review a subset.
- Use export when you need the authoritative full-state bundle.
- If you need a narrower dataset for analysis, filter the exported bundle after download.
Hard Delete Cannot Be Confirmed or Is Rejected
Symptom: The Memory inspector refuses to proceed with hard delete, or the result banner says the delete was not applied.
Meaning: Hard delete is confirmation-protected and requires a non-empty rationale before the request is sent. Even with a rationale, the governed mutation can still deny or defer the action.
Fix:
- Enter a clear rationale before selecting Confirm hard delete.
- Read the returned reason code in the result banner.
- If the action was not applied, inspect
memory.auditfor the authoritative mutation outcome and evidence refs. - Treat deny or defer outcomes as runtime truth; clear the governing blocker instead of retrying without changes.
Learning Visibility Shows Missing Source or Evidence Diagnostics
Symptom: The /memory Learning detail view reports missing source records, missing evidence refs, degraded lineage integrity, or unavailable control-state context.
Meaning: The selected distilled pattern references canonical inputs that are no longer fully resolvable in the current project view. Phase 8.8 surfaces those gaps explicitly instead of hiding them.
Fix:
- Read the diagnostic first; it is part of the contract, not a cosmetic warning.
- Check whether the referenced source records were superseded, deleted, or created in an older state that no longer has full evidence linkage.
- Use linked trace IDs, audit history, and evidence refs to determine whether the pattern should be refreshed, replaced, or retired through the governed runtime path.
- Do not assume the missing input can be reconstructed from the UI alone.
Learning Visibility Says Governance Cards Are Representative
Symptom: The Learning view notes that lifecycle events are derived, governance cards are representative, or historicalDecisionLogAvailable is false.
Meaning: Phase 8.8 projects the current canonical confidence-governance outcome for a pattern, but it does not yet persist a per-pattern historical decision ledger in workflow runtime history.
Fix:
- Use the card's
outcomeandreasonCodeto understand the current governance posture of the pattern. - Use traces, audit evidence, and linked provenance when you need proof of a specific past decision.
- Treat representative cards as interpretation aids for current behavior, not as historical proof that a particular runtime action already happened.
Dispatch or Action Blocked (WMODE-*)
Symptom: A workflow dispatch or lifecycle action is blocked with a reason code such as WMODE-002, WMODE-003, or WMODE-010.
Meaning: The admission guard rejected the action because it would violate the authority chain or workmode boundaries.
| Code | Cause |
|---|---|
| WMODE-002 | Authority widening — e.g. a worker attempted to dispatch, or an orchestrator tried to dispatch to cortex |
| WMODE-003 | Nested orchestration — an orchestrator tried to dispatch to another orchestrator |
| WMODE-010 | Worker escalation — a worker attempted to dispatch to an authoritative agent |
Fix: The action is not permitted by design. Check that the dispatch source and target respect the authority chain: nous_cortex → orchestration_agent → worker_agent. See Workmode and Authority Boundaries.
Ingress Rejection (Phase 5.3)
Symptom: Automation trigger (scheduler, hook, webhook) returns rejected with a reason code.
Reason codes and fixes:
| Code | Cause | Fix |
|---|---|---|
unauthenticated | Webhook HMAC failed or missing | Verify HMAC signature; ensure key_id and auth_context_ref are correct |
scope_mismatch | Principal not bound to workflow | Check credential scope matches project_id and workflow_ref |
event_forbidden | Event type not in allowlist | Add event_name to credential's allowed_event_names |
policy_blocked | Policy blocks this trigger class | Review project/workflow policy for external trigger allowance |
replay_detected | Stale timestamp or duplicate nonce | Ensure occurred_at within +/- 5 min; use unique nonce per request |
rate_limited | Rate limit exceeded | Wait and retry with backoff; reduce request frequency |
invalid_envelope | Missing project_id, workflow_ref, or invalid trigger_type | Fix envelope; ensure all required fields present |
control_state_blocked | Project hard_stopped or paused_review | Resume project or release hard_stop via Projects UI / operator-control |
workflow_admission_blocked | Ingress was valid but canonical workflow admission still failed | Inspect the returned reason_code, project lookup, and workflow configuration before retrying |
See Automation Gateway for full operator guidance.
Workflow Cannot Start Because the Definition Is Missing or Invalid
Symptom: A workflow start request returns workflow_definition_unavailable or workflow_definition_invalid, and no new run_id is created.
Meaning: The runtime could not resolve or validate the canonical workflow definition stored in the target project. This fails closed before any run starts.
| Code | Cause | Fix |
|---|---|---|
workflow_definition_unavailable | The project has no matching workflow definition or no valid default workflow reference | Check the project's stored workflow configuration and default workflow selection |
workflow_definition_invalid | The canonical workflow definition failed validation | Correct the graph definition, then retry after validation passes |
Common validation failures include:
- Cycles in the node graph
- Dangling edges that reference missing nodes
- Missing or invalid entry nodes
- Duplicate node or edge identities
Do not retry the same start request until the canonical project workflow definition is repaired.
Workflow Admission Blocked Before Run Creation (Phase 9.1)
Symptom: A start or automation request clears transport-level checks, but the workflow still does not start and returns an admission reason code.
Meaning: Workflow admission failed after definition resolution but before run creation. No canonical run state exists yet, so the fix is at the project, authority, or control-state layer.
| Code | Cause | Fix |
|---|---|---|
AUTH-SCOPE-MISMATCH | The request scope and resolved workflow definition do not match | Ensure the project/workflow identifiers point at the same stored definition |
POL-CONTROL-STATE-BLOCKED | The project is hard stopped or otherwise blocked | Use operator-control to release the blocking state |
POL-PAUSED-BLOCKED | The project is paused for review | Resume the project before retrying |
OPCTL-INVALID-STATE | The request attempts to resume from an invalid control state | Fix the control-state transition first |
WMODE-001 | Unsupported workmode for this admission path | Use a supported workmode configuration |
WMODE-002 | Authority widening was attempted | Narrow the request back to the allowed authority chain |
WMODE-003 | Nested orchestration was attempted | Use the standard orchestration lane rather than orchestrating inside orchestration |
WMODE-010 | A worker attempted an authoritative start path | Reissue the request from the correct authority source |
Admission failures are authoritative. Use the returned reasonCode and evidenceRefs before retrying.
Scheduled Trigger Repeats or Never Advances (Phase 9.3)
Symptom: A cron/calendar schedule appears stuck, keeps returning duplicate outcomes, or never seems to fire again after a restart.
Meaning: Phase 9.3 persists schedule definitions and due cursors, but actual run truth still comes from ingress and workflow state.
Fix:
- Check the schedule's
nextDueAt,lastDispatchedAt,workflowDefinitionId, andworkmodeId. - If ingress returns
accepted_already_dispatched, treat that as duplicate-safe recovery of the original run rather than a failure. - If the schedule keeps returning
workflow_admission_blocked, repair the target project's workflow configuration instead of forcing redispatch. - If
nextDueAtis missing or stale after restart, let the scheduler recompute it from the canonical trigger definition before assuming work was skipped.
Artifact Retrieve/List Returns Nothing Even Though a Write Happened (Phase 9.3)
Symptom: An artifact write appears to succeed, but default retrieve or list calls return nothing.
Meaning: Phase 9.3 keeps prepared versions hidden from default visibility. Only committed versions are normal runtime truth.
Fix:
- Check whether the artifact version is still
preparedinstead ofcommitted. - If the producing workflow is still waiting on
checkpoint_commit, do not treat the artifact as durably available yet. - Verify you are reading from the correct
projectId; artifact IDs are not global bearer tokens. - If a prepared version never commits, inspect recovery or checkpoint evidence instead of replaying the same write blindly.
Artifact Integrity Mismatch or Corruption Signal (Phase 9.3)
Symptom: Artifact retrieval fails closed, or downstream tooling reports an integrity mismatch.
Meaning: The stored payload bytes no longer match the artifact manifest's integrityRef (sha256:<64-char-hex>). The runtime is deliberately refusing to return corrupted content.
Fix:
- Treat the failure as authoritative; do not bypass it by reading the raw payload directly.
- Check whether the payload document was interrupted, partially rewritten, or manually edited.
- Recreate or recommit the artifact from the canonical workflow output if the checkpoint and evidence chain allow it.
- If the producing run had uncertain external effects, escalate into review rather than assuming the artifact can be reconstructed safely.
Workflow Run State or Dispatch Lineage Is Hard to Interpret
Symptom: A run exists, but it appears stalled, paused, blocked, or unexpectedly branched.
Meaning: Phase 9.1 and Phase 9.2 expose canonical run, node, wait, checkpoint, correction, and dispatch-lineage state. Those records explain what happened without requiring the UI to infer workflow truth.
| Field | How to use it |
|---|---|
status | Overall workflow state such as ready, running, waiting, blocked_review, paused, completed, or failed |
readyNodeIds | Nodes eligible to dispatch now |
waitingNodeIds | Nodes intentionally paused on a continuation path |
blockedNodeIds | Nodes that forced the run into review-required posture |
completedNodeIds | Nodes that have already finished |
nodeStates[<nodeId>].status | Per-node state such as pending, ready, running, waiting, completed, blocked, or failed |
nodeStates[<nodeId>].activeWaitState | Canonical wait details, including wait kind, reason code, evidence refs, and optional resumeToken |
nodeStates[<nodeId>].correctionArcs | The corrective path recorded by the runtime (resume, retry, reprompt, or rollback) |
lastPreparedCheckpointId / lastCommittedCheckpointId | Whether the run has only prepared checkpoint state or a durable checkpoint that is safe to continue from |
selectedBranchKey / activatedEdgeIds | Which condition branch actually fired and which outbound edges became live |
dispatchLineage[*].dispatchLineageId | Unique identifier for one dispatch attempt |
parentNodeId / viaEdgeId | How the current node became ready |
reasonCode / evidenceRefs | The authoritative explanation for a transition, pause, block, or failure |
Fix:
- If the run is
paused, inspect control-state and operator actions first. - If the run is
waiting, inspectactiveWaitState.kind,reasonCode, andresumeTokenon the waiting node before dispatching anything else. - If a node is
blockedorfailed, use thereasonCodeandevidenceRefsto identify the governing blocker. - If the run is
blocked_review, inspect the newestcorrectionArcsentry before deciding whether the path is resume, retry, reprompt, or rollback. - If branching looks wrong, inspect
parentNodeIdandviaEdgeIdin dispatch lineage before assuming the UI or scheduler is wrong. - If checkpoint behavior looks wrong, compare
lastPreparedCheckpointIdandlastCommittedCheckpointIdbefore resuming. - If no nodes are ready, inspect upstream node states and graph structure rather than replaying the same dispatch.
Workflow Is Waiting and Will Not Advance (Phase 9.2)
Symptom: A run remains waiting, and no new nodes dispatch even though the run is not failed.
Meaning: The runtime is intentionally holding progress on a canonical wait state. The wait kind tells you what must happen next.
| Wait kind | Meaning | Fix |
|---|---|---|
async_batch | External or long-running work has not been completed yet | Wait for the completion witness, then continue the same node with the current resumeToken |
human_decision | A person must approve or reject the step | Submit the explicit decision instead of replaying the node |
retry_backoff | Governance or retry policy deferred immediate progress | Clear the blocker or let the backoff condition resolve, then resume or reprompt the same node |
checkpoint_commit | The node output exists, but the checkpoint commit is not durable yet | Wait for checkpoint commit completion and continue the same node rather than executing it again |
If the control state is paused_review or resuming, the wait can remain active with reason codes such as workflow_wait_paused_review or workflow_wait_resuming until the control-state transition is resolved.
Workflow Is Blocked for Review After Resume, Decision, or Retry (Phase 9.2)
Symptom: A run enters blocked_review after a continuation, human decision, or retry-related step.
Meaning: The runtime refused blind progress and recorded a correction posture in correctionArcs.
| Reason code | Cause | Fix |
|---|---|---|
workflow_continuation_token_mismatch | A stale or wrong continuation token was supplied | Re-read the waiting node and use its current resumeToken |
workflow_resume_review_required | The previous attempt has unknown_external_effect, so blind resume is unsafe | Review the evidence and explicitly approve continuation before resuming |
workflow_human_decision_rejected | The human-decision node was rejected | Follow the correction arc and rollback/reprompt path instead of replaying the same continuation |
workflow_resume_denied_hard_stopped | The project became hard_stopped during resume | Clear the hard stop before trying to continue |
workflow_retry_backoff_resolution_required | A retry/backoff path needs explicit operator action | Apply the indicated retry or reprompt path, then continue from the same run |
Treat correctionArcs as authoritative runtime state. They are not suggestions; they are the canonical record of how the workflow expects recovery to proceed.
Checkpoint Commit Stays Pending (Phase 9.2)
Symptom: The run stays waiting with activeWaitState.kind = checkpoint_commit, or lastPreparedCheckpointId changes while lastCommittedCheckpointId does not.
Meaning: The runtime prepared checkpoint state but has not yet durably committed it. Phase 9.2 does not allow blind continuation from provisional checkpoint state.
Fix:
- Wait for the checkpoint commit to succeed, then continue the same node/run with the committed checkpoint reference.
- Do not redispatch the node while
checkpoint_commitis still active. - If commit never completes, inspect checkpoint-manager evidence and the node's
reasonCodebefore deciding whether the path should remain waiting or be escalated into a review workflow.
Recovery Blocked or Failed (Phase 5.4)
Symptom: Workflow shows recovery_blocked_review_required or recovery_failed_hard_stop.
Meaning: The recovery system reached a terminal state that requires operator action or investigation.
| State | Cause | Fix |
|---|---|---|
recovery_blocked_review_required | unknown_external_effect flagged; retry blocked (missing idempotency); or policy/invariant violation | Review evidence; use operator-control to authorize resume or escalate |
recovery_failed_hard_stop | Integrity mismatch; policy violation; or unrecoverable failure | Investigate logs and evidence; use operator-control to change control state; re-trigger if appropriate |
Fix: See Recovery Governance for full operator guidance.