Skip to main content

Operations Checklist

Operations checklist

Verify shell health first, then follow the core operator path. Use the runbook or endpoint map when something stops working.

Core operator modes

  • Healthy: project pipeline state updates and event stream are current.
  • Degraded: health is up but event stream/pipeline updates are delayed.
  • Blocked: missing permissions, budget hard stop, or unresolved backend contract gaps.

Pre-check before shift

  1. Open dashboard and verify health status.
  2. Open projects and confirm at least one project loads.
  3. Open one project and verify pipeline state endpoint responds.
  4. Open endpoint map and verify deployment contract version.

Core operator flow

  1. Start triage: open files route (/projects/[projectId]/files), upload files, and verify pipeline enters running phase triage.
  2. Review plan: open plan route (/projects/[projectId]/plan), commit/resume, and verify plan version increments.
  3. Run evaluation: open calibration route (/projects/[projectId]/runs/[runId]/calibration), launch full run, and verify redirect to live route with new run id.
  4. Review results: open results route (/projects/[projectId]/runs/[runId]/results) and inspect filters, evidence, and applicant receipts.

Incident triage checklist

  1. Capture project id, run id, route, request id, timestamp.
  2. Check run state, stream/events, and budget endpoints.
  3. Classify as auth, validation, transient infra, budget stop, or orchestration failure.

Escalation path

1. Start here for the core operator sequence.

2. Endpoint map — contract or permissions failures.

3. Runbook — route-specific recovery.

  • GET /api/v1/projects/{project_id}/runs/{run_id}/state
  • GET /api/v1/projects/{project_id}/runs/{run_id}/events/stream
  • GET /api/v1/projects/{project_id}/budget?run_id={run_id}

Troubleshooting matrix

  • 401/403: verify user role and org membership.
  • 409: plan version conflict; reload plan and retry.
  • 429: throttled or budget-limited; wait or request approval.
  • 5xx: capture request id and escalate.
  • stale UI: compare event freshness against SLO targets.
  • cancelled run: confirm cancellation reason and route to run history/new run action.

Release readiness artifacts

  • Troubleshooting runbook
  • Release readiness templates: docs/v2-architecture/40-frontend/operations/05-RELEASE-READINESS-TEMPLATES.md