Skip to content

Incident Response

Alert triage map. Match alert name to runbook section.

Alert -> runbook map

Alert Severity Runbook section
ApiHighErrorRate (5xx > 1% 5m) P1 API errors
WorkerQueueStalled (> 10min) P2 Queue stalled
WebsocketConnectionsDrop (> 50% 5m) P2 WS drop
SchedulerMissed (> 2 consecutive) P3 Scheduler
MigrateJobFailed P1 Migrate failed
ArgoCDSyncFailed P1 Argo sync

API high error rate

Purpose

Investigate elevated 5xx responses on the api workload.

When to use

Alert ApiHighErrorRate fires.

Steps

  1. Check pod logs: kubectl logs -n <tenant> -l app=api --tail=100
  2. Check recent deploy: argocd app history <tenant>-laravel
  3. If bad deploy: rollback via do-deploy.yml dispatch with previous tag (see Promote Image)
  4. If infra issue: check DO status page, managed DB health

Rollback

Re-dispatch do-deploy.yml with last known-good tag.


Worker queue stalled

Purpose

Horizon not processing jobs.

Steps

  1. kubectl exec -n <tenant> deploy/worker -- php artisan horizon:status
  2. Check Redis connectivity: kubectl exec -n <tenant> deploy/worker -- php artisan tinker --execute="Cache::ping()"
  3. Restart worker pod if stuck: kubectl rollout restart deploy/worker -n <tenant>

Websocket connections drop

Purpose

Reverb pod(s) losing active WS connections.

Steps

  1. Check Reverb pods: kubectl get pods -n <tenant> -l app=websockets
  2. Check Redis broker connectivity (REVERB_SCALING_REPLICATION=redis)
  3. Check Gateway listener :6001 health

Scheduler missed

Purpose

CronJob not firing or timing out.

Steps

  1. kubectl get cronjobs -n <tenant>
  2. kubectl get jobs -n <tenant> --sort-by=.metadata.creationTimestamp | tail -5
  3. Check job logs for failures

Migrate Job failed

Purpose

PreSync migration failed, blocking Argo sync. Existing pods still running old code.

Steps

  1. argocd app get <tenant>-laravel - note the failed Job name
  2. kubectl logs job/<migrate-job-name> -n <tenant>
  3. Fix migration or roll back image tag via do-deploy.yml dispatch
  4. Once fixed: argocd app sync <tenant>-laravel

Argo CD sync failed

Purpose

Argo CD cannot sync an Application.

Steps

  1. argocd app get <app-name> - read the conditions
  2. Common causes: PreSync hook failed (see above), network issue to cluster, invalid manifest
  3. If network issue to infra-ops: check GitHub App credentials Secret in argocd ns