Incident Response¶
Alert triage map. Match alert name to runbook section.
Alert -> runbook map¶
| Alert | Severity | Runbook section |
|---|---|---|
ApiHighErrorRate (5xx > 1% 5m) |
P1 | API errors |
WorkerQueueStalled (> 10min) |
P2 | Queue stalled |
WebsocketConnectionsDrop (> 50% 5m) |
P2 | WS drop |
SchedulerMissed (> 2 consecutive) |
P3 | Scheduler |
MigrateJobFailed |
P1 | Migrate failed |
ArgoCDSyncFailed |
P1 | Argo sync |
API high error rate¶
Purpose¶
Investigate elevated 5xx responses on the api workload.
When to use¶
Alert ApiHighErrorRate fires.
Steps¶
- Check pod logs:
kubectl logs -n <tenant> -l app=api --tail=100 - Check recent deploy:
argocd app history <tenant>-laravel - If bad deploy: rollback via
do-deploy.ymldispatch with previous tag (see Promote Image) - If infra issue: check DO status page, managed DB health
Rollback¶
Re-dispatch do-deploy.yml with last known-good tag.
Worker queue stalled¶
Purpose¶
Horizon not processing jobs.
Steps¶
kubectl exec -n <tenant> deploy/worker -- php artisan horizon:status- Check Redis connectivity:
kubectl exec -n <tenant> deploy/worker -- php artisan tinker --execute="Cache::ping()" - Restart worker pod if stuck:
kubectl rollout restart deploy/worker -n <tenant>
Websocket connections drop¶
Purpose¶
Reverb pod(s) losing active WS connections.
Steps¶
- Check Reverb pods:
kubectl get pods -n <tenant> -l app=websockets - Check Redis broker connectivity (REVERB_SCALING_REPLICATION=redis)
- Check Gateway listener :6001 health
Scheduler missed¶
Purpose¶
CronJob not firing or timing out.
Steps¶
kubectl get cronjobs -n <tenant>kubectl get jobs -n <tenant> --sort-by=.metadata.creationTimestamp | tail -5- Check job logs for failures
Migrate Job failed¶
Purpose¶
PreSync migration failed, blocking Argo sync. Existing pods still running old code.
Steps¶
argocd app get <tenant>-laravel- note the failed Job namekubectl logs job/<migrate-job-name> -n <tenant>- Fix migration or roll back image tag via
do-deploy.ymldispatch - Once fixed:
argocd app sync <tenant>-laravel
Argo CD sync failed¶
Purpose¶
Argo CD cannot sync an Application.
Steps¶
argocd app get <app-name>- read the conditions- Common causes: PreSync hook failed (see above), network issue to cluster, invalid manifest
- If network issue to infra-ops: check GitHub App credentials Secret in argocd ns