AWS to DO Backend Deploy Preparation Plan¶
Linear-project-and-tickets plan. NOT implementation. Each ticket has context, scope, AC, validation/verification steps. Mirrors format of aws-to-do-data-migration_f4a8604e.plan.md.
1. Goal¶
Build the deploy plumbing for [soa/](soa/) Laravel backend on DOKS (cluster oep-prd-cluster). End state: all 3 tenant envs (oep-stg, mansety-prd, us-prd) running on DOKS via ArgoCD with no live traffic yet. Per-tenant traffic cutover = separate Linear project per tenant.
2. Scope¶
In scope¶
- Doppler workspace structure (project + configs only) managed via Terraform; service tokens generated manually.
- Migrate org-wide GH Actions secrets to Doppler; remove the org-level GH copies.
- TF additions in
unipuka-infra-do/oep-infra/for: namespaces, DOCR pullsecret, Doppler service-token K8s Secret bootstrap, Argo CD repo creds Secret, Argo CDhelm_releaseinstall + root Application apply. unipuka-infra-ops/GitOps repo full layout (platform chart + apps charts + AppProjects + ApplicationSets + docs).- Argo CD + ESO + Gateway + kube-prometheus-stack + Loki + OpenCost installed and tenant-scoped.
laravel-appHelm chart (api + worker + websockets/Reverb + scheduler CronJob + PreSync migrate Job).- New CI workflows in
soa/.github/workflows/: build + push DOCR (do-build.yml), gitops deploy (do-deploy.yml). - Add
/healthendpoint in soa. - All 3 tenant envs deployed (idle, no traffic) and parity-verified.
Out of scope¶
- Per-tenant traffic cutover (DNS flip, AWS drain, traffic switch) -> separate
AWS to DO Backend Cutover: <tenant>Linear project per tenant. - Extracting reusable workflows to
[unipuka-github-workflows/](unipuka-github-workflows/)-> future iteration after CI stabilises. - Argo Rollouts / canary deploys -> future iteration post-cutover-stable.
- Image security scanning blocking gate (Trivy advisory only).
- Removing legacy AWS workflows -> deferred to cutover project per tenant.
3. Locked design decisions (from grilling)¶
- Manifest tool: Helm everywhere (laravel-app chart + platform chart).
- Bootstrap: Pure Terraform.
helm_releaseinunipuka-infra-do/oep-infra/installs Argo CD on DOKS. TFkubernetes_manifestapplies the App-of-Apps root Application. Argo CD then syncs everything else fromunipuka-infra-ops/(AppProjects, ApplicationSets, platform chart, app charts). Nobootstrap.sh. TF stays the source of truth for Argo CD's own chart version + values. - Deploys: Plain
Deploymentv1, rollingUpdate strategy. Argo Rollouts post-cutover. - Image: One env-agnostic image per commit. Tag = git-SHA + semver. DOCR =
registry.digitalocean.com/oep(DO container registryoepin regionams3perunipuka-infra-do/oep-infra/registry.tf+ default region inunipuka-infra-do/oep-infra/variables.tf). Full pathregistry.digitalocean.com/oep/api:<tag>. Same digest used across all tenants. - Deploy gating: Manual
workflow_dispatchfor ALL envs. Direct commit tounipuka-infra-ops/masterfrom the deploy workflow. GH Environment + required reviewers for prod tenants.do-build.ymlalso supportsworkflow_dispatchwithrefinput to rebuild from any commit. - Doppler structure: ONE Doppler project per service (
oep), configs = env (oep-stg,mansety-prd,us-prd) + abaseconfig for shared keys. Project + config structure is portable to any ESO-compatible store (paths preserved). Service tokens generated manually in Doppler UI (NOT through TF) to keep token values out of TF state; tokens land as GH org secrets + K8s Secrets via separate manual steps. - Reverb websockets: Same shared Gateway, additional HTTPS listener on port 6001 with TLS terminate (Cloudflare Origin cert). Zero frontend env change. Multi-replica + Redis broker for HA.
- Migrate hook: PreSync (NOT PostSync). Convention: migrations must be backward-compatible to support rollback. Update CLAUDE.md.
- ApplicationSet: v1 = List generator only (3 tenants, 1 chart). Per-tenant
targetRevisionoverridable from List element for staging-only test of template changes (see Section 5.5). Matrix generator deferred to when 2nd app joins. - Argo Projects: Two AppProjects:
platform(cluster-scoped) andtenants(namespace-scoped tooep-stg|mansety-prd|us-prd). - Sync policy:
prune: true+selfHeal: truefor ALL Argo Applications (incl. prd). infra-ops master is the source of truth; removed manifests get deleted. Branch protection on infra-ops (review-required for prod values changes) is the gate. - Monitoring: kube-prometheus-stack + Loki + OpenCost in scope.
- Trivy: advisory only.
- Cutover: per-tenant, separate project per tenant.
4. Linear project¶
- Project name:
AWS to DO Migration: Backend Deploy Preparation - Identifier prefix:
BD-(Backend Deploy). MilestonesBD-0..BD-8; ticketsBD-<milestone>.<n>. - Summary: "Prep work to run Laravel backend on DOKS via ArgoCD GitOps. Builds CI, Doppler structure, platform components (Argo+ESO+Gateway+monitoring),
laravel-appHelm chart, and idle deploys to all 3 tenants. Per-tenant traffic cutover is a separate project per tenant." - Labels: reuse
migration,infra,backend. Adddeploy-prep. - Acceptance for project closure: all 3 tenant deployments running idle on DOKS, parity-verified against AWS source-of-truth, CI green end-to-end on workflow_dispatch, Doppler is the only source for backend secrets, monitoring stack live with per-workload alerts, runbooks committed. Cutover preconditions handed off to per-tenant cutover projects.
5. End-state architecture¶
flowchart LR
subgraph soa [soa repo]
BuildWf["do-build.yml<br/>(dispatch or merge)"]
DeployWf["do-deploy.yml<br/>(dispatch + tenant + tag)"]
end
subgraph DOCR
Img["api:<tag>"]
end
subgraph InfraOps [unipuka-infra-ops]
Platform["platform/<br/>(ESO, Gateway, NetPol,<br/>quotas, monitoring)"]
AppsLaravel["apps/laravel/<br/>(Helm chart +<br/>values-<tenant>.yaml)"]
Root["bootstrap/<br/>root Application +<br/>ApplicationSets"]
end
subgraph DOKS [DOKS oep-prd-cluster]
Argo[ArgoCD]
ESO[External Secrets Operator]
Gateway["Shared Gateway<br/>:443 + :6001"]
Mon["kube-prometheus-stack<br/>Loki + OpenCost"]
subgraph Tenants
OepStg["oep-stg ns<br/>(api+worker+ws+sched)"]
ManPrd["mansety-prd ns"]
UsPrd["us-prd ns"]
end
end
Doppler[(Doppler<br/>oep project)]
subgraph InfraDo [unipuka-infra-do/oep-infra]
Tf["TF: namespaces, DOCR pullsecret,<br/>Doppler service-token Secret,<br/>Argo repo creds, Doppler project+configs,<br/>helm_release argo-cd, root Application"]
end
BuildWf --> Img
DeployWf -->|direct commit| InfraOps
Tf -.->|kubernetes + helm providers| DOKS
Tf -.->|doppler provider: project + configs only| Doppler
Argo -->|watches| InfraOps
Argo -->|applies| Platform
Argo -->|applies| AppsLaravel
ESO -->|sync| Doppler
ESO -->|writes K8s Secret| Tenants
Img -->|imagePullSecret| Tenants
Gateway -->|HTTPRoute| Tenants
BuildWf -.->|doppler run| Doppler
5.5 Template versioning + env-promotion strategy¶
Two independent version dials, decoupled on purpose:
-
Image tag (= app code version). Per-tenant
image.taginapps/laravel/values-<tenant>.yaml. Bumped bydo-deploy.ymlworkflow dispatch. Promotion = same image tag promoted across envs by re-running the workflow per tenant. Allows older tags to be re-deployed (just dispatch with the older<tag>). -
Chart version (= manifest template version). Bumped in
apps/laravel/Chart.yaml(semver) when templates, defaults, or platform-side contract change. CI lint job inunipuka-infra-ops/.github/workflows/helm-lint.ymlenforces version bump on template diff. Pushed to master in a PR.
To test template changes in staging only without affecting prd:
- Open PR branch in
unipuka-infra-ops(e.g.feat/reverb-redis-broker-tune). - ApplicationSet List generator gives each tenant element a
targetRevisionfield. Default =master. Edit theoep-stgelement inbootstrap/applicationsets/laravel.yamltotargetRevision: feat/reverb-redis-broker-tuneON THE PR BRANCH itself. - Push PR branch. Argo CD doesn't auto-pick this up because the ApplicationSet manifest is still on master.
- Manually merge ONLY the ApplicationSet change (the
targetRevisionflip foroep-stg) to master via a small follow-up PR, OR temporarily patch the live ApplicationSet viakubectl edit(recorded in runbook). Then Argo CD pointsoep-stgApplication at the PR branch; prd Applications stay on master. - Validate in
oep-stg. When happy, merge the original PR to master, then revert/clean up thetargetRevision: feat/...back tomasterforoep-stg. Prods catch up on next dispatch. - Runbook for this lives in
unipuka-infra-ops/docs/test-template-on-staging.md(BD-8.2).
ApplicationSet skeleton (illustrative):
apiVersion: argoproj.io/v1alpha1
kind: ApplicationSet
metadata:
name: laravel
spec:
generators:
- list:
elements:
- tenant: oep-stg
targetRevision: master
- tenant: mansety-prd
targetRevision: master
- tenant: us-prd
targetRevision: master
template:
metadata:
name: '{{tenant}}-laravel'
spec:
project: tenants
source:
repoURL: https://github.com/unipuka/unipuka-infra-ops.git
targetRevision: '{{targetRevision}}'
path: apps/laravel
helm:
valueFiles:
- 'values-{{tenant}}.yaml'
destination:
namespace: '{{tenant}}'
server: https://kubernetes.default.svc
syncPolicy:
automated:
prune: true
selfHeal: true
6. Repository layouts at completion¶
6.1 unipuka-infra-ops/ (greenfield)¶
unipuka-infra-ops/
bootstrap/
root-application.yaml # App-of-Apps root. Applied by TF kubernetes_manifest. Argo CD syncs everything under bootstrap/ + platform/ + apps/ from this point.
appprojects/
platform.yaml # AppProject: cluster-scoped, sourceRepo unipuka-infra-ops
tenants.yaml # AppProject: namespace-scoped (oep-stg|mansety-prd|us-prd)
applicationsets/
laravel.yaml # ApplicationSet: List generator over 3 tenants -> apps/laravel; per-element targetRevision (Section 5.5)
platform.yaml # Application: pulls platform/ chart into the cluster
platform/
Chart.yaml
values.yaml # global toggles
templates/
namespaces.yaml # all tenant + platform ns with labels
resourcequotas.yaml # per-tenant ResourceQuota
limitranges.yaml # per-tenant LimitRange
networkpolicies.yaml # default-deny + allow rules
gateway.yaml # shared Gateway: listeners :443 (per-tenant hostnames) + :6001 (per-tenant ws hostnames)
cloudflare-origin-cert.yaml # TLS Secret per hostname, refs ExternalSecret
eso-clustersecretstores.yaml # 1 ClusterSecretStore per Doppler config
charts/
external-secrets/ # ESO operator Helm subchart
kube-prometheus-stack/ # Helm subchart, scrape configs, alerts
loki/ # Loki + Promtail/Alloy
opencost/ # OpenCost Helm
apps/
laravel/
Chart.yaml
values.yaml # defaults
values-oep-stg.yaml
values-mansety-prd.yaml
values-us-prd.yaml
templates/
_helpers.tpl
api-deployment.yaml # nginx + php-fpm via supervisord (or split, see BD-4.2)
api-service.yaml
api-httproute-443.yaml
worker-deployment.yaml # horizon
websockets-deployment.yaml # reverb, multi-replica, redis broker
websockets-service.yaml
websockets-httproute-6001.yaml
scheduler-cronjob.yaml # */1 schedule:run
migrate-presync-job.yaml # PreSync hook
externalsecret.yaml # per-workload envFrom
pdb.yaml
hpa.yaml
servicemonitor.yaml # Prometheus scrape
alertrules.yaml # PrometheusRule
docs/ # rendered to docs.unipuka.app via mkdocs-material (BD-2)
index.md # landing
runbooks/
bootstrap.md # how to terraform apply the TF helm_release + root Application + day-0 prerequisites
add-tenant.md
rotate-secrets.md # Doppler service-token rotation (manual)
promote-image.md # how to use do-deploy.yml; re-deploy older tag; DOCR retention
test-template-on-staging.md # per Section 5.5
argocd-break-glass.md # how to use built-in admin secret when Dex/GitHub OAuth is down
incident-response.md # alert -> runbook map
adr/
index.md # auto-generated index of ADRs
_template.md # ADR template (Context/Decision/Status/Consequences/Alternatives)
0001-secret-management.md # Doppler structure rationale + portability
audit/
secrets-audit.md # BD-1.4 deliverable
cutover-precondition-oep-stg.md
cutover-precondition-mansety-prd.md
cutover-precondition-us-prd.md
mkdocs.yml # mkdocs-material config (BD-2.1)
requirements.txt # mkdocs + plugins pinned versions
.github/workflows/
helm-lint.yml # chart-testing on PR
docs-build.yml # mkdocs --strict on PR; build+deploy on master push (BD-2.2)
README.md
6.2 soa/.github/workflows/ (new files added)¶
do-build.yml # lint + test + build + Trivy advisory + push DOCR + Sentry release
do-deploy.yml # workflow_dispatch(tag,tenant) -> commit to unipuka-infra-ops master
(Legacy oep-staging-pipeline.yml, mansety-production-pipeline.yml, us-production-pipeline.yml stay until per-tenant cutover.)
6.3 unipuka-infra-do/oep-infra/ (additions only)¶
providers.tf # add doppler + kubernetes + helm + cloudflare providers
variables.tf # add doppler_admin_token, doppler_service_token_oep_stg|mansety_prd|us_prd, argocd_github_app_id, argocd_github_app_private_key, argocd_github_app_installation_id
doppler.tf # Doppler project oep + configs (base, oep-stg, mansety-prd, us-prd) via Doppler TF provider. NO service tokens managed here (manual).
k8s_namespaces.tf # tenant + platform namespaces (argocd, external-secrets, monitoring, oep-stg, mansety-prd, us-prd) via kubernetes_namespace
k8s_secrets.tf # one-off K8s Secrets via kubernetes provider:
# - DOCR imagePullSecret `do-registry` in each tenant + monitoring + external-secrets ns
# - Doppler service-token Secret per tenant in external-secrets ns (value from var.doppler_service_token_<env>, NOT from Doppler API)
# - Argo CD repo creds Secret in argocd ns (GH App from sensitive vars)
argocd.tf # helm_release `argocd` from argo Helm chart (CRDs included) + kubernetes_manifest for bootstrap/root-application.yaml. Pinned chart version. Ongoing argo-cd upgrades = TF apply.
docs_site.tf # Cloudflare Pages project for docs.unipuka.app, deploy_on_push from unipuka-infra-ops/master, Cloudflare Access policy bound to unipuka GH org (BD-2.3)
registry.tf (existing) is extended in BD-5.5 to add GC retention policy (30 tagged images target) as DOCR doesn't expose retention as a first-class TF arg.
7. Milestones¶
BD-0: Planning + project bootstrap (3 tickets)¶
- BD-0.1 Create the Linear project
AWS to DO Migration: Backend Deploy Preparationwith milestones BD-1..BD-8 and labeldeploy-prep. Paste this plan into the project description. Commit canonical copy to[unipuka-infra-ops/docs/backend-deploy-prep-plan.md](unipuka-infra-ops/docs/backend-deploy-prep-plan.md). (1pt) - BD-0.2 Audit + update
CLAUDE.mdto reflect locked decisions + real infra state. Known discrepancies as of plan time: (a) Infra section says DO regionfra1, actual =ams3perunipuka-infra-do/oep-infra/variables.tfline 22; (b) Kubernetes section says PostSync migrate hook, locked decision = PreSync; (c) Bootstrap wording mentions ESO + Doppler but predates locked Doppler project structure (project=oep, configs=env, manual service tokens); (d) AppProject naming not specified, locked =platform+tenants; (e) Argo CD install path not specified, locked = TFhelm_release(nobootstrap.sh); (f) Monitoring stack listed but observability scope was unclear, locked = full kube-prometheus-stack + Loki + OpenCost in scope; (g) Gateway listeners not enumerated, locked =:80 redirect,:443,:6001(Reverb). Update each section. Re-audit at BD-8.1 for any further drift. (2pt) - AC: Each discrepancy above resolved in CLAUDE.md with a commit referencing this ticket.
- V&V: Diff CLAUDE.md vs actual TF + plan Section 3 produces zero contradictions.
- BD-0.3 ADR: secret-management approach. Document the Doppler structure decision (project = service
oep, configs = env), the boundary between Doppler-managed secrets and Helm-managed static config, the manual-service-token rationale, the ESO sync model, and rotation cadence. Commit tounipuka-infra-ops/docs/adr/0001-secret-management.md. (1pt)
BD-1: Doppler foundation (5 tickets)¶
- BD-1.1 Add Doppler Terraform provider (
DopplerHQ/doppler) tounipuka-infra-do/oep-infra/providers.tf. Adddoppler_admin_tokenvariable (workspace admin token used ONLY at TF apply time; passed viaTF_VAR_doppler_admin_tokenenv, never committed). Bootstrap workspace:terraform init, verify provider auth. (1pt) - AC:
terraform planruns clean with Doppler provider configured. - V&V:
terraform providerslists Doppler; planned changes are empty fordoppler.tfplaceholder. - BD-1.2 TF: Create Doppler project
oep+ configsbase,oep-stg,mansety-prd,us-prd(configs inherit frombase). Lifecycle:prevent_destroy = trueon the project. File:unipuka-infra-do/oep-infra/doppler.tf. NOTE: this ticket manages only the project + configs (structure), NOT secret values, NOT service tokens. (2pt) - AC: Doppler dashboard shows project + 4 configs with inheritance from
base. - V&V:
doppler configs --project oeplists all 4. - BD-1.3 Manually generate 3 Doppler service tokens (one per non-
baseconfig) in the Doppler UI with read-only scope. For each token, manually create a GH secret at theunipuka/soarepo level (NOT org-level) namedDOPPLER_TOKEN_OEP_STG,DOPPLER_TOKEN_MANSETY_PRD,DOPPLER_TOKEN_US_PRD. Token values are also stored in TF input vars (TF_VAR_doppler_service_token_oep_stg, etc.) at apply time only (passed via env, never committed) so BD-3.3 can land them as K8s Secrets. Rationale: keeps tokens out of TF-managed Doppler-provider state (which is the leak surface) AND out of Doppler-side automation outputs. Document the generation runbook inunipuka-infra-ops/docs/rotate-secrets.md. (2pt) - AC: 3 service tokens visible in Doppler UI with read-only scope; 3 GH secrets exist at
unipuka/soarepo level. - V&V: From a test workflow on
master,doppler secrets download --token=$DOPPLER_TOKEN_OEP_STG --no-file --format jsonreturns the expected JSON shape. - BD-1.4 Audit + migrate org-wide GH Actions secrets to Doppler. Deliverables:
unipuka-infra-ops/docs/secrets-audit.mdlisting every GH secret used bysoaworkflows today (FIREBASE_CREDENTIALS, SENTRY_AUTH_TOKEN, etc.), the target Doppler config + key path, and migration ownership. Then mirror values into Dopplerbaseor per-env config. Once a key is mirrored + new workflow consumes it viadoppler run, the org-level GH secret is removed. (3pt) - AC: Audit doc exists with status per secret (migrated / WIP / blocked). All migrated secrets removed from GH org-level scope.
- V&V: For each migrated secret, value in Doppler matches the previous GH secret (verified via
gh secret listBEFORE removal +doppler secrets getAFTER). - BD-1.5 Populate per-tenant Doppler config keys (
oep-stg,mansety-prd,us-prd) with all Laravel runtime env vars currently in AWS SSM + the existing ECS task-definitions insoa/task-definitions/. Includes:APP_KEY,DB_*,REDIS_*,AWS_*(now DO Spaces equivalents), Sentry/Bugsnag DSNs, Pusher, Typesense, Firebase. Keys must be backward-compatible to be picked up by ESO ExternalSecret -> K8s Secret -> envFrom. (3pt) - AC: Each config has >= N keys matching the ECS task-def env list (N captured during audit).
- V&V:
doppler secrets --config oep-stgmatches a snapshot ofaws ecs describe-task-definition --task-definition oep/api.
BD-2: Documentation platform (4 tickets)¶
Set up a single source-of-truth, browsable docs site rendered with mkdocs-material. Hosts all runbooks, ADRs, audit docs, cutover checklists. Markdown lives in unipuka-infra-ops/docs/; the same files are usable as plain markdown in-repo AND rendered as a static site. Lands early so every subsequent milestone writes its docs into this structure from day one.
- BD-2.1 Set up mkdocs-material in
unipuka-infra-ops/. Addmkdocs.ymlat repo root with: site nameUnipuka Platform Docs, themematerial, navigation (Runbooks, ADRs, Audit, Cutover, Reference), search plugin, mermaid diagrams (pymdownx.superfences+ mermaid custom_fences),mkdocs-awesome-pages-pluginfor auto-nav,mkdocs-git-revision-date-localized-plugin.requirements.txtpins versions. Bootstrap pages:docs/index.md(landing),docs/runbooks/index.md,docs/adr/index.md(ADR index),docs/audit/index.md. (2pt) - AC:
mkdocs serverenders site locally with all bootstrap pages in nav. - V&V: Touch a markdown file under
docs/, observe live reload. Mermaid diagram in a sample page renders. - BD-2.2 Add CI workflow
unipuka-infra-ops/.github/workflows/docs-build.yml: on PR ->mkdocs build --strict(fails on broken links / nav warnings), markdown linter (markdownlint-cli). On push to master -> build + deploy to hosting target (BD-2.3). Caches pip deps. (2pt) - AC: PR with a markdown link to a missing page turns red on
mkdocs build --strict. - V&V: Open a deliberate broken-link PR, CI fails. Fix it, CI green.
- BD-2.3 Pick + provision hosting. Options: a) GitHub Pages (free, requires public repo OR GH Enterprise; infra-ops is private so this means GH Pages with org-level "private pages" which needs GH Enterprise), b) DO App Platform static site (free starter tier, custom domain via Cloudflare, gated by Cloudflare Access for internal-only), c) DOKS in-cluster behind the shared Gateway with Cloudflare Access. Recommend (b): DO App Platform static site at
docs.unipuka.app, deploy_on_push from master, Cloudflare Access policy restricts tounipuka.comGoogle Workspace identities OR GH org members via OAuth. Add resource tounipuka-infra-do/oep-infra/(newdocs_site.tfor extend the existing module). (3pt) - AC:
https://docs.unipuka.appreachable, Cloudflare Access challenge fires for non-team members. - V&V: Logged-in team member loads docs site; logged-out / non-team gets 403.
- BD-2.4 ADR template + index + first ADR moved into docs site. Template (
docs/adr/_template.md): Context, Decision, Status, Consequences, Alternatives. Index (docs/adr/index.md) auto-lists ADRs via awesome-pages. Move BD-0.3's0001-secret-management.mdintodocs/adr/. Add a runbook stub for every operator runbook expected by this project (matches list in Section 6.1):bootstrap.md,add-tenant.md,rotate-secrets.md,promote-image.md,test-template-on-staging.md,incident-response.md. Each stub is a placeholder with the eventual heading skeleton; the milestone that creates the matching feature fills them in. (1pt) - AC: ADR index lists 0001; all 6 runbook stubs exist with template skeleton.
- V&V: Click each runbook from rendered docs nav -> page loads with
## Purpose / ## When to use / ## Steps / ## Rollback / ## V&Vsections.
BD-3: Platform bootstrap on DOKS (14 tickets)¶
- BD-3.1 TF: Create K8s namespaces (
argocd,external-secrets,monitoring,oep-stg,mansety-prd,us-prd) via thekubernetesprovider in[unipuka-infra-do/oep-infra/k8s_bootstrap.tf](unipuka-infra-do/oep-infra/k8s_bootstrap.tf). Label each withteam,tier,tenant,env. (1pt) - AC:
kubectl get nslists all 6 with labels. - V&V: TF
planafter apply = empty diff. - BD-3.2 TF: Create DOCR imagePullSecret in each tenant ns + monitoring + external-secrets ns. Uses
digitalocean_kubernetes_cluster.main.kube_config[0]-> kubernetes provider. Name =do-registry. (2pt) - AC:
kubectl -n oep-stg get secret do-registry -o yamlshows dockerconfigjson type. - V&V: Pod with
imagePullSecrets: [{name: do-registry}]in tenant ns pulls a private DOCR image. - BD-3.3 TF: Create Doppler service-token K8s Secret per tenant (
doppler-token-oep-stg, etc.) in theexternal-secretsnamespace via the kubernetes provider inunipuka-infra-do/oep-infra/k8s_secrets.tf. Secret value sourced fromvar.doppler_service_token_<env>(sensitive input from CI/local env). NOT pulled from Doppler API to keep tokens out of Doppler-provider state. (1pt) - AC: 3 Secrets in
external-secretsns containing service token. - V&V: ESO ClusterSecretStore (BD-3.6) references these Secrets and reports
Status: Valid. - BD-3.4 TF: Create Argo CD repo credentials Secret in
argocdnamespace via kubernetes provider. Pick: GitHub App (preferred over PAT for long-lived no-expiry tokens). App ID + private key + installation ID supplied as TF sensitive vars (var.argocd_github_app_id,var.argocd_github_app_private_key,var.argocd_github_app_installation_id). Secret labeledargocd.argoproj.io/secret-type: repo-credsso Argo CD picks it up automatically. (2pt) - AC:
argocd repo listfrom inside cluster showsunipuka-infra-opsas connected. - V&V: Push a test commit to infra-ops master; Argo CD picks up the change within sync interval.
- BD-3.5 TF: Install Argo CD via
helm_releaseinunipuka-infra-do/oep-infra/argocd.tf. Pinned chart version (e.g.argo-cd-7.x). Values inline OR from[unipuka-infra-ops/bootstrap/argocd-values.yaml](unipuka-infra-ops/bootstrap/argocd-values.yaml)(read viafile()). Thenkubernetes_manifestresource applies the root Application fromunipuka-infra-ops/bootstrap/root-application.yaml. After this, TF stays the source of truth for Argo CD's own chart version + values; Argo CD syncs everything underbootstrap/,platform/,apps/from infra-ops. (3pt) - AC:
terraform applylands a working Argo CD; root Application + child Applications allSynced/Healthy. Re-applying TF is idempotent (zero diff). - V&V:
argocd app listshowsroot,platform,<tenant>-laravelx3. Bump Helm chart version in TF, run apply, observe rolling upgrade of Argo CD itself with no drift. - BD-3.6 Platform chart: ESO operator install (Helm subchart in
[unipuka-infra-ops/platform/charts/external-secrets/](unipuka-infra-ops/platform/charts/external-secrets/)) + oneClusterSecretStoreCR per Doppler config (3 total) referencing thedoppler-token-<env>Secret. (2pt) - AC:
kubectl get clustersecretstoreshows 3 stores allValid. - V&V: Create a probe
ExternalSecretinoep-stgns referencing theoep-stgClusterSecretStore + a known Doppler key; verify synced Secret exists with the right value. - BD-3.7 Platform chart: shared
Gatewayresource with TWO listener sections: a) HTTPS :443 (per-tenant hostnames from values), b) HTTPS :6001 (per-tenant ws hostnames). Both TLS terminate with Cloudflare Origin certs.allowedRoutes.namespaces.selectorenforces tenant isolation. Add HTTP :80 listener with redirect HTTPRoute (per Gateway-test pattern). (3pt) - AC:
kubectl describe gateway -n <gateway-ns>shows 3 listeners (HTTP/HTTPS/WS);Status: Accepted. DO LB created. - V&V: From outside CF,
curl https://api-staging.unipuka.app/healthresolves (404 acceptable pre-app-deploy).curl -i http://api-staging.unipuka.app/returns 301 -> https. - BD-3.8 Platform chart: TLS Secret per hostname for the Cloudflare Origin certs. Pulled via ESO from Doppler
baseconfig (keys:CF_ORIGIN_CERT_<host>,CF_ORIGIN_KEY_<host>). Avoid committing the cert files currently sitting in[unipuka-infra-ops/cloudflare-unipuka-cert.pem](unipuka-infra-ops/cloudflare-unipuka-cert.pem)-- delete them in this ticket. (2pt) - AC:
kubectl -n <gateway-ns> get secret cloudflare-origin-<host>exists with valid cert. - V&V: TLS handshake on the Gateway listener uses the right SAN.
- BD-3.9 Platform chart:
NetworkPolicytemplates per tenant ns. Default-deny ingress + egress; explicit allow for: a) intra-ns pod-to-pod, b) DNS to kube-system, c) DOKS managed MySQL + Redis (intra-VPC IPs viaipBlock), d) egress to internet for Doppler API, Sentry/Bugsnag, Typesense, Pusher, Firebase, Cloudflare Origin (use FQDN-resolver pattern or wide CIDR + label-based). (3pt) - AC: Pod in
oep-stgcan curlhttps://api.doppler.com, MySQL, Redis; pod cannot curl pod inmansety-prd. - V&V: Run
kubectl execprobe pod tests for each allow + deny path. - BD-3.10 Platform chart:
ResourceQuota+LimitRangeper tenant ns from CLAUDE.md numbers (CPU/mem/pods cap, per-container defaults). Tunable per tenant via Helm values. (1pt) - AC:
kubectl describe quota -n oep-stgshows hard limits. - V&V: Apply a Pod requesting > quota; admission denies it with the expected reason.
- BD-3.11 Platform chart: install
kube-prometheus-stackHelm subchart inmonitoringns. Add taint + toleration so it lands on a dedicateddedicated=monitoring:NoSchedulenode (label one node manually first or add a node pool via TF). ServiceMonitor + PrometheusRule CRDs available. Grafana ingress via the shared Gateway with a dedicatedgrafana.<base>hostname. (3pt) - AC: Prometheus pods Running on the dedicated node; Grafana reachable on its hostname; Prometheus scrapes itself + Alertmanager.
- V&V:
up{job="kubernetes-pods"}query returns non-empty in Prometheus UI. - BD-3.12 Platform chart: install Loki + Promtail/Grafana Alloy in
monitoringns. Wire Grafana datasource. Logs from all tenant + platform ns flow into Loki. Retention 7d for staging, 30d for prod. Storage: DO Spaces backend or in-cluster persistent volume (pick + document). (3pt) - AC: Grafana Explore on Loki datasource returns logs labeled
{namespace="oep-stg"}. - V&V: Trigger a Pod log line in
oep-stg; observe within 30s in Grafana. - BD-3.13 Platform chart: install OpenCost. Wire Prometheus datasource. (1pt)
- AC: OpenCost UI reachable, shows per-namespace cost projection within 24h after install.
- V&V: Compare OpenCost monthly projection vs DO billing for the same window (rough match).
- BD-3.14 Argo CD auth: SSO via GitHub OAuth through Dex (bundled with argo-cd Helm chart). Why this option (vs alternatives):
- Picked: GitHub OAuth via Dex. Team identity already in
unipukaGH org. RBAC mapped to GH teams. No separate user store to maintain. Built-in admin password kept only for break-glass. - Rejected: local accounts (per-user passwords) -> N more secrets to rotate, manual offboarding.
- Rejected: Google OAuth direct -> we use GH already; one fewer IdP.
- Rejected: PR-only token / no SSO -> not viable for a team UI.
Implementation: create a "GitHub OAuth App" under the unipuka GH org with callback https://argocd.unipuka.app/api/dex/callback. Store client ID + client secret in Doppler base config (keys ARGOCD_DEX_GITHUB_CLIENT_ID, ARGOCD_DEX_GITHUB_CLIENT_SECRET). ESO ExternalSecret syncs them into the argocd ns as argocd-dex-github. Argo CD argocd-cm ConfigMap (managed via Helm values pulled by TF in BD-3.5) wires up the dex github connector + argocd-rbac-cm maps GH teams to roles (e.g. unipuka:platform-admins -> role:admin, unipuka:engineers -> role:readonly-prod-edit-stg). Built-in admin disabled via admin.enabled: false in values (after first apply). Operator break-glass via argocd-initial-admin-secret only if Dex is broken (documented in docs/runbooks/argocd-break-glass.md). Argo CD UI exposed via a per-env HTTPRoute on the shared Gateway (argocd.unipuka.app, protected behind Cloudflare Access as defense-in-depth). (3pt)
- AC: Team member visits argocd.unipuka.app, redirected to GitHub OAuth, lands authenticated with the role mapped from their GH team. Built-in admin login disabled.
- V&V: A user not in the unipuka GH org cannot complete the OAuth flow. A user in unipuka:engineers gets read-only on prod Applications, edit on stg.
BD-4: laravel-app Helm chart (11 tickets)¶
- BD-4.1 Initialise
[unipuka-infra-ops/apps/laravel/](unipuka-infra-ops/apps/laravel/)Helm chart skeleton withChart.yaml, defaultvalues.yaml, emptytemplates/,values-oep-stg.yaml,values-mansety-prd.yaml,values-us-prd.yaml. Define values schema (values.schema.json) for: image, replicas, resources, hostnames (api + ws), redis broker DSN ref, podDisruptionBudget thresholds, hpa thresholds. (2pt) - AC:
helm lint apps/laravel -f apps/laravel/values-oep-stg.yamlpasses for all 3 tenant values files. - V&V:
helm template ...for each tenant produces non-empty output. - BD-4.2 api workload Helm templates. Decision encoded here: keep all-in-one supervisord image but run only nginx + php-fpm via supervisord (no worker/ws/scheduler in this Deployment). Args drive supervisord program subset. Service + HTTPRoute on Gateway listener
:443. Readiness probe ->GET /health. Liveness probe ->GET /health. (3pt) - AC: Rendered template applies cleanly in cluster, /health returns 200.
- V&V: kubectl exec into a pod, run
php artisanworks; nginx + php-fpm both alive. - BD-4.3 worker Helm template =
Deploymentrunningphp artisan horizon. Readiness probe ->php artisan horizon:status. Graceful shutdown:terminationGracePeriodSeconds+ Horizon'sphp artisan horizon:terminatepreStop hook. (2pt) - AC: Horizon dashboard reachable in-cluster; jobs in tenant Redis queue get processed.
- V&V: Push a sample job, observe Horizon picks it up; SIGTERM during job lets it finish before pod dies.
- BD-4.4 websockets (Reverb) Helm template =
Deploymentrunningphp artisan reverb:startwith replicas: 2 default. Reverb config uses Redis broker (BROADCAST_DRIVER=reverb+REVERB_SCALING_ENABLED=true+REVERB_SCALING_REPLICATION=redisenv). Service ClusterIP. HTTPRoute on Gateway listener:6001. PDB minAvailable: 1. (3pt) - AC: 2 Reverb pods, both receive client connections behind the Gateway listener :6001.
- V&V: Open 2 WS clients connecting to same hostname, ensure messages broadcast from one are seen by the other regardless of pod routing.
- BD-4.5 scheduler
CronJobtemplate. Schedule*/1 * * **.concurrencyPolicy: Forbid.successfulJobsHistoryLimit: 1,failedJobsHistoryLimit: 3. Command =php artisan schedule:run. (1pt) - AC: CronJob fires every minute, completes < 30s typical.
- V&V:
kubectl get jobs -n <tenant>shows runs every minute; logs showRunning scheduled command:output. - BD-4.6 PreSync migrate
Jobtemplate with Argo CD hook annotations:argocd.argoproj.io/hook: PreSync,argocd.argoproj.io/hook-delete-policy: BeforeHookCreation. Command =php artisan migrate --force --isolated. ActiveDeadlineSeconds: 1800. backoffLimit: 0. (2pt) - AC: New release triggered via Argo Sync -> Job runs before any Deployment update; failed Job -> Argo Sync fails with Job logs visible.
- V&V: Simulate a deliberate migration failure -> Argo Sync reports
Degraded; existing Deployments untouched. - BD-4.7
ExternalSecrettemplate per workload. Each refers to the tenant'sClusterSecretStore(BD-3.6) and pulls all keys withdataFrom.find.name: ".*"or explicit list. Target K8s Secret<release>-app-envconsumed viaenvFromin api/worker/ws/scheduler/migrate. RefreshInterval: 60s. (2pt) - AC:
kubectl get externalsecret -n oep-stg-> Status:SecretSynced. - V&V: Edit a key in Doppler
oep-stgconfig; within 60-90s see the K8s Secret value change. - BD-4.8 PodDisruptionBudget + HorizontalPodAutoscaler templates. PDBs: api minAvailable 50%, ws minAvailable 1, worker minAvailable 0 (worker = batch, OK to drain). HPA: api scale on CPU (target 70%) + memory; ws scale on connections (manual replicas only v1). Toggleable per tenant. (1pt)
- AC:
kubectl get hpa -n oep-stgshows api HPA active. - V&V: Load-gen on api pushes HPA replicas up; reset back to base after.
- BD-4.9 Per-tenant values files. Encode from existing AWS task-defs + frontend env table in
[unipuka-infra-do/oep-infra/tenants.tf](unipuka-infra-do/oep-infra/tenants.tf): - oep-stg: 1 replica per workload,
api.unipuka.app/api-staging.unipuka.apphostnames (decide canonical), small resource requests. - mansety-prd: 2 replicas api + ws, 1 worker (Horizon supervisors handle internal concurrency),
api.mansety.comhostname, prod-sized resources. - us-prd: 2 replicas,
api.us-academy.nethostname, prod-sized resources. (3pt) - AC:
helm templateper tenant matches expected hostnames + replicas. - V&V: Dry-run apply of rendered manifests in respective ns has no admission errors.
- BD-4.10 Add
/healthroute to[soa/](soa/). Returns 200 + JSON withdb: ok,redis: ok,app: ok. Used by readiness/liveness probes. Update[soa/Dockerfile](soa/Dockerfile)if any change needed. (2pt) - AC:
php artisan servelocally,curl /healthreturns 200. - V&V: Probe in cluster Pod returns 200 < 200ms.
- BD-4.11 chart-testing in
[unipuka-infra-ops/.github/workflows/helm-lint.yml](unipuka-infra-ops/.github/workflows/helm-lint.yml). Runsct lint --target-branch master. Triggered on PR. (1pt) - AC: Workflow green on a sample PR touching the chart.
- V&V: Deliberately break the chart, PR turns red.
BD-5: CI pipeline in soa repo (6 tickets)¶
- BD-5.1 New workflow
soa/.github/workflows/do-build.yml. Triggers: a) push tomaster, b)workflow_dispatchwith optionalrefinput (commit SHA or tag, defaults to current branch). Jobs: validate-jsonnet (kept), bake (checkout the resolvedref) -> lint (pint viadoppler run) -> test (phpunit viadoppler runwithDOPPLER_TOKEN_OEP_STGfor parity envs) -> trivy advisory scan -> push image toregistry.digitalocean.com/oep/api:<git-sha>+registry.digitalocean.com/oep/api:v<semver>. DOCR login via DO API token fromDIGITALOCEAN_ACCESS_TOKENGH secret. Doppler integration: at workflow start,doppler-cliconfigured withDOPPLER_TOKEN_*to fetch any keys needed at CI time (Firebase creds for tests, Sentry token). Therefinput lets us rebuild + republish an arbitrary historical commit if its image was GC'd from DOCR (e.g. for a hot rollback past the retention window). (4pt) - AC: Workflow green end-to-end on
masterpush; dispatch withref=<historical-sha>rebuilds + repushes the same SHA tag. - V&V:
doctl registry repository list-tags apishows both tags;docker pullworks from outside cluster (with creds); dispatch with oldrefproduces a tag equal to the older<git-sha>. - BD-5.2 Auto-versioning logic. On push to
master, conventional-commit parser bumps semver (major/minor/patch) and creates GH release with auto-notes (mirrors[unipuka-github-workflows/.github/workflows/release.yml](unipuka-github-workflows/.github/workflows/release.yml)pattern). Outputs<semver>-> consumed by image push step. (2pt) - AC: A
feat:commit creates a minor bump release on master push. - V&V:
gh release list -R unipuka/soashows the new tag. - BD-5.3 New workflow
[soa/.github/workflows/do-deploy.yml](soa/.github/workflows/do-deploy.yml). Trigger:workflow_dispatchonly. Inputs:tag(image tag, default = latest GH release tag),tenant(oep-stg/mansety-prd/us-prd). Steps: a) checkoutunipuka-infra-ops, b)yq -i '.image.tag = strenv(TAG)' apps/laravel/values-${tenant}.yaml, c) configure git user, d) commitchore(deploy): bump <tenant> to <tag>, e) push tounipuka-infra-ops/master. Slack notification on success/failure. Direct commit (no PR). (3pt) - AC: Dispatch with
tag=v0.1.0,tenant=oep-stg-> commit appears in infra-ops; Argo picks up + syncs. - V&V:
git log -1 unipuka-infra-ops master --pretty=fullshows the commit;argocd app get oep-stgreflects new image tag within 1-2 min. - BD-5.4 GH Environment protection rules in
unipuka/soa. Create envsoep-stg,mansety-prd,us-prd. Required reviewers on prod envs (mansety-prd,us-prd).do-deploy.ymldeclaresenvironment: ${{ inputs.tenant }}so the gate fires. (1pt) - AC: Manual dispatch to
mansety-prdpauses for reviewer approval. - V&V: Reviewer approves -> workflow proceeds; declines -> workflow aborts.
- BD-5.5 DOCR retention policy via TF in
unipuka-infra-do/oep-infra/registry.tf. Cost driver: DOCR Starter is free up to 500MB; Basic is $5/mo up to 5GB; Pro is $20/mo up to 100GB. soa/api image ~500-700MB compressed. 30 tags ~= 15-21GB (each tag = unique manifest + delta layers, layer sharing makes the effective size lower in practice). Goal: keep last 30 tagged images per repo (~1 month of daily releases) which is the rollback window we realistically need; older tags rebuilt on demand via BD-5.1refinput. DOCR doesn't expose retention as native TF args, so implement via: (a) GH Action on a weekly cron that lists tags, sorts by push date, deletes all but newest 30, then runsdoctl registry garbage-collection start --include-untagged-manifests. Document inunipuka-infra-ops/docs/promote-image.md. (2pt) - AC: After 30 pushes, the 31st push (or next weekly GC, whichever first) removes the oldest tag; untagged manifests are absent from DOCR after the next weekly GC.
- V&V:
doctl registry repository list-manifests api | wc -l<= 30 + a small buffer; no untagged manifests after GC run. Monthly DOCR bill confirms Basic-tier ceiling not exceeded. - BD-5.6 Sentry release integration in
do-build.yml. After image push:sentry-cli releases new -p soa $SEMVER+set-commits --auto+finalize+deploys ... new -e <env>per tenant. Token from Dopplerbase. Decision: deploys fired indo-deploy.yml(per tenant) instead ofdo-build.yml(per build) so deploy environment is correct. (2pt) - AC: Sentry release exists per build; per-env deploy markers exist per dispatch.
- V&V: Sentry UI shows release timeline with commits + deploys per env.
BD-6: oep-stg end-to-end deploy + parity verify (7 tickets)¶
- BD-6.1 First-time Argo CD bring-up. Operator runs
terraform applyinunipuka-infra-do/oep-infra/which executes the newargocd.tf(BD-3.5) for the first time. Document the prerequisites (kubeconfig set to oep-prd-cluster, all required TF input vars exported) inunipuka-infra-ops/docs/bootstrap.md. (1pt) - AC: Argo CD UI reachable, root Application synced.
- V&V:
argocd app listlistsroot,platform. BothSynced/Healthy. Subsequentterraform applyis a no-op. - BD-6.2 Apply ApplicationSet for laravel-app (committed in infra-ops, automatically picked up by root Application from BD-6.1). Generates 3 Applications (
oep-stg-laravel,mansety-prd-laravel,us-prd-laravel). ALL applications havesyncPolicy.automated.prune = true+syncPolicy.automated.selfHeal = true(per Section 3 locked decision). infra-ops master is the only source of truth; removed manifests = deleted resources. Branch protection on infra-ops with required reviewer forapps/laravel/values-*-prd.yamlis the safety net. (2pt) - AC: 3 Applications visible in Argo, all auto-syncing with prune + selfHeal on.
- V&V: Touch a file under
apps/laravel/templates/, push to master, observe each tenant Application re-sync within 1-2 min. Delete a Service fromtemplates/, push, observe it disappear from each tenant ns. - BD-6.3 Argo CD AppProjects applied.
platformAppProject with cluster-scoped permissions limited toargocd|external-secrets|monitoring|<gateway-ns>ns.tenantsAppProject limited tooep-stg|mansety-prd|us-prdns + allowlist of CRDs (no ClusterRole, no namespace creation). (1pt) - AC: Attempting to deploy a ClusterRole via
tenantsAppProject is denied. - V&V:
argocd proj listshows both. - BD-6.4 First end-to-end deploy of
laravel-apptooep-stg. Sequence: a) confirm Doppleroep-stgpopulated, b) triggerdo-build.ymlon a clean master commit, c) triggerdo-deploy.ymldispatch tag=v0.1.0tenant=oep-stg, d) watch Argo sync (PreSync migrate runs first), e) confirm all 4 workloads Healthy. (3pt) - AC:
kubectl get pods -n oep-stgallRunning./healthreturns 200 via Gateway. Horizon dashboard reachable. Reverb accepting WS. - V&V: smoke matrix in BD-6.5 passes.
- BD-6.5 Parity smoke matrix for
oep-stg. Create a smoke-test script[unipuka-infra-ops/scripts/smoke.sh](unipuka-infra-ops/scripts/smoke.sh)that hits N critical endpoints (login, list courses, websocket connect, queue job dispatch, scheduled command run). Run against the new DOKS oep-stg deploy. (3pt) - AC: Script returns 0 exit code.
- V&V: Each endpoint's response matches the AWS-side baseline (status code + JSON shape).
- BD-6.6 Per-workload alert rules. PrometheusRule per workload in the chart's
[alertrules.yaml](unipuka-infra-ops/apps/laravel/templates/alertrules.yaml): api 5xx ratio > 1% (5m), worker queue stalled > 10min, ws active connections drop > 50% (5m), scheduler missed > 2 consecutive runs, migrate Job failure. Routed to Slack via Alertmanager. (3pt) - AC: Alerts visible in Alertmanager UI; trigger each manually to verify routing.
- V&V: Each alert ends up in the configured Slack channel within 1 min of firing.
- BD-6.7 Bugsnag release integration (kept alongside Sentry per CLAUDE.md).
[soa/](soa/)release script invoked fromdo-build.yml, source maps + release markers uploaded. (1pt) - AC: Bugsnag dashboard shows release per build.
- V&V: Trigger a deliberate exception in oep-stg -> Bugsnag captures it with the right release ID.
BD-7: mansety-prd + us-prd deploy idle (5 tickets)¶
- BD-7.1 Tune sizing in
values-mansety-prd.yaml+values-us-prd.yamlto match current ECS task-def CPU/mem footprint (api 2 replicas each running 0.5 vCPU / 1Gi). Set replica counts + HPA bounds. (2pt) - AC:
helm templaterenders match the agreed-upon sizing doc. - V&V: Resource sum stays within tenant ResourceQuota.
- BD-7.2 Populate Doppler
mansety-prd+us-prdconfigs with secrets from AWS SSM. Cross-reference[soa/task-definitions/mansety/api.json](soa/task-definitions/mansety/api.json)+us/api.json. (3pt) - AC: Per-config secret count matches AWS SSM count (audit doc).
- V&V:
doppler secrets --config mansety-prd | wc -lmatches expected. - BD-7.3 Deploy
laravel-apptomansety-prdandus-prdnamespaces (manualdo-deploy.ymldispatch per tenant, follow GH Environment approval). No traffic still on AWS. (2pt) - AC: All workloads Running per tenant.
- V&V: BD-6.5 parity smoke matrix passes per tenant.
- BD-7.4 Full parity test matrix per prod tenant. Run
[soa/](soa/)integration test suite + Scribe API contract tests against DO mansety-prd and us-prd (read-only mode where possible). Doppler config inherits the test-onlyAPP_KEYif needed. (3pt) - AC: Test suite green per tenant.
- V&V: Compare critical endpoint responses byte-for-byte against the AWS-side baseline; document any drift.
- BD-7.5 Per-tenant cutover-precondition checklist document.
[unipuka-infra-ops/docs/cutover-precondition-<tenant>.md](unipuka-infra-ops/docs/cutover-precondition-mansety-prd.md). Items: pods healthy 7d, alerts quiet 7d, parity matrix green, data-migration-DM6 done, secrets snapshot taken, DNS rollback plan staged. Owned by + handed to the per-tenant cutover Linear project. (1pt) - AC: 3 checklist docs committed (oep-stg, mansety-prd, us-prd).
- V&V: Linked from the new cutover project descriptions when created.
BD-8: Cleanup + handoff (3 tickets)¶
- BD-8.1 Update
[CLAUDE.md](CLAUDE.md)repo map and the Kubernetes section to reflect the final state: infra-ops layout, AppProjects (platform+tenants), Doppler structure, Gateway listeners (:80/:443/:6001), monitoring stack, helm chart path. (1pt) - BD-8.2 Operator runbooks committed in
[unipuka-infra-ops/docs/](unipuka-infra-ops/docs/):bootstrap.md,add-tenant.md,rotate-secrets.md,promote-image.md,incident-response.md(alert -> runbook map). (3pt) - AC: Each runbook covers happy + sad paths + rollback steps.
- V&V: Pair with an on-call engineer, run one runbook end-to-end on staging.
- BD-8.3 Spin up 3 per-tenant cutover Linear projects (
AWS to DO Backend Cutover: oep-stg,... mansety-prd,... us-prd). Link from this project. Hand off the cutover-precondition checklists. (1pt)
8. Dependencies summary¶
flowchart TD
BD0[BD-0 Planning]
BD1[BD-1 Doppler foundation]
BD2[BD-2 Documentation platform]
BD3[BD-3 Platform bootstrap]
BD4[BD-4 Laravel Helm chart]
BD5[BD-5 CI pipeline in soa]
BD6[BD-6 oep-stg deploy]
BD7[BD-7 prd tenants deploy idle]
BD8[BD-8 Cleanup + handoff]
BD0 --> BD1
BD0 --> BD2
BD1 --> BD3
BD2 -.->|enables docs| BD3
BD3 --> BD4
BD3 --> BD5
BD4 --> BD6
BD5 --> BD6
BD6 --> BD7
BD7 --> BD8
BD-2 (docs) runs in parallel with BD-1/BD-3; it does not block BD-3 but every milestone from BD-3 onward writes its docs into the site that BD-2 provisions.
Critical path: BD-0 -> BD-1 -> BD-3 -> (BD-4 or BD-5 parallel) -> BD-6 -> BD-7 -> BD-8.
9. Risks + mitigations¶
- Doppler service token leak. Mitigation: tokens generated manually in Doppler UI (BD-1.3, never in TF state), stored in 3 places only: a) Doppler itself, b)
DOPPLER_TOKEN_*GH secrets at theunipuka/soarepo level (not org-level), c) K8sSecretinexternal-secretsns. TF state holds the K8s-Secret value (sealed via DO Spaces backend encryption) but NOT in the Doppler-provider portion of state. Rotation runbook in BD-8.2. - Migrate Job blocks sync. PreSync hook has 30min timeout + backoffLimit 0. On failure, Argo Sync stays
DegradedBUT existing Deployments are untouched (the new revision never applies). From a traffic POV this is auto-rollback: old pods keep serving until the operator either (a) fixes the migration + re-dispatches, or (b) re-dispatchesdo-deploy.ymlwith the previous image tag. Alertmanager fires aSyncFailedalert routed to Slack within 1 min. Runbook in BD-8.2. - Reverb broker race on multi-replica. Mitigated by Redis broker (
REVERB_SCALING_REPLICATION=redis). BD-4.4 V&V explicitly tests cross-pod broadcast. - NetworkPolicy too tight breaks egress. Document allow-list per tenant; default-deny applied last after staging-only soak in BD-3.9.
- Argo prune deletes a resource by accident. Mitigated by infra-ops branch protection: required reviewer on
apps/laravel/values-*-prd.yaml+apps/laravel/templates/**paths. Pull request explicitly shows the deletion diff. Argo's "delete pending" prune window + git revert = simple rollback. Acceptable trade-off for keeping master as the only source of truth. - DOCR garbage collection deletes referenced image. Retention keeps last 30 tagged images per repo (BD-5.5) which is ~1 month of daily releases and caps storage at the Basic-tier $5/mo footprint. Beyond that window, operator rebuilds + republishes the image from any commit via
do-build.ymlworkflow_dispatchrefinput (BD-5.1). - GitOps drift between Doppler + Helm values. Secrets MUST flow ONLY through ESO -> ExternalSecret -> K8s Secret -> envFrom; never in Helm values or git. Static config (hostnames, replicas, resource sizing) lives ONLY in Helm values. Document the boundary in BD-4.1.
10. Estimated points¶
- BD-0: 4pt (BD-0.2 grown to 2pt for discrepancy audit; BD-0.3 ADR shrunk to 1pt)
- BD-1: 11pt (BD-1.6 dropped)
- BD-2: 8pt (mkdocs-material docs platform; 4 tickets)
- BD-3: 30pt (BD-3.5 +1pt for TF helm_release; BD-3.14 Argo CD auth via Dex+GitHub +3pt)
- BD-4: 22pt
- BD-5: 14pt
- BD-6: 14pt
- BD-7: 11pt
- BD-8: 5pt
- Total: ~119pt (~15 person-weeks @ 8pt/week; can compress with parallelism on BD-2 + BD-4 + BD-5).
11. What this plan does NOT do¶
- Does not perform the actual implementation. Just creates the Linear project + tickets. Each ticket gets picked up + implemented per its own AC + V&V.
- Does not cover per-tenant traffic cutover. Three separate Linear projects (one per tenant) will be spun up in BD-8.3 to handle DNS flip + AWS drain + traffic switch + rollback per tenant.
- Does not migrate or delete the legacy AWS Jsonnet workflows in
[soa/.github/workflows/](soa/.github/workflows/). Those stay until the cutover project for the respective tenant closes. - Does not migrate the existing AWS ECR/ECS resources or
[unipuka-infra/](unipuka-infra/)TF. Those are removed by the cutover projects per tenant. - Does not introduce Argo Rollouts / canary. Tracked as future iteration after cutover stable.
- Does not introduce or evaluate HashiCorp Vault. Doppler is the secret store. Out of scope explicitly per Section 3.