Skip to content

AWS to DO Backend Deploy Preparation Plan

Linear-project-and-tickets plan. NOT implementation. Each ticket has context, scope, AC, validation/verification steps. Mirrors format of aws-to-do-data-migration_f4a8604e.plan.md.

1. Goal

Build the deploy plumbing for [soa/](soa/) Laravel backend on DOKS (cluster oep-prd-cluster). End state: all 3 tenant envs (oep-stg, mansety-prd, us-prd) running on DOKS via ArgoCD with no live traffic yet. Per-tenant traffic cutover = separate Linear project per tenant.

2. Scope

In scope

  • Doppler workspace structure (project + configs only) managed via Terraform; service tokens generated manually.
  • Migrate org-wide GH Actions secrets to Doppler; remove the org-level GH copies.
  • TF additions in unipuka-infra-do/oep-infra/ for: namespaces, DOCR pullsecret, Doppler service-token K8s Secret bootstrap, Argo CD repo creds Secret, Argo CD helm_release install + root Application apply.
  • unipuka-infra-ops/ GitOps repo full layout (platform chart + apps charts + AppProjects + ApplicationSets + docs).
  • Argo CD + ESO + Gateway + kube-prometheus-stack + Loki + OpenCost installed and tenant-scoped.
  • laravel-app Helm chart (api + worker + websockets/Reverb + scheduler CronJob + PreSync migrate Job).
  • New CI workflows in soa/.github/workflows/: build + push DOCR (do-build.yml), gitops deploy (do-deploy.yml).
  • Add /health endpoint in soa.
  • All 3 tenant envs deployed (idle, no traffic) and parity-verified.

Out of scope

  • Per-tenant traffic cutover (DNS flip, AWS drain, traffic switch) -> separate AWS to DO Backend Cutover: <tenant> Linear project per tenant.
  • Extracting reusable workflows to [unipuka-github-workflows/](unipuka-github-workflows/) -> future iteration after CI stabilises.
  • Argo Rollouts / canary deploys -> future iteration post-cutover-stable.
  • Image security scanning blocking gate (Trivy advisory only).
  • Removing legacy AWS workflows -> deferred to cutover project per tenant.

3. Locked design decisions (from grilling)

  • Manifest tool: Helm everywhere (laravel-app chart + platform chart).
  • Bootstrap: Pure Terraform. helm_release in unipuka-infra-do/oep-infra/ installs Argo CD on DOKS. TF kubernetes_manifest applies the App-of-Apps root Application. Argo CD then syncs everything else from unipuka-infra-ops/ (AppProjects, ApplicationSets, platform chart, app charts). No bootstrap.sh. TF stays the source of truth for Argo CD's own chart version + values.
  • Deploys: Plain Deployment v1, rollingUpdate strategy. Argo Rollouts post-cutover.
  • Image: One env-agnostic image per commit. Tag = git-SHA + semver. DOCR = registry.digitalocean.com/oep (DO container registry oep in region ams3 per unipuka-infra-do/oep-infra/registry.tf + default region in unipuka-infra-do/oep-infra/variables.tf). Full path registry.digitalocean.com/oep/api:<tag>. Same digest used across all tenants.
  • Deploy gating: Manual workflow_dispatch for ALL envs. Direct commit to unipuka-infra-ops/master from the deploy workflow. GH Environment + required reviewers for prod tenants. do-build.yml also supports workflow_dispatch with ref input to rebuild from any commit.
  • Doppler structure: ONE Doppler project per service (oep), configs = env (oep-stg, mansety-prd, us-prd) + a base config for shared keys. Project + config structure is portable to any ESO-compatible store (paths preserved). Service tokens generated manually in Doppler UI (NOT through TF) to keep token values out of TF state; tokens land as GH org secrets + K8s Secrets via separate manual steps.
  • Reverb websockets: Same shared Gateway, additional HTTPS listener on port 6001 with TLS terminate (Cloudflare Origin cert). Zero frontend env change. Multi-replica + Redis broker for HA.
  • Migrate hook: PreSync (NOT PostSync). Convention: migrations must be backward-compatible to support rollback. Update CLAUDE.md.
  • ApplicationSet: v1 = List generator only (3 tenants, 1 chart). Per-tenant targetRevision overridable from List element for staging-only test of template changes (see Section 5.5). Matrix generator deferred to when 2nd app joins.
  • Argo Projects: Two AppProjects: platform (cluster-scoped) and tenants (namespace-scoped to oep-stg|mansety-prd|us-prd).
  • Sync policy: prune: true + selfHeal: true for ALL Argo Applications (incl. prd). infra-ops master is the source of truth; removed manifests get deleted. Branch protection on infra-ops (review-required for prod values changes) is the gate.
  • Monitoring: kube-prometheus-stack + Loki + OpenCost in scope.
  • Trivy: advisory only.
  • Cutover: per-tenant, separate project per tenant.

4. Linear project

  • Project name: AWS to DO Migration: Backend Deploy Preparation
  • Identifier prefix: BD- (Backend Deploy). Milestones BD-0 .. BD-8; tickets BD-<milestone>.<n>.
  • Summary: "Prep work to run Laravel backend on DOKS via ArgoCD GitOps. Builds CI, Doppler structure, platform components (Argo+ESO+Gateway+monitoring), laravel-app Helm chart, and idle deploys to all 3 tenants. Per-tenant traffic cutover is a separate project per tenant."
  • Labels: reuse migration, infra, backend. Add deploy-prep.
  • Acceptance for project closure: all 3 tenant deployments running idle on DOKS, parity-verified against AWS source-of-truth, CI green end-to-end on workflow_dispatch, Doppler is the only source for backend secrets, monitoring stack live with per-workload alerts, runbooks committed. Cutover preconditions handed off to per-tenant cutover projects.

5. End-state architecture

flowchart LR
  subgraph soa [soa repo]
    BuildWf["do-build.yml<br/>(dispatch or merge)"]
    DeployWf["do-deploy.yml<br/>(dispatch + tenant + tag)"]
  end

  subgraph DOCR
    Img["api:<tag>"]
  end

  subgraph InfraOps [unipuka-infra-ops]
    Platform["platform/<br/>(ESO, Gateway, NetPol,<br/>quotas, monitoring)"]
    AppsLaravel["apps/laravel/<br/>(Helm chart +<br/>values-<tenant>.yaml)"]
    Root["bootstrap/<br/>root Application +<br/>ApplicationSets"]
  end

  subgraph DOKS [DOKS oep-prd-cluster]
    Argo[ArgoCD]
    ESO[External Secrets Operator]
    Gateway["Shared Gateway<br/>:443 + :6001"]
    Mon["kube-prometheus-stack<br/>Loki + OpenCost"]
    subgraph Tenants
      OepStg["oep-stg ns<br/>(api+worker+ws+sched)"]
      ManPrd["mansety-prd ns"]
      UsPrd["us-prd ns"]
    end
  end

  Doppler[(Doppler<br/>oep project)]
  subgraph InfraDo [unipuka-infra-do/oep-infra]
    Tf["TF: namespaces, DOCR pullsecret,<br/>Doppler service-token Secret,<br/>Argo repo creds, Doppler project+configs,<br/>helm_release argo-cd, root Application"]
  end

  BuildWf --> Img
  DeployWf -->|direct commit| InfraOps
  Tf -.->|kubernetes + helm providers| DOKS
  Tf -.->|doppler provider: project + configs only| Doppler
  Argo -->|watches| InfraOps
  Argo -->|applies| Platform
  Argo -->|applies| AppsLaravel
  ESO -->|sync| Doppler
  ESO -->|writes K8s Secret| Tenants
  Img -->|imagePullSecret| Tenants
  Gateway -->|HTTPRoute| Tenants
  BuildWf -.->|doppler run| Doppler

5.5 Template versioning + env-promotion strategy

Two independent version dials, decoupled on purpose:

  1. Image tag (= app code version). Per-tenant image.tag in apps/laravel/values-<tenant>.yaml. Bumped by do-deploy.yml workflow dispatch. Promotion = same image tag promoted across envs by re-running the workflow per tenant. Allows older tags to be re-deployed (just dispatch with the older <tag>).

  2. Chart version (= manifest template version). Bumped in apps/laravel/Chart.yaml (semver) when templates, defaults, or platform-side contract change. CI lint job in unipuka-infra-ops/.github/workflows/helm-lint.yml enforces version bump on template diff. Pushed to master in a PR.

To test template changes in staging only without affecting prd:

  • Open PR branch in unipuka-infra-ops (e.g. feat/reverb-redis-broker-tune).
  • ApplicationSet List generator gives each tenant element a targetRevision field. Default = master. Edit the oep-stg element in bootstrap/applicationsets/laravel.yaml to targetRevision: feat/reverb-redis-broker-tune ON THE PR BRANCH itself.
  • Push PR branch. Argo CD doesn't auto-pick this up because the ApplicationSet manifest is still on master.
  • Manually merge ONLY the ApplicationSet change (the targetRevision flip for oep-stg) to master via a small follow-up PR, OR temporarily patch the live ApplicationSet via kubectl edit (recorded in runbook). Then Argo CD points oep-stg Application at the PR branch; prd Applications stay on master.
  • Validate in oep-stg. When happy, merge the original PR to master, then revert/clean up the targetRevision: feat/... back to master for oep-stg. Prods catch up on next dispatch.
  • Runbook for this lives in unipuka-infra-ops/docs/test-template-on-staging.md (BD-8.2).

ApplicationSet skeleton (illustrative):

apiVersion: argoproj.io/v1alpha1
kind: ApplicationSet
metadata:
  name: laravel
spec:
  generators:
    - list:
        elements:
          - tenant: oep-stg
            targetRevision: master
          - tenant: mansety-prd
            targetRevision: master
          - tenant: us-prd
            targetRevision: master
  template:
    metadata:
      name: '{{tenant}}-laravel'
    spec:
      project: tenants
      source:
        repoURL: https://github.com/unipuka/unipuka-infra-ops.git
        targetRevision: '{{targetRevision}}'
        path: apps/laravel
        helm:
          valueFiles:
            - 'values-{{tenant}}.yaml'
      destination:
        namespace: '{{tenant}}'
        server: https://kubernetes.default.svc
      syncPolicy:
        automated:
          prune: true
          selfHeal: true

6. Repository layouts at completion

6.1 unipuka-infra-ops/ (greenfield)

unipuka-infra-ops/
  bootstrap/
    root-application.yaml              # App-of-Apps root. Applied by TF kubernetes_manifest. Argo CD syncs everything under bootstrap/ + platform/ + apps/ from this point.
    appprojects/
      platform.yaml                    # AppProject: cluster-scoped, sourceRepo unipuka-infra-ops
      tenants.yaml                     # AppProject: namespace-scoped (oep-stg|mansety-prd|us-prd)
    applicationsets/
      laravel.yaml                     # ApplicationSet: List generator over 3 tenants -> apps/laravel; per-element targetRevision (Section 5.5)
      platform.yaml                    # Application: pulls platform/ chart into the cluster
  platform/
    Chart.yaml
    values.yaml                        # global toggles
    templates/
      namespaces.yaml                  # all tenant + platform ns with labels
      resourcequotas.yaml              # per-tenant ResourceQuota
      limitranges.yaml                 # per-tenant LimitRange
      networkpolicies.yaml             # default-deny + allow rules
      gateway.yaml                     # shared Gateway: listeners :443 (per-tenant hostnames) + :6001 (per-tenant ws hostnames)
      cloudflare-origin-cert.yaml      # TLS Secret per hostname, refs ExternalSecret
      eso-clustersecretstores.yaml     # 1 ClusterSecretStore per Doppler config
    charts/
      external-secrets/                # ESO operator Helm subchart
      kube-prometheus-stack/           # Helm subchart, scrape configs, alerts
      loki/                            # Loki + Promtail/Alloy
      opencost/                        # OpenCost Helm
  apps/
    laravel/
      Chart.yaml
      values.yaml                      # defaults
      values-oep-stg.yaml
      values-mansety-prd.yaml
      values-us-prd.yaml
      templates/
        _helpers.tpl
        api-deployment.yaml            # nginx + php-fpm via supervisord (or split, see BD-4.2)
        api-service.yaml
        api-httproute-443.yaml
        worker-deployment.yaml         # horizon
        websockets-deployment.yaml     # reverb, multi-replica, redis broker
        websockets-service.yaml
        websockets-httproute-6001.yaml
        scheduler-cronjob.yaml         # */1 schedule:run
        migrate-presync-job.yaml       # PreSync hook
        externalsecret.yaml            # per-workload envFrom
        pdb.yaml
        hpa.yaml
        servicemonitor.yaml            # Prometheus scrape
        alertrules.yaml                # PrometheusRule
  docs/                                # rendered to docs.unipuka.app via mkdocs-material (BD-2)
    index.md                           # landing
    runbooks/
      bootstrap.md                     # how to terraform apply the TF helm_release + root Application + day-0 prerequisites
      add-tenant.md
      rotate-secrets.md                # Doppler service-token rotation (manual)
      promote-image.md                 # how to use do-deploy.yml; re-deploy older tag; DOCR retention
      test-template-on-staging.md      # per Section 5.5
      argocd-break-glass.md            # how to use built-in admin secret when Dex/GitHub OAuth is down
      incident-response.md             # alert -> runbook map
    adr/
      index.md                         # auto-generated index of ADRs
      _template.md                     # ADR template (Context/Decision/Status/Consequences/Alternatives)
      0001-secret-management.md        # Doppler structure rationale + portability
    audit/
      secrets-audit.md                 # BD-1.4 deliverable
      cutover-precondition-oep-stg.md
      cutover-precondition-mansety-prd.md
      cutover-precondition-us-prd.md
  mkdocs.yml                           # mkdocs-material config (BD-2.1)
  requirements.txt                     # mkdocs + plugins pinned versions
  .github/workflows/
    helm-lint.yml                      # chart-testing on PR
    docs-build.yml                     # mkdocs --strict on PR; build+deploy on master push (BD-2.2)
  README.md

6.2 soa/.github/workflows/ (new files added)

do-build.yml                           # lint + test + build + Trivy advisory + push DOCR + Sentry release
do-deploy.yml                          # workflow_dispatch(tag,tenant) -> commit to unipuka-infra-ops master

(Legacy oep-staging-pipeline.yml, mansety-production-pipeline.yml, us-production-pipeline.yml stay until per-tenant cutover.)

6.3 unipuka-infra-do/oep-infra/ (additions only)

providers.tf                           # add doppler + kubernetes + helm + cloudflare providers
variables.tf                           # add doppler_admin_token, doppler_service_token_oep_stg|mansety_prd|us_prd, argocd_github_app_id, argocd_github_app_private_key, argocd_github_app_installation_id
doppler.tf                             # Doppler project oep + configs (base, oep-stg, mansety-prd, us-prd) via Doppler TF provider. NO service tokens managed here (manual).
k8s_namespaces.tf                      # tenant + platform namespaces (argocd, external-secrets, monitoring, oep-stg, mansety-prd, us-prd) via kubernetes_namespace
k8s_secrets.tf                         # one-off K8s Secrets via kubernetes provider:
                                       #   - DOCR imagePullSecret `do-registry` in each tenant + monitoring + external-secrets ns
                                       #   - Doppler service-token Secret per tenant in external-secrets ns (value from var.doppler_service_token_<env>, NOT from Doppler API)
                                       #   - Argo CD repo creds Secret in argocd ns (GH App from sensitive vars)
argocd.tf                              # helm_release `argocd` from argo Helm chart (CRDs included) + kubernetes_manifest for bootstrap/root-application.yaml. Pinned chart version. Ongoing argo-cd upgrades = TF apply.
docs_site.tf                           # Cloudflare Pages project for docs.unipuka.app, deploy_on_push from unipuka-infra-ops/master, Cloudflare Access policy bound to unipuka GH org (BD-2.3)

registry.tf (existing) is extended in BD-5.5 to add GC retention policy (30 tagged images target) as DOCR doesn't expose retention as a first-class TF arg.

7. Milestones

BD-0: Planning + project bootstrap (3 tickets)

  • BD-0.1 Create the Linear project AWS to DO Migration: Backend Deploy Preparation with milestones BD-1..BD-8 and label deploy-prep. Paste this plan into the project description. Commit canonical copy to [unipuka-infra-ops/docs/backend-deploy-prep-plan.md](unipuka-infra-ops/docs/backend-deploy-prep-plan.md). (1pt)
  • BD-0.2 Audit + update CLAUDE.md to reflect locked decisions + real infra state. Known discrepancies as of plan time: (a) Infra section says DO region fra1, actual = ams3 per unipuka-infra-do/oep-infra/variables.tf line 22; (b) Kubernetes section says PostSync migrate hook, locked decision = PreSync; (c) Bootstrap wording mentions ESO + Doppler but predates locked Doppler project structure (project=oep, configs=env, manual service tokens); (d) AppProject naming not specified, locked = platform + tenants; (e) Argo CD install path not specified, locked = TF helm_release (no bootstrap.sh); (f) Monitoring stack listed but observability scope was unclear, locked = full kube-prometheus-stack + Loki + OpenCost in scope; (g) Gateway listeners not enumerated, locked = :80 redirect, :443, :6001 (Reverb). Update each section. Re-audit at BD-8.1 for any further drift. (2pt)
  • AC: Each discrepancy above resolved in CLAUDE.md with a commit referencing this ticket.
  • V&V: Diff CLAUDE.md vs actual TF + plan Section 3 produces zero contradictions.
  • BD-0.3 ADR: secret-management approach. Document the Doppler structure decision (project = service oep, configs = env), the boundary between Doppler-managed secrets and Helm-managed static config, the manual-service-token rationale, the ESO sync model, and rotation cadence. Commit to unipuka-infra-ops/docs/adr/0001-secret-management.md. (1pt)

BD-1: Doppler foundation (5 tickets)

  • BD-1.1 Add Doppler Terraform provider (DopplerHQ/doppler) to unipuka-infra-do/oep-infra/providers.tf. Add doppler_admin_token variable (workspace admin token used ONLY at TF apply time; passed via TF_VAR_doppler_admin_token env, never committed). Bootstrap workspace: terraform init, verify provider auth. (1pt)
  • AC: terraform plan runs clean with Doppler provider configured.
  • V&V: terraform providers lists Doppler; planned changes are empty for doppler.tf placeholder.
  • BD-1.2 TF: Create Doppler project oep + configs base, oep-stg, mansety-prd, us-prd (configs inherit from base). Lifecycle: prevent_destroy = true on the project. File: unipuka-infra-do/oep-infra/doppler.tf. NOTE: this ticket manages only the project + configs (structure), NOT secret values, NOT service tokens. (2pt)
  • AC: Doppler dashboard shows project + 4 configs with inheritance from base.
  • V&V: doppler configs --project oep lists all 4.
  • BD-1.3 Manually generate 3 Doppler service tokens (one per non-base config) in the Doppler UI with read-only scope. For each token, manually create a GH secret at the unipuka/soa repo level (NOT org-level) named DOPPLER_TOKEN_OEP_STG, DOPPLER_TOKEN_MANSETY_PRD, DOPPLER_TOKEN_US_PRD. Token values are also stored in TF input vars (TF_VAR_doppler_service_token_oep_stg, etc.) at apply time only (passed via env, never committed) so BD-3.3 can land them as K8s Secrets. Rationale: keeps tokens out of TF-managed Doppler-provider state (which is the leak surface) AND out of Doppler-side automation outputs. Document the generation runbook in unipuka-infra-ops/docs/rotate-secrets.md. (2pt)
  • AC: 3 service tokens visible in Doppler UI with read-only scope; 3 GH secrets exist at unipuka/soa repo level.
  • V&V: From a test workflow on master, doppler secrets download --token=$DOPPLER_TOKEN_OEP_STG --no-file --format json returns the expected JSON shape.
  • BD-1.4 Audit + migrate org-wide GH Actions secrets to Doppler. Deliverables: unipuka-infra-ops/docs/secrets-audit.md listing every GH secret used by soa workflows today (FIREBASE_CREDENTIALS, SENTRY_AUTH_TOKEN, etc.), the target Doppler config + key path, and migration ownership. Then mirror values into Doppler base or per-env config. Once a key is mirrored + new workflow consumes it via doppler run, the org-level GH secret is removed. (3pt)
  • AC: Audit doc exists with status per secret (migrated / WIP / blocked). All migrated secrets removed from GH org-level scope.
  • V&V: For each migrated secret, value in Doppler matches the previous GH secret (verified via gh secret list BEFORE removal + doppler secrets get AFTER).
  • BD-1.5 Populate per-tenant Doppler config keys (oep-stg, mansety-prd, us-prd) with all Laravel runtime env vars currently in AWS SSM + the existing ECS task-definitions in soa/task-definitions/. Includes: APP_KEY, DB_*, REDIS_*, AWS_* (now DO Spaces equivalents), Sentry/Bugsnag DSNs, Pusher, Typesense, Firebase. Keys must be backward-compatible to be picked up by ESO ExternalSecret -> K8s Secret -> envFrom. (3pt)
  • AC: Each config has >= N keys matching the ECS task-def env list (N captured during audit).
  • V&V: doppler secrets --config oep-stg matches a snapshot of aws ecs describe-task-definition --task-definition oep/api.

BD-2: Documentation platform (4 tickets)

Set up a single source-of-truth, browsable docs site rendered with mkdocs-material. Hosts all runbooks, ADRs, audit docs, cutover checklists. Markdown lives in unipuka-infra-ops/docs/; the same files are usable as plain markdown in-repo AND rendered as a static site. Lands early so every subsequent milestone writes its docs into this structure from day one.

  • BD-2.1 Set up mkdocs-material in unipuka-infra-ops/. Add mkdocs.yml at repo root with: site name Unipuka Platform Docs, theme material, navigation (Runbooks, ADRs, Audit, Cutover, Reference), search plugin, mermaid diagrams (pymdownx.superfences + mermaid custom_fences), mkdocs-awesome-pages-plugin for auto-nav, mkdocs-git-revision-date-localized-plugin. requirements.txt pins versions. Bootstrap pages: docs/index.md (landing), docs/runbooks/index.md, docs/adr/index.md (ADR index), docs/audit/index.md. (2pt)
  • AC: mkdocs serve renders site locally with all bootstrap pages in nav.
  • V&V: Touch a markdown file under docs/, observe live reload. Mermaid diagram in a sample page renders.
  • BD-2.2 Add CI workflow unipuka-infra-ops/.github/workflows/docs-build.yml: on PR -> mkdocs build --strict (fails on broken links / nav warnings), markdown linter (markdownlint-cli). On push to master -> build + deploy to hosting target (BD-2.3). Caches pip deps. (2pt)
  • AC: PR with a markdown link to a missing page turns red on mkdocs build --strict.
  • V&V: Open a deliberate broken-link PR, CI fails. Fix it, CI green.
  • BD-2.3 Pick + provision hosting. Options: a) GitHub Pages (free, requires public repo OR GH Enterprise; infra-ops is private so this means GH Pages with org-level "private pages" which needs GH Enterprise), b) DO App Platform static site (free starter tier, custom domain via Cloudflare, gated by Cloudflare Access for internal-only), c) DOKS in-cluster behind the shared Gateway with Cloudflare Access. Recommend (b): DO App Platform static site at docs.unipuka.app, deploy_on_push from master, Cloudflare Access policy restricts to unipuka.com Google Workspace identities OR GH org members via OAuth. Add resource to unipuka-infra-do/oep-infra/ (new docs_site.tf or extend the existing module). (3pt)
  • AC: https://docs.unipuka.app reachable, Cloudflare Access challenge fires for non-team members.
  • V&V: Logged-in team member loads docs site; logged-out / non-team gets 403.
  • BD-2.4 ADR template + index + first ADR moved into docs site. Template (docs/adr/_template.md): Context, Decision, Status, Consequences, Alternatives. Index (docs/adr/index.md) auto-lists ADRs via awesome-pages. Move BD-0.3's 0001-secret-management.md into docs/adr/. Add a runbook stub for every operator runbook expected by this project (matches list in Section 6.1): bootstrap.md, add-tenant.md, rotate-secrets.md, promote-image.md, test-template-on-staging.md, incident-response.md. Each stub is a placeholder with the eventual heading skeleton; the milestone that creates the matching feature fills them in. (1pt)
  • AC: ADR index lists 0001; all 6 runbook stubs exist with template skeleton.
  • V&V: Click each runbook from rendered docs nav -> page loads with ## Purpose / ## When to use / ## Steps / ## Rollback / ## V&V sections.

BD-3: Platform bootstrap on DOKS (14 tickets)

  • BD-3.1 TF: Create K8s namespaces (argocd, external-secrets, monitoring, oep-stg, mansety-prd, us-prd) via the kubernetes provider in [unipuka-infra-do/oep-infra/k8s_bootstrap.tf](unipuka-infra-do/oep-infra/k8s_bootstrap.tf). Label each with team, tier, tenant, env. (1pt)
  • AC: kubectl get ns lists all 6 with labels.
  • V&V: TF plan after apply = empty diff.
  • BD-3.2 TF: Create DOCR imagePullSecret in each tenant ns + monitoring + external-secrets ns. Uses digitalocean_kubernetes_cluster.main.kube_config[0] -> kubernetes provider. Name = do-registry. (2pt)
  • AC: kubectl -n oep-stg get secret do-registry -o yaml shows dockerconfigjson type.
  • V&V: Pod with imagePullSecrets: [{name: do-registry}] in tenant ns pulls a private DOCR image.
  • BD-3.3 TF: Create Doppler service-token K8s Secret per tenant (doppler-token-oep-stg, etc.) in the external-secrets namespace via the kubernetes provider in unipuka-infra-do/oep-infra/k8s_secrets.tf. Secret value sourced from var.doppler_service_token_<env> (sensitive input from CI/local env). NOT pulled from Doppler API to keep tokens out of Doppler-provider state. (1pt)
  • AC: 3 Secrets in external-secrets ns containing service token.
  • V&V: ESO ClusterSecretStore (BD-3.6) references these Secrets and reports Status: Valid.
  • BD-3.4 TF: Create Argo CD repo credentials Secret in argocd namespace via kubernetes provider. Pick: GitHub App (preferred over PAT for long-lived no-expiry tokens). App ID + private key + installation ID supplied as TF sensitive vars (var.argocd_github_app_id, var.argocd_github_app_private_key, var.argocd_github_app_installation_id). Secret labeled argocd.argoproj.io/secret-type: repo-creds so Argo CD picks it up automatically. (2pt)
  • AC: argocd repo list from inside cluster shows unipuka-infra-ops as connected.
  • V&V: Push a test commit to infra-ops master; Argo CD picks up the change within sync interval.
  • BD-3.5 TF: Install Argo CD via helm_release in unipuka-infra-do/oep-infra/argocd.tf. Pinned chart version (e.g. argo-cd-7.x). Values inline OR from [unipuka-infra-ops/bootstrap/argocd-values.yaml](unipuka-infra-ops/bootstrap/argocd-values.yaml) (read via file()). Then kubernetes_manifest resource applies the root Application from unipuka-infra-ops/bootstrap/root-application.yaml. After this, TF stays the source of truth for Argo CD's own chart version + values; Argo CD syncs everything under bootstrap/, platform/, apps/ from infra-ops. (3pt)
  • AC: terraform apply lands a working Argo CD; root Application + child Applications all Synced/Healthy. Re-applying TF is idempotent (zero diff).
  • V&V: argocd app list shows root, platform, <tenant>-laravel x3. Bump Helm chart version in TF, run apply, observe rolling upgrade of Argo CD itself with no drift.
  • BD-3.6 Platform chart: ESO operator install (Helm subchart in [unipuka-infra-ops/platform/charts/external-secrets/](unipuka-infra-ops/platform/charts/external-secrets/)) + one ClusterSecretStore CR per Doppler config (3 total) referencing the doppler-token-<env> Secret. (2pt)
  • AC: kubectl get clustersecretstore shows 3 stores all Valid.
  • V&V: Create a probe ExternalSecret in oep-stg ns referencing the oep-stg ClusterSecretStore + a known Doppler key; verify synced Secret exists with the right value.
  • BD-3.7 Platform chart: shared Gateway resource with TWO listener sections: a) HTTPS :443 (per-tenant hostnames from values), b) HTTPS :6001 (per-tenant ws hostnames). Both TLS terminate with Cloudflare Origin certs. allowedRoutes.namespaces.selector enforces tenant isolation. Add HTTP :80 listener with redirect HTTPRoute (per Gateway-test pattern). (3pt)
  • AC: kubectl describe gateway -n <gateway-ns> shows 3 listeners (HTTP/HTTPS/WS); Status: Accepted. DO LB created.
  • V&V: From outside CF, curl https://api-staging.unipuka.app/health resolves (404 acceptable pre-app-deploy). curl -i http://api-staging.unipuka.app/ returns 301 -> https.
  • BD-3.8 Platform chart: TLS Secret per hostname for the Cloudflare Origin certs. Pulled via ESO from Doppler base config (keys: CF_ORIGIN_CERT_<host>, CF_ORIGIN_KEY_<host>). Avoid committing the cert files currently sitting in [unipuka-infra-ops/cloudflare-unipuka-cert.pem](unipuka-infra-ops/cloudflare-unipuka-cert.pem) -- delete them in this ticket. (2pt)
  • AC: kubectl -n <gateway-ns> get secret cloudflare-origin-<host> exists with valid cert.
  • V&V: TLS handshake on the Gateway listener uses the right SAN.
  • BD-3.9 Platform chart: NetworkPolicy templates per tenant ns. Default-deny ingress + egress; explicit allow for: a) intra-ns pod-to-pod, b) DNS to kube-system, c) DOKS managed MySQL + Redis (intra-VPC IPs via ipBlock), d) egress to internet for Doppler API, Sentry/Bugsnag, Typesense, Pusher, Firebase, Cloudflare Origin (use FQDN-resolver pattern or wide CIDR + label-based). (3pt)
  • AC: Pod in oep-stg can curl https://api.doppler.com, MySQL, Redis; pod cannot curl pod in mansety-prd.
  • V&V: Run kubectl exec probe pod tests for each allow + deny path.
  • BD-3.10 Platform chart: ResourceQuota + LimitRange per tenant ns from CLAUDE.md numbers (CPU/mem/pods cap, per-container defaults). Tunable per tenant via Helm values. (1pt)
  • AC: kubectl describe quota -n oep-stg shows hard limits.
  • V&V: Apply a Pod requesting > quota; admission denies it with the expected reason.
  • BD-3.11 Platform chart: install kube-prometheus-stack Helm subchart in monitoring ns. Add taint + toleration so it lands on a dedicated dedicated=monitoring:NoSchedule node (label one node manually first or add a node pool via TF). ServiceMonitor + PrometheusRule CRDs available. Grafana ingress via the shared Gateway with a dedicated grafana.<base> hostname. (3pt)
  • AC: Prometheus pods Running on the dedicated node; Grafana reachable on its hostname; Prometheus scrapes itself + Alertmanager.
  • V&V: up{job="kubernetes-pods"} query returns non-empty in Prometheus UI.
  • BD-3.12 Platform chart: install Loki + Promtail/Grafana Alloy in monitoring ns. Wire Grafana datasource. Logs from all tenant + platform ns flow into Loki. Retention 7d for staging, 30d for prod. Storage: DO Spaces backend or in-cluster persistent volume (pick + document). (3pt)
  • AC: Grafana Explore on Loki datasource returns logs labeled {namespace="oep-stg"}.
  • V&V: Trigger a Pod log line in oep-stg; observe within 30s in Grafana.
  • BD-3.13 Platform chart: install OpenCost. Wire Prometheus datasource. (1pt)
  • AC: OpenCost UI reachable, shows per-namespace cost projection within 24h after install.
  • V&V: Compare OpenCost monthly projection vs DO billing for the same window (rough match).
  • BD-3.14 Argo CD auth: SSO via GitHub OAuth through Dex (bundled with argo-cd Helm chart). Why this option (vs alternatives):
  • Picked: GitHub OAuth via Dex. Team identity already in unipuka GH org. RBAC mapped to GH teams. No separate user store to maintain. Built-in admin password kept only for break-glass.
  • Rejected: local accounts (per-user passwords) -> N more secrets to rotate, manual offboarding.
  • Rejected: Google OAuth direct -> we use GH already; one fewer IdP.
  • Rejected: PR-only token / no SSO -> not viable for a team UI.

Implementation: create a "GitHub OAuth App" under the unipuka GH org with callback https://argocd.unipuka.app/api/dex/callback. Store client ID + client secret in Doppler base config (keys ARGOCD_DEX_GITHUB_CLIENT_ID, ARGOCD_DEX_GITHUB_CLIENT_SECRET). ESO ExternalSecret syncs them into the argocd ns as argocd-dex-github. Argo CD argocd-cm ConfigMap (managed via Helm values pulled by TF in BD-3.5) wires up the dex github connector + argocd-rbac-cm maps GH teams to roles (e.g. unipuka:platform-admins -> role:admin, unipuka:engineers -> role:readonly-prod-edit-stg). Built-in admin disabled via admin.enabled: false in values (after first apply). Operator break-glass via argocd-initial-admin-secret only if Dex is broken (documented in docs/runbooks/argocd-break-glass.md). Argo CD UI exposed via a per-env HTTPRoute on the shared Gateway (argocd.unipuka.app, protected behind Cloudflare Access as defense-in-depth). (3pt) - AC: Team member visits argocd.unipuka.app, redirected to GitHub OAuth, lands authenticated with the role mapped from their GH team. Built-in admin login disabled. - V&V: A user not in the unipuka GH org cannot complete the OAuth flow. A user in unipuka:engineers gets read-only on prod Applications, edit on stg.

BD-4: laravel-app Helm chart (11 tickets)

  • BD-4.1 Initialise [unipuka-infra-ops/apps/laravel/](unipuka-infra-ops/apps/laravel/) Helm chart skeleton with Chart.yaml, default values.yaml, empty templates/, values-oep-stg.yaml, values-mansety-prd.yaml, values-us-prd.yaml. Define values schema (values.schema.json) for: image, replicas, resources, hostnames (api + ws), redis broker DSN ref, podDisruptionBudget thresholds, hpa thresholds. (2pt)
  • AC: helm lint apps/laravel -f apps/laravel/values-oep-stg.yaml passes for all 3 tenant values files.
  • V&V: helm template ... for each tenant produces non-empty output.
  • BD-4.2 api workload Helm templates. Decision encoded here: keep all-in-one supervisord image but run only nginx + php-fpm via supervisord (no worker/ws/scheduler in this Deployment). Args drive supervisord program subset. Service + HTTPRoute on Gateway listener :443. Readiness probe -> GET /health. Liveness probe -> GET /health. (3pt)
  • AC: Rendered template applies cleanly in cluster, /health returns 200.
  • V&V: kubectl exec into a pod, run php artisan works; nginx + php-fpm both alive.
  • BD-4.3 worker Helm template = Deployment running php artisan horizon. Readiness probe -> php artisan horizon:status. Graceful shutdown: terminationGracePeriodSeconds + Horizon's php artisan horizon:terminate preStop hook. (2pt)
  • AC: Horizon dashboard reachable in-cluster; jobs in tenant Redis queue get processed.
  • V&V: Push a sample job, observe Horizon picks it up; SIGTERM during job lets it finish before pod dies.
  • BD-4.4 websockets (Reverb) Helm template = Deployment running php artisan reverb:start with replicas: 2 default. Reverb config uses Redis broker (BROADCAST_DRIVER=reverb + REVERB_SCALING_ENABLED=true + REVERB_SCALING_REPLICATION=redis env). Service ClusterIP. HTTPRoute on Gateway listener :6001. PDB minAvailable: 1. (3pt)
  • AC: 2 Reverb pods, both receive client connections behind the Gateway listener :6001.
  • V&V: Open 2 WS clients connecting to same hostname, ensure messages broadcast from one are seen by the other regardless of pod routing.
  • BD-4.5 scheduler CronJob template. Schedule */1 * * **. concurrencyPolicy: Forbid. successfulJobsHistoryLimit: 1, failedJobsHistoryLimit: 3. Command = php artisan schedule:run. (1pt)
  • AC: CronJob fires every minute, completes < 30s typical.
  • V&V: kubectl get jobs -n <tenant> shows runs every minute; logs show Running scheduled command: output.
  • BD-4.6 PreSync migrate Job template with Argo CD hook annotations: argocd.argoproj.io/hook: PreSync, argocd.argoproj.io/hook-delete-policy: BeforeHookCreation. Command = php artisan migrate --force --isolated. ActiveDeadlineSeconds: 1800. backoffLimit: 0. (2pt)
  • AC: New release triggered via Argo Sync -> Job runs before any Deployment update; failed Job -> Argo Sync fails with Job logs visible.
  • V&V: Simulate a deliberate migration failure -> Argo Sync reports Degraded; existing Deployments untouched.
  • BD-4.7 ExternalSecret template per workload. Each refers to the tenant's ClusterSecretStore (BD-3.6) and pulls all keys with dataFrom.find.name: ".*" or explicit list. Target K8s Secret <release>-app-env consumed via envFrom in api/worker/ws/scheduler/migrate. RefreshInterval: 60s. (2pt)
  • AC: kubectl get externalsecret -n oep-stg -> Status: SecretSynced.
  • V&V: Edit a key in Doppler oep-stg config; within 60-90s see the K8s Secret value change.
  • BD-4.8 PodDisruptionBudget + HorizontalPodAutoscaler templates. PDBs: api minAvailable 50%, ws minAvailable 1, worker minAvailable 0 (worker = batch, OK to drain). HPA: api scale on CPU (target 70%) + memory; ws scale on connections (manual replicas only v1). Toggleable per tenant. (1pt)
  • AC: kubectl get hpa -n oep-stg shows api HPA active.
  • V&V: Load-gen on api pushes HPA replicas up; reset back to base after.
  • BD-4.9 Per-tenant values files. Encode from existing AWS task-defs + frontend env table in [unipuka-infra-do/oep-infra/tenants.tf](unipuka-infra-do/oep-infra/tenants.tf):
  • oep-stg: 1 replica per workload, api.unipuka.app/api-staging.unipuka.app hostnames (decide canonical), small resource requests.
  • mansety-prd: 2 replicas api + ws, 1 worker (Horizon supervisors handle internal concurrency), api.mansety.com hostname, prod-sized resources.
  • us-prd: 2 replicas, api.us-academy.net hostname, prod-sized resources. (3pt)
  • AC: helm template per tenant matches expected hostnames + replicas.
  • V&V: Dry-run apply of rendered manifests in respective ns has no admission errors.
  • BD-4.10 Add /health route to [soa/](soa/). Returns 200 + JSON with db: ok, redis: ok, app: ok. Used by readiness/liveness probes. Update [soa/Dockerfile](soa/Dockerfile) if any change needed. (2pt)
  • AC: php artisan serve locally, curl /health returns 200.
  • V&V: Probe in cluster Pod returns 200 < 200ms.
  • BD-4.11 chart-testing in [unipuka-infra-ops/.github/workflows/helm-lint.yml](unipuka-infra-ops/.github/workflows/helm-lint.yml). Runs ct lint --target-branch master. Triggered on PR. (1pt)
  • AC: Workflow green on a sample PR touching the chart.
  • V&V: Deliberately break the chart, PR turns red.

BD-5: CI pipeline in soa repo (6 tickets)

  • BD-5.1 New workflow soa/.github/workflows/do-build.yml. Triggers: a) push to master, b) workflow_dispatch with optional ref input (commit SHA or tag, defaults to current branch). Jobs: validate-jsonnet (kept), bake (checkout the resolved ref) -> lint (pint via doppler run) -> test (phpunit via doppler run with DOPPLER_TOKEN_OEP_STG for parity envs) -> trivy advisory scan -> push image to registry.digitalocean.com/oep/api:<git-sha> + registry.digitalocean.com/oep/api:v<semver>. DOCR login via DO API token from DIGITALOCEAN_ACCESS_TOKEN GH secret. Doppler integration: at workflow start, doppler-cli configured with DOPPLER_TOKEN_* to fetch any keys needed at CI time (Firebase creds for tests, Sentry token). The ref input lets us rebuild + republish an arbitrary historical commit if its image was GC'd from DOCR (e.g. for a hot rollback past the retention window). (4pt)
  • AC: Workflow green end-to-end on master push; dispatch with ref=<historical-sha> rebuilds + repushes the same SHA tag.
  • V&V: doctl registry repository list-tags api shows both tags; docker pull works from outside cluster (with creds); dispatch with old ref produces a tag equal to the older <git-sha>.
  • BD-5.2 Auto-versioning logic. On push to master, conventional-commit parser bumps semver (major/minor/patch) and creates GH release with auto-notes (mirrors [unipuka-github-workflows/.github/workflows/release.yml](unipuka-github-workflows/.github/workflows/release.yml) pattern). Outputs <semver> -> consumed by image push step. (2pt)
  • AC: A feat: commit creates a minor bump release on master push.
  • V&V: gh release list -R unipuka/soa shows the new tag.
  • BD-5.3 New workflow [soa/.github/workflows/do-deploy.yml](soa/.github/workflows/do-deploy.yml). Trigger: workflow_dispatch only. Inputs: tag (image tag, default = latest GH release tag), tenant (oep-stg/mansety-prd/us-prd). Steps: a) checkout unipuka-infra-ops, b) yq -i '.image.tag = strenv(TAG)' apps/laravel/values-${tenant}.yaml, c) configure git user, d) commit chore(deploy): bump <tenant> to <tag>, e) push to unipuka-infra-ops/master. Slack notification on success/failure. Direct commit (no PR). (3pt)
  • AC: Dispatch with tag=v0.1.0, tenant=oep-stg -> commit appears in infra-ops; Argo picks up + syncs.
  • V&V: git log -1 unipuka-infra-ops master --pretty=full shows the commit; argocd app get oep-stg reflects new image tag within 1-2 min.
  • BD-5.4 GH Environment protection rules in unipuka/soa. Create envs oep-stg, mansety-prd, us-prd. Required reviewers on prod envs (mansety-prd, us-prd). do-deploy.yml declares environment: ${{ inputs.tenant }} so the gate fires. (1pt)
  • AC: Manual dispatch to mansety-prd pauses for reviewer approval.
  • V&V: Reviewer approves -> workflow proceeds; declines -> workflow aborts.
  • BD-5.5 DOCR retention policy via TF in unipuka-infra-do/oep-infra/registry.tf. Cost driver: DOCR Starter is free up to 500MB; Basic is $5/mo up to 5GB; Pro is $20/mo up to 100GB. soa/api image ~500-700MB compressed. 30 tags ~= 15-21GB (each tag = unique manifest + delta layers, layer sharing makes the effective size lower in practice). Goal: keep last 30 tagged images per repo (~1 month of daily releases) which is the rollback window we realistically need; older tags rebuilt on demand via BD-5.1 ref input. DOCR doesn't expose retention as native TF args, so implement via: (a) GH Action on a weekly cron that lists tags, sorts by push date, deletes all but newest 30, then runs doctl registry garbage-collection start --include-untagged-manifests. Document in unipuka-infra-ops/docs/promote-image.md. (2pt)
  • AC: After 30 pushes, the 31st push (or next weekly GC, whichever first) removes the oldest tag; untagged manifests are absent from DOCR after the next weekly GC.
  • V&V: doctl registry repository list-manifests api | wc -l <= 30 + a small buffer; no untagged manifests after GC run. Monthly DOCR bill confirms Basic-tier ceiling not exceeded.
  • BD-5.6 Sentry release integration in do-build.yml. After image push: sentry-cli releases new -p soa $SEMVER + set-commits --auto + finalize + deploys ... new -e <env> per tenant. Token from Doppler base. Decision: deploys fired in do-deploy.yml (per tenant) instead of do-build.yml (per build) so deploy environment is correct. (2pt)
  • AC: Sentry release exists per build; per-env deploy markers exist per dispatch.
  • V&V: Sentry UI shows release timeline with commits + deploys per env.

BD-6: oep-stg end-to-end deploy + parity verify (7 tickets)

  • BD-6.1 First-time Argo CD bring-up. Operator runs terraform apply in unipuka-infra-do/oep-infra/ which executes the new argocd.tf (BD-3.5) for the first time. Document the prerequisites (kubeconfig set to oep-prd-cluster, all required TF input vars exported) in unipuka-infra-ops/docs/bootstrap.md. (1pt)
  • AC: Argo CD UI reachable, root Application synced.
  • V&V: argocd app list lists root, platform. Both Synced/Healthy. Subsequent terraform apply is a no-op.
  • BD-6.2 Apply ApplicationSet for laravel-app (committed in infra-ops, automatically picked up by root Application from BD-6.1). Generates 3 Applications (oep-stg-laravel, mansety-prd-laravel, us-prd-laravel). ALL applications have syncPolicy.automated.prune = true + syncPolicy.automated.selfHeal = true (per Section 3 locked decision). infra-ops master is the only source of truth; removed manifests = deleted resources. Branch protection on infra-ops with required reviewer for apps/laravel/values-*-prd.yaml is the safety net. (2pt)
  • AC: 3 Applications visible in Argo, all auto-syncing with prune + selfHeal on.
  • V&V: Touch a file under apps/laravel/templates/, push to master, observe each tenant Application re-sync within 1-2 min. Delete a Service from templates/, push, observe it disappear from each tenant ns.
  • BD-6.3 Argo CD AppProjects applied. platform AppProject with cluster-scoped permissions limited to argocd|external-secrets|monitoring|<gateway-ns> ns. tenants AppProject limited to oep-stg|mansety-prd|us-prd ns + allowlist of CRDs (no ClusterRole, no namespace creation). (1pt)
  • AC: Attempting to deploy a ClusterRole via tenants AppProject is denied.
  • V&V: argocd proj list shows both.
  • BD-6.4 First end-to-end deploy of laravel-app to oep-stg. Sequence: a) confirm Doppler oep-stg populated, b) trigger do-build.yml on a clean master commit, c) trigger do-deploy.yml dispatch tag=v0.1.0 tenant=oep-stg, d) watch Argo sync (PreSync migrate runs first), e) confirm all 4 workloads Healthy. (3pt)
  • AC: kubectl get pods -n oep-stg all Running. /health returns 200 via Gateway. Horizon dashboard reachable. Reverb accepting WS.
  • V&V: smoke matrix in BD-6.5 passes.
  • BD-6.5 Parity smoke matrix for oep-stg. Create a smoke-test script [unipuka-infra-ops/scripts/smoke.sh](unipuka-infra-ops/scripts/smoke.sh) that hits N critical endpoints (login, list courses, websocket connect, queue job dispatch, scheduled command run). Run against the new DOKS oep-stg deploy. (3pt)
  • AC: Script returns 0 exit code.
  • V&V: Each endpoint's response matches the AWS-side baseline (status code + JSON shape).
  • BD-6.6 Per-workload alert rules. PrometheusRule per workload in the chart's [alertrules.yaml](unipuka-infra-ops/apps/laravel/templates/alertrules.yaml): api 5xx ratio > 1% (5m), worker queue stalled > 10min, ws active connections drop > 50% (5m), scheduler missed > 2 consecutive runs, migrate Job failure. Routed to Slack via Alertmanager. (3pt)
  • AC: Alerts visible in Alertmanager UI; trigger each manually to verify routing.
  • V&V: Each alert ends up in the configured Slack channel within 1 min of firing.
  • BD-6.7 Bugsnag release integration (kept alongside Sentry per CLAUDE.md). [soa/](soa/) release script invoked from do-build.yml, source maps + release markers uploaded. (1pt)
  • AC: Bugsnag dashboard shows release per build.
  • V&V: Trigger a deliberate exception in oep-stg -> Bugsnag captures it with the right release ID.

BD-7: mansety-prd + us-prd deploy idle (5 tickets)

  • BD-7.1 Tune sizing in values-mansety-prd.yaml + values-us-prd.yaml to match current ECS task-def CPU/mem footprint (api 2 replicas each running 0.5 vCPU / 1Gi). Set replica counts + HPA bounds. (2pt)
  • AC: helm template renders match the agreed-upon sizing doc.
  • V&V: Resource sum stays within tenant ResourceQuota.
  • BD-7.2 Populate Doppler mansety-prd + us-prd configs with secrets from AWS SSM. Cross-reference [soa/task-definitions/mansety/api.json](soa/task-definitions/mansety/api.json) + us/api.json. (3pt)
  • AC: Per-config secret count matches AWS SSM count (audit doc).
  • V&V: doppler secrets --config mansety-prd | wc -l matches expected.
  • BD-7.3 Deploy laravel-app to mansety-prd and us-prd namespaces (manual do-deploy.yml dispatch per tenant, follow GH Environment approval). No traffic still on AWS. (2pt)
  • AC: All workloads Running per tenant.
  • V&V: BD-6.5 parity smoke matrix passes per tenant.
  • BD-7.4 Full parity test matrix per prod tenant. Run [soa/](soa/) integration test suite + Scribe API contract tests against DO mansety-prd and us-prd (read-only mode where possible). Doppler config inherits the test-only APP_KEY if needed. (3pt)
  • AC: Test suite green per tenant.
  • V&V: Compare critical endpoint responses byte-for-byte against the AWS-side baseline; document any drift.
  • BD-7.5 Per-tenant cutover-precondition checklist document. [unipuka-infra-ops/docs/cutover-precondition-<tenant>.md](unipuka-infra-ops/docs/cutover-precondition-mansety-prd.md). Items: pods healthy 7d, alerts quiet 7d, parity matrix green, data-migration-DM6 done, secrets snapshot taken, DNS rollback plan staged. Owned by + handed to the per-tenant cutover Linear project. (1pt)
  • AC: 3 checklist docs committed (oep-stg, mansety-prd, us-prd).
  • V&V: Linked from the new cutover project descriptions when created.

BD-8: Cleanup + handoff (3 tickets)

  • BD-8.1 Update [CLAUDE.md](CLAUDE.md) repo map and the Kubernetes section to reflect the final state: infra-ops layout, AppProjects (platform + tenants), Doppler structure, Gateway listeners (:80/:443/:6001), monitoring stack, helm chart path. (1pt)
  • BD-8.2 Operator runbooks committed in [unipuka-infra-ops/docs/](unipuka-infra-ops/docs/): bootstrap.md, add-tenant.md, rotate-secrets.md, promote-image.md, incident-response.md (alert -> runbook map). (3pt)
  • AC: Each runbook covers happy + sad paths + rollback steps.
  • V&V: Pair with an on-call engineer, run one runbook end-to-end on staging.
  • BD-8.3 Spin up 3 per-tenant cutover Linear projects (AWS to DO Backend Cutover: oep-stg, ... mansety-prd, ... us-prd). Link from this project. Hand off the cutover-precondition checklists. (1pt)

8. Dependencies summary

flowchart TD
  BD0[BD-0 Planning]
  BD1[BD-1 Doppler foundation]
  BD2[BD-2 Documentation platform]
  BD3[BD-3 Platform bootstrap]
  BD4[BD-4 Laravel Helm chart]
  BD5[BD-5 CI pipeline in soa]
  BD6[BD-6 oep-stg deploy]
  BD7[BD-7 prd tenants deploy idle]
  BD8[BD-8 Cleanup + handoff]

  BD0 --> BD1
  BD0 --> BD2
  BD1 --> BD3
  BD2 -.->|enables docs| BD3
  BD3 --> BD4
  BD3 --> BD5
  BD4 --> BD6
  BD5 --> BD6
  BD6 --> BD7
  BD7 --> BD8

BD-2 (docs) runs in parallel with BD-1/BD-3; it does not block BD-3 but every milestone from BD-3 onward writes its docs into the site that BD-2 provisions.

Critical path: BD-0 -> BD-1 -> BD-3 -> (BD-4 or BD-5 parallel) -> BD-6 -> BD-7 -> BD-8.

9. Risks + mitigations

  • Doppler service token leak. Mitigation: tokens generated manually in Doppler UI (BD-1.3, never in TF state), stored in 3 places only: a) Doppler itself, b) DOPPLER_TOKEN_* GH secrets at the unipuka/soa repo level (not org-level), c) K8s Secret in external-secrets ns. TF state holds the K8s-Secret value (sealed via DO Spaces backend encryption) but NOT in the Doppler-provider portion of state. Rotation runbook in BD-8.2.
  • Migrate Job blocks sync. PreSync hook has 30min timeout + backoffLimit 0. On failure, Argo Sync stays Degraded BUT existing Deployments are untouched (the new revision never applies). From a traffic POV this is auto-rollback: old pods keep serving until the operator either (a) fixes the migration + re-dispatches, or (b) re-dispatches do-deploy.yml with the previous image tag. Alertmanager fires a SyncFailed alert routed to Slack within 1 min. Runbook in BD-8.2.
  • Reverb broker race on multi-replica. Mitigated by Redis broker (REVERB_SCALING_REPLICATION=redis). BD-4.4 V&V explicitly tests cross-pod broadcast.
  • NetworkPolicy too tight breaks egress. Document allow-list per tenant; default-deny applied last after staging-only soak in BD-3.9.
  • Argo prune deletes a resource by accident. Mitigated by infra-ops branch protection: required reviewer on apps/laravel/values-*-prd.yaml + apps/laravel/templates/** paths. Pull request explicitly shows the deletion diff. Argo's "delete pending" prune window + git revert = simple rollback. Acceptable trade-off for keeping master as the only source of truth.
  • DOCR garbage collection deletes referenced image. Retention keeps last 30 tagged images per repo (BD-5.5) which is ~1 month of daily releases and caps storage at the Basic-tier $5/mo footprint. Beyond that window, operator rebuilds + republishes the image from any commit via do-build.yml workflow_dispatch ref input (BD-5.1).
  • GitOps drift between Doppler + Helm values. Secrets MUST flow ONLY through ESO -> ExternalSecret -> K8s Secret -> envFrom; never in Helm values or git. Static config (hostnames, replicas, resource sizing) lives ONLY in Helm values. Document the boundary in BD-4.1.

10. Estimated points

  • BD-0: 4pt (BD-0.2 grown to 2pt for discrepancy audit; BD-0.3 ADR shrunk to 1pt)
  • BD-1: 11pt (BD-1.6 dropped)
  • BD-2: 8pt (mkdocs-material docs platform; 4 tickets)
  • BD-3: 30pt (BD-3.5 +1pt for TF helm_release; BD-3.14 Argo CD auth via Dex+GitHub +3pt)
  • BD-4: 22pt
  • BD-5: 14pt
  • BD-6: 14pt
  • BD-7: 11pt
  • BD-8: 5pt
  • Total: ~119pt (~15 person-weeks @ 8pt/week; can compress with parallelism on BD-2 + BD-4 + BD-5).

11. What this plan does NOT do

  • Does not perform the actual implementation. Just creates the Linear project + tickets. Each ticket gets picked up + implemented per its own AC + V&V.
  • Does not cover per-tenant traffic cutover. Three separate Linear projects (one per tenant) will be spun up in BD-8.3 to handle DNS flip + AWS drain + traffic switch + rollback per tenant.
  • Does not migrate or delete the legacy AWS Jsonnet workflows in [soa/.github/workflows/](soa/.github/workflows/). Those stay until the cutover project for the respective tenant closes.
  • Does not migrate the existing AWS ECR/ECS resources or [unipuka-infra/](unipuka-infra/) TF. Those are removed by the cutover projects per tenant.
  • Does not introduce Argo Rollouts / canary. Tracked as future iteration after cutover stable.
  • Does not introduce or evaluate HashiCorp Vault. Doppler is the secret store. Out of scope explicitly per Section 3.