Backend deploy target is self-hosted Docker (VPS / Kubernetes / Docker Engine), not Cloud Run. Add a multi-stage Dockerfile (Node 20, bcrypt compiled in build stage, non-root runtime), .dockerignore, a staging docker-compose, and DEPLOY.md covering install, build, migrate, run, and log mapping/rotation. Pin engines.node>=20. Update deployment.md runbook and backend/CLAUDE.md infra line off Cloud Run. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
11 KiB
Deployment notes
Operational decisions and dependency configuration for staging/production. Keep this updated as we make infra choices; cross-link from feature plans when a deploy-time setting matters.
Infrastructure summary
| Component | Service | Tier / Notes |
|---|---|---|
| Backend (public + internal) | Self-hosted Docker (VPS / Kubernetes / Docker Engine) | NOT Cloud Run. Container from backend/Dockerfile; horizontal scaling via replicas; SIGTERM trapped for graceful drain (server.js) |
| Database | GCP Cloud SQL (PostgreSQL) | Source of truth for all durable state |
| Pub/sub + cache | Valkey | Self-hosted on VM today; Memorystore Standard (HA) recommended for prod (see § Valkey) |
| Networking | GCP VPC | Internal listener (port 3001) never exposed; CC reaches it via VPN |
| Payment | Xendit | See phase5-xendit-plan.md for keys / webhook URL setup |
| Auth | Self-managed JWT + FCM-only Firebase | See backend/CLAUDE.md |
Valkey
Valkey is used for two distinct purposes:
- Pub/sub — cross-instance event fan-out (chat messages, session lifecycle, config invalidation). See backend/src/plugins/valkey.js.
- Availability mirror —
mitras:online,mitras:deactivated,mitra:capacity:<id>,mitra:heartbeat:<id>, andavailability:snapshotper valkey-online-mirror-plan.md. Postgres remains the durable source of truth; Valkey is the hot read path.
Persistence — required or optional?
Not required. All durable state lives in Postgres; Valkey is a cache + ephemeral liveness layer that fully rebuilds via seedFromPostgres() on backend reconnect.
What's actually in Valkey, and what happens if it's wiped:
| Key | Derivable from Postgres? | Cost of loss |
|---|---|---|
mitras:online |
yes | reseeded on reconnect |
mitras:deactivated |
yes | reseeded on reconnect |
mitra:capacity:<id> |
yes (COUNT(*) FROM chat_sessions) |
reseeded on reconnect |
mitra:heartbeat:<id> |
no — pure transient liveness | seed writes NOW; ≤ a few seconds of fuzz on last_heartbeat_at forensics |
availability:snapshot |
recomputable | next beacon poll repopulates |
Reader code in services/* has explicit Postgres fallbacks for every Valkey op, so the cold-cache window during a restart degrades performance, not correctness.
Persistence recommendation by environment
| Environment | Setting | Reason |
|---|---|---|
| Dev / local | No persistence (--save "" --appendonly no or just default) |
Restarts wipe state; reseed handles it cleanly; zero disk overhead |
| Staging | AOF on (--appendonly yes) |
Verifies prod-like behavior; tiny disk cost |
| Production | AOF on, optionally RDB too (--appendonly yes --save 60 1000) |
Eliminates cold-cache window after restart; trivial disk footprint (few MB) |
The application code is identical across all three — persistence is a deploy-time knob, not a code-level concern.
Self-hosted Valkey (current state, dev/staging)
Docker container on the existing VM. Reference config:
valkey:
image: valkey/valkey:7-alpine
command: valkey-server --appendonly yes --save 60 1000
volumes:
- valkey-data:/data
ports:
- "6379:6379"
restart: unless-stopped
Backend reaches it via VALKEY_URL=redis://<vm-ip>:6379 in backend/.env (or Cloud Run env var).
Memorystore migration (when going to prod)
The reseed-from-Postgres flow makes migration trivial — Valkey state is never load-bearing:
- Provision Memorystore for Valkey, Standard tier (HA with replica) in the same VPC + region as Cloud Run.
- Smallest available size (~1 GB) is plenty; actual data footprint is well under 1 MB.
- Cost: ~$50/month at minimum sizing in asia-southeast2.
- Update Cloud Run env:
VALKEY_URL=redis://<memorystore-internal-ip>:6379. - Deploy new revision. Cloud Run rolling deploy → new instances seed Memorystore from Postgres; old instances drain on old Valkey.
- Shut down old Valkey once traffic has migrated.
Zero downtime. No data migration needed (state is derivable). The cold-cache window on new instances is handled by the existing Postgres-fallback reader paths.
Tier choice rationale
| Tier | When to use | Failover behavior |
|---|---|---|
| Self-hosted Docker | Dev, staging | Manual restart; backend reseeds when Valkey comes back |
| Memorystore Basic | Cost-sensitive single-AZ staging | ~1–5 min outage per maintenance event; backend handles via Postgres fallback |
| Memorystore Standard (HA) | Production | ~30s automatic failover; replica keeps data live |
The system is correct on any tier — HA reduces customer-visible latency spikes during Valkey events from minutes to seconds.
Cloud Run
(Placeholder for prod tuning — fill in as we make decisions about region, min/max instances, concurrency, secrets manager wiring.)
Manual staging deploy runbook
Goal: stand up a staging backend so the Android staging flavor (com.mybestie.staging) has a real API_BASE_URL to talk to. Done manually for now (no CI/CD yet — see open ops).
Deploy target: self-hosted Docker (VPS / Kubernetes / Docker Engine) — not Cloud Run. The backend ships a multi-stage backend/Dockerfile (Node 20, non-root runtime, native
bcryptcompiled in the build stage). Build withdocker build -t halobestie-backend ./backend.Full operational runbook — install Docker, build/push, migrate, run (Docker + Compose + k8s), and log mapping/rotation — lives in backend/DEPLOY.md. The steps below are the staging-bring-up summary.
A1 — Provision the staging database (Cloud SQL Postgres)
- Create a Cloud SQL Postgres instance (or a separate
halobestie_stagingDB on a shared instance). Pin the same region as the Cloud Run service. - Capture its connection string for
DATABASE_URL(use the Cloud SQL connector / Unix socket form for Cloud Run, or private IP over the VPC connector). - Run migrations + seed against it:
cd backend DATABASE_URL=postgresql://... npm run db:migrate DATABASE_URL=postgresql://... npm run db:seed
A2 — Provision staging Valkey — self-hosted Docker on the VM is fine for staging (--appendonly yes, see § Valkey). Note the VALKEY_URL.
A3 — Staging Firebase Admin creds — the app's staging google-services.json / GoogleService-Info.plist point at Firebase project my-bestie-876ec. The backend's FIREBASE_SERVICE_ACCOUNT must be a service-account key from that same project, or FCM push + token verification will silently target the wrong project. Mount it as a secret and set FIREBASE_SERVICE_ACCOUNT_PATH (or switch to a Secret Manager mount).
A4 — Build the image + run migrations, then start the container.
Build (on a build host or in CI), then push to your registry:
docker build -t <registry>/halobestie-backend:staging ./backend
docker push <registry>/halobestie-backend:staging
Run migrations as a one-off before (re)starting the service — never auto-migrate on boot (replica race):
docker run --rm --env-file backend/.env.staging \
<registry>/halobestie-backend:staging node src/db/migrate.js
# first deploy only:
docker run --rm --env-file backend/.env.staging \
<registry>/halobestie-backend:staging node src/db/seed.js
Run the service (plain Docker Engine example; k8s = Deployment + Service with the same env/secrets and liveness/readiness probes on :3000):
docker run -d --name halobestie-staging \
--env-file backend/.env.staging \
-p 3000:3000 \
-v /path/to/firebase-sa.json:/secrets/firebase-sa.json:ro \
--restart unless-stopped \
<registry>/halobestie-backend:staging
- Publish only port 3000. The internal listener (3001) stays bound to
127.0.0.1inside the container — do not map it. FIREBASE_SERVICE_ACCOUNT_PATHmust point at the mounted path (e.g./secrets/firebase-sa.json), not a baked-in file.- Put a TLS-terminating reverse proxy (Nginx / Traefik / Caddy) in front for
https://staging-api.halobestie.com.
Staging-specific env values (backend/.env.staging; see backend/.env.example for the full list):
| Var | Staging value |
|---|---|
AUTH_JWT_SECRET |
a fresh secret — not the prod one |
XENDIT_ENABLED |
false until you wire test-mode keys + webhook |
XENDIT_SECRET_KEY / XENDIT_WEBHOOK_TOKEN |
Xendit test credentials |
XENDIT_SUCCESS/FAILURE_REDIRECT_URL |
staging backend's /payment/return/* URLs |
FAZPASS_ENABLED |
false (test-user OTP bypass path) unless testing real OTP |
CC_ORIGIN |
staging control-center origin (if deployed) |
ADMIN_EMAIL / ADMIN_PASSWORD |
staging control-center login |
Public listener only. The internal listener (port 3001, control center) must stay off the public internet — don't expose it from this Cloud Run service. CC for staging, if needed, goes behind the VPC/VPN per the root architecture rules.
A5 — Capture the URL. Point a DNS record (e.g. staging-api.halobestie.com) at the host/reverse proxy and terminate TLS there. This HTTPS URL is the value the app needs in Phase B.
App handoff (Phase B) — once A5 gives a URL
- Put the real URL in
client_app/env/staging.json+mitra_app/env/staging.json(API_BASE_URL), and remove the_TODOkey from the client file. - Build the staging APK:
Output:
cd client_app flutter build apk --flavor staging -t lib/main_staging.dart --dart-define-from-file=env/staging.jsonbuild/app/outputs/flutter-apk/app-staging-release.apk. - Distribute via Firebase App Distribution (debug-signed APK is accepted — no upload keystore needed for staging) or share the APK directly.
com.mybestie.staginginstalls side-by-side with prod.
Release signing is still debug keys (client_app/android/app/build.gradle.kts
release { ... }). Fine for Firebase App Distribution / direct APK. A real upload keystore is only required if you later publish staging to a Play Store internal-testing track. iOS staging is not wired yet (only oneRunner.xcscheme— no per-flavor schemes/build-configs).
Cloud SQL
(Placeholder — pool size, machine type, HA flag, backup retention.)
Xendit
See phase5-xendit-plan.md for credential setup and webhook URL configuration. Stage 8 (live E2E) is currently blocked on test-mode keys.
Open ops decisions
- Confirm Memorystore Standard tier for prod deploy (recommended in § Valkey).
- Pin GCP region for backend + Cloud SQL + Memorystore (all must match for sub-ms internal latency).
- Secrets manager (GCP Secret Manager vs Cloud Run env vars) for
AUTH_JWT_SECRET,XENDIT_SECRET_KEY, etc. - Backup retention policy for Cloud SQL.
- CI/CD pipeline for Cloud Run deploys.