Files

Ramadhan Sjamsani 91bdbd5289 build(backend): Dockerize for self-hosted deploy + deploy/log docs

Backend deploy target is self-hosted Docker (VPS / Kubernetes / Docker
Engine), not Cloud Run. Add a multi-stage Dockerfile (Node 20, bcrypt
compiled in build stage, non-root runtime), .dockerignore, a staging
docker-compose, and DEPLOY.md covering install, build, migrate, run, and
log mapping/rotation. Pin engines.node>=20. Update deployment.md runbook
and backend/CLAUDE.md infra line off Cloud Run.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

2026-06-05 15:10:59 +08:00

11 KiB

Raw Blame History

Deployment notes

Operational decisions and dependency configuration for staging/production. Keep this updated as we make infra choices; cross-link from feature plans when a deploy-time setting matters.

Infrastructure summary

Component	Service	Tier / Notes
Backend (public + internal)	Self-hosted Docker (VPS / Kubernetes / Docker Engine)	NOT Cloud Run. Container from backend/Dockerfile; horizontal scaling via replicas; SIGTERM trapped for graceful drain (server.js)
Database	GCP Cloud SQL (PostgreSQL)	Source of truth for all durable state
Pub/sub + cache	Valkey	Self-hosted on VM today; Memorystore Standard (HA) recommended for prod (see § Valkey)
Networking	GCP VPC	Internal listener (port 3001) never exposed; CC reaches it via VPN
Payment	Xendit	See phase5-xendit-plan.md for keys / webhook URL setup
Auth	Self-managed JWT + FCM-only Firebase	See backend/CLAUDE.md

Valkey

Valkey is used for two distinct purposes:

Pub/sub — cross-instance event fan-out (chat messages, session lifecycle, config invalidation). See backend/src/plugins/valkey.js.
Availability mirror — mitras:online, mitras:deactivated, mitra:capacity:<id>, mitra:heartbeat:<id>, and availability:snapshot per valkey-online-mirror-plan.md. Postgres remains the durable source of truth; Valkey is the hot read path.

Persistence — required or optional?

Not required. All durable state lives in Postgres; Valkey is a cache + ephemeral liveness layer that fully rebuilds via seedFromPostgres() on backend reconnect.

What's actually in Valkey, and what happens if it's wiped:

Key	Derivable from Postgres?	Cost of loss
`mitras:online`	yes	reseeded on reconnect
`mitras:deactivated`	yes	reseeded on reconnect
`mitra:capacity:<id>`	yes (`COUNT(*) FROM chat_sessions`)	reseeded on reconnect
`mitra:heartbeat:<id>`	no — pure transient liveness	seed writes `NOW`; ≤ a few seconds of fuzz on `last_heartbeat_at` forensics
`availability:snapshot`	recomputable	next beacon poll repopulates

Reader code in services/* has explicit Postgres fallbacks for every Valkey op, so the cold-cache window during a restart degrades performance, not correctness.

Persistence recommendation by environment

Environment	Setting	Reason
Dev / local	No persistence (`--save "" --appendonly no` or just default)	Restarts wipe state; reseed handles it cleanly; zero disk overhead
Staging	AOF on (`--appendonly yes`)	Verifies prod-like behavior; tiny disk cost
Production	AOF on, optionally RDB too (`--appendonly yes --save 60 1000`)	Eliminates cold-cache window after restart; trivial disk footprint (few MB)

The application code is identical across all three — persistence is a deploy-time knob, not a code-level concern.

Self-hosted Valkey (current state, dev/staging)

Docker container on the existing VM. Reference config:

valkey:
  image: valkey/valkey:7-alpine
  command: valkey-server --appendonly yes --save 60 1000
  volumes:
    - valkey-data:/data
  ports:
    - "6379:6379"
  restart: unless-stopped

Backend reaches it via VALKEY_URL=redis://<vm-ip>:6379 in backend/.env (or Cloud Run env var).

Memorystore migration (when going to prod)

The reseed-from-Postgres flow makes migration trivial — Valkey state is never load-bearing:

Provision Memorystore for Valkey, Standard tier (HA with replica) in the same VPC + region as Cloud Run.
- Smallest available size (~1 GB) is plenty; actual data footprint is well under 1 MB.
- Cost: ~$50/month at minimum sizing in asia-southeast2.
Update Cloud Run env: VALKEY_URL=redis://<memorystore-internal-ip>:6379.
Deploy new revision. Cloud Run rolling deploy → new instances seed Memorystore from Postgres; old instances drain on old Valkey.
Shut down old Valkey once traffic has migrated.

Zero downtime. No data migration needed (state is derivable). The cold-cache window on new instances is handled by the existing Postgres-fallback reader paths.

Tier choice rationale

Tier	When to use	Failover behavior
Self-hosted Docker	Dev, staging	Manual restart; backend reseeds when Valkey comes back
Memorystore Basic	Cost-sensitive single-AZ staging	~1–5 min outage per maintenance event; backend handles via Postgres fallback
Memorystore Standard (HA)	Production	~30s automatic failover; replica keeps data live

The system is correct on any tier — HA reduces customer-visible latency spikes during Valkey events from minutes to seconds.

Cloud Run

(Placeholder for prod tuning — fill in as we make decisions about region, min/max instances, concurrency, secrets manager wiring.)

Manual staging deploy runbook

Goal: stand up a staging backend so the Android staging flavor (com.mybestie.staging) has a real API_BASE_URL to talk to. Done manually for now (no CI/CD yet — see open ops).

Deploy target: self-hosted Docker (VPS / Kubernetes / Docker Engine) — not Cloud Run. The backend ships a multi-stage backend/Dockerfile (Node 20, non-root runtime, native bcrypt compiled in the build stage). Build with docker build -t halobestie-backend ./backend.

Full operational runbook — install Docker, build/push, migrate, run (Docker + Compose + k8s), and log mapping/rotation — lives in backend/DEPLOY.md. The steps below are the staging-bring-up summary.

A1 — Provision the staging database (Cloud SQL Postgres)

Create a Cloud SQL Postgres instance (or a separate halobestie_staging DB on a shared instance). Pin the same region as the Cloud Run service.
Capture its connection string for DATABASE_URL (use the Cloud SQL connector / Unix socket form for Cloud Run, or private IP over the VPC connector).

Run migrations + seed against it:

cd backend
DATABASE_URL=postgresql://... npm run db:migrate
DATABASE_URL=postgresql://... npm run db:seed

A2 — Provision staging Valkey — self-hosted Docker on the VM is fine for staging (--appendonly yes, see § Valkey). Note the VALKEY_URL.

A3 — Staging Firebase Admin creds — the app's staging google-services.json / GoogleService-Info.plist point at Firebase project my-bestie-876ec. The backend's FIREBASE_SERVICE_ACCOUNT must be a service-account key from that same project, or FCM push + token verification will silently target the wrong project. Mount it as a secret and set FIREBASE_SERVICE_ACCOUNT_PATH (or switch to a Secret Manager mount).

A4 — Build the image + run migrations, then start the container.

Build (on a build host or in CI), then push to your registry:

docker build -t <registry>/halobestie-backend:staging ./backend
docker push <registry>/halobestie-backend:staging

Run migrations as a one-off before (re)starting the service — never auto-migrate on boot (replica race):

docker run --rm --env-file backend/.env.staging \
  <registry>/halobestie-backend:staging node src/db/migrate.js
# first deploy only:
docker run --rm --env-file backend/.env.staging \
  <registry>/halobestie-backend:staging node src/db/seed.js

Run the service (plain Docker Engine example; k8s = Deployment + Service with the same env/secrets and liveness/readiness probes on :3000):

docker run -d --name halobestie-staging \
  --env-file backend/.env.staging \
  -p 3000:3000 \
  -v /path/to/firebase-sa.json:/secrets/firebase-sa.json:ro \
  --restart unless-stopped \
  <registry>/halobestie-backend:staging

Publish only port 3000. The internal listener (3001) stays bound to 127.0.0.1 inside the container — do not map it.
FIREBASE_SERVICE_ACCOUNT_PATH must point at the mounted path (e.g. /secrets/firebase-sa.json), not a baked-in file.
Put a TLS-terminating reverse proxy (Nginx / Traefik / Caddy) in front for https://staging-api.halobestie.com.

Staging-specific env values (backend/.env.staging; see backend/.env.example for the full list):

Var	Staging value
`AUTH_JWT_SECRET`	a fresh secret — not the prod one
`XENDIT_ENABLED`	`false` until you wire test-mode keys + webhook
`XENDIT_SECRET_KEY` / `XENDIT_WEBHOOK_TOKEN`	Xendit test credentials
`XENDIT_SUCCESS/FAILURE_REDIRECT_URL`	staging backend's `/payment/return/*` URLs
`FAZPASS_ENABLED`	`false` (test-user OTP bypass path) unless testing real OTP
`CC_ORIGIN`	staging control-center origin (if deployed)
`ADMIN_EMAIL` / `ADMIN_PASSWORD`	staging control-center login

Public listener only. The internal listener (port 3001, control center) must stay off the public internet — don't expose it from this Cloud Run service. CC for staging, if needed, goes behind the VPC/VPN per the root architecture rules.

A5 — Capture the URL. Point a DNS record (e.g. staging-api.halobestie.com) at the host/reverse proxy and terminate TLS there. This HTTPS URL is the value the app needs in Phase B.

App handoff (Phase B) — once A5 gives a URL

Put the real URL in client_app/env/staging.json + mitra_app/env/staging.json (API_BASE_URL), and remove the _TODO key from the client file.

Build the staging APK:

cd client_app
flutter build apk --flavor staging -t lib/main_staging.dart --dart-define-from-file=env/staging.json

Output: build/app/outputs/flutter-apk/app-staging-release.apk.

Distribute via Firebase App Distribution (debug-signed APK is accepted — no upload keystore needed for staging) or share the APK directly. com.mybestie.staging installs side-by-side with prod.

Release signing is still debug keys (client_app/android/app/build.gradle.kts release { ... }). Fine for Firebase App Distribution / direct APK. A real upload keystore is only required if you later publish staging to a Play Store internal-testing track. iOS staging is not wired yet (only one Runner.xcscheme — no per-flavor schemes/build-configs).

Cloud SQL

(Placeholder — pool size, machine type, HA flag, backup retention.)

Xendit

See phase5-xendit-plan.md for credential setup and webhook URL configuration. Stage 8 (live E2E) is currently blocked on test-mode keys.

Open ops decisions

Confirm Memorystore Standard tier for prod deploy (recommended in § Valkey).
Pin GCP region for backend + Cloud SQL + Memorystore (all must match for sub-ms internal latency).
Secrets manager (GCP Secret Manager vs Cloud Run env vars) for AUTH_JWT_SECRET, XENDIT_SECRET_KEY, etc.
Backup retention policy for Cloud SQL.
CI/CD pipeline for Cloud Run deploys.

11 KiB Raw Blame History Unescape Escape