Backend deploy target is self-hosted Docker (VPS / Kubernetes / Docker Engine), not Cloud Run. Add a multi-stage Dockerfile (Node 20, bcrypt compiled in build stage, non-root runtime), .dockerignore, a staging docker-compose, and DEPLOY.md covering install, build, migrate, run, and log mapping/rotation. Pin engines.node>=20. Update deployment.md runbook and backend/CLAUDE.md infra line off Cloud Run. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
187 lines
11 KiB
Markdown
187 lines
11 KiB
Markdown
# Deployment notes
|
||
|
||
Operational decisions and dependency configuration for staging/production. Keep this updated as we make infra choices; cross-link from feature plans when a deploy-time setting matters.
|
||
|
||
## Infrastructure summary
|
||
|
||
| Component | Service | Tier / Notes |
|
||
|---|---|---|
|
||
| Backend (public + internal) | Self-hosted Docker (VPS / Kubernetes / Docker Engine) | NOT Cloud Run. Container from [backend/Dockerfile](../backend/Dockerfile); horizontal scaling via replicas; SIGTERM trapped for graceful drain ([server.js](../backend/src/server.js)) |
|
||
| Database | GCP Cloud SQL (PostgreSQL) | Source of truth for all durable state |
|
||
| Pub/sub + cache | Valkey | Self-hosted on VM today; Memorystore Standard (HA) recommended for prod (see [§ Valkey](#valkey)) |
|
||
| Networking | GCP VPC | Internal listener (port 3001) never exposed; CC reaches it via VPN |
|
||
| Payment | Xendit | See [phase5-xendit-plan.md](phase5-xendit-plan.md) for keys / webhook URL setup |
|
||
| Auth | Self-managed JWT + FCM-only Firebase | See [backend/CLAUDE.md](../backend/CLAUDE.md) |
|
||
|
||
## Valkey
|
||
|
||
Valkey is used for two distinct purposes:
|
||
|
||
1. **Pub/sub** — cross-instance event fan-out (chat messages, session lifecycle, config invalidation). See [backend/src/plugins/valkey.js](../backend/src/plugins/valkey.js).
|
||
2. **Availability mirror** — `mitras:online`, `mitras:deactivated`, `mitra:capacity:<id>`, `mitra:heartbeat:<id>`, and `availability:snapshot` per [valkey-online-mirror-plan.md](valkey-online-mirror-plan.md). Postgres remains the durable source of truth; Valkey is the hot read path.
|
||
|
||
### Persistence — required or optional?
|
||
|
||
**Not required.** All durable state lives in Postgres; Valkey is a cache + ephemeral liveness layer that fully rebuilds via `seedFromPostgres()` on backend reconnect.
|
||
|
||
What's actually in Valkey, and what happens if it's wiped:
|
||
|
||
| Key | Derivable from Postgres? | Cost of loss |
|
||
|---|---|---|
|
||
| `mitras:online` | yes | reseeded on reconnect |
|
||
| `mitras:deactivated` | yes | reseeded on reconnect |
|
||
| `mitra:capacity:<id>` | yes (`COUNT(*) FROM chat_sessions`) | reseeded on reconnect |
|
||
| `mitra:heartbeat:<id>` | **no** — pure transient liveness | seed writes `NOW`; ≤ a few seconds of fuzz on `last_heartbeat_at` forensics |
|
||
| `availability:snapshot` | recomputable | next beacon poll repopulates |
|
||
|
||
Reader code in services/* has explicit Postgres fallbacks for every Valkey op, so the cold-cache window during a restart degrades performance, not correctness.
|
||
|
||
### Persistence recommendation by environment
|
||
|
||
| Environment | Setting | Reason |
|
||
|---|---|---|
|
||
| **Dev / local** | No persistence (`--save "" --appendonly no` or just default) | Restarts wipe state; reseed handles it cleanly; zero disk overhead |
|
||
| **Staging** | AOF on (`--appendonly yes`) | Verifies prod-like behavior; tiny disk cost |
|
||
| **Production** | AOF on, optionally RDB too (`--appendonly yes --save 60 1000`) | Eliminates cold-cache window after restart; trivial disk footprint (few MB) |
|
||
|
||
The application code is identical across all three — persistence is a deploy-time knob, not a code-level concern.
|
||
|
||
### Self-hosted Valkey (current state, dev/staging)
|
||
|
||
Docker container on the existing VM. Reference config:
|
||
|
||
```yaml
|
||
valkey:
|
||
image: valkey/valkey:7-alpine
|
||
command: valkey-server --appendonly yes --save 60 1000
|
||
volumes:
|
||
- valkey-data:/data
|
||
ports:
|
||
- "6379:6379"
|
||
restart: unless-stopped
|
||
```
|
||
|
||
Backend reaches it via `VALKEY_URL=redis://<vm-ip>:6379` in `backend/.env` (or Cloud Run env var).
|
||
|
||
### Memorystore migration (when going to prod)
|
||
|
||
The reseed-from-Postgres flow makes migration trivial — Valkey state is never load-bearing:
|
||
|
||
1. Provision **Memorystore for Valkey, Standard tier** (HA with replica) in the same VPC + region as Cloud Run.
|
||
- Smallest available size (~1 GB) is plenty; actual data footprint is well under 1 MB.
|
||
- Cost: ~$50/month at minimum sizing in asia-southeast2.
|
||
2. Update Cloud Run env: `VALKEY_URL=redis://<memorystore-internal-ip>:6379`.
|
||
3. Deploy new revision. Cloud Run rolling deploy → new instances seed Memorystore from Postgres; old instances drain on old Valkey.
|
||
4. Shut down old Valkey once traffic has migrated.
|
||
|
||
**Zero downtime.** No data migration needed (state is derivable). The cold-cache window on new instances is handled by the existing Postgres-fallback reader paths.
|
||
|
||
### Tier choice rationale
|
||
|
||
| Tier | When to use | Failover behavior |
|
||
|---|---|---|
|
||
| Self-hosted Docker | Dev, staging | Manual restart; backend reseeds when Valkey comes back |
|
||
| Memorystore Basic | Cost-sensitive single-AZ staging | ~1–5 min outage per maintenance event; backend handles via Postgres fallback |
|
||
| Memorystore Standard (HA) | **Production** | ~30s automatic failover; replica keeps data live |
|
||
|
||
The system is correct on any tier — HA reduces customer-visible latency spikes during Valkey events from minutes to seconds.
|
||
|
||
## Cloud Run
|
||
|
||
(Placeholder for prod tuning — fill in as we make decisions about region, min/max instances, concurrency, secrets manager wiring.)
|
||
|
||
### Manual staging deploy runbook
|
||
|
||
Goal: stand up a staging backend so the Android **staging** flavor (`com.mybestie.staging`) has a real `API_BASE_URL` to talk to. Done manually for now (no CI/CD yet — see open ops).
|
||
|
||
> **Deploy target: self-hosted Docker** (VPS / Kubernetes / Docker Engine) — not Cloud Run. The backend ships a multi-stage [backend/Dockerfile](../backend/Dockerfile) (Node 20, non-root runtime, native `bcrypt` compiled in the build stage). Build with `docker build -t halobestie-backend ./backend`.
|
||
>
|
||
> **Full operational runbook — install Docker, build/push, migrate, run (Docker + Compose + k8s), and log mapping/rotation — lives in [backend/DEPLOY.md](../backend/DEPLOY.md).** The steps below are the staging-bring-up summary.
|
||
|
||
**A1 — Provision the staging database (Cloud SQL Postgres)**
|
||
1. Create a Cloud SQL Postgres instance (or a separate `halobestie_staging` DB on a shared instance). Pin the **same region** as the Cloud Run service.
|
||
2. Capture its connection string for `DATABASE_URL` (use the Cloud SQL connector / Unix socket form for Cloud Run, or private IP over the VPC connector).
|
||
3. Run migrations + seed against it:
|
||
```bash
|
||
cd backend
|
||
DATABASE_URL=postgresql://... npm run db:migrate
|
||
DATABASE_URL=postgresql://... npm run db:seed
|
||
```
|
||
|
||
**A2 — Provision staging Valkey** — self-hosted Docker on the VM is fine for staging (`--appendonly yes`, see [§ Valkey](#valkey)). Note the `VALKEY_URL`.
|
||
|
||
**A3 — Staging Firebase Admin creds** — the app's staging `google-services.json` / `GoogleService-Info.plist` point at Firebase project **`my-bestie-876ec`**. The backend's `FIREBASE_SERVICE_ACCOUNT` **must be a service-account key from that same project**, or FCM push + token verification will silently target the wrong project. Mount it as a secret and set `FIREBASE_SERVICE_ACCOUNT_PATH` (or switch to a Secret Manager mount).
|
||
|
||
**A4 — Build the image + run migrations, then start the container.**
|
||
|
||
Build (on a build host or in CI), then push to your registry:
|
||
```bash
|
||
docker build -t <registry>/halobestie-backend:staging ./backend
|
||
docker push <registry>/halobestie-backend:staging
|
||
```
|
||
|
||
Run migrations as a **one-off** before (re)starting the service — never auto-migrate on boot (replica race):
|
||
```bash
|
||
docker run --rm --env-file backend/.env.staging \
|
||
<registry>/halobestie-backend:staging node src/db/migrate.js
|
||
# first deploy only:
|
||
docker run --rm --env-file backend/.env.staging \
|
||
<registry>/halobestie-backend:staging node src/db/seed.js
|
||
```
|
||
|
||
Run the service (plain Docker Engine example; k8s = Deployment + Service with the same env/secrets and liveness/readiness probes on `:3000`):
|
||
```bash
|
||
docker run -d --name halobestie-staging \
|
||
--env-file backend/.env.staging \
|
||
-p 3000:3000 \
|
||
-v /path/to/firebase-sa.json:/secrets/firebase-sa.json:ro \
|
||
--restart unless-stopped \
|
||
<registry>/halobestie-backend:staging
|
||
```
|
||
- Publish **only** port 3000. The internal listener (3001) stays bound to `127.0.0.1` inside the container — do not map it.
|
||
- `FIREBASE_SERVICE_ACCOUNT_PATH` must point at the mounted path (e.g. `/secrets/firebase-sa.json`), not a baked-in file.
|
||
- Put a TLS-terminating reverse proxy (Nginx / Traefik / Caddy) in front for `https://staging-api.halobestie.com`.
|
||
|
||
Staging-specific env values (`backend/.env.staging`; see [backend/.env.example](../backend/.env.example) for the full list):
|
||
| Var | Staging value |
|
||
|---|---|
|
||
| `AUTH_JWT_SECRET` | a fresh secret — **not** the prod one |
|
||
| `XENDIT_ENABLED` | `false` until you wire test-mode keys + webhook |
|
||
| `XENDIT_SECRET_KEY` / `XENDIT_WEBHOOK_TOKEN` | Xendit **test** credentials |
|
||
| `XENDIT_SUCCESS/FAILURE_REDIRECT_URL` | staging backend's `/payment/return/*` URLs |
|
||
| `FAZPASS_ENABLED` | `false` (test-user OTP bypass path) unless testing real OTP |
|
||
| `CC_ORIGIN` | staging control-center origin (if deployed) |
|
||
| `ADMIN_EMAIL` / `ADMIN_PASSWORD` | staging control-center login |
|
||
|
||
> **Public listener only.** The internal listener (port 3001, control center) must stay off the public internet — don't expose it from this Cloud Run service. CC for staging, if needed, goes behind the VPC/VPN per the root architecture rules.
|
||
|
||
**A5 — Capture the URL.** Point a DNS record (e.g. `staging-api.halobestie.com`) at the host/reverse proxy and terminate TLS there. **This HTTPS URL is the value the app needs** in Phase B.
|
||
|
||
### App handoff (Phase B) — once A5 gives a URL
|
||
1. Put the real URL in [`client_app/env/staging.json`](../client_app/env/staging.json) + [`mitra_app/env/staging.json`](../mitra_app/env/staging.json) (`API_BASE_URL`), and remove the `_TODO` key from the client file.
|
||
2. Build the staging APK:
|
||
```bash
|
||
cd client_app
|
||
flutter build apk --flavor staging -t lib/main_staging.dart --dart-define-from-file=env/staging.json
|
||
```
|
||
Output: `build/app/outputs/flutter-apk/app-staging-release.apk`.
|
||
3. Distribute via **Firebase App Distribution** (debug-signed APK is accepted — no upload keystore needed for staging) or share the APK directly. `com.mybestie.staging` installs side-by-side with prod.
|
||
|
||
> **Release signing is still debug keys** ([client_app/android/app/build.gradle.kts](../client_app/android/app/build.gradle.kts) `release { ... }`). Fine for Firebase App Distribution / direct APK. A real upload keystore is only required if you later publish staging to a Play Store internal-testing track. iOS staging is **not** wired yet (only one `Runner.xcscheme` — no per-flavor schemes/build-configs).
|
||
|
||
## Cloud SQL
|
||
|
||
(Placeholder — pool size, machine type, HA flag, backup retention.)
|
||
|
||
## Xendit
|
||
|
||
See [phase5-xendit-plan.md](phase5-xendit-plan.md) for credential setup and webhook URL configuration. Stage 8 (live E2E) is currently blocked on test-mode keys.
|
||
|
||
## Open ops decisions
|
||
|
||
- [ ] Confirm Memorystore Standard tier for prod deploy (recommended in [§ Valkey](#valkey)).
|
||
- [ ] Pin GCP region for backend + Cloud SQL + Memorystore (all must match for sub-ms internal latency).
|
||
- [ ] Secrets manager (GCP Secret Manager vs Cloud Run env vars) for `AUTH_JWT_SECRET`, `XENDIT_SECRET_KEY`, etc.
|
||
- [ ] Backup retention policy for Cloud SQL.
|
||
- [ ] CI/CD pipeline for Cloud Run deploys.
|