Files
halobestie-clone/requirement/deployment.md
Ramadhan Sjamsani 91bdbd5289 build(backend): Dockerize for self-hosted deploy + deploy/log docs
Backend deploy target is self-hosted Docker (VPS / Kubernetes / Docker
Engine), not Cloud Run. Add a multi-stage Dockerfile (Node 20, bcrypt
compiled in build stage, non-root runtime), .dockerignore, a staging
docker-compose, and DEPLOY.md covering install, build, migrate, run, and
log mapping/rotation. Pin engines.node>=20. Update deployment.md runbook
and backend/CLAUDE.md infra line off Cloud Run.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-05 15:10:59 +08:00

187 lines
11 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Deployment notes
Operational decisions and dependency configuration for staging/production. Keep this updated as we make infra choices; cross-link from feature plans when a deploy-time setting matters.
## Infrastructure summary
| Component | Service | Tier / Notes |
|---|---|---|
| Backend (public + internal) | Self-hosted Docker (VPS / Kubernetes / Docker Engine) | NOT Cloud Run. Container from [backend/Dockerfile](../backend/Dockerfile); horizontal scaling via replicas; SIGTERM trapped for graceful drain ([server.js](../backend/src/server.js)) |
| Database | GCP Cloud SQL (PostgreSQL) | Source of truth for all durable state |
| Pub/sub + cache | Valkey | Self-hosted on VM today; Memorystore Standard (HA) recommended for prod (see [§ Valkey](#valkey)) |
| Networking | GCP VPC | Internal listener (port 3001) never exposed; CC reaches it via VPN |
| Payment | Xendit | See [phase5-xendit-plan.md](phase5-xendit-plan.md) for keys / webhook URL setup |
| Auth | Self-managed JWT + FCM-only Firebase | See [backend/CLAUDE.md](../backend/CLAUDE.md) |
## Valkey
Valkey is used for two distinct purposes:
1. **Pub/sub** — cross-instance event fan-out (chat messages, session lifecycle, config invalidation). See [backend/src/plugins/valkey.js](../backend/src/plugins/valkey.js).
2. **Availability mirror**`mitras:online`, `mitras:deactivated`, `mitra:capacity:<id>`, `mitra:heartbeat:<id>`, and `availability:snapshot` per [valkey-online-mirror-plan.md](valkey-online-mirror-plan.md). Postgres remains the durable source of truth; Valkey is the hot read path.
### Persistence — required or optional?
**Not required.** All durable state lives in Postgres; Valkey is a cache + ephemeral liveness layer that fully rebuilds via `seedFromPostgres()` on backend reconnect.
What's actually in Valkey, and what happens if it's wiped:
| Key | Derivable from Postgres? | Cost of loss |
|---|---|---|
| `mitras:online` | yes | reseeded on reconnect |
| `mitras:deactivated` | yes | reseeded on reconnect |
| `mitra:capacity:<id>` | yes (`COUNT(*) FROM chat_sessions`) | reseeded on reconnect |
| `mitra:heartbeat:<id>` | **no** — pure transient liveness | seed writes `NOW`; ≤ a few seconds of fuzz on `last_heartbeat_at` forensics |
| `availability:snapshot` | recomputable | next beacon poll repopulates |
Reader code in services/* has explicit Postgres fallbacks for every Valkey op, so the cold-cache window during a restart degrades performance, not correctness.
### Persistence recommendation by environment
| Environment | Setting | Reason |
|---|---|---|
| **Dev / local** | No persistence (`--save "" --appendonly no` or just default) | Restarts wipe state; reseed handles it cleanly; zero disk overhead |
| **Staging** | AOF on (`--appendonly yes`) | Verifies prod-like behavior; tiny disk cost |
| **Production** | AOF on, optionally RDB too (`--appendonly yes --save 60 1000`) | Eliminates cold-cache window after restart; trivial disk footprint (few MB) |
The application code is identical across all three — persistence is a deploy-time knob, not a code-level concern.
### Self-hosted Valkey (current state, dev/staging)
Docker container on the existing VM. Reference config:
```yaml
valkey:
image: valkey/valkey:7-alpine
command: valkey-server --appendonly yes --save 60 1000
volumes:
- valkey-data:/data
ports:
- "6379:6379"
restart: unless-stopped
```
Backend reaches it via `VALKEY_URL=redis://<vm-ip>:6379` in `backend/.env` (or Cloud Run env var).
### Memorystore migration (when going to prod)
The reseed-from-Postgres flow makes migration trivial — Valkey state is never load-bearing:
1. Provision **Memorystore for Valkey, Standard tier** (HA with replica) in the same VPC + region as Cloud Run.
- Smallest available size (~1 GB) is plenty; actual data footprint is well under 1 MB.
- Cost: ~$50/month at minimum sizing in asia-southeast2.
2. Update Cloud Run env: `VALKEY_URL=redis://<memorystore-internal-ip>:6379`.
3. Deploy new revision. Cloud Run rolling deploy → new instances seed Memorystore from Postgres; old instances drain on old Valkey.
4. Shut down old Valkey once traffic has migrated.
**Zero downtime.** No data migration needed (state is derivable). The cold-cache window on new instances is handled by the existing Postgres-fallback reader paths.
### Tier choice rationale
| Tier | When to use | Failover behavior |
|---|---|---|
| Self-hosted Docker | Dev, staging | Manual restart; backend reseeds when Valkey comes back |
| Memorystore Basic | Cost-sensitive single-AZ staging | ~15 min outage per maintenance event; backend handles via Postgres fallback |
| Memorystore Standard (HA) | **Production** | ~30s automatic failover; replica keeps data live |
The system is correct on any tier — HA reduces customer-visible latency spikes during Valkey events from minutes to seconds.
## Cloud Run
(Placeholder for prod tuning — fill in as we make decisions about region, min/max instances, concurrency, secrets manager wiring.)
### Manual staging deploy runbook
Goal: stand up a staging backend so the Android **staging** flavor (`com.mybestie.staging`) has a real `API_BASE_URL` to talk to. Done manually for now (no CI/CD yet — see open ops).
> **Deploy target: self-hosted Docker** (VPS / Kubernetes / Docker Engine) — not Cloud Run. The backend ships a multi-stage [backend/Dockerfile](../backend/Dockerfile) (Node 20, non-root runtime, native `bcrypt` compiled in the build stage). Build with `docker build -t halobestie-backend ./backend`.
>
> **Full operational runbook — install Docker, build/push, migrate, run (Docker + Compose + k8s), and log mapping/rotation — lives in [backend/DEPLOY.md](../backend/DEPLOY.md).** The steps below are the staging-bring-up summary.
**A1 — Provision the staging database (Cloud SQL Postgres)**
1. Create a Cloud SQL Postgres instance (or a separate `halobestie_staging` DB on a shared instance). Pin the **same region** as the Cloud Run service.
2. Capture its connection string for `DATABASE_URL` (use the Cloud SQL connector / Unix socket form for Cloud Run, or private IP over the VPC connector).
3. Run migrations + seed against it:
```bash
cd backend
DATABASE_URL=postgresql://... npm run db:migrate
DATABASE_URL=postgresql://... npm run db:seed
```
**A2 — Provision staging Valkey** — self-hosted Docker on the VM is fine for staging (`--appendonly yes`, see [§ Valkey](#valkey)). Note the `VALKEY_URL`.
**A3 — Staging Firebase Admin creds** — the app's staging `google-services.json` / `GoogleService-Info.plist` point at Firebase project **`my-bestie-876ec`**. The backend's `FIREBASE_SERVICE_ACCOUNT` **must be a service-account key from that same project**, or FCM push + token verification will silently target the wrong project. Mount it as a secret and set `FIREBASE_SERVICE_ACCOUNT_PATH` (or switch to a Secret Manager mount).
**A4 — Build the image + run migrations, then start the container.**
Build (on a build host or in CI), then push to your registry:
```bash
docker build -t <registry>/halobestie-backend:staging ./backend
docker push <registry>/halobestie-backend:staging
```
Run migrations as a **one-off** before (re)starting the service — never auto-migrate on boot (replica race):
```bash
docker run --rm --env-file backend/.env.staging \
<registry>/halobestie-backend:staging node src/db/migrate.js
# first deploy only:
docker run --rm --env-file backend/.env.staging \
<registry>/halobestie-backend:staging node src/db/seed.js
```
Run the service (plain Docker Engine example; k8s = Deployment + Service with the same env/secrets and liveness/readiness probes on `:3000`):
```bash
docker run -d --name halobestie-staging \
--env-file backend/.env.staging \
-p 3000:3000 \
-v /path/to/firebase-sa.json:/secrets/firebase-sa.json:ro \
--restart unless-stopped \
<registry>/halobestie-backend:staging
```
- Publish **only** port 3000. The internal listener (3001) stays bound to `127.0.0.1` inside the container — do not map it.
- `FIREBASE_SERVICE_ACCOUNT_PATH` must point at the mounted path (e.g. `/secrets/firebase-sa.json`), not a baked-in file.
- Put a TLS-terminating reverse proxy (Nginx / Traefik / Caddy) in front for `https://staging-api.halobestie.com`.
Staging-specific env values (`backend/.env.staging`; see [backend/.env.example](../backend/.env.example) for the full list):
| Var | Staging value |
|---|---|
| `AUTH_JWT_SECRET` | a fresh secret — **not** the prod one |
| `XENDIT_ENABLED` | `false` until you wire test-mode keys + webhook |
| `XENDIT_SECRET_KEY` / `XENDIT_WEBHOOK_TOKEN` | Xendit **test** credentials |
| `XENDIT_SUCCESS/FAILURE_REDIRECT_URL` | staging backend's `/payment/return/*` URLs |
| `FAZPASS_ENABLED` | `false` (test-user OTP bypass path) unless testing real OTP |
| `CC_ORIGIN` | staging control-center origin (if deployed) |
| `ADMIN_EMAIL` / `ADMIN_PASSWORD` | staging control-center login |
> **Public listener only.** The internal listener (port 3001, control center) must stay off the public internet — don't expose it from this Cloud Run service. CC for staging, if needed, goes behind the VPC/VPN per the root architecture rules.
**A5 — Capture the URL.** Point a DNS record (e.g. `staging-api.halobestie.com`) at the host/reverse proxy and terminate TLS there. **This HTTPS URL is the value the app needs** in Phase B.
### App handoff (Phase B) — once A5 gives a URL
1. Put the real URL in [`client_app/env/staging.json`](../client_app/env/staging.json) + [`mitra_app/env/staging.json`](../mitra_app/env/staging.json) (`API_BASE_URL`), and remove the `_TODO` key from the client file.
2. Build the staging APK:
```bash
cd client_app
flutter build apk --flavor staging -t lib/main_staging.dart --dart-define-from-file=env/staging.json
```
Output: `build/app/outputs/flutter-apk/app-staging-release.apk`.
3. Distribute via **Firebase App Distribution** (debug-signed APK is accepted — no upload keystore needed for staging) or share the APK directly. `com.mybestie.staging` installs side-by-side with prod.
> **Release signing is still debug keys** ([client_app/android/app/build.gradle.kts](../client_app/android/app/build.gradle.kts) `release { ... }`). Fine for Firebase App Distribution / direct APK. A real upload keystore is only required if you later publish staging to a Play Store internal-testing track. iOS staging is **not** wired yet (only one `Runner.xcscheme` — no per-flavor schemes/build-configs).
## Cloud SQL
(Placeholder — pool size, machine type, HA flag, backup retention.)
## Xendit
See [phase5-xendit-plan.md](phase5-xendit-plan.md) for credential setup and webhook URL configuration. Stage 8 (live E2E) is currently blocked on test-mode keys.
## Open ops decisions
- [ ] Confirm Memorystore Standard tier for prod deploy (recommended in [§ Valkey](#valkey)).
- [ ] Pin GCP region for backend + Cloud SQL + Memorystore (all must match for sub-ms internal latency).
- [ ] Secrets manager (GCP Secret Manager vs Cloud Run env vars) for `AUTH_JWT_SECRET`, `XENDIT_SECRET_KEY`, etc.
- [ ] Backup retention policy for Cloud SQL.
- [ ] CI/CD pipeline for Cloud Run deploys.