# Deployment notes Operational decisions and dependency configuration for staging/production. Keep this updated as we make infra choices; cross-link from feature plans when a deploy-time setting matters. ## Infrastructure summary | Component | Service | Tier / Notes | |---|---|---| | Backend (public + internal) | GCP Cloud Run | Horizontal scaling; SIGTERM trapped for graceful drain ([server.js](../backend/src/server.js)) | | Database | GCP Cloud SQL (PostgreSQL) | Source of truth for all durable state | | Pub/sub + cache | Valkey | Self-hosted on VM today; Memorystore Standard (HA) recommended for prod (see [§ Valkey](#valkey)) | | Networking | GCP VPC | Internal listener (port 3001) never exposed; CC reaches it via VPN | | Payment | Xendit | See [phase5-xendit-plan.md](phase5-xendit-plan.md) for keys / webhook URL setup | | Auth | Self-managed JWT + FCM-only Firebase | See [backend/CLAUDE.md](../backend/CLAUDE.md) | ## Valkey Valkey is used for two distinct purposes: 1. **Pub/sub** — cross-instance event fan-out (chat messages, session lifecycle, config invalidation). See [backend/src/plugins/valkey.js](../backend/src/plugins/valkey.js). 2. **Availability mirror** — `mitras:online`, `mitras:deactivated`, `mitra:capacity:`, `mitra:heartbeat:`, and `availability:snapshot` per [valkey-online-mirror-plan.md](valkey-online-mirror-plan.md). Postgres remains the durable source of truth; Valkey is the hot read path. ### Persistence — required or optional? **Not required.** All durable state lives in Postgres; Valkey is a cache + ephemeral liveness layer that fully rebuilds via `seedFromPostgres()` on backend reconnect. What's actually in Valkey, and what happens if it's wiped: | Key | Derivable from Postgres? | Cost of loss | |---|---|---| | `mitras:online` | yes | reseeded on reconnect | | `mitras:deactivated` | yes | reseeded on reconnect | | `mitra:capacity:` | yes (`COUNT(*) FROM chat_sessions`) | reseeded on reconnect | | `mitra:heartbeat:` | **no** — pure transient liveness | seed writes `NOW`; ≤ a few seconds of fuzz on `last_heartbeat_at` forensics | | `availability:snapshot` | recomputable | next beacon poll repopulates | Reader code in services/* has explicit Postgres fallbacks for every Valkey op, so the cold-cache window during a restart degrades performance, not correctness. ### Persistence recommendation by environment | Environment | Setting | Reason | |---|---|---| | **Dev / local** | No persistence (`--save "" --appendonly no` or just default) | Restarts wipe state; reseed handles it cleanly; zero disk overhead | | **Staging** | AOF on (`--appendonly yes`) | Verifies prod-like behavior; tiny disk cost | | **Production** | AOF on, optionally RDB too (`--appendonly yes --save 60 1000`) | Eliminates cold-cache window after restart; trivial disk footprint (few MB) | The application code is identical across all three — persistence is a deploy-time knob, not a code-level concern. ### Self-hosted Valkey (current state, dev/staging) Docker container on the existing VM. Reference config: ```yaml valkey: image: valkey/valkey:7-alpine command: valkey-server --appendonly yes --save 60 1000 volumes: - valkey-data:/data ports: - "6379:6379" restart: unless-stopped ``` Backend reaches it via `VALKEY_URL=redis://:6379` in `backend/.env` (or Cloud Run env var). ### Memorystore migration (when going to prod) The reseed-from-Postgres flow makes migration trivial — Valkey state is never load-bearing: 1. Provision **Memorystore for Valkey, Standard tier** (HA with replica) in the same VPC + region as Cloud Run. - Smallest available size (~1 GB) is plenty; actual data footprint is well under 1 MB. - Cost: ~$50/month at minimum sizing in asia-southeast2. 2. Update Cloud Run env: `VALKEY_URL=redis://:6379`. 3. Deploy new revision. Cloud Run rolling deploy → new instances seed Memorystore from Postgres; old instances drain on old Valkey. 4. Shut down old Valkey once traffic has migrated. **Zero downtime.** No data migration needed (state is derivable). The cold-cache window on new instances is handled by the existing Postgres-fallback reader paths. ### Tier choice rationale | Tier | When to use | Failover behavior | |---|---|---| | Self-hosted Docker | Dev, staging | Manual restart; backend reseeds when Valkey comes back | | Memorystore Basic | Cost-sensitive single-AZ staging | ~1–5 min outage per maintenance event; backend handles via Postgres fallback | | Memorystore Standard (HA) | **Production** | ~30s automatic failover; replica keeps data live | The system is correct on any tier — HA reduces customer-visible latency spikes during Valkey events from minutes to seconds. ## Cloud Run (Placeholder — fill in as we make decisions about region, min/max instances, concurrency, secrets manager wiring.) ## Cloud SQL (Placeholder — pool size, machine type, HA flag, backup retention.) ## Xendit See [phase5-xendit-plan.md](phase5-xendit-plan.md) for credential setup and webhook URL configuration. Stage 8 (live E2E) is currently blocked on test-mode keys. ## Open ops decisions - [ ] Confirm Memorystore Standard tier for prod deploy (recommended in [§ Valkey](#valkey)). - [ ] Pin GCP region for backend + Cloud SQL + Memorystore (all must match for sub-ms internal latency). - [ ] Secrets manager (GCP Secret Manager vs Cloud Run env vars) for `AUTH_JWT_SECRET`, `XENDIT_SECRET_KEY`, etc. - [ ] Backup retention policy for Cloud SQL. - [ ] CI/CD pipeline for Cloud Run deploys.