Phase 6: Valkey availability mirror — move read path off Postgres

Mitra-availability state (online flag, deactivated flag, per-mitra session count, heartbeat liveness) mirrored into Valkey so the customer beacon + pairing blast + dashboard counts no longer hit Postgres on the hot path. Postgres remains the durable source of truth; Valkey state is fully derivable via seedFromPostgres on startup + reconnect. Schema - mitras:online SET — mirror of is_online - mitras:deactivated SET — mirror of is_active=false - mitra:capacity:<id> STRING — active+pending_payment session count - mitra💓<id> STRING — ISO timestamp of last ping - availability:snapshot JSON — beacon cache, TTL 10s, cluster-shared Write paths (Postgres first, best-effort Valkey) - setOnline/setOffline mirror SADD/SREM + heartbeat SET/DEL - updateMitraStatus mirrors mitras:deactivated AND revokes auth_sessions on deactivate (bounds the "ghost online" window to access-token TTL) - heartbeat is Valkey-only on the hot path; the per-ping Postgres UPDATE on last_heartbeat_at is eliminated (was 1,200 ops/min at prod scale) - chat_session lifecycle (accept/end/reroute/extension/expiry) calls recomputeCapacityForMitra after each UPDATE — derive-from-truth avoids the bookkeeping risk of per-transition INCR/DECR Read paths (Valkey-first, Postgres fallback on Valkey error) - isMitraReachable: SISMEMBER mitras:online + heartbeat freshness - findAvailableMitras: SDIFF + pipelined GETs, filter by capacity + heartbeat - countAvailableMitrasFromCache: Valkey-driven, cached cluster-wide 10s TTL - dashboard online count: SCARD - Each reader wraps Valkey ops in try/catch → Postgres fallback on outage Heartbeat path on /api/mitra/status/heartbeat - resolveMitra preHandler replaced with heartbeatGuard: SISMEMBER on mitras:deactivated (~0 DB hits per ping). Falls back to full DB resolveMitra if Valkey is unreachable so a Valkey outage doesn't silently accept heartbeats from deactivated mitras. Three sweeps, env-configurable cadences - MITRA_AUTO_OFFLINE_SWEEP_SECONDS (30) — Valkey-driven stale detection - HEARTBEAT_MIRROR_INTERVAL_SECONDS (60) — batched UPSERT writes Valkey timestamps to Postgres last_heartbeat_at via UNNEST (1 statement per cycle, idempotent across instances) - VALKEY_ONLINE_MIRROR_SWEEP_SECONDS (300) — periodic reseed heals drift Startup - restoreActiveTimers → seedFromPostgres → bind listeners - onValkeyReady re-runs the seed on every reconnect (cold start + reseed on Valkey restart, no manual intervention) Failure semantics - Read fallback: every Valkey read wrapped, falls back to existing Postgres JOIN query — system stays correct during Valkey outage, performance degrades not breaks - Write best-effort: Postgres write commits before Valkey is touched; Valkey errors log + continue; reconciliation sweep heals drift - Auto-offline sweep aborts entirely on Valkey error (does NOT mass- offline via Postgres scan during Valkey hiccup) Tests - New: 32 integration tests in mitra-status.valkey-mirror.test.js covering seed, write-through, fallbacks, capacity lifecycle, auto-offline sweep, heartbeat mirror, deactivation flow, beacon cache - Updated: fixtures.js seeds Valkey alongside Postgres when isOnline=true - Updated: helpers/db.js resetDb also flushes test Valkey - Fixed 2 pre-existing session-timer flakes (string IDs failed uuid parse; vi.advanceTimersByTimeAsync raced real Postgres I/O) - All 124/124 backend tests pass (was 90/92) Docs - requirement/valkey-online-mirror-plan.md — canonical plan - requirement/valkey-online-mirror-testing.md — manual E2E checklist - requirement/deployment.md — infra + Valkey persistence guidance for prod (Memorystore Standard tier recommended; migration from self-hosted Valkey is zero-downtime via reseed-from-Postgres) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-25 18:07:55 +08:00
parent 3fff4b1c6e
commit 553dbac52f
20 changed files with 1839 additions and 82 deletions
--- a/requirement/deployment.md
+++ b/requirement/deployment.md
@@ -0,0 +1,107 @@
+# Deployment notes
+
+Operational decisions and dependency configuration for staging/production. Keep this updated as we make infra choices; cross-link from feature plans when a deploy-time setting matters.
+
+## Infrastructure summary
+
+| Component | Service | Tier / Notes |
+|---|---|---|
+| Backend (public + internal) | GCP Cloud Run | Horizontal scaling; SIGTERM trapped for graceful drain ([server.js](../backend/src/server.js)) |
+| Database | GCP Cloud SQL (PostgreSQL) | Source of truth for all durable state |
+| Pub/sub + cache | Valkey | Self-hosted on VM today; Memorystore Standard (HA) recommended for prod (see [§ Valkey](#valkey)) |
+| Networking | GCP VPC | Internal listener (port 3001) never exposed; CC reaches it via VPN |
+| Payment | Xendit | See [phase5-xendit-plan.md](phase5-xendit-plan.md) for keys / webhook URL setup |
+| Auth | Self-managed JWT + FCM-only Firebase | See [backend/CLAUDE.md](../backend/CLAUDE.md) |
+
+## Valkey
+
+Valkey is used for two distinct purposes:
+
+1. **Pub/sub** — cross-instance event fan-out (chat messages, session lifecycle, config invalidation). See [backend/src/plugins/valkey.js](../backend/src/plugins/valkey.js).
+2. **Availability mirror** — `mitras:online`, `mitras:deactivated`, `mitra:capacity:<id>`, `mitra:heartbeat:<id>`, and `availability:snapshot` per [valkey-online-mirror-plan.md](valkey-online-mirror-plan.md). Postgres remains the durable source of truth; Valkey is the hot read path.
+
+### Persistence — required or optional?
+
+**Not required.** All durable state lives in Postgres; Valkey is a cache + ephemeral liveness layer that fully rebuilds via `seedFromPostgres()` on backend reconnect.
+
+What's actually in Valkey, and what happens if it's wiped:
+
+| Key | Derivable from Postgres? | Cost of loss |
+|---|---|---|
+| `mitras:online` | yes | reseeded on reconnect |
+| `mitras:deactivated` | yes | reseeded on reconnect |
+| `mitra:capacity:<id>` | yes (`COUNT(*) FROM chat_sessions`) | reseeded on reconnect |
+| `mitra:heartbeat:<id>` | **no** — pure transient liveness | seed writes `NOW`; ≤ a few seconds of fuzz on `last_heartbeat_at` forensics |
+| `availability:snapshot` | recomputable | next beacon poll repopulates |
+
+Reader code in services/* has explicit Postgres fallbacks for every Valkey op, so the cold-cache window during a restart degrades performance, not correctness.
+
+### Persistence recommendation by environment
+
+| Environment | Setting | Reason |
+|---|---|---|
+| **Dev / local** | No persistence (`--save "" --appendonly no` or just default) | Restarts wipe state; reseed handles it cleanly; zero disk overhead |
+| **Staging** | AOF on (`--appendonly yes`) | Verifies prod-like behavior; tiny disk cost |
+| **Production** | AOF on, optionally RDB too (`--appendonly yes --save 60 1000`) | Eliminates cold-cache window after restart; trivial disk footprint (few MB) |
+
+The application code is identical across all three — persistence is a deploy-time knob, not a code-level concern.
+
+### Self-hosted Valkey (current state, dev/staging)
+
+Docker container on the existing VM. Reference config:
+
+```yaml
+valkey:
+  image: valkey/valkey:7-alpine
+  command: valkey-server --appendonly yes --save 60 1000
+  volumes:
+    - valkey-data:/data
+  ports:
+    - "6379:6379"
+  restart: unless-stopped
+```
+
+Backend reaches it via `VALKEY_URL=redis://<vm-ip>:6379` in `backend/.env` (or Cloud Run env var).
+
+### Memorystore migration (when going to prod)
+
+The reseed-from-Postgres flow makes migration trivial — Valkey state is never load-bearing:
+
+1. Provision **Memorystore for Valkey, Standard tier** (HA with replica) in the same VPC + region as Cloud Run.
+   - Smallest available size (~1 GB) is plenty; actual data footprint is well under 1 MB.
+   - Cost: ~$50/month at minimum sizing in asia-southeast2.
+2. Update Cloud Run env: `VALKEY_URL=redis://<memorystore-internal-ip>:6379`.
+3. Deploy new revision. Cloud Run rolling deploy → new instances seed Memorystore from Postgres; old instances drain on old Valkey.
+4. Shut down old Valkey once traffic has migrated.
+
+**Zero downtime.** No data migration needed (state is derivable). The cold-cache window on new instances is handled by the existing Postgres-fallback reader paths.
+
+### Tier choice rationale
+
+| Tier | When to use | Failover behavior |
+|---|---|---|
+| Self-hosted Docker | Dev, staging | Manual restart; backend reseeds when Valkey comes back |
+| Memorystore Basic | Cost-sensitive single-AZ staging | ~1–5 min outage per maintenance event; backend handles via Postgres fallback |
+| Memorystore Standard (HA) | **Production** | ~30s automatic failover; replica keeps data live |
+
+The system is correct on any tier — HA reduces customer-visible latency spikes during Valkey events from minutes to seconds.
+
+## Cloud Run
+
+(Placeholder — fill in as we make decisions about region, min/max instances, concurrency, secrets manager wiring.)
+
+## Cloud SQL
+
+(Placeholder — pool size, machine type, HA flag, backup retention.)
+
+## Xendit
+
+See [phase5-xendit-plan.md](phase5-xendit-plan.md) for credential setup and webhook URL configuration. Stage 8 (live E2E) is currently blocked on test-mode keys.
+
+## Open ops decisions
+
+- [ ] Confirm Memorystore Standard tier for prod deploy (recommended in [§ Valkey](#valkey)).
+- [ ] Pin GCP region for backend + Cloud SQL + Memorystore (all must match for sub-ms internal latency).
+- [ ] Secrets manager (GCP Secret Manager vs Cloud Run env vars) for `AUTH_JWT_SECRET`, `XENDIT_SECRET_KEY`, etc.
+- [ ] Backup retention policy for Cloud SQL.
+- [ ] CI/CD pipeline for Cloud Run deploys.