Files
halobestie-clone/requirement/deployment.md
Ramadhan Sjamsani 553dbac52f Phase 6: Valkey availability mirror — move read path off Postgres
Mitra-availability state (online flag, deactivated flag, per-mitra session
count, heartbeat liveness) mirrored into Valkey so the customer beacon
+ pairing blast + dashboard counts no longer hit Postgres on the hot path.
Postgres remains the durable source of truth; Valkey state is fully
derivable via seedFromPostgres on startup + reconnect.

Schema
- mitras:online           SET    — mirror of is_online
- mitras:deactivated      SET    — mirror of is_active=false
- mitra:capacity:<id>     STRING — active+pending_payment session count
- mitra💓<id>    STRING — ISO timestamp of last ping
- availability:snapshot   JSON   — beacon cache, TTL 10s, cluster-shared

Write paths (Postgres first, best-effort Valkey)
- setOnline/setOffline mirror SADD/SREM + heartbeat SET/DEL
- updateMitraStatus mirrors mitras:deactivated AND revokes auth_sessions
  on deactivate (bounds the "ghost online" window to access-token TTL)
- heartbeat is Valkey-only on the hot path; the per-ping Postgres UPDATE
  on last_heartbeat_at is eliminated (was 1,200 ops/min at prod scale)
- chat_session lifecycle (accept/end/reroute/extension/expiry) calls
  recomputeCapacityForMitra after each UPDATE — derive-from-truth avoids
  the bookkeeping risk of per-transition INCR/DECR

Read paths (Valkey-first, Postgres fallback on Valkey error)
- isMitraReachable: SISMEMBER mitras:online + heartbeat freshness
- findAvailableMitras: SDIFF + pipelined GETs, filter by capacity + heartbeat
- countAvailableMitrasFromCache: Valkey-driven, cached cluster-wide 10s TTL
- dashboard online count: SCARD
- Each reader wraps Valkey ops in try/catch → Postgres fallback on outage

Heartbeat path on /api/mitra/status/heartbeat
- resolveMitra preHandler replaced with heartbeatGuard: SISMEMBER on
  mitras:deactivated (~0 DB hits per ping). Falls back to full DB
  resolveMitra if Valkey is unreachable so a Valkey outage doesn't
  silently accept heartbeats from deactivated mitras.

Three sweeps, env-configurable cadences
- MITRA_AUTO_OFFLINE_SWEEP_SECONDS (30) — Valkey-driven stale detection
- HEARTBEAT_MIRROR_INTERVAL_SECONDS (60) — batched UPSERT writes
  Valkey timestamps to Postgres last_heartbeat_at via UNNEST (1 statement
  per cycle, idempotent across instances)
- VALKEY_ONLINE_MIRROR_SWEEP_SECONDS (300) — periodic reseed heals drift

Startup
- restoreActiveTimers → seedFromPostgres → bind listeners
- onValkeyReady re-runs the seed on every reconnect (cold start + reseed
  on Valkey restart, no manual intervention)

Failure semantics
- Read fallback: every Valkey read wrapped, falls back to existing
  Postgres JOIN query — system stays correct during Valkey outage,
  performance degrades not breaks
- Write best-effort: Postgres write commits before Valkey is touched;
  Valkey errors log + continue; reconciliation sweep heals drift
- Auto-offline sweep aborts entirely on Valkey error (does NOT mass-
  offline via Postgres scan during Valkey hiccup)

Tests
- New: 32 integration tests in mitra-status.valkey-mirror.test.js
  covering seed, write-through, fallbacks, capacity lifecycle,
  auto-offline sweep, heartbeat mirror, deactivation flow, beacon cache
- Updated: fixtures.js seeds Valkey alongside Postgres when isOnline=true
- Updated: helpers/db.js resetDb also flushes test Valkey
- Fixed 2 pre-existing session-timer flakes (string IDs failed uuid
  parse; vi.advanceTimersByTimeAsync raced real Postgres I/O)
- All 124/124 backend tests pass (was 90/92)

Docs
- requirement/valkey-online-mirror-plan.md — canonical plan
- requirement/valkey-online-mirror-testing.md — manual E2E checklist
- requirement/deployment.md — infra + Valkey persistence guidance for
  prod (Memorystore Standard tier recommended; migration from
  self-hosted Valkey is zero-downtime via reseed-from-Postgres)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-25 18:07:55 +08:00

108 lines
5.5 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Deployment notes
Operational decisions and dependency configuration for staging/production. Keep this updated as we make infra choices; cross-link from feature plans when a deploy-time setting matters.
## Infrastructure summary
| Component | Service | Tier / Notes |
|---|---|---|
| Backend (public + internal) | GCP Cloud Run | Horizontal scaling; SIGTERM trapped for graceful drain ([server.js](../backend/src/server.js)) |
| Database | GCP Cloud SQL (PostgreSQL) | Source of truth for all durable state |
| Pub/sub + cache | Valkey | Self-hosted on VM today; Memorystore Standard (HA) recommended for prod (see [§ Valkey](#valkey)) |
| Networking | GCP VPC | Internal listener (port 3001) never exposed; CC reaches it via VPN |
| Payment | Xendit | See [phase5-xendit-plan.md](phase5-xendit-plan.md) for keys / webhook URL setup |
| Auth | Self-managed JWT + FCM-only Firebase | See [backend/CLAUDE.md](../backend/CLAUDE.md) |
## Valkey
Valkey is used for two distinct purposes:
1. **Pub/sub** — cross-instance event fan-out (chat messages, session lifecycle, config invalidation). See [backend/src/plugins/valkey.js](../backend/src/plugins/valkey.js).
2. **Availability mirror**`mitras:online`, `mitras:deactivated`, `mitra:capacity:<id>`, `mitra:heartbeat:<id>`, and `availability:snapshot` per [valkey-online-mirror-plan.md](valkey-online-mirror-plan.md). Postgres remains the durable source of truth; Valkey is the hot read path.
### Persistence — required or optional?
**Not required.** All durable state lives in Postgres; Valkey is a cache + ephemeral liveness layer that fully rebuilds via `seedFromPostgres()` on backend reconnect.
What's actually in Valkey, and what happens if it's wiped:
| Key | Derivable from Postgres? | Cost of loss |
|---|---|---|
| `mitras:online` | yes | reseeded on reconnect |
| `mitras:deactivated` | yes | reseeded on reconnect |
| `mitra:capacity:<id>` | yes (`COUNT(*) FROM chat_sessions`) | reseeded on reconnect |
| `mitra:heartbeat:<id>` | **no** — pure transient liveness | seed writes `NOW`; ≤ a few seconds of fuzz on `last_heartbeat_at` forensics |
| `availability:snapshot` | recomputable | next beacon poll repopulates |
Reader code in services/* has explicit Postgres fallbacks for every Valkey op, so the cold-cache window during a restart degrades performance, not correctness.
### Persistence recommendation by environment
| Environment | Setting | Reason |
|---|---|---|
| **Dev / local** | No persistence (`--save "" --appendonly no` or just default) | Restarts wipe state; reseed handles it cleanly; zero disk overhead |
| **Staging** | AOF on (`--appendonly yes`) | Verifies prod-like behavior; tiny disk cost |
| **Production** | AOF on, optionally RDB too (`--appendonly yes --save 60 1000`) | Eliminates cold-cache window after restart; trivial disk footprint (few MB) |
The application code is identical across all three — persistence is a deploy-time knob, not a code-level concern.
### Self-hosted Valkey (current state, dev/staging)
Docker container on the existing VM. Reference config:
```yaml
valkey:
image: valkey/valkey:7-alpine
command: valkey-server --appendonly yes --save 60 1000
volumes:
- valkey-data:/data
ports:
- "6379:6379"
restart: unless-stopped
```
Backend reaches it via `VALKEY_URL=redis://<vm-ip>:6379` in `backend/.env` (or Cloud Run env var).
### Memorystore migration (when going to prod)
The reseed-from-Postgres flow makes migration trivial — Valkey state is never load-bearing:
1. Provision **Memorystore for Valkey, Standard tier** (HA with replica) in the same VPC + region as Cloud Run.
- Smallest available size (~1 GB) is plenty; actual data footprint is well under 1 MB.
- Cost: ~$50/month at minimum sizing in asia-southeast2.
2. Update Cloud Run env: `VALKEY_URL=redis://<memorystore-internal-ip>:6379`.
3. Deploy new revision. Cloud Run rolling deploy → new instances seed Memorystore from Postgres; old instances drain on old Valkey.
4. Shut down old Valkey once traffic has migrated.
**Zero downtime.** No data migration needed (state is derivable). The cold-cache window on new instances is handled by the existing Postgres-fallback reader paths.
### Tier choice rationale
| Tier | When to use | Failover behavior |
|---|---|---|
| Self-hosted Docker | Dev, staging | Manual restart; backend reseeds when Valkey comes back |
| Memorystore Basic | Cost-sensitive single-AZ staging | ~15 min outage per maintenance event; backend handles via Postgres fallback |
| Memorystore Standard (HA) | **Production** | ~30s automatic failover; replica keeps data live |
The system is correct on any tier — HA reduces customer-visible latency spikes during Valkey events from minutes to seconds.
## Cloud Run
(Placeholder — fill in as we make decisions about region, min/max instances, concurrency, secrets manager wiring.)
## Cloud SQL
(Placeholder — pool size, machine type, HA flag, backup retention.)
## Xendit
See [phase5-xendit-plan.md](phase5-xendit-plan.md) for credential setup and webhook URL configuration. Stage 8 (live E2E) is currently blocked on test-mode keys.
## Open ops decisions
- [ ] Confirm Memorystore Standard tier for prod deploy (recommended in [§ Valkey](#valkey)).
- [ ] Pin GCP region for backend + Cloud SQL + Memorystore (all must match for sub-ms internal latency).
- [ ] Secrets manager (GCP Secret Manager vs Cloud Run env vars) for `AUTH_JWT_SECRET`, `XENDIT_SECRET_KEY`, etc.
- [ ] Backup retention policy for Cloud SQL.
- [ ] CI/CD pipeline for Cloud Run deploys.