Files

Ramadhan Sjamsani 553dbac52f Phase 6: Valkey availability mirror — move read path off Postgres

Mitra-availability state (online flag, deactivated flag, per-mitra session
count, heartbeat liveness) mirrored into Valkey so the customer beacon
+ pairing blast + dashboard counts no longer hit Postgres on the hot path.
Postgres remains the durable source of truth; Valkey state is fully
derivable via seedFromPostgres on startup + reconnect.

Schema
- mitras:online           SET    — mirror of is_online
- mitras:deactivated      SET    — mirror of is_active=false
- mitra:capacity:<id>     STRING — active+pending_payment session count
- mitra💓<id>    STRING — ISO timestamp of last ping
- availability:snapshot   JSON   — beacon cache, TTL 10s, cluster-shared

Write paths (Postgres first, best-effort Valkey)
- setOnline/setOffline mirror SADD/SREM + heartbeat SET/DEL
- updateMitraStatus mirrors mitras:deactivated AND revokes auth_sessions
  on deactivate (bounds the "ghost online" window to access-token TTL)
- heartbeat is Valkey-only on the hot path; the per-ping Postgres UPDATE
  on last_heartbeat_at is eliminated (was 1,200 ops/min at prod scale)
- chat_session lifecycle (accept/end/reroute/extension/expiry) calls
  recomputeCapacityForMitra after each UPDATE — derive-from-truth avoids
  the bookkeeping risk of per-transition INCR/DECR

Read paths (Valkey-first, Postgres fallback on Valkey error)
- isMitraReachable: SISMEMBER mitras:online + heartbeat freshness
- findAvailableMitras: SDIFF + pipelined GETs, filter by capacity + heartbeat
- countAvailableMitrasFromCache: Valkey-driven, cached cluster-wide 10s TTL
- dashboard online count: SCARD
- Each reader wraps Valkey ops in try/catch → Postgres fallback on outage

Heartbeat path on /api/mitra/status/heartbeat
- resolveMitra preHandler replaced with heartbeatGuard: SISMEMBER on
  mitras:deactivated (~0 DB hits per ping). Falls back to full DB
  resolveMitra if Valkey is unreachable so a Valkey outage doesn't
  silently accept heartbeats from deactivated mitras.

Three sweeps, env-configurable cadences
- MITRA_AUTO_OFFLINE_SWEEP_SECONDS (30) — Valkey-driven stale detection
- HEARTBEAT_MIRROR_INTERVAL_SECONDS (60) — batched UPSERT writes
  Valkey timestamps to Postgres last_heartbeat_at via UNNEST (1 statement
  per cycle, idempotent across instances)
- VALKEY_ONLINE_MIRROR_SWEEP_SECONDS (300) — periodic reseed heals drift

Startup
- restoreActiveTimers → seedFromPostgres → bind listeners
- onValkeyReady re-runs the seed on every reconnect (cold start + reseed
  on Valkey restart, no manual intervention)

Failure semantics
- Read fallback: every Valkey read wrapped, falls back to existing
  Postgres JOIN query — system stays correct during Valkey outage,
  performance degrades not breaks
- Write best-effort: Postgres write commits before Valkey is touched;
  Valkey errors log + continue; reconciliation sweep heals drift
- Auto-offline sweep aborts entirely on Valkey error (does NOT mass-
  offline via Postgres scan during Valkey hiccup)

Tests
- New: 32 integration tests in mitra-status.valkey-mirror.test.js
  covering seed, write-through, fallbacks, capacity lifecycle,
  auto-offline sweep, heartbeat mirror, deactivation flow, beacon cache
- Updated: fixtures.js seeds Valkey alongside Postgres when isOnline=true
- Updated: helpers/db.js resetDb also flushes test Valkey
- Fixed 2 pre-existing session-timer flakes (string IDs failed uuid
  parse; vi.advanceTimersByTimeAsync raced real Postgres I/O)
- All 124/124 backend tests pass (was 90/92)

Docs
- requirement/valkey-online-mirror-plan.md — canonical plan
- requirement/valkey-online-mirror-testing.md — manual E2E checklist
- requirement/deployment.md — infra + Valkey persistence guidance for
  prod (Memorystore Standard tier recommended; migration from
  self-hosted Valkey is zero-downtime via reseed-from-Postgres)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

2026-05-25 18:07:55 +08:00

5.5 KiB

Raw Blame History

Deployment notes

Operational decisions and dependency configuration for staging/production. Keep this updated as we make infra choices; cross-link from feature plans when a deploy-time setting matters.

Infrastructure summary

Component	Service	Tier / Notes
Backend (public + internal)	GCP Cloud Run	Horizontal scaling; SIGTERM trapped for graceful drain (server.js)
Database	GCP Cloud SQL (PostgreSQL)	Source of truth for all durable state
Pub/sub + cache	Valkey	Self-hosted on VM today; Memorystore Standard (HA) recommended for prod (see § Valkey)
Networking	GCP VPC	Internal listener (port 3001) never exposed; CC reaches it via VPN
Payment	Xendit	See phase5-xendit-plan.md for keys / webhook URL setup
Auth	Self-managed JWT + FCM-only Firebase	See backend/CLAUDE.md

Valkey

Valkey is used for two distinct purposes:

Pub/sub — cross-instance event fan-out (chat messages, session lifecycle, config invalidation). See backend/src/plugins/valkey.js.
Availability mirror — mitras:online, mitras:deactivated, mitra:capacity:<id>, mitra:heartbeat:<id>, and availability:snapshot per valkey-online-mirror-plan.md. Postgres remains the durable source of truth; Valkey is the hot read path.

Persistence — required or optional?

Not required. All durable state lives in Postgres; Valkey is a cache + ephemeral liveness layer that fully rebuilds via seedFromPostgres() on backend reconnect.

What's actually in Valkey, and what happens if it's wiped:

Key	Derivable from Postgres?	Cost of loss
`mitras:online`	yes	reseeded on reconnect
`mitras:deactivated`	yes	reseeded on reconnect
`mitra:capacity:<id>`	yes (`COUNT(*) FROM chat_sessions`)	reseeded on reconnect
`mitra:heartbeat:<id>`	no — pure transient liveness	seed writes `NOW`; ≤ a few seconds of fuzz on `last_heartbeat_at` forensics
`availability:snapshot`	recomputable	next beacon poll repopulates

Reader code in services/* has explicit Postgres fallbacks for every Valkey op, so the cold-cache window during a restart degrades performance, not correctness.

Persistence recommendation by environment

Environment	Setting	Reason
Dev / local	No persistence (`--save "" --appendonly no` or just default)	Restarts wipe state; reseed handles it cleanly; zero disk overhead
Staging	AOF on (`--appendonly yes`)	Verifies prod-like behavior; tiny disk cost
Production	AOF on, optionally RDB too (`--appendonly yes --save 60 1000`)	Eliminates cold-cache window after restart; trivial disk footprint (few MB)

The application code is identical across all three — persistence is a deploy-time knob, not a code-level concern.

Self-hosted Valkey (current state, dev/staging)

Docker container on the existing VM. Reference config:

valkey:
  image: valkey/valkey:7-alpine
  command: valkey-server --appendonly yes --save 60 1000
  volumes:
    - valkey-data:/data
  ports:
    - "6379:6379"
  restart: unless-stopped

Backend reaches it via VALKEY_URL=redis://<vm-ip>:6379 in backend/.env (or Cloud Run env var).

Memorystore migration (when going to prod)

The reseed-from-Postgres flow makes migration trivial — Valkey state is never load-bearing:

Provision Memorystore for Valkey, Standard tier (HA with replica) in the same VPC + region as Cloud Run.
- Smallest available size (~1 GB) is plenty; actual data footprint is well under 1 MB.
- Cost: ~$50/month at minimum sizing in asia-southeast2.
Update Cloud Run env: VALKEY_URL=redis://<memorystore-internal-ip>:6379.
Deploy new revision. Cloud Run rolling deploy → new instances seed Memorystore from Postgres; old instances drain on old Valkey.
Shut down old Valkey once traffic has migrated.

Zero downtime. No data migration needed (state is derivable). The cold-cache window on new instances is handled by the existing Postgres-fallback reader paths.

Tier choice rationale

Tier	When to use	Failover behavior
Self-hosted Docker	Dev, staging	Manual restart; backend reseeds when Valkey comes back
Memorystore Basic	Cost-sensitive single-AZ staging	~1–5 min outage per maintenance event; backend handles via Postgres fallback
Memorystore Standard (HA)	Production	~30s automatic failover; replica keeps data live

The system is correct on any tier — HA reduces customer-visible latency spikes during Valkey events from minutes to seconds.

Cloud Run

(Placeholder — fill in as we make decisions about region, min/max instances, concurrency, secrets manager wiring.)

Cloud SQL

(Placeholder — pool size, machine type, HA flag, backup retention.)

Xendit

See phase5-xendit-plan.md for credential setup and webhook URL configuration. Stage 8 (live E2E) is currently blocked on test-mode keys.

Open ops decisions

Confirm Memorystore Standard tier for prod deploy (recommended in § Valkey).
Pin GCP region for backend + Cloud SQL + Memorystore (all must match for sub-ms internal latency).
Secrets manager (GCP Secret Manager vs Cloud Run env vars) for AUTH_JWT_SECRET, XENDIT_SECRET_KEY, etc.
Backup retention policy for Cloud SQL.
CI/CD pipeline for Cloud Run deploys.

5.5 KiB Raw Blame History Unescape Escape