Files

Ramadhan Sjamsani 553dbac52f Phase 6: Valkey availability mirror — move read path off Postgres

Mitra-availability state (online flag, deactivated flag, per-mitra session
count, heartbeat liveness) mirrored into Valkey so the customer beacon
+ pairing blast + dashboard counts no longer hit Postgres on the hot path.
Postgres remains the durable source of truth; Valkey state is fully
derivable via seedFromPostgres on startup + reconnect.

Schema
- mitras:online           SET    — mirror of is_online
- mitras:deactivated      SET    — mirror of is_active=false
- mitra:capacity:<id>     STRING — active+pending_payment session count
- mitra💓<id>    STRING — ISO timestamp of last ping
- availability:snapshot   JSON   — beacon cache, TTL 10s, cluster-shared

Write paths (Postgres first, best-effort Valkey)
- setOnline/setOffline mirror SADD/SREM + heartbeat SET/DEL
- updateMitraStatus mirrors mitras:deactivated AND revokes auth_sessions
  on deactivate (bounds the "ghost online" window to access-token TTL)
- heartbeat is Valkey-only on the hot path; the per-ping Postgres UPDATE
  on last_heartbeat_at is eliminated (was 1,200 ops/min at prod scale)
- chat_session lifecycle (accept/end/reroute/extension/expiry) calls
  recomputeCapacityForMitra after each UPDATE — derive-from-truth avoids
  the bookkeeping risk of per-transition INCR/DECR

Read paths (Valkey-first, Postgres fallback on Valkey error)
- isMitraReachable: SISMEMBER mitras:online + heartbeat freshness
- findAvailableMitras: SDIFF + pipelined GETs, filter by capacity + heartbeat
- countAvailableMitrasFromCache: Valkey-driven, cached cluster-wide 10s TTL
- dashboard online count: SCARD
- Each reader wraps Valkey ops in try/catch → Postgres fallback on outage

Heartbeat path on /api/mitra/status/heartbeat
- resolveMitra preHandler replaced with heartbeatGuard: SISMEMBER on
  mitras:deactivated (~0 DB hits per ping). Falls back to full DB
  resolveMitra if Valkey is unreachable so a Valkey outage doesn't
  silently accept heartbeats from deactivated mitras.

Three sweeps, env-configurable cadences
- MITRA_AUTO_OFFLINE_SWEEP_SECONDS (30) — Valkey-driven stale detection
- HEARTBEAT_MIRROR_INTERVAL_SECONDS (60) — batched UPSERT writes
  Valkey timestamps to Postgres last_heartbeat_at via UNNEST (1 statement
  per cycle, idempotent across instances)
- VALKEY_ONLINE_MIRROR_SWEEP_SECONDS (300) — periodic reseed heals drift

Startup
- restoreActiveTimers → seedFromPostgres → bind listeners
- onValkeyReady re-runs the seed on every reconnect (cold start + reseed
  on Valkey restart, no manual intervention)

Failure semantics
- Read fallback: every Valkey read wrapped, falls back to existing
  Postgres JOIN query — system stays correct during Valkey outage,
  performance degrades not breaks
- Write best-effort: Postgres write commits before Valkey is touched;
  Valkey errors log + continue; reconciliation sweep heals drift
- Auto-offline sweep aborts entirely on Valkey error (does NOT mass-
  offline via Postgres scan during Valkey hiccup)

Tests
- New: 32 integration tests in mitra-status.valkey-mirror.test.js
  covering seed, write-through, fallbacks, capacity lifecycle,
  auto-offline sweep, heartbeat mirror, deactivation flow, beacon cache
- Updated: fixtures.js seeds Valkey alongside Postgres when isOnline=true
- Updated: helpers/db.js resetDb also flushes test Valkey
- Fixed 2 pre-existing session-timer flakes (string IDs failed uuid
  parse; vi.advanceTimersByTimeAsync raced real Postgres I/O)
- All 124/124 backend tests pass (was 90/92)

Docs
- requirement/valkey-online-mirror-plan.md — canonical plan
- requirement/valkey-online-mirror-testing.md — manual E2E checklist
- requirement/deployment.md — infra + Valkey persistence guidance for
  prod (Memorystore Standard tier recommended; migration from
  self-hosted Valkey is zero-downtime via reseed-from-Postgres)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

2026-05-25 18:07:55 +08:00

15 KiB

Raw Blame History

Valkey Availability Mirror — Testing Checklist

End-to-end verification for the Valkey-mirror refactor described in valkey-online-mirror-plan.md.

Cluster labels: [BE] backend / curl / SQL / Valkey-cli, [CC] control_center, [M] mitra_app, [C] client_app.

Run order: Section A first (seed + startup) — every subsequent section assumes a fresh seed. Sections B–J are otherwise independent.

Setup

Backend running on 192.168.88.247:3000 (public) + :3001 (internal) — curl http://192.168.88.247:3000/api/shared/auth-providers returns 200
Valkey reachable from backend (VALKEY_URL matches running instance; [valkey] subscribed to config:invalidate appears in backend boot log)
Postgres reachable; backend boot log shows [valkey-mirror] seed: X online, Y deactivated, Z with active sessions
At least 3 mitra accounts exist in DB (we need to flip them online/offline/deactivated across tests)
One customer account ready for the blast scenarios

Helpful aliases for verification (run from backend/):

alias vk='node --env-file=.env -e'
# Then in tests: vk "(async()=>{const v=(await import('./src/plugins/valkey.js')).getValkeyClient();
#   console.log(await v.smembers('mitras:online'));process.exit(0)})()"

Section A — Seed + Startup

Verifies seedFromPostgres() populates Valkey correctly from Postgres truth.

[BE] Restart backend; log shows one [valkey-mirror] seed: N online, M deactivated, K with active sessions line on startup

[BE] Counts in the log match Postgres truth:

SELECT
  (SELECT COUNT(*) FROM mitra_online_status WHERE is_online=true) AS online,
  (SELECT COUNT(*) FROM mitras WHERE is_active=false) AS deactivated,
  (SELECT COUNT(DISTINCT mitra_id) FROM chat_sessions
   WHERE mitra_id IS NOT NULL AND status IN ('active','pending_payment')) AS with_sessions;

[BE] Valkey contents match — SMEMBERS mitras:online returns the same IDs as SELECT mitra_id FROM mitra_online_status WHERE is_online=true
[BE] Valkey mitra:heartbeat:<id> exists for every currently-online mitra, value is a recent ISO timestamp (within seed time)
[BE] Valkey mitra:capacity:<id> matches SELECT COUNT(*) FROM chat_sessions WHERE mitra_id=<id> AND status IN ('active','pending_payment') for every online mitra
[BE] SMEMBERS mitras:deactivated matches SELECT id FROM mitras WHERE is_active=false

Reconnect re-seed

[BE] With backend running, FLUSHDB on Valkey, wait ~5s for ioredis reconnect, verify a second [valkey-mirror] seed: log entry appears
[BE] All four Valkey structures are rebuilt and match Postgres again

Section B — Online / Offline toggle (write-through)

Verifies setOnline / setOffline writes both stores in the right order.

[M] Mitra taps "online" → backend updates Postgres mitra_online_status.is_online=true
[BE] SISMEMBER mitras:online <mitra-id> returns 1 within 1s of the toggle
[BE] GET mitra:heartbeat:<mitra-id> returns a fresh ISO timestamp (within seconds of the toggle)
[M] Mitra taps "offline"
[BE] Postgres is_online=false, Valkey SISMEMBER mitras:online <id> returns 0, GET mitra:heartbeat:<id> returns nil
[BE] mitra_online_logs has paired online / offline audit rows

Valkey-failure best-effort write

[BE] Stop Valkey, then toggle a mitra online → request succeeds (200), backend log shows [valkey-mirror] setOnline <id> failed: but Postgres is updated correctly
[BE] Restart Valkey → reconciliation sweep (≤ 300s default) eventually rebuilds the SET to include this mitra

Section C — Heartbeat path

Verifies the rewrite: per-ping = 1 Valkey write, 0 DB writes.

[M] Mitra online for at least one heartbeat cycle (~30s)
[BE] Watch Postgres query log during heartbeat — no UPDATE mitra_online_status SET last_heartbeat_at rows fire on every ping (only the batched mirror, default every 60s)
[BE] GET mitra:heartbeat:<id> value advances on each ping (re-check ~30s later, timestamp moves forward)
[BE] After 60s+ wait, SELECT last_heartbeat_at FROM mitra_online_status WHERE mitra_id=<id> advances (heartbeat mirror sweep ran)

Deactivation guard via Valkey

[CC] Admin deactivates the mitra (Phase 5 path: is_active=false)
[BE] SISMEMBER mitras:deactivated <id> immediately returns 1
[M] Mitra app's next heartbeat → backend returns 403 ACCOUNT_INACTIVE
[BE] No Postgres SELECT on mitras table for that heartbeat (verify with query log) — guard is pure Valkey

Fallback when Valkey is down

[BE] Stop Valkey
[M] Mitra app heartbeats → backend logs [heartbeat] valkey check failed, falling back to DB, request still succeeds for active mitra, returns 403 for deactivated mitra (full DB-backed resolveMitra path)
[BE] Restart Valkey → next heartbeat uses Valkey path again (no fallback log line)

Section D — Capacity counter

Verifies INCR/DECR across session lifecycle via recomputeCapacityForMitra.

[BE] Reset: pick a mitra with mitra:capacity:<id> = 0
[C] Customer pays → mitra accepts the blast → chat starts
[BE] GET mitra:capacity:<id> returns 1 within 1s of mitra accepting
[C] Second customer pays → same mitra accepts (assuming max_customers_per_mitra >= 2)
[BE] GET mitra:capacity:<id> returns 2
[C] First session ends naturally (timer expires + goodbye flow completes)
[BE] GET mitra:capacity:<id> returns 1 within 1s
[C] Second session ends
[BE] GET mitra:capacity:<id> returns 0

Reroute

[CC] Reroute an active session from mitra A → mitra B
[BE] mitra:capacity:A decrements, mitra:capacity:B increments — both atomic with the chat_sessions UPDATE

Capacity gates blast

[BE] Set mitra:capacity:<id> to max_customers_per_mitra directly (SET mitra:capacity:<id> 3)
[C] Customer pays → blast → this mitra is excluded (verify with chat_request_notifications — no row for this mitra)

Section E — Deactivation flow (full propagation)

Verifies updateMitraStatus + revokeAllSessionsForUser close the deactivation gap.

[BE] Mitra has an active access token (capture from /api/mitra/auth/otp/verify or use existing logged-in session)
[BE] Confirm mitra can call protected routes (curl -H "Authorization: Bearer <token>" /api/mitra/...)
[CC] Admin deactivates the mitra
[BE] Postgres: mitras.is_active=false, auth_sessions.revoked_at IS NOT NULL for this mitra
[BE] Valkey: SISMEMBER mitras:deactivated <id> = 1
[BE] Mitra's current access token still works on routes that don't re-check active state (stateless JWT) — bounded by access-token TTL
[BE] Mitra's heartbeat returns 403 ACCOUNT_INACTIVE immediately (Valkey check on hot path)
[BE] Mitra's next refresh token attempt fails (because auth_sessions.revoked_at was set) → app effectively logs them out

Re-activation

[CC] Re-activate the mitra
[BE] Postgres is_active=true, Valkey SISMEMBER mitras:deactivated <id> = 0
[M] Mitra re-logs in, heartbeats again successfully

Section F — Read paths (Valkey-first)

Verifies all reads use Valkey with Postgres fallback.

`isMitraReachable`

[BE] Online mitra with fresh heartbeat → isMitraReachable returns true (call via node -e or any route that uses it, e.g. extension flow)
[BE] Manually set mitra:heartbeat:<id> to a timestamp older than stale_after_seconds → isMitraReachable returns false (even though is_online=true in Postgres)
[BE] Stop Valkey → isMitraReachable logs [isMitraReachable] valkey unavailable, falling back to DB and returns based on Postgres is_online

`findAvailableMitras` (blast)

[BE] Run findAvailableMitras (e.g. trigger a customer blast) — log shows Valkey path used (no warning about fallback)
[BE] Result IDs match SDIFF mitras:online mitras:deactivated filtered by capacity + heartbeat freshness
[BE] Stop Valkey → next blast logs [findAvailableMitras] valkey unavailable, falling back to Postgres and still returns correct results

`countAvailableMitrasFromCache` (customer beacon)

[BE] curl /public/bestie-availability returns {available: bool, count: N} matching reality
[BE] GET availability:snapshot in Valkey shows cached JSON within 10s of last poll
[BE] Multiple rapid polls (5+ per second from 3 different IPs) → only one Valkey-driven recompute per 10s; Postgres query log shows zero mitra-availability JOINs in steady state (only the once-per-10s cache miss)
[CC] Operator changes max_customers_per_mitra → availability:snapshot is DELd (cache bust), next poll recomputes

Dashboard online count

[CC] Dashboard "Online Mitras" stat matches SCARD mitras:online in Valkey
[BE] Verify the dashboard query no longer hits SELECT COUNT(*) FROM mitra_online_status WHERE is_online=true (check query log)

Section G — Auto-offline sweep (Valkey-driven)

Verifies stale heartbeats → flipped offline via Valkey diff.

[BE] Set MITRA_AUTO_OFFLINE_SWEEP_SECONDS=10 in env for faster test cycle, restart backend
[CC] Set stale_after_seconds=15 in CC settings (or directly in app_config)
[M] Mitra goes online, sends one heartbeat
[BE] Manually delete the heartbeat key: DEL mitra:heartbeat:<id>
[BE] Wait up to 25s (15s stale + 10s sweep cadence)
[BE] Postgres: is_online=false for this mitra, audit row in mitra_online_logs with status='offline'
[BE] Valkey: SISMEMBER mitras:online <id> = 0, no mitra:heartbeat:<id> key

Sweep skips on Valkey error

[BE] Stop Valkey
[BE] Backend log shows [auto-offline] valkey unavailable, skipping this tick: each sweep cadence
[BE] No Postgres UPDATE fires during the outage (verify with query log) — confirms we don't mass-offline on Valkey hiccup

Section H — Heartbeat → Postgres batched mirror

Verifies the 60s UNNEST UPDATE.

[BE] Multiple mitras online, all heartbeating
[BE] Set HEARTBEAT_MIRROR_INTERVAL_SECONDS=15 for faster cycles, restart
[BE] Wait one mirror cycle — Postgres log shows one UPDATE statement (with UNNEST(...) containing all online mitra IDs)
[BE] SELECT last_heartbeat_at FROM mitra_online_status WHERE is_online=true returns timestamps within last cycle
[BE] Compare with Valkey GET mitra:heartbeat:<id> — Postgres lags Valkey by ≤ mirror-cadence seconds (forensic-grade, not real-time)

Mirror skips on Valkey error

[BE] Stop Valkey
[BE] Backend log shows [heartbeat-mirror] valkey unavailable, skipping: each cycle
[BE] Postgres last_heartbeat_at does NOT advance during the outage (correct — Valkey is the source of "when was last ping?")

Section I — Reconciliation sweep

Verifies drift heals every VALKEY_ONLINE_MIRROR_SWEEP_SECONDS.

[BE] Set VALKEY_ONLINE_MIRROR_SWEEP_SECONDS=30 for faster test, restart

[BE] Manually corrupt Valkey:

SADD mitras:online 00000000-0000-0000-0000-000000000999  # bogus ID
SREM mitras:online <real-online-mitra-id>                 # remove a real one
SET mitra:capacity:<real-id> 99                           # bogus capacity

[BE] Wait one sweep cycle (~30s) → log shows [valkey-mirror] seed: again
[BE] After sweep: Valkey state matches Postgres exactly (bogus ID gone, real ID present, capacity reset)

Sweep disabled when env=0

[BE] Set VALKEY_ONLINE_MIRROR_SWEEP_SECONDS=0, restart
[BE] Confirm no periodic seed log appears after the initial startup seed

Section J — Failure modes (Valkey degradation)

End-to-end behavior when Valkey is down for an extended period.

[BE] Stop Valkey
[C] Customer beacon poll → availability:snapshot GET fails → falls back to Postgres JOIN; UX unchanged but DB query rate spikes
[C] Customer triggers blast → findAvailableMitras Valkey path fails → falls back to Postgres JOIN; blast still works
[M] Mitra heartbeat → Valkey write fails (logged), but heartbeat returns 200; the missed write is irrelevant (auto-offline sweep is also skipping)
[M] Mitra toggle online → Postgres update succeeds, Valkey SADD fails (logged); on next reconciliation sweep after Valkey returns, mitra is back in mitras:online SET
[BE] Restart Valkey → reconnect listener fires → seedFromPostgres() runs → state restored; degraded period ends

Section K — Multi-instance (defer until Cloud Run)

Run only when ≥ 2 backend instances are active.

[BE] Two instances both run their own seed on startup — final Valkey state is consistent (idempotent DEL + SADD)
[BE] Concurrent setOnline calls on the same mitra from different instances → final SET state correct (atomic SADD)
[BE] availability:snapshot cache miss on instance A fills the snapshot; instance B's next poll reads the cached value (cluster-shared cache works)
[BE] Operator changes max_customers_per_mitra on one instance → config:invalidate pub/sub fires → other instance also DELs availability:snapshot
[BE] Heartbeat mirror UPDATEs from multiple instances are idempotent (last writer wins on timestamp, no errors)

Smoke tests (quick happy path)

5-minute sanity check after any deploy:

[BE] Backend log shows successful seed on startup
[M] Mitra toggles online → appears in SMEMBERS mitras:online
[C] Customer sees "Mulai Curhat" enabled
[C] Customer pays → mitra accepts → chat starts → mitra:capacity:<id> increments
[C] Chat ends → counter decrements
[M] Mitra toggles offline → removed from SET

Known limitations / what this checklist does NOT cover

Load testing — sustained heartbeat volume at prod scale (300+ mitras × 2 pings/min). Plan: separate load-test stage when prod is provisioned.
Memorystore-specific behavior — failover, RDB+AOF interaction. Plan: re-run Sections A, G, I, J against Memorystore Standard tier before prod cutover.
Long-running drift — overnight runs where eviction or memory-pressure could affect Valkey state. Plan: monitor INFO memory in prod for the first week.

15 KiB Raw Blame History Unescape Escape