Mitra-availability state (online flag, deactivated flag, per-mitra session count, heartbeat liveness) mirrored into Valkey so the customer beacon + pairing blast + dashboard counts no longer hit Postgres on the hot path. Postgres remains the durable source of truth; Valkey state is fully derivable via seedFromPostgres on startup + reconnect. Schema - mitras:online SET — mirror of is_online - mitras:deactivated SET — mirror of is_active=false - mitra:capacity:<id> STRING — active+pending_payment session count - mitra💓<id> STRING — ISO timestamp of last ping - availability:snapshot JSON — beacon cache, TTL 10s, cluster-shared Write paths (Postgres first, best-effort Valkey) - setOnline/setOffline mirror SADD/SREM + heartbeat SET/DEL - updateMitraStatus mirrors mitras:deactivated AND revokes auth_sessions on deactivate (bounds the "ghost online" window to access-token TTL) - heartbeat is Valkey-only on the hot path; the per-ping Postgres UPDATE on last_heartbeat_at is eliminated (was 1,200 ops/min at prod scale) - chat_session lifecycle (accept/end/reroute/extension/expiry) calls recomputeCapacityForMitra after each UPDATE — derive-from-truth avoids the bookkeeping risk of per-transition INCR/DECR Read paths (Valkey-first, Postgres fallback on Valkey error) - isMitraReachable: SISMEMBER mitras:online + heartbeat freshness - findAvailableMitras: SDIFF + pipelined GETs, filter by capacity + heartbeat - countAvailableMitrasFromCache: Valkey-driven, cached cluster-wide 10s TTL - dashboard online count: SCARD - Each reader wraps Valkey ops in try/catch → Postgres fallback on outage Heartbeat path on /api/mitra/status/heartbeat - resolveMitra preHandler replaced with heartbeatGuard: SISMEMBER on mitras:deactivated (~0 DB hits per ping). Falls back to full DB resolveMitra if Valkey is unreachable so a Valkey outage doesn't silently accept heartbeats from deactivated mitras. Three sweeps, env-configurable cadences - MITRA_AUTO_OFFLINE_SWEEP_SECONDS (30) — Valkey-driven stale detection - HEARTBEAT_MIRROR_INTERVAL_SECONDS (60) — batched UPSERT writes Valkey timestamps to Postgres last_heartbeat_at via UNNEST (1 statement per cycle, idempotent across instances) - VALKEY_ONLINE_MIRROR_SWEEP_SECONDS (300) — periodic reseed heals drift Startup - restoreActiveTimers → seedFromPostgres → bind listeners - onValkeyReady re-runs the seed on every reconnect (cold start + reseed on Valkey restart, no manual intervention) Failure semantics - Read fallback: every Valkey read wrapped, falls back to existing Postgres JOIN query — system stays correct during Valkey outage, performance degrades not breaks - Write best-effort: Postgres write commits before Valkey is touched; Valkey errors log + continue; reconciliation sweep heals drift - Auto-offline sweep aborts entirely on Valkey error (does NOT mass- offline via Postgres scan during Valkey hiccup) Tests - New: 32 integration tests in mitra-status.valkey-mirror.test.js covering seed, write-through, fallbacks, capacity lifecycle, auto-offline sweep, heartbeat mirror, deactivation flow, beacon cache - Updated: fixtures.js seeds Valkey alongside Postgres when isOnline=true - Updated: helpers/db.js resetDb also flushes test Valkey - Fixed 2 pre-existing session-timer flakes (string IDs failed uuid parse; vi.advanceTimersByTimeAsync raced real Postgres I/O) - All 124/124 backend tests pass (was 90/92) Docs - requirement/valkey-online-mirror-plan.md — canonical plan - requirement/valkey-online-mirror-testing.md — manual E2E checklist - requirement/deployment.md — infra + Valkey persistence guidance for prod (Memorystore Standard tier recommended; migration from self-hosted Valkey is zero-downtime via reseed-from-Postgres) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
15 KiB
Valkey Availability Mirror — Testing Checklist
End-to-end verification for the Valkey-mirror refactor described in valkey-online-mirror-plan.md.
Cluster labels: [BE] backend / curl / SQL / Valkey-cli, [CC] control_center, [M] mitra_app, [C] client_app.
Run order: Section A first (seed + startup) — every subsequent section assumes a fresh seed. Sections B–J are otherwise independent.
Setup
- Backend running on
192.168.88.247:3000(public) +:3001(internal) —curl http://192.168.88.247:3000/api/shared/auth-providersreturns 200 - Valkey reachable from backend (
VALKEY_URLmatches running instance;[valkey] subscribed to config:invalidateappears in backend boot log) - Postgres reachable; backend boot log shows
[valkey-mirror] seed: X online, Y deactivated, Z with active sessions - At least 3 mitra accounts exist in DB (we need to flip them online/offline/deactivated across tests)
- One customer account ready for the blast scenarios
- Helpful aliases for verification (run from
backend/):alias vk='node --env-file=.env -e' # Then in tests: vk "(async()=>{const v=(await import('./src/plugins/valkey.js')).getValkeyClient(); # console.log(await v.smembers('mitras:online'));process.exit(0)})()"
Section A — Seed + Startup
Verifies seedFromPostgres() populates Valkey correctly from Postgres truth.
- [BE] Restart backend; log shows one
[valkey-mirror] seed: N online, M deactivated, K with active sessionsline on startup - [BE] Counts in the log match Postgres truth:
SELECT (SELECT COUNT(*) FROM mitra_online_status WHERE is_online=true) AS online, (SELECT COUNT(*) FROM mitras WHERE is_active=false) AS deactivated, (SELECT COUNT(DISTINCT mitra_id) FROM chat_sessions WHERE mitra_id IS NOT NULL AND status IN ('active','pending_payment')) AS with_sessions; - [BE] Valkey contents match —
SMEMBERS mitras:onlinereturns the same IDs asSELECT mitra_id FROM mitra_online_status WHERE is_online=true - [BE] Valkey
mitra:heartbeat:<id>exists for every currently-online mitra, value is a recent ISO timestamp (within seed time) - [BE] Valkey
mitra:capacity:<id>matchesSELECT COUNT(*) FROM chat_sessions WHERE mitra_id=<id> AND status IN ('active','pending_payment')for every online mitra - [BE]
SMEMBERS mitras:deactivatedmatchesSELECT id FROM mitras WHERE is_active=false
Reconnect re-seed
- [BE] With backend running,
FLUSHDBon Valkey, wait ~5s for ioredis reconnect, verify a second[valkey-mirror] seed:log entry appears - [BE] All four Valkey structures are rebuilt and match Postgres again
Section B — Online / Offline toggle (write-through)
Verifies setOnline / setOffline writes both stores in the right order.
- [M] Mitra taps "online" → backend updates Postgres
mitra_online_status.is_online=true - [BE]
SISMEMBER mitras:online <mitra-id>returns1within 1s of the toggle - [BE]
GET mitra:heartbeat:<mitra-id>returns a fresh ISO timestamp (within seconds of the toggle) - [M] Mitra taps "offline"
- [BE] Postgres
is_online=false, ValkeySISMEMBER mitras:online <id>returns0,GET mitra:heartbeat:<id>returnsnil - [BE]
mitra_online_logshas pairedonline/offlineaudit rows
Valkey-failure best-effort write
- [BE] Stop Valkey, then toggle a mitra online → request succeeds (200), backend log shows
[valkey-mirror] setOnline <id> failed:but Postgres is updated correctly - [BE] Restart Valkey → reconciliation sweep (≤ 300s default) eventually rebuilds the SET to include this mitra
Section C — Heartbeat path
Verifies the rewrite: per-ping = 1 Valkey write, 0 DB writes.
- [M] Mitra online for at least one heartbeat cycle (~30s)
- [BE] Watch Postgres query log during heartbeat — no
UPDATE mitra_online_status SET last_heartbeat_atrows fire on every ping (only the batched mirror, default every 60s) - [BE]
GET mitra:heartbeat:<id>value advances on each ping (re-check ~30s later, timestamp moves forward) - [BE] After 60s+ wait,
SELECT last_heartbeat_at FROM mitra_online_status WHERE mitra_id=<id>advances (heartbeat mirror sweep ran)
Deactivation guard via Valkey
- [CC] Admin deactivates the mitra (Phase 5 path:
is_active=false) - [BE]
SISMEMBER mitras:deactivated <id>immediately returns1 - [M] Mitra app's next heartbeat → backend returns
403 ACCOUNT_INACTIVE - [BE] No Postgres SELECT on
mitrastable for that heartbeat (verify with query log) — guard is pure Valkey
Fallback when Valkey is down
- [BE] Stop Valkey
- [M] Mitra app heartbeats → backend logs
[heartbeat] valkey check failed, falling back to DB, request still succeeds for active mitra, returns403for deactivated mitra (full DB-backed resolveMitra path) - [BE] Restart Valkey → next heartbeat uses Valkey path again (no fallback log line)
Section D — Capacity counter
Verifies INCR/DECR across session lifecycle via recomputeCapacityForMitra.
- [BE] Reset: pick a mitra with
mitra:capacity:<id> = 0 - [C] Customer pays → mitra accepts the blast → chat starts
- [BE]
GET mitra:capacity:<id>returns1within 1s of mitra accepting - [C] Second customer pays → same mitra accepts (assuming
max_customers_per_mitra >= 2) - [BE]
GET mitra:capacity:<id>returns2 - [C] First session ends naturally (timer expires + goodbye flow completes)
- [BE]
GET mitra:capacity:<id>returns1within 1s - [C] Second session ends
- [BE]
GET mitra:capacity:<id>returns0
Reroute
- [CC] Reroute an active session from mitra A → mitra B
- [BE]
mitra:capacity:Adecrements,mitra:capacity:Bincrements — both atomic with the chat_sessions UPDATE
Capacity gates blast
- [BE] Set
mitra:capacity:<id>tomax_customers_per_mitradirectly (SET mitra:capacity:<id> 3) - [C] Customer pays → blast → this mitra is excluded (verify with
chat_request_notifications— no row for this mitra)
Section E — Deactivation flow (full propagation)
Verifies updateMitraStatus + revokeAllSessionsForUser close the deactivation gap.
- [BE] Mitra has an active access token (capture from
/api/mitra/auth/otp/verifyor use existing logged-in session) - [BE] Confirm mitra can call protected routes (
curl -H "Authorization: Bearer <token>" /api/mitra/...) - [CC] Admin deactivates the mitra
- [BE] Postgres:
mitras.is_active=false,auth_sessions.revoked_at IS NOT NULLfor this mitra - [BE] Valkey:
SISMEMBER mitras:deactivated <id>= 1 - [BE] Mitra's current access token still works on routes that don't re-check active state (stateless JWT) — bounded by access-token TTL
- [BE] Mitra's heartbeat returns
403 ACCOUNT_INACTIVEimmediately (Valkey check on hot path) - [BE] Mitra's next refresh token attempt fails (because
auth_sessions.revoked_atwas set) → app effectively logs them out
Re-activation
- [CC] Re-activate the mitra
- [BE] Postgres
is_active=true, ValkeySISMEMBER mitras:deactivated <id>= 0 - [M] Mitra re-logs in, heartbeats again successfully
Section F — Read paths (Valkey-first)
Verifies all reads use Valkey with Postgres fallback.
isMitraReachable
- [BE] Online mitra with fresh heartbeat →
isMitraReachablereturnstrue(call vianode -eor any route that uses it, e.g. extension flow) - [BE] Manually set
mitra:heartbeat:<id>to a timestamp older thanstale_after_seconds→isMitraReachablereturnsfalse(even thoughis_online=truein Postgres) - [BE] Stop Valkey →
isMitraReachablelogs[isMitraReachable] valkey unavailable, falling back to DBand returns based on Postgresis_online
findAvailableMitras (blast)
- [BE] Run
findAvailableMitras(e.g. trigger a customer blast) — log shows Valkey path used (no warning about fallback) - [BE] Result IDs match
SDIFF mitras:online mitras:deactivatedfiltered by capacity + heartbeat freshness - [BE] Stop Valkey → next blast logs
[findAvailableMitras] valkey unavailable, falling back to Postgresand still returns correct results
countAvailableMitrasFromCache (customer beacon)
- [BE]
curl /public/bestie-availabilityreturns{available: bool, count: N}matching reality - [BE]
GET availability:snapshotin Valkey shows cached JSON within 10s of last poll - [BE] Multiple rapid polls (5+ per second from 3 different IPs) → only one Valkey-driven recompute per 10s; Postgres query log shows zero mitra-availability JOINs in steady state (only the once-per-10s cache miss)
- [CC] Operator changes
max_customers_per_mitra→availability:snapshotisDELd (cache bust), next poll recomputes
Dashboard online count
- [CC] Dashboard "Online Mitras" stat matches
SCARD mitras:onlinein Valkey - [BE] Verify the dashboard query no longer hits
SELECT COUNT(*) FROM mitra_online_status WHERE is_online=true(check query log)
Section G — Auto-offline sweep (Valkey-driven)
Verifies stale heartbeats → flipped offline via Valkey diff.
- [BE] Set
MITRA_AUTO_OFFLINE_SWEEP_SECONDS=10in env for faster test cycle, restart backend - [CC] Set
stale_after_seconds=15in CC settings (or directly inapp_config) - [M] Mitra goes online, sends one heartbeat
- [BE] Manually delete the heartbeat key:
DEL mitra:heartbeat:<id> - [BE] Wait up to 25s (15s stale + 10s sweep cadence)
- [BE] Postgres:
is_online=falsefor this mitra, audit row inmitra_online_logswith status='offline' - [BE] Valkey:
SISMEMBER mitras:online <id>= 0, nomitra:heartbeat:<id>key
Sweep skips on Valkey error
- [BE] Stop Valkey
- [BE] Backend log shows
[auto-offline] valkey unavailable, skipping this tick:each sweep cadence - [BE] No Postgres UPDATE fires during the outage (verify with query log) — confirms we don't mass-offline on Valkey hiccup
Section H — Heartbeat → Postgres batched mirror
Verifies the 60s UNNEST UPDATE.
- [BE] Multiple mitras online, all heartbeating
- [BE] Set
HEARTBEAT_MIRROR_INTERVAL_SECONDS=15for faster cycles, restart - [BE] Wait one mirror cycle — Postgres log shows one UPDATE statement (with
UNNEST(...)containing all online mitra IDs) - [BE]
SELECT last_heartbeat_at FROM mitra_online_status WHERE is_online=truereturns timestamps within last cycle - [BE] Compare with Valkey
GET mitra:heartbeat:<id>— Postgres lags Valkey by ≤ mirror-cadence seconds (forensic-grade, not real-time)
Mirror skips on Valkey error
- [BE] Stop Valkey
- [BE] Backend log shows
[heartbeat-mirror] valkey unavailable, skipping:each cycle - [BE] Postgres
last_heartbeat_atdoes NOT advance during the outage (correct — Valkey is the source of "when was last ping?")
Section I — Reconciliation sweep
Verifies drift heals every VALKEY_ONLINE_MIRROR_SWEEP_SECONDS.
- [BE] Set
VALKEY_ONLINE_MIRROR_SWEEP_SECONDS=30for faster test, restart - [BE] Manually corrupt Valkey:
SADD mitras:online 00000000-0000-0000-0000-000000000999 # bogus ID SREM mitras:online <real-online-mitra-id> # remove a real one SET mitra:capacity:<real-id> 99 # bogus capacity - [BE] Wait one sweep cycle (~30s) → log shows
[valkey-mirror] seed:again - [BE] After sweep: Valkey state matches Postgres exactly (bogus ID gone, real ID present, capacity reset)
Sweep disabled when env=0
- [BE] Set
VALKEY_ONLINE_MIRROR_SWEEP_SECONDS=0, restart - [BE] Confirm no periodic seed log appears after the initial startup seed
Section J — Failure modes (Valkey degradation)
End-to-end behavior when Valkey is down for an extended period.
- [BE] Stop Valkey
- [C] Customer beacon poll →
availability:snapshotGET fails → falls back to Postgres JOIN; UX unchanged but DB query rate spikes - [C] Customer triggers blast →
findAvailableMitrasValkey path fails → falls back to Postgres JOIN; blast still works - [M] Mitra heartbeat → Valkey write fails (logged), but heartbeat returns 200; the missed write is irrelevant (auto-offline sweep is also skipping)
- [M] Mitra toggle online → Postgres update succeeds, Valkey SADD fails (logged); on next reconciliation sweep after Valkey returns, mitra is back in
mitras:onlineSET - [BE] Restart Valkey → reconnect listener fires →
seedFromPostgres()runs → state restored; degraded period ends
Section K — Multi-instance (defer until Cloud Run)
Run only when ≥ 2 backend instances are active.
- [BE] Two instances both run their own seed on startup — final Valkey state is consistent (idempotent
DEL + SADD) - [BE] Concurrent setOnline calls on the same mitra from different instances → final SET state correct (atomic SADD)
- [BE]
availability:snapshotcache miss on instance A fills the snapshot; instance B's next poll reads the cached value (cluster-shared cache works) - [BE] Operator changes
max_customers_per_mitraon one instance →config:invalidatepub/sub fires → other instance also DELsavailability:snapshot - [BE] Heartbeat mirror UPDATEs from multiple instances are idempotent (last writer wins on timestamp, no errors)
Smoke tests (quick happy path)
5-minute sanity check after any deploy:
- [BE] Backend log shows successful seed on startup
- [M] Mitra toggles online → appears in
SMEMBERS mitras:online - [C] Customer sees "Mulai Curhat" enabled
- [C] Customer pays → mitra accepts → chat starts →
mitra:capacity:<id>increments - [C] Chat ends → counter decrements
- [M] Mitra toggles offline → removed from SET
Known limitations / what this checklist does NOT cover
- Load testing — sustained heartbeat volume at prod scale (300+ mitras × 2 pings/min). Plan: separate load-test stage when prod is provisioned.
- Memorystore-specific behavior — failover, RDB+AOF interaction. Plan: re-run Sections A, G, I, J against Memorystore Standard tier before prod cutover.
- Long-running drift — overnight runs where eviction or memory-pressure could affect Valkey state. Plan: monitor
INFO memoryin prod for the first week.