# Valkey Availability Mirror — Testing Checklist End-to-end verification for the Valkey-mirror refactor described in [valkey-online-mirror-plan.md](valkey-online-mirror-plan.md). Cluster labels: **[BE]** backend / curl / SQL / Valkey-cli, **[CC]** control_center, **[M]** mitra_app, **[C]** client_app. > **Run order:** Section A first (seed + startup) — every subsequent section assumes a fresh seed. Sections B–J are otherwise independent. --- ## Setup - [ ] Backend running on `192.168.88.247:3000` (public) + `:3001` (internal) — `curl http://192.168.88.247:3000/api/shared/auth-providers` returns 200 - [ ] Valkey reachable from backend (`VALKEY_URL` matches running instance; `[valkey] subscribed to config:invalidate` appears in backend boot log) - [ ] Postgres reachable; backend boot log shows `[valkey-mirror] seed: X online, Y deactivated, Z with active sessions` - [ ] At least 3 mitra accounts exist in DB (we need to flip them online/offline/deactivated across tests) - [ ] One customer account ready for the blast scenarios - [ ] Helpful aliases for verification (run from `backend/`): ```bash alias vk='node --env-file=.env -e' # Then in tests: vk "(async()=>{const v=(await import('./src/plugins/valkey.js')).getValkeyClient(); # console.log(await v.smembers('mitras:online'));process.exit(0)})()" ``` --- ## Section A — Seed + Startup Verifies `seedFromPostgres()` populates Valkey correctly from Postgres truth. - [ ] **[BE]** Restart backend; log shows one `[valkey-mirror] seed: N online, M deactivated, K with active sessions` line on startup - [ ] **[BE]** Counts in the log match Postgres truth: ```sql SELECT (SELECT COUNT(*) FROM mitra_online_status WHERE is_online=true) AS online, (SELECT COUNT(*) FROM mitras WHERE is_active=false) AS deactivated, (SELECT COUNT(DISTINCT mitra_id) FROM chat_sessions WHERE mitra_id IS NOT NULL AND status IN ('active','pending_payment')) AS with_sessions; ``` - [ ] **[BE]** Valkey contents match — `SMEMBERS mitras:online` returns the same IDs as `SELECT mitra_id FROM mitra_online_status WHERE is_online=true` - [ ] **[BE]** Valkey `mitra:heartbeat:` exists for every currently-online mitra, value is a recent ISO timestamp (within seed time) - [ ] **[BE]** Valkey `mitra:capacity:` matches `SELECT COUNT(*) FROM chat_sessions WHERE mitra_id= AND status IN ('active','pending_payment')` for every online mitra - [ ] **[BE]** `SMEMBERS mitras:deactivated` matches `SELECT id FROM mitras WHERE is_active=false` ### Reconnect re-seed - [ ] **[BE]** With backend running, `FLUSHDB` on Valkey, wait ~5s for ioredis reconnect, verify a second `[valkey-mirror] seed:` log entry appears - [ ] **[BE]** All four Valkey structures are rebuilt and match Postgres again --- ## Section B — Online / Offline toggle (write-through) Verifies `setOnline` / `setOffline` writes both stores in the right order. - [ ] **[M]** Mitra taps "online" → backend updates Postgres `mitra_online_status.is_online=true` - [ ] **[BE]** `SISMEMBER mitras:online ` returns `1` within 1s of the toggle - [ ] **[BE]** `GET mitra:heartbeat:` returns a fresh ISO timestamp (within seconds of the toggle) - [ ] **[M]** Mitra taps "offline" - [ ] **[BE]** Postgres `is_online=false`, Valkey `SISMEMBER mitras:online ` returns `0`, `GET mitra:heartbeat:` returns `nil` - [ ] **[BE]** `mitra_online_logs` has paired `online` / `offline` audit rows ### Valkey-failure best-effort write - [ ] **[BE]** Stop Valkey, then toggle a mitra online → request succeeds (200), backend log shows `[valkey-mirror] setOnline failed:` but Postgres is updated correctly - [ ] **[BE]** Restart Valkey → reconciliation sweep (≤ 300s default) eventually rebuilds the SET to include this mitra --- ## Section C — Heartbeat path Verifies the rewrite: per-ping = 1 Valkey write, 0 DB writes. - [ ] **[M]** Mitra online for at least one heartbeat cycle (~30s) - [ ] **[BE]** Watch Postgres query log during heartbeat — **no `UPDATE mitra_online_status SET last_heartbeat_at` rows fire on every ping** (only the batched mirror, default every 60s) - [ ] **[BE]** `GET mitra:heartbeat:` value advances on each ping (re-check ~30s later, timestamp moves forward) - [ ] **[BE]** After 60s+ wait, `SELECT last_heartbeat_at FROM mitra_online_status WHERE mitra_id=` advances (heartbeat mirror sweep ran) ### Deactivation guard via Valkey - [ ] **[CC]** Admin deactivates the mitra (Phase 5 path: `is_active=false`) - [ ] **[BE]** `SISMEMBER mitras:deactivated ` immediately returns `1` - [ ] **[M]** Mitra app's next heartbeat → backend returns `403 ACCOUNT_INACTIVE` - [ ] **[BE]** No Postgres SELECT on `mitras` table for that heartbeat (verify with query log) — guard is pure Valkey ### Fallback when Valkey is down - [ ] **[BE]** Stop Valkey - [ ] **[M]** Mitra app heartbeats → backend logs `[heartbeat] valkey check failed, falling back to DB`, request still succeeds for active mitra, returns `403` for deactivated mitra (full DB-backed resolveMitra path) - [ ] **[BE]** Restart Valkey → next heartbeat uses Valkey path again (no fallback log line) --- ## Section D — Capacity counter Verifies INCR/DECR across session lifecycle via `recomputeCapacityForMitra`. - [ ] **[BE]** Reset: pick a mitra with `mitra:capacity: = 0` - [ ] **[C]** Customer pays → mitra accepts the blast → chat starts - [ ] **[BE]** `GET mitra:capacity:` returns `1` within 1s of mitra accepting - [ ] **[C]** Second customer pays → same mitra accepts (assuming `max_customers_per_mitra >= 2`) - [ ] **[BE]** `GET mitra:capacity:` returns `2` - [ ] **[C]** First session ends naturally (timer expires + goodbye flow completes) - [ ] **[BE]** `GET mitra:capacity:` returns `1` within 1s - [ ] **[C]** Second session ends - [ ] **[BE]** `GET mitra:capacity:` returns `0` ### Reroute - [ ] **[CC]** Reroute an active session from mitra A → mitra B - [ ] **[BE]** `mitra:capacity:A` decrements, `mitra:capacity:B` increments — both atomic with the chat_sessions UPDATE ### Capacity gates blast - [ ] **[BE]** Set `mitra:capacity:` to `max_customers_per_mitra` directly (`SET mitra:capacity: 3`) - [ ] **[C]** Customer pays → blast → **this mitra is excluded** (verify with `chat_request_notifications` — no row for this mitra) --- ## Section E — Deactivation flow (full propagation) Verifies `updateMitraStatus` + `revokeAllSessionsForUser` close the deactivation gap. - [ ] **[BE]** Mitra has an active access token (capture from `/api/mitra/auth/otp/verify` or use existing logged-in session) - [ ] **[BE]** Confirm mitra can call protected routes (`curl -H "Authorization: Bearer " /api/mitra/...`) - [ ] **[CC]** Admin deactivates the mitra - [ ] **[BE]** Postgres: `mitras.is_active=false`, `auth_sessions.revoked_at IS NOT NULL` for this mitra - [ ] **[BE]** Valkey: `SISMEMBER mitras:deactivated ` = 1 - [ ] **[BE]** Mitra's current access token still works on routes that don't re-check active state (stateless JWT) — bounded by access-token TTL - [ ] **[BE]** Mitra's heartbeat returns `403 ACCOUNT_INACTIVE` immediately (Valkey check on hot path) - [ ] **[BE]** Mitra's next refresh token attempt fails (because `auth_sessions.revoked_at` was set) → app effectively logs them out ### Re-activation - [ ] **[CC]** Re-activate the mitra - [ ] **[BE]** Postgres `is_active=true`, Valkey `SISMEMBER mitras:deactivated ` = 0 - [ ] **[M]** Mitra re-logs in, heartbeats again successfully --- ## Section F — Read paths (Valkey-first) Verifies all reads use Valkey with Postgres fallback. ### `isMitraReachable` - [ ] **[BE]** Online mitra with fresh heartbeat → `isMitraReachable` returns `true` (call via `node -e` or any route that uses it, e.g. extension flow) - [ ] **[BE]** Manually set `mitra:heartbeat:` to a timestamp older than `stale_after_seconds` → `isMitraReachable` returns `false` (even though `is_online=true` in Postgres) - [ ] **[BE]** Stop Valkey → `isMitraReachable` logs `[isMitraReachable] valkey unavailable, falling back to DB` and returns based on Postgres `is_online` ### `findAvailableMitras` (blast) - [ ] **[BE]** Run `findAvailableMitras` (e.g. trigger a customer blast) — log shows Valkey path used (no warning about fallback) - [ ] **[BE]** Result IDs match `SDIFF mitras:online mitras:deactivated` filtered by capacity + heartbeat freshness - [ ] **[BE]** Stop Valkey → next blast logs `[findAvailableMitras] valkey unavailable, falling back to Postgres` and still returns correct results ### `countAvailableMitrasFromCache` (customer beacon) - [ ] **[BE]** `curl /public/bestie-availability` returns `{available: bool, count: N}` matching reality - [ ] **[BE]** `GET availability:snapshot` in Valkey shows cached JSON within 10s of last poll - [ ] **[BE]** Multiple rapid polls (5+ per second from 3 different IPs) → only one Valkey-driven recompute per 10s; Postgres query log shows **zero** mitra-availability JOINs in steady state (only the once-per-10s cache miss) - [ ] **[CC]** Operator changes `max_customers_per_mitra` → `availability:snapshot` is `DEL`d (cache bust), next poll recomputes ### Dashboard online count - [ ] **[CC]** Dashboard "Online Mitras" stat matches `SCARD mitras:online` in Valkey - [ ] **[BE]** Verify the dashboard query no longer hits `SELECT COUNT(*) FROM mitra_online_status WHERE is_online=true` (check query log) --- ## Section G — Auto-offline sweep (Valkey-driven) Verifies stale heartbeats → flipped offline via Valkey diff. - [ ] **[BE]** Set `MITRA_AUTO_OFFLINE_SWEEP_SECONDS=10` in env for faster test cycle, restart backend - [ ] **[CC]** Set `stale_after_seconds=15` in CC settings (or directly in `app_config`) - [ ] **[M]** Mitra goes online, sends one heartbeat - [ ] **[BE]** Manually delete the heartbeat key: `DEL mitra:heartbeat:` - [ ] **[BE]** Wait up to 25s (15s stale + 10s sweep cadence) - [ ] **[BE]** Postgres: `is_online=false` for this mitra, audit row in `mitra_online_logs` with status='offline' - [ ] **[BE]** Valkey: `SISMEMBER mitras:online ` = 0, no `mitra:heartbeat:` key ### Sweep skips on Valkey error - [ ] **[BE]** Stop Valkey - [ ] **[BE]** Backend log shows `[auto-offline] valkey unavailable, skipping this tick:` each sweep cadence - [ ] **[BE]** **No Postgres UPDATE fires** during the outage (verify with query log) — confirms we don't mass-offline on Valkey hiccup --- ## Section H — Heartbeat → Postgres batched mirror Verifies the 60s UNNEST UPDATE. - [ ] **[BE]** Multiple mitras online, all heartbeating - [ ] **[BE]** Set `HEARTBEAT_MIRROR_INTERVAL_SECONDS=15` for faster cycles, restart - [ ] **[BE]** Wait one mirror cycle — Postgres log shows **one** UPDATE statement (with `UNNEST(...)` containing all online mitra IDs) - [ ] **[BE]** `SELECT last_heartbeat_at FROM mitra_online_status WHERE is_online=true` returns timestamps within last cycle - [ ] **[BE]** Compare with Valkey `GET mitra:heartbeat:` — Postgres lags Valkey by ≤ mirror-cadence seconds (forensic-grade, not real-time) ### Mirror skips on Valkey error - [ ] **[BE]** Stop Valkey - [ ] **[BE]** Backend log shows `[heartbeat-mirror] valkey unavailable, skipping:` each cycle - [ ] **[BE]** Postgres `last_heartbeat_at` does NOT advance during the outage (correct — Valkey is the source of "when was last ping?") --- ## Section I — Reconciliation sweep Verifies drift heals every `VALKEY_ONLINE_MIRROR_SWEEP_SECONDS`. - [ ] **[BE]** Set `VALKEY_ONLINE_MIRROR_SWEEP_SECONDS=30` for faster test, restart - [ ] **[BE]** Manually corrupt Valkey: ``` SADD mitras:online 00000000-0000-0000-0000-000000000999 # bogus ID SREM mitras:online # remove a real one SET mitra:capacity: 99 # bogus capacity ``` - [ ] **[BE]** Wait one sweep cycle (~30s) → log shows `[valkey-mirror] seed:` again - [ ] **[BE]** After sweep: Valkey state matches Postgres exactly (bogus ID gone, real ID present, capacity reset) ### Sweep disabled when env=0 - [ ] **[BE]** Set `VALKEY_ONLINE_MIRROR_SWEEP_SECONDS=0`, restart - [ ] **[BE]** Confirm no periodic seed log appears after the initial startup seed --- ## Section J — Failure modes (Valkey degradation) End-to-end behavior when Valkey is down for an extended period. - [ ] **[BE]** Stop Valkey - [ ] **[C]** Customer beacon poll → `availability:snapshot` GET fails → falls back to Postgres JOIN; UX unchanged but DB query rate spikes - [ ] **[C]** Customer triggers blast → `findAvailableMitras` Valkey path fails → falls back to Postgres JOIN; blast still works - [ ] **[M]** Mitra heartbeat → Valkey write fails (logged), but heartbeat returns 200; the missed write is irrelevant (auto-offline sweep is also skipping) - [ ] **[M]** Mitra toggle online → Postgres update succeeds, Valkey SADD fails (logged); on next reconciliation sweep after Valkey returns, mitra is back in `mitras:online` SET - [ ] **[BE]** Restart Valkey → reconnect listener fires → `seedFromPostgres()` runs → state restored; degraded period ends --- ## Section K — Multi-instance (defer until Cloud Run) Run only when ≥ 2 backend instances are active. - [ ] **[BE]** Two instances both run their own seed on startup — final Valkey state is consistent (idempotent `DEL + SADD`) - [ ] **[BE]** Concurrent setOnline calls on the same mitra from different instances → final SET state correct (atomic SADD) - [ ] **[BE]** `availability:snapshot` cache miss on instance A fills the snapshot; instance B's next poll reads the cached value (cluster-shared cache works) - [ ] **[BE]** Operator changes `max_customers_per_mitra` on one instance → `config:invalidate` pub/sub fires → other instance also DELs `availability:snapshot` - [ ] **[BE]** Heartbeat mirror UPDATEs from multiple instances are idempotent (last writer wins on timestamp, no errors) --- ## Smoke tests (quick happy path) 5-minute sanity check after any deploy: - [ ] **[BE]** Backend log shows successful seed on startup - [ ] **[M]** Mitra toggles online → appears in `SMEMBERS mitras:online` - [ ] **[C]** Customer sees "Mulai Curhat" enabled - [ ] **[C]** Customer pays → mitra accepts → chat starts → `mitra:capacity:` increments - [ ] **[C]** Chat ends → counter decrements - [ ] **[M]** Mitra toggles offline → removed from SET --- ## Known limitations / what this checklist does NOT cover - **Load testing** — sustained heartbeat volume at prod scale (300+ mitras × 2 pings/min). Plan: separate load-test stage when prod is provisioned. - **Memorystore-specific behavior** — failover, RDB+AOF interaction. Plan: re-run Sections A, G, I, J against Memorystore Standard tier before prod cutover. - **Long-running drift** — overnight runs where eviction or memory-pressure could affect Valkey state. Plan: monitor `INFO memory` in prod for the first week.