Files
halobestie-clone/requirement/valkey-online-mirror-testing.md
Ramadhan Sjamsani 553dbac52f Phase 6: Valkey availability mirror — move read path off Postgres
Mitra-availability state (online flag, deactivated flag, per-mitra session
count, heartbeat liveness) mirrored into Valkey so the customer beacon
+ pairing blast + dashboard counts no longer hit Postgres on the hot path.
Postgres remains the durable source of truth; Valkey state is fully
derivable via seedFromPostgres on startup + reconnect.

Schema
- mitras:online           SET    — mirror of is_online
- mitras:deactivated      SET    — mirror of is_active=false
- mitra:capacity:<id>     STRING — active+pending_payment session count
- mitra💓<id>    STRING — ISO timestamp of last ping
- availability:snapshot   JSON   — beacon cache, TTL 10s, cluster-shared

Write paths (Postgres first, best-effort Valkey)
- setOnline/setOffline mirror SADD/SREM + heartbeat SET/DEL
- updateMitraStatus mirrors mitras:deactivated AND revokes auth_sessions
  on deactivate (bounds the "ghost online" window to access-token TTL)
- heartbeat is Valkey-only on the hot path; the per-ping Postgres UPDATE
  on last_heartbeat_at is eliminated (was 1,200 ops/min at prod scale)
- chat_session lifecycle (accept/end/reroute/extension/expiry) calls
  recomputeCapacityForMitra after each UPDATE — derive-from-truth avoids
  the bookkeeping risk of per-transition INCR/DECR

Read paths (Valkey-first, Postgres fallback on Valkey error)
- isMitraReachable: SISMEMBER mitras:online + heartbeat freshness
- findAvailableMitras: SDIFF + pipelined GETs, filter by capacity + heartbeat
- countAvailableMitrasFromCache: Valkey-driven, cached cluster-wide 10s TTL
- dashboard online count: SCARD
- Each reader wraps Valkey ops in try/catch → Postgres fallback on outage

Heartbeat path on /api/mitra/status/heartbeat
- resolveMitra preHandler replaced with heartbeatGuard: SISMEMBER on
  mitras:deactivated (~0 DB hits per ping). Falls back to full DB
  resolveMitra if Valkey is unreachable so a Valkey outage doesn't
  silently accept heartbeats from deactivated mitras.

Three sweeps, env-configurable cadences
- MITRA_AUTO_OFFLINE_SWEEP_SECONDS (30) — Valkey-driven stale detection
- HEARTBEAT_MIRROR_INTERVAL_SECONDS (60) — batched UPSERT writes
  Valkey timestamps to Postgres last_heartbeat_at via UNNEST (1 statement
  per cycle, idempotent across instances)
- VALKEY_ONLINE_MIRROR_SWEEP_SECONDS (300) — periodic reseed heals drift

Startup
- restoreActiveTimers → seedFromPostgres → bind listeners
- onValkeyReady re-runs the seed on every reconnect (cold start + reseed
  on Valkey restart, no manual intervention)

Failure semantics
- Read fallback: every Valkey read wrapped, falls back to existing
  Postgres JOIN query — system stays correct during Valkey outage,
  performance degrades not breaks
- Write best-effort: Postgres write commits before Valkey is touched;
  Valkey errors log + continue; reconciliation sweep heals drift
- Auto-offline sweep aborts entirely on Valkey error (does NOT mass-
  offline via Postgres scan during Valkey hiccup)

Tests
- New: 32 integration tests in mitra-status.valkey-mirror.test.js
  covering seed, write-through, fallbacks, capacity lifecycle,
  auto-offline sweep, heartbeat mirror, deactivation flow, beacon cache
- Updated: fixtures.js seeds Valkey alongside Postgres when isOnline=true
- Updated: helpers/db.js resetDb also flushes test Valkey
- Fixed 2 pre-existing session-timer flakes (string IDs failed uuid
  parse; vi.advanceTimersByTimeAsync raced real Postgres I/O)
- All 124/124 backend tests pass (was 90/92)

Docs
- requirement/valkey-online-mirror-plan.md — canonical plan
- requirement/valkey-online-mirror-testing.md — manual E2E checklist
- requirement/deployment.md — infra + Valkey persistence guidance for
  prod (Memorystore Standard tier recommended; migration from
  self-hosted Valkey is zero-downtime via reseed-from-Postgres)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-25 18:07:55 +08:00

273 lines
15 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Valkey Availability Mirror — Testing Checklist
End-to-end verification for the Valkey-mirror refactor described in [valkey-online-mirror-plan.md](valkey-online-mirror-plan.md).
Cluster labels: **[BE]** backend / curl / SQL / Valkey-cli, **[CC]** control_center, **[M]** mitra_app, **[C]** client_app.
> **Run order:** Section A first (seed + startup) — every subsequent section assumes a fresh seed. Sections BJ are otherwise independent.
---
## Setup
- [ ] Backend running on `192.168.88.247:3000` (public) + `:3001` (internal) — `curl http://192.168.88.247:3000/api/shared/auth-providers` returns 200
- [ ] Valkey reachable from backend (`VALKEY_URL` matches running instance; `[valkey] subscribed to config:invalidate` appears in backend boot log)
- [ ] Postgres reachable; backend boot log shows `[valkey-mirror] seed: X online, Y deactivated, Z with active sessions`
- [ ] At least 3 mitra accounts exist in DB (we need to flip them online/offline/deactivated across tests)
- [ ] One customer account ready for the blast scenarios
- [ ] Helpful aliases for verification (run from `backend/`):
```bash
alias vk='node --env-file=.env -e'
# Then in tests: vk "(async()=>{const v=(await import('./src/plugins/valkey.js')).getValkeyClient();
# console.log(await v.smembers('mitras:online'));process.exit(0)})()"
```
---
## Section A — Seed + Startup
Verifies `seedFromPostgres()` populates Valkey correctly from Postgres truth.
- [ ] **[BE]** Restart backend; log shows one `[valkey-mirror] seed: N online, M deactivated, K with active sessions` line on startup
- [ ] **[BE]** Counts in the log match Postgres truth:
```sql
SELECT
(SELECT COUNT(*) FROM mitra_online_status WHERE is_online=true) AS online,
(SELECT COUNT(*) FROM mitras WHERE is_active=false) AS deactivated,
(SELECT COUNT(DISTINCT mitra_id) FROM chat_sessions
WHERE mitra_id IS NOT NULL AND status IN ('active','pending_payment')) AS with_sessions;
```
- [ ] **[BE]** Valkey contents match — `SMEMBERS mitras:online` returns the same IDs as `SELECT mitra_id FROM mitra_online_status WHERE is_online=true`
- [ ] **[BE]** Valkey `mitra:heartbeat:<id>` exists for every currently-online mitra, value is a recent ISO timestamp (within seed time)
- [ ] **[BE]** Valkey `mitra:capacity:<id>` matches `SELECT COUNT(*) FROM chat_sessions WHERE mitra_id=<id> AND status IN ('active','pending_payment')` for every online mitra
- [ ] **[BE]** `SMEMBERS mitras:deactivated` matches `SELECT id FROM mitras WHERE is_active=false`
### Reconnect re-seed
- [ ] **[BE]** With backend running, `FLUSHDB` on Valkey, wait ~5s for ioredis reconnect, verify a second `[valkey-mirror] seed:` log entry appears
- [ ] **[BE]** All four Valkey structures are rebuilt and match Postgres again
---
## Section B — Online / Offline toggle (write-through)
Verifies `setOnline` / `setOffline` writes both stores in the right order.
- [ ] **[M]** Mitra taps "online" → backend updates Postgres `mitra_online_status.is_online=true`
- [ ] **[BE]** `SISMEMBER mitras:online <mitra-id>` returns `1` within 1s of the toggle
- [ ] **[BE]** `GET mitra:heartbeat:<mitra-id>` returns a fresh ISO timestamp (within seconds of the toggle)
- [ ] **[M]** Mitra taps "offline"
- [ ] **[BE]** Postgres `is_online=false`, Valkey `SISMEMBER mitras:online <id>` returns `0`, `GET mitra:heartbeat:<id>` returns `nil`
- [ ] **[BE]** `mitra_online_logs` has paired `online` / `offline` audit rows
### Valkey-failure best-effort write
- [ ] **[BE]** Stop Valkey, then toggle a mitra online → request succeeds (200), backend log shows `[valkey-mirror] setOnline <id> failed:` but Postgres is updated correctly
- [ ] **[BE]** Restart Valkey → reconciliation sweep (≤ 300s default) eventually rebuilds the SET to include this mitra
---
## Section C — Heartbeat path
Verifies the rewrite: per-ping = 1 Valkey write, 0 DB writes.
- [ ] **[M]** Mitra online for at least one heartbeat cycle (~30s)
- [ ] **[BE]** Watch Postgres query log during heartbeat — **no `UPDATE mitra_online_status SET last_heartbeat_at` rows fire on every ping** (only the batched mirror, default every 60s)
- [ ] **[BE]** `GET mitra:heartbeat:<id>` value advances on each ping (re-check ~30s later, timestamp moves forward)
- [ ] **[BE]** After 60s+ wait, `SELECT last_heartbeat_at FROM mitra_online_status WHERE mitra_id=<id>` advances (heartbeat mirror sweep ran)
### Deactivation guard via Valkey
- [ ] **[CC]** Admin deactivates the mitra (Phase 5 path: `is_active=false`)
- [ ] **[BE]** `SISMEMBER mitras:deactivated <id>` immediately returns `1`
- [ ] **[M]** Mitra app's next heartbeat → backend returns `403 ACCOUNT_INACTIVE`
- [ ] **[BE]** No Postgres SELECT on `mitras` table for that heartbeat (verify with query log) — guard is pure Valkey
### Fallback when Valkey is down
- [ ] **[BE]** Stop Valkey
- [ ] **[M]** Mitra app heartbeats → backend logs `[heartbeat] valkey check failed, falling back to DB`, request still succeeds for active mitra, returns `403` for deactivated mitra (full DB-backed resolveMitra path)
- [ ] **[BE]** Restart Valkey → next heartbeat uses Valkey path again (no fallback log line)
---
## Section D — Capacity counter
Verifies INCR/DECR across session lifecycle via `recomputeCapacityForMitra`.
- [ ] **[BE]** Reset: pick a mitra with `mitra:capacity:<id> = 0`
- [ ] **[C]** Customer pays → mitra accepts the blast → chat starts
- [ ] **[BE]** `GET mitra:capacity:<id>` returns `1` within 1s of mitra accepting
- [ ] **[C]** Second customer pays → same mitra accepts (assuming `max_customers_per_mitra >= 2`)
- [ ] **[BE]** `GET mitra:capacity:<id>` returns `2`
- [ ] **[C]** First session ends naturally (timer expires + goodbye flow completes)
- [ ] **[BE]** `GET mitra:capacity:<id>` returns `1` within 1s
- [ ] **[C]** Second session ends
- [ ] **[BE]** `GET mitra:capacity:<id>` returns `0`
### Reroute
- [ ] **[CC]** Reroute an active session from mitra A → mitra B
- [ ] **[BE]** `mitra:capacity:A` decrements, `mitra:capacity:B` increments — both atomic with the chat_sessions UPDATE
### Capacity gates blast
- [ ] **[BE]** Set `mitra:capacity:<id>` to `max_customers_per_mitra` directly (`SET mitra:capacity:<id> 3`)
- [ ] **[C]** Customer pays → blast → **this mitra is excluded** (verify with `chat_request_notifications` — no row for this mitra)
---
## Section E — Deactivation flow (full propagation)
Verifies `updateMitraStatus` + `revokeAllSessionsForUser` close the deactivation gap.
- [ ] **[BE]** Mitra has an active access token (capture from `/api/mitra/auth/otp/verify` or use existing logged-in session)
- [ ] **[BE]** Confirm mitra can call protected routes (`curl -H "Authorization: Bearer <token>" /api/mitra/...`)
- [ ] **[CC]** Admin deactivates the mitra
- [ ] **[BE]** Postgres: `mitras.is_active=false`, `auth_sessions.revoked_at IS NOT NULL` for this mitra
- [ ] **[BE]** Valkey: `SISMEMBER mitras:deactivated <id>` = 1
- [ ] **[BE]** Mitra's current access token still works on routes that don't re-check active state (stateless JWT) — bounded by access-token TTL
- [ ] **[BE]** Mitra's heartbeat returns `403 ACCOUNT_INACTIVE` immediately (Valkey check on hot path)
- [ ] **[BE]** Mitra's next refresh token attempt fails (because `auth_sessions.revoked_at` was set) → app effectively logs them out
### Re-activation
- [ ] **[CC]** Re-activate the mitra
- [ ] **[BE]** Postgres `is_active=true`, Valkey `SISMEMBER mitras:deactivated <id>` = 0
- [ ] **[M]** Mitra re-logs in, heartbeats again successfully
---
## Section F — Read paths (Valkey-first)
Verifies all reads use Valkey with Postgres fallback.
### `isMitraReachable`
- [ ] **[BE]** Online mitra with fresh heartbeat → `isMitraReachable` returns `true` (call via `node -e` or any route that uses it, e.g. extension flow)
- [ ] **[BE]** Manually set `mitra:heartbeat:<id>` to a timestamp older than `stale_after_seconds` → `isMitraReachable` returns `false` (even though `is_online=true` in Postgres)
- [ ] **[BE]** Stop Valkey → `isMitraReachable` logs `[isMitraReachable] valkey unavailable, falling back to DB` and returns based on Postgres `is_online`
### `findAvailableMitras` (blast)
- [ ] **[BE]** Run `findAvailableMitras` (e.g. trigger a customer blast) — log shows Valkey path used (no warning about fallback)
- [ ] **[BE]** Result IDs match `SDIFF mitras:online mitras:deactivated` filtered by capacity + heartbeat freshness
- [ ] **[BE]** Stop Valkey → next blast logs `[findAvailableMitras] valkey unavailable, falling back to Postgres` and still returns correct results
### `countAvailableMitrasFromCache` (customer beacon)
- [ ] **[BE]** `curl /public/bestie-availability` returns `{available: bool, count: N}` matching reality
- [ ] **[BE]** `GET availability:snapshot` in Valkey shows cached JSON within 10s of last poll
- [ ] **[BE]** Multiple rapid polls (5+ per second from 3 different IPs) → only one Valkey-driven recompute per 10s; Postgres query log shows **zero** mitra-availability JOINs in steady state (only the once-per-10s cache miss)
- [ ] **[CC]** Operator changes `max_customers_per_mitra` → `availability:snapshot` is `DEL`d (cache bust), next poll recomputes
### Dashboard online count
- [ ] **[CC]** Dashboard "Online Mitras" stat matches `SCARD mitras:online` in Valkey
- [ ] **[BE]** Verify the dashboard query no longer hits `SELECT COUNT(*) FROM mitra_online_status WHERE is_online=true` (check query log)
---
## Section G — Auto-offline sweep (Valkey-driven)
Verifies stale heartbeats → flipped offline via Valkey diff.
- [ ] **[BE]** Set `MITRA_AUTO_OFFLINE_SWEEP_SECONDS=10` in env for faster test cycle, restart backend
- [ ] **[CC]** Set `stale_after_seconds=15` in CC settings (or directly in `app_config`)
- [ ] **[M]** Mitra goes online, sends one heartbeat
- [ ] **[BE]** Manually delete the heartbeat key: `DEL mitra:heartbeat:<id>`
- [ ] **[BE]** Wait up to 25s (15s stale + 10s sweep cadence)
- [ ] **[BE]** Postgres: `is_online=false` for this mitra, audit row in `mitra_online_logs` with status='offline'
- [ ] **[BE]** Valkey: `SISMEMBER mitras:online <id>` = 0, no `mitra:heartbeat:<id>` key
### Sweep skips on Valkey error
- [ ] **[BE]** Stop Valkey
- [ ] **[BE]** Backend log shows `[auto-offline] valkey unavailable, skipping this tick:` each sweep cadence
- [ ] **[BE]** **No Postgres UPDATE fires** during the outage (verify with query log) — confirms we don't mass-offline on Valkey hiccup
---
## Section H — Heartbeat → Postgres batched mirror
Verifies the 60s UNNEST UPDATE.
- [ ] **[BE]** Multiple mitras online, all heartbeating
- [ ] **[BE]** Set `HEARTBEAT_MIRROR_INTERVAL_SECONDS=15` for faster cycles, restart
- [ ] **[BE]** Wait one mirror cycle — Postgres log shows **one** UPDATE statement (with `UNNEST(...)` containing all online mitra IDs)
- [ ] **[BE]** `SELECT last_heartbeat_at FROM mitra_online_status WHERE is_online=true` returns timestamps within last cycle
- [ ] **[BE]** Compare with Valkey `GET mitra:heartbeat:<id>` — Postgres lags Valkey by ≤ mirror-cadence seconds (forensic-grade, not real-time)
### Mirror skips on Valkey error
- [ ] **[BE]** Stop Valkey
- [ ] **[BE]** Backend log shows `[heartbeat-mirror] valkey unavailable, skipping:` each cycle
- [ ] **[BE]** Postgres `last_heartbeat_at` does NOT advance during the outage (correct — Valkey is the source of "when was last ping?")
---
## Section I — Reconciliation sweep
Verifies drift heals every `VALKEY_ONLINE_MIRROR_SWEEP_SECONDS`.
- [ ] **[BE]** Set `VALKEY_ONLINE_MIRROR_SWEEP_SECONDS=30` for faster test, restart
- [ ] **[BE]** Manually corrupt Valkey:
```
SADD mitras:online 00000000-0000-0000-0000-000000000999 # bogus ID
SREM mitras:online <real-online-mitra-id> # remove a real one
SET mitra:capacity:<real-id> 99 # bogus capacity
```
- [ ] **[BE]** Wait one sweep cycle (~30s) → log shows `[valkey-mirror] seed:` again
- [ ] **[BE]** After sweep: Valkey state matches Postgres exactly (bogus ID gone, real ID present, capacity reset)
### Sweep disabled when env=0
- [ ] **[BE]** Set `VALKEY_ONLINE_MIRROR_SWEEP_SECONDS=0`, restart
- [ ] **[BE]** Confirm no periodic seed log appears after the initial startup seed
---
## Section J — Failure modes (Valkey degradation)
End-to-end behavior when Valkey is down for an extended period.
- [ ] **[BE]** Stop Valkey
- [ ] **[C]** Customer beacon poll → `availability:snapshot` GET fails → falls back to Postgres JOIN; UX unchanged but DB query rate spikes
- [ ] **[C]** Customer triggers blast → `findAvailableMitras` Valkey path fails → falls back to Postgres JOIN; blast still works
- [ ] **[M]** Mitra heartbeat → Valkey write fails (logged), but heartbeat returns 200; the missed write is irrelevant (auto-offline sweep is also skipping)
- [ ] **[M]** Mitra toggle online → Postgres update succeeds, Valkey SADD fails (logged); on next reconciliation sweep after Valkey returns, mitra is back in `mitras:online` SET
- [ ] **[BE]** Restart Valkey → reconnect listener fires → `seedFromPostgres()` runs → state restored; degraded period ends
---
## Section K — Multi-instance (defer until Cloud Run)
Run only when ≥ 2 backend instances are active.
- [ ] **[BE]** Two instances both run their own seed on startup — final Valkey state is consistent (idempotent `DEL + SADD`)
- [ ] **[BE]** Concurrent setOnline calls on the same mitra from different instances → final SET state correct (atomic SADD)
- [ ] **[BE]** `availability:snapshot` cache miss on instance A fills the snapshot; instance B's next poll reads the cached value (cluster-shared cache works)
- [ ] **[BE]** Operator changes `max_customers_per_mitra` on one instance → `config:invalidate` pub/sub fires → other instance also DELs `availability:snapshot`
- [ ] **[BE]** Heartbeat mirror UPDATEs from multiple instances are idempotent (last writer wins on timestamp, no errors)
---
## Smoke tests (quick happy path)
5-minute sanity check after any deploy:
- [ ] **[BE]** Backend log shows successful seed on startup
- [ ] **[M]** Mitra toggles online → appears in `SMEMBERS mitras:online`
- [ ] **[C]** Customer sees "Mulai Curhat" enabled
- [ ] **[C]** Customer pays → mitra accepts → chat starts → `mitra:capacity:<id>` increments
- [ ] **[C]** Chat ends → counter decrements
- [ ] **[M]** Mitra toggles offline → removed from SET
---
## Known limitations / what this checklist does NOT cover
- **Load testing** — sustained heartbeat volume at prod scale (300+ mitras × 2 pings/min). Plan: separate load-test stage when prod is provisioned.
- **Memorystore-specific behavior** — failover, RDB+AOF interaction. Plan: re-run Sections A, G, I, J against Memorystore Standard tier before prod cutover.
- **Long-running drift** — overnight runs where eviction or memory-pressure could affect Valkey state. Plan: monitor `INFO memory` in prod for the first week.