Mitra-availability state (online flag, deactivated flag, per-mitra session count, heartbeat liveness) mirrored into Valkey so the customer beacon + pairing blast + dashboard counts no longer hit Postgres on the hot path. Postgres remains the durable source of truth; Valkey state is fully derivable via seedFromPostgres on startup + reconnect. Schema - mitras:online SET — mirror of is_online - mitras:deactivated SET — mirror of is_active=false - mitra:capacity:<id> STRING — active+pending_payment session count - mitra💓<id> STRING — ISO timestamp of last ping - availability:snapshot JSON — beacon cache, TTL 10s, cluster-shared Write paths (Postgres first, best-effort Valkey) - setOnline/setOffline mirror SADD/SREM + heartbeat SET/DEL - updateMitraStatus mirrors mitras:deactivated AND revokes auth_sessions on deactivate (bounds the "ghost online" window to access-token TTL) - heartbeat is Valkey-only on the hot path; the per-ping Postgres UPDATE on last_heartbeat_at is eliminated (was 1,200 ops/min at prod scale) - chat_session lifecycle (accept/end/reroute/extension/expiry) calls recomputeCapacityForMitra after each UPDATE — derive-from-truth avoids the bookkeeping risk of per-transition INCR/DECR Read paths (Valkey-first, Postgres fallback on Valkey error) - isMitraReachable: SISMEMBER mitras:online + heartbeat freshness - findAvailableMitras: SDIFF + pipelined GETs, filter by capacity + heartbeat - countAvailableMitrasFromCache: Valkey-driven, cached cluster-wide 10s TTL - dashboard online count: SCARD - Each reader wraps Valkey ops in try/catch → Postgres fallback on outage Heartbeat path on /api/mitra/status/heartbeat - resolveMitra preHandler replaced with heartbeatGuard: SISMEMBER on mitras:deactivated (~0 DB hits per ping). Falls back to full DB resolveMitra if Valkey is unreachable so a Valkey outage doesn't silently accept heartbeats from deactivated mitras. Three sweeps, env-configurable cadences - MITRA_AUTO_OFFLINE_SWEEP_SECONDS (30) — Valkey-driven stale detection - HEARTBEAT_MIRROR_INTERVAL_SECONDS (60) — batched UPSERT writes Valkey timestamps to Postgres last_heartbeat_at via UNNEST (1 statement per cycle, idempotent across instances) - VALKEY_ONLINE_MIRROR_SWEEP_SECONDS (300) — periodic reseed heals drift Startup - restoreActiveTimers → seedFromPostgres → bind listeners - onValkeyReady re-runs the seed on every reconnect (cold start + reseed on Valkey restart, no manual intervention) Failure semantics - Read fallback: every Valkey read wrapped, falls back to existing Postgres JOIN query — system stays correct during Valkey outage, performance degrades not breaks - Write best-effort: Postgres write commits before Valkey is touched; Valkey errors log + continue; reconciliation sweep heals drift - Auto-offline sweep aborts entirely on Valkey error (does NOT mass- offline via Postgres scan during Valkey hiccup) Tests - New: 32 integration tests in mitra-status.valkey-mirror.test.js covering seed, write-through, fallbacks, capacity lifecycle, auto-offline sweep, heartbeat mirror, deactivation flow, beacon cache - Updated: fixtures.js seeds Valkey alongside Postgres when isOnline=true - Updated: helpers/db.js resetDb also flushes test Valkey - Fixed 2 pre-existing session-timer flakes (string IDs failed uuid parse; vi.advanceTimersByTimeAsync raced real Postgres I/O) - All 124/124 backend tests pass (was 90/92) Docs - requirement/valkey-online-mirror-plan.md — canonical plan - requirement/valkey-online-mirror-testing.md — manual E2E checklist - requirement/deployment.md — infra + Valkey persistence guidance for prod (Memorystore Standard tier recommended; migration from self-hosted Valkey is zero-downtime via reseed-from-Postgres) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
273 lines
15 KiB
Markdown
273 lines
15 KiB
Markdown
# Valkey Availability Mirror — Testing Checklist
|
||
|
||
End-to-end verification for the Valkey-mirror refactor described in [valkey-online-mirror-plan.md](valkey-online-mirror-plan.md).
|
||
|
||
Cluster labels: **[BE]** backend / curl / SQL / Valkey-cli, **[CC]** control_center, **[M]** mitra_app, **[C]** client_app.
|
||
|
||
> **Run order:** Section A first (seed + startup) — every subsequent section assumes a fresh seed. Sections B–J are otherwise independent.
|
||
|
||
---
|
||
|
||
## Setup
|
||
|
||
- [ ] Backend running on `192.168.88.247:3000` (public) + `:3001` (internal) — `curl http://192.168.88.247:3000/api/shared/auth-providers` returns 200
|
||
- [ ] Valkey reachable from backend (`VALKEY_URL` matches running instance; `[valkey] subscribed to config:invalidate` appears in backend boot log)
|
||
- [ ] Postgres reachable; backend boot log shows `[valkey-mirror] seed: X online, Y deactivated, Z with active sessions`
|
||
- [ ] At least 3 mitra accounts exist in DB (we need to flip them online/offline/deactivated across tests)
|
||
- [ ] One customer account ready for the blast scenarios
|
||
- [ ] Helpful aliases for verification (run from `backend/`):
|
||
```bash
|
||
alias vk='node --env-file=.env -e'
|
||
# Then in tests: vk "(async()=>{const v=(await import('./src/plugins/valkey.js')).getValkeyClient();
|
||
# console.log(await v.smembers('mitras:online'));process.exit(0)})()"
|
||
```
|
||
|
||
---
|
||
|
||
## Section A — Seed + Startup
|
||
|
||
Verifies `seedFromPostgres()` populates Valkey correctly from Postgres truth.
|
||
|
||
- [ ] **[BE]** Restart backend; log shows one `[valkey-mirror] seed: N online, M deactivated, K with active sessions` line on startup
|
||
- [ ] **[BE]** Counts in the log match Postgres truth:
|
||
```sql
|
||
SELECT
|
||
(SELECT COUNT(*) FROM mitra_online_status WHERE is_online=true) AS online,
|
||
(SELECT COUNT(*) FROM mitras WHERE is_active=false) AS deactivated,
|
||
(SELECT COUNT(DISTINCT mitra_id) FROM chat_sessions
|
||
WHERE mitra_id IS NOT NULL AND status IN ('active','pending_payment')) AS with_sessions;
|
||
```
|
||
- [ ] **[BE]** Valkey contents match — `SMEMBERS mitras:online` returns the same IDs as `SELECT mitra_id FROM mitra_online_status WHERE is_online=true`
|
||
- [ ] **[BE]** Valkey `mitra:heartbeat:<id>` exists for every currently-online mitra, value is a recent ISO timestamp (within seed time)
|
||
- [ ] **[BE]** Valkey `mitra:capacity:<id>` matches `SELECT COUNT(*) FROM chat_sessions WHERE mitra_id=<id> AND status IN ('active','pending_payment')` for every online mitra
|
||
- [ ] **[BE]** `SMEMBERS mitras:deactivated` matches `SELECT id FROM mitras WHERE is_active=false`
|
||
|
||
### Reconnect re-seed
|
||
|
||
- [ ] **[BE]** With backend running, `FLUSHDB` on Valkey, wait ~5s for ioredis reconnect, verify a second `[valkey-mirror] seed:` log entry appears
|
||
- [ ] **[BE]** All four Valkey structures are rebuilt and match Postgres again
|
||
|
||
---
|
||
|
||
## Section B — Online / Offline toggle (write-through)
|
||
|
||
Verifies `setOnline` / `setOffline` writes both stores in the right order.
|
||
|
||
- [ ] **[M]** Mitra taps "online" → backend updates Postgres `mitra_online_status.is_online=true`
|
||
- [ ] **[BE]** `SISMEMBER mitras:online <mitra-id>` returns `1` within 1s of the toggle
|
||
- [ ] **[BE]** `GET mitra:heartbeat:<mitra-id>` returns a fresh ISO timestamp (within seconds of the toggle)
|
||
- [ ] **[M]** Mitra taps "offline"
|
||
- [ ] **[BE]** Postgres `is_online=false`, Valkey `SISMEMBER mitras:online <id>` returns `0`, `GET mitra:heartbeat:<id>` returns `nil`
|
||
- [ ] **[BE]** `mitra_online_logs` has paired `online` / `offline` audit rows
|
||
|
||
### Valkey-failure best-effort write
|
||
|
||
- [ ] **[BE]** Stop Valkey, then toggle a mitra online → request succeeds (200), backend log shows `[valkey-mirror] setOnline <id> failed:` but Postgres is updated correctly
|
||
- [ ] **[BE]** Restart Valkey → reconciliation sweep (≤ 300s default) eventually rebuilds the SET to include this mitra
|
||
|
||
---
|
||
|
||
## Section C — Heartbeat path
|
||
|
||
Verifies the rewrite: per-ping = 1 Valkey write, 0 DB writes.
|
||
|
||
- [ ] **[M]** Mitra online for at least one heartbeat cycle (~30s)
|
||
- [ ] **[BE]** Watch Postgres query log during heartbeat — **no `UPDATE mitra_online_status SET last_heartbeat_at` rows fire on every ping** (only the batched mirror, default every 60s)
|
||
- [ ] **[BE]** `GET mitra:heartbeat:<id>` value advances on each ping (re-check ~30s later, timestamp moves forward)
|
||
- [ ] **[BE]** After 60s+ wait, `SELECT last_heartbeat_at FROM mitra_online_status WHERE mitra_id=<id>` advances (heartbeat mirror sweep ran)
|
||
|
||
### Deactivation guard via Valkey
|
||
|
||
- [ ] **[CC]** Admin deactivates the mitra (Phase 5 path: `is_active=false`)
|
||
- [ ] **[BE]** `SISMEMBER mitras:deactivated <id>` immediately returns `1`
|
||
- [ ] **[M]** Mitra app's next heartbeat → backend returns `403 ACCOUNT_INACTIVE`
|
||
- [ ] **[BE]** No Postgres SELECT on `mitras` table for that heartbeat (verify with query log) — guard is pure Valkey
|
||
|
||
### Fallback when Valkey is down
|
||
|
||
- [ ] **[BE]** Stop Valkey
|
||
- [ ] **[M]** Mitra app heartbeats → backend logs `[heartbeat] valkey check failed, falling back to DB`, request still succeeds for active mitra, returns `403` for deactivated mitra (full DB-backed resolveMitra path)
|
||
- [ ] **[BE]** Restart Valkey → next heartbeat uses Valkey path again (no fallback log line)
|
||
|
||
---
|
||
|
||
## Section D — Capacity counter
|
||
|
||
Verifies INCR/DECR across session lifecycle via `recomputeCapacityForMitra`.
|
||
|
||
- [ ] **[BE]** Reset: pick a mitra with `mitra:capacity:<id> = 0`
|
||
- [ ] **[C]** Customer pays → mitra accepts the blast → chat starts
|
||
- [ ] **[BE]** `GET mitra:capacity:<id>` returns `1` within 1s of mitra accepting
|
||
- [ ] **[C]** Second customer pays → same mitra accepts (assuming `max_customers_per_mitra >= 2`)
|
||
- [ ] **[BE]** `GET mitra:capacity:<id>` returns `2`
|
||
- [ ] **[C]** First session ends naturally (timer expires + goodbye flow completes)
|
||
- [ ] **[BE]** `GET mitra:capacity:<id>` returns `1` within 1s
|
||
- [ ] **[C]** Second session ends
|
||
- [ ] **[BE]** `GET mitra:capacity:<id>` returns `0`
|
||
|
||
### Reroute
|
||
|
||
- [ ] **[CC]** Reroute an active session from mitra A → mitra B
|
||
- [ ] **[BE]** `mitra:capacity:A` decrements, `mitra:capacity:B` increments — both atomic with the chat_sessions UPDATE
|
||
|
||
### Capacity gates blast
|
||
|
||
- [ ] **[BE]** Set `mitra:capacity:<id>` to `max_customers_per_mitra` directly (`SET mitra:capacity:<id> 3`)
|
||
- [ ] **[C]** Customer pays → blast → **this mitra is excluded** (verify with `chat_request_notifications` — no row for this mitra)
|
||
|
||
---
|
||
|
||
## Section E — Deactivation flow (full propagation)
|
||
|
||
Verifies `updateMitraStatus` + `revokeAllSessionsForUser` close the deactivation gap.
|
||
|
||
- [ ] **[BE]** Mitra has an active access token (capture from `/api/mitra/auth/otp/verify` or use existing logged-in session)
|
||
- [ ] **[BE]** Confirm mitra can call protected routes (`curl -H "Authorization: Bearer <token>" /api/mitra/...`)
|
||
- [ ] **[CC]** Admin deactivates the mitra
|
||
- [ ] **[BE]** Postgres: `mitras.is_active=false`, `auth_sessions.revoked_at IS NOT NULL` for this mitra
|
||
- [ ] **[BE]** Valkey: `SISMEMBER mitras:deactivated <id>` = 1
|
||
- [ ] **[BE]** Mitra's current access token still works on routes that don't re-check active state (stateless JWT) — bounded by access-token TTL
|
||
- [ ] **[BE]** Mitra's heartbeat returns `403 ACCOUNT_INACTIVE` immediately (Valkey check on hot path)
|
||
- [ ] **[BE]** Mitra's next refresh token attempt fails (because `auth_sessions.revoked_at` was set) → app effectively logs them out
|
||
|
||
### Re-activation
|
||
|
||
- [ ] **[CC]** Re-activate the mitra
|
||
- [ ] **[BE]** Postgres `is_active=true`, Valkey `SISMEMBER mitras:deactivated <id>` = 0
|
||
- [ ] **[M]** Mitra re-logs in, heartbeats again successfully
|
||
|
||
---
|
||
|
||
## Section F — Read paths (Valkey-first)
|
||
|
||
Verifies all reads use Valkey with Postgres fallback.
|
||
|
||
### `isMitraReachable`
|
||
|
||
- [ ] **[BE]** Online mitra with fresh heartbeat → `isMitraReachable` returns `true` (call via `node -e` or any route that uses it, e.g. extension flow)
|
||
- [ ] **[BE]** Manually set `mitra:heartbeat:<id>` to a timestamp older than `stale_after_seconds` → `isMitraReachable` returns `false` (even though `is_online=true` in Postgres)
|
||
- [ ] **[BE]** Stop Valkey → `isMitraReachable` logs `[isMitraReachable] valkey unavailable, falling back to DB` and returns based on Postgres `is_online`
|
||
|
||
### `findAvailableMitras` (blast)
|
||
|
||
- [ ] **[BE]** Run `findAvailableMitras` (e.g. trigger a customer blast) — log shows Valkey path used (no warning about fallback)
|
||
- [ ] **[BE]** Result IDs match `SDIFF mitras:online mitras:deactivated` filtered by capacity + heartbeat freshness
|
||
- [ ] **[BE]** Stop Valkey → next blast logs `[findAvailableMitras] valkey unavailable, falling back to Postgres` and still returns correct results
|
||
|
||
### `countAvailableMitrasFromCache` (customer beacon)
|
||
|
||
- [ ] **[BE]** `curl /public/bestie-availability` returns `{available: bool, count: N}` matching reality
|
||
- [ ] **[BE]** `GET availability:snapshot` in Valkey shows cached JSON within 10s of last poll
|
||
- [ ] **[BE]** Multiple rapid polls (5+ per second from 3 different IPs) → only one Valkey-driven recompute per 10s; Postgres query log shows **zero** mitra-availability JOINs in steady state (only the once-per-10s cache miss)
|
||
- [ ] **[CC]** Operator changes `max_customers_per_mitra` → `availability:snapshot` is `DEL`d (cache bust), next poll recomputes
|
||
|
||
### Dashboard online count
|
||
|
||
- [ ] **[CC]** Dashboard "Online Mitras" stat matches `SCARD mitras:online` in Valkey
|
||
- [ ] **[BE]** Verify the dashboard query no longer hits `SELECT COUNT(*) FROM mitra_online_status WHERE is_online=true` (check query log)
|
||
|
||
---
|
||
|
||
## Section G — Auto-offline sweep (Valkey-driven)
|
||
|
||
Verifies stale heartbeats → flipped offline via Valkey diff.
|
||
|
||
- [ ] **[BE]** Set `MITRA_AUTO_OFFLINE_SWEEP_SECONDS=10` in env for faster test cycle, restart backend
|
||
- [ ] **[CC]** Set `stale_after_seconds=15` in CC settings (or directly in `app_config`)
|
||
- [ ] **[M]** Mitra goes online, sends one heartbeat
|
||
- [ ] **[BE]** Manually delete the heartbeat key: `DEL mitra:heartbeat:<id>`
|
||
- [ ] **[BE]** Wait up to 25s (15s stale + 10s sweep cadence)
|
||
- [ ] **[BE]** Postgres: `is_online=false` for this mitra, audit row in `mitra_online_logs` with status='offline'
|
||
- [ ] **[BE]** Valkey: `SISMEMBER mitras:online <id>` = 0, no `mitra:heartbeat:<id>` key
|
||
|
||
### Sweep skips on Valkey error
|
||
|
||
- [ ] **[BE]** Stop Valkey
|
||
- [ ] **[BE]** Backend log shows `[auto-offline] valkey unavailable, skipping this tick:` each sweep cadence
|
||
- [ ] **[BE]** **No Postgres UPDATE fires** during the outage (verify with query log) — confirms we don't mass-offline on Valkey hiccup
|
||
|
||
---
|
||
|
||
## Section H — Heartbeat → Postgres batched mirror
|
||
|
||
Verifies the 60s UNNEST UPDATE.
|
||
|
||
- [ ] **[BE]** Multiple mitras online, all heartbeating
|
||
- [ ] **[BE]** Set `HEARTBEAT_MIRROR_INTERVAL_SECONDS=15` for faster cycles, restart
|
||
- [ ] **[BE]** Wait one mirror cycle — Postgres log shows **one** UPDATE statement (with `UNNEST(...)` containing all online mitra IDs)
|
||
- [ ] **[BE]** `SELECT last_heartbeat_at FROM mitra_online_status WHERE is_online=true` returns timestamps within last cycle
|
||
- [ ] **[BE]** Compare with Valkey `GET mitra:heartbeat:<id>` — Postgres lags Valkey by ≤ mirror-cadence seconds (forensic-grade, not real-time)
|
||
|
||
### Mirror skips on Valkey error
|
||
|
||
- [ ] **[BE]** Stop Valkey
|
||
- [ ] **[BE]** Backend log shows `[heartbeat-mirror] valkey unavailable, skipping:` each cycle
|
||
- [ ] **[BE]** Postgres `last_heartbeat_at` does NOT advance during the outage (correct — Valkey is the source of "when was last ping?")
|
||
|
||
---
|
||
|
||
## Section I — Reconciliation sweep
|
||
|
||
Verifies drift heals every `VALKEY_ONLINE_MIRROR_SWEEP_SECONDS`.
|
||
|
||
- [ ] **[BE]** Set `VALKEY_ONLINE_MIRROR_SWEEP_SECONDS=30` for faster test, restart
|
||
- [ ] **[BE]** Manually corrupt Valkey:
|
||
```
|
||
SADD mitras:online 00000000-0000-0000-0000-000000000999 # bogus ID
|
||
SREM mitras:online <real-online-mitra-id> # remove a real one
|
||
SET mitra:capacity:<real-id> 99 # bogus capacity
|
||
```
|
||
- [ ] **[BE]** Wait one sweep cycle (~30s) → log shows `[valkey-mirror] seed:` again
|
||
- [ ] **[BE]** After sweep: Valkey state matches Postgres exactly (bogus ID gone, real ID present, capacity reset)
|
||
|
||
### Sweep disabled when env=0
|
||
|
||
- [ ] **[BE]** Set `VALKEY_ONLINE_MIRROR_SWEEP_SECONDS=0`, restart
|
||
- [ ] **[BE]** Confirm no periodic seed log appears after the initial startup seed
|
||
|
||
---
|
||
|
||
## Section J — Failure modes (Valkey degradation)
|
||
|
||
End-to-end behavior when Valkey is down for an extended period.
|
||
|
||
- [ ] **[BE]** Stop Valkey
|
||
- [ ] **[C]** Customer beacon poll → `availability:snapshot` GET fails → falls back to Postgres JOIN; UX unchanged but DB query rate spikes
|
||
- [ ] **[C]** Customer triggers blast → `findAvailableMitras` Valkey path fails → falls back to Postgres JOIN; blast still works
|
||
- [ ] **[M]** Mitra heartbeat → Valkey write fails (logged), but heartbeat returns 200; the missed write is irrelevant (auto-offline sweep is also skipping)
|
||
- [ ] **[M]** Mitra toggle online → Postgres update succeeds, Valkey SADD fails (logged); on next reconciliation sweep after Valkey returns, mitra is back in `mitras:online` SET
|
||
- [ ] **[BE]** Restart Valkey → reconnect listener fires → `seedFromPostgres()` runs → state restored; degraded period ends
|
||
|
||
---
|
||
|
||
## Section K — Multi-instance (defer until Cloud Run)
|
||
|
||
Run only when ≥ 2 backend instances are active.
|
||
|
||
- [ ] **[BE]** Two instances both run their own seed on startup — final Valkey state is consistent (idempotent `DEL + SADD`)
|
||
- [ ] **[BE]** Concurrent setOnline calls on the same mitra from different instances → final SET state correct (atomic SADD)
|
||
- [ ] **[BE]** `availability:snapshot` cache miss on instance A fills the snapshot; instance B's next poll reads the cached value (cluster-shared cache works)
|
||
- [ ] **[BE]** Operator changes `max_customers_per_mitra` on one instance → `config:invalidate` pub/sub fires → other instance also DELs `availability:snapshot`
|
||
- [ ] **[BE]** Heartbeat mirror UPDATEs from multiple instances are idempotent (last writer wins on timestamp, no errors)
|
||
|
||
---
|
||
|
||
## Smoke tests (quick happy path)
|
||
|
||
5-minute sanity check after any deploy:
|
||
|
||
- [ ] **[BE]** Backend log shows successful seed on startup
|
||
- [ ] **[M]** Mitra toggles online → appears in `SMEMBERS mitras:online`
|
||
- [ ] **[C]** Customer sees "Mulai Curhat" enabled
|
||
- [ ] **[C]** Customer pays → mitra accepts → chat starts → `mitra:capacity:<id>` increments
|
||
- [ ] **[C]** Chat ends → counter decrements
|
||
- [ ] **[M]** Mitra toggles offline → removed from SET
|
||
|
||
---
|
||
|
||
## Known limitations / what this checklist does NOT cover
|
||
|
||
- **Load testing** — sustained heartbeat volume at prod scale (300+ mitras × 2 pings/min). Plan: separate load-test stage when prod is provisioned.
|
||
- **Memorystore-specific behavior** — failover, RDB+AOF interaction. Plan: re-run Sections A, G, I, J against Memorystore Standard tier before prod cutover.
|
||
- **Long-running drift** — overnight runs where eviction or memory-pressure could affect Valkey state. Plan: monitor `INFO memory` in prod for the first week.
|