halobestie-clone/requirement/valkey-online-mirror-testing.md

# Valkey Availability Mirror — Testing Checklist

End-to-end verification for the Valkey-mirror refactor described in [valkey-online-mirror-plan.md](valkey-online-mirror-plan.md).

Cluster labels: **[BE]** backend / curl / SQL / Valkey-cli, **[CC]** control_center, **[M]** mitra_app, **[C]** client_app.

> **Run order:** Section A first (seed + startup) — every subsequent section assumes a fresh seed. Sections B–J are otherwise independent.

---

## Setup

- [ ] Backend running on `192.168.88.247:3000` (public) + `:3001` (internal) — `curl http://192.168.88.247:3000/api/shared/auth-providers` returns 200
- [ ] Valkey reachable from backend (`VALKEY_URL` matches running instance; `[valkey] subscribed to config:invalidate` appears in backend boot log)
- [ ] Postgres reachable; backend boot log shows `[valkey-mirror] seed: X online, Y deactivated, Z with active sessions`
- [ ] At least 3 mitra accounts exist in DB (we need to flip them online/offline/deactivated across tests)
- [ ] One customer account ready for the blast scenarios
- [ ] Helpful aliases for verification (run from `backend/`):
  ```bash
  alias vk='node --env-file=.env -e'
  # Then in tests: vk "(async()=>{const v=(await import('./src/plugins/valkey.js')).getValkeyClient();
  #   console.log(await v.smembers('mitras:online'));process.exit(0)})()"
  ```

---

## Section A — Seed + Startup

Verifies `seedFromPostgres()` populates Valkey correctly from Postgres truth.

- [ ] **[BE]** Restart backend; log shows one `[valkey-mirror] seed: N online, M deactivated, K with active sessions` line on startup
- [ ] **[BE]** Counts in the log match Postgres truth:
  ```sql
  SELECT
    (SELECT COUNT(*) FROM mitra_online_status WHERE is_online=true) AS online,
    (SELECT COUNT(*) FROM mitras WHERE is_active=false) AS deactivated,
    (SELECT COUNT(DISTINCT mitra_id) FROM chat_sessions
     WHERE mitra_id IS NOT NULL AND status IN ('active','pending_payment')) AS with_sessions;
  ```
- [ ] **[BE]** Valkey contents match — `SMEMBERS mitras:online` returns the same IDs as `SELECT mitra_id FROM mitra_online_status WHERE is_online=true`
- [ ] **[BE]** Valkey `mitra:heartbeat:<id>` exists for every currently-online mitra, value is a recent ISO timestamp (within seed time)
- [ ] **[BE]** Valkey `mitra:capacity:<id>` matches `SELECT COUNT(*) FROM chat_sessions WHERE mitra_id=<id> AND status IN ('active','pending_payment')` for every online mitra
- [ ] **[BE]** `SMEMBERS mitras:deactivated` matches `SELECT id FROM mitras WHERE is_active=false`

### Reconnect re-seed

- [ ] **[BE]** With backend running, `FLUSHDB` on Valkey, wait ~5s for ioredis reconnect, verify a second `[valkey-mirror] seed:` log entry appears
- [ ] **[BE]** All four Valkey structures are rebuilt and match Postgres again

---

## Section B — Online / Offline toggle (write-through)

Verifies `setOnline` / `setOffline` writes both stores in the right order.

- [ ] **[M]** Mitra taps "online" → backend updates Postgres `mitra_online_status.is_online=true`
- [ ] **[BE]** `SISMEMBER mitras:online <mitra-id>` returns `1` within 1s of the toggle
- [ ] **[BE]** `GET mitra:heartbeat:<mitra-id>` returns a fresh ISO timestamp (within seconds of the toggle)
- [ ] **[M]** Mitra taps "offline"
- [ ] **[BE]** Postgres `is_online=false`, Valkey `SISMEMBER mitras:online <id>` returns `0`, `GET mitra:heartbeat:<id>` returns `nil`
- [ ] **[BE]** `mitra_online_logs` has paired `online` / `offline` audit rows

### Valkey-failure best-effort write

- [ ] **[BE]** Stop Valkey, then toggle a mitra online → request succeeds (200), backend log shows `[valkey-mirror] setOnline <id> failed:` but Postgres is updated correctly
- [ ] **[BE]** Restart Valkey → reconciliation sweep (≤ 300s default) eventually rebuilds the SET to include this mitra

---

## Section C — Heartbeat path

Verifies the rewrite: per-ping = 1 Valkey write, 0 DB writes.

- [ ] **[M]** Mitra online for at least one heartbeat cycle (~30s)
- [ ] **[BE]** Watch Postgres query log during heartbeat — **no `UPDATE mitra_online_status SET last_heartbeat_at` rows fire on every ping** (only the batched mirror, default every 60s)
- [ ] **[BE]** `GET mitra:heartbeat:<id>` value advances on each ping (re-check ~30s later, timestamp moves forward)
- [ ] **[BE]** After 60s+ wait, `SELECT last_heartbeat_at FROM mitra_online_status WHERE mitra_id=<id>` advances (heartbeat mirror sweep ran)

### Deactivation guard via Valkey

- [ ] **[CC]** Admin deactivates the mitra (Phase 5 path: `is_active=false`)
- [ ] **[BE]** `SISMEMBER mitras:deactivated <id>` immediately returns `1`
- [ ] **[M]** Mitra app's next heartbeat → backend returns `403 ACCOUNT_INACTIVE`
- [ ] **[BE]** No Postgres SELECT on `mitras` table for that heartbeat (verify with query log) — guard is pure Valkey

### Fallback when Valkey is down

- [ ] **[BE]** Stop Valkey
- [ ] **[M]** Mitra app heartbeats → backend logs `[heartbeat] valkey check failed, falling back to DB`, request still succeeds for active mitra, returns `403` for deactivated mitra (full DB-backed resolveMitra path)
- [ ] **[BE]** Restart Valkey → next heartbeat uses Valkey path again (no fallback log line)

---

## Section D — Capacity counter

Verifies INCR/DECR across session lifecycle via `recomputeCapacityForMitra`.

- [ ] **[BE]** Reset: pick a mitra with `mitra:capacity:<id> = 0`
- [ ] **[C]** Customer pays → mitra accepts the blast → chat starts
- [ ] **[BE]** `GET mitra:capacity:<id>` returns `1` within 1s of mitra accepting
- [ ] **[C]** Second customer pays → same mitra accepts (assuming `max_customers_per_mitra >= 2`)
- [ ] **[BE]** `GET mitra:capacity:<id>` returns `2`
- [ ] **[C]** First session ends naturally (timer expires + goodbye flow completes)
- [ ] **[BE]** `GET mitra:capacity:<id>` returns `1` within 1s
- [ ] **[C]** Second session ends
- [ ] **[BE]** `GET mitra:capacity:<id>` returns `0`

### Reroute

- [ ] **[CC]** Reroute an active session from mitra A → mitra B
- [ ] **[BE]** `mitra:capacity:A` decrements, `mitra:capacity:B` increments — both atomic with the chat_sessions UPDATE

### Capacity gates blast

- [ ] **[BE]** Set `mitra:capacity:<id>` to `max_customers_per_mitra` directly (`SET mitra:capacity:<id> 3`)
- [ ] **[C]** Customer pays → blast → **this mitra is excluded** (verify with `chat_request_notifications` — no row for this mitra)

---

## Section E — Deactivation flow (full propagation)

Verifies `updateMitraStatus` + `revokeAllSessionsForUser` close the deactivation gap.

- [ ] **[BE]** Mitra has an active access token (capture from `/api/mitra/auth/otp/verify` or use existing logged-in session)
- [ ] **[BE]** Confirm mitra can call protected routes (`curl -H "Authorization: Bearer <token>" /api/mitra/...`)
- [ ] **[CC]** Admin deactivates the mitra
- [ ] **[BE]** Postgres: `mitras.is_active=false`, `auth_sessions.revoked_at IS NOT NULL` for this mitra
- [ ] **[BE]** Valkey: `SISMEMBER mitras:deactivated <id>` = 1
- [ ] **[BE]** Mitra's current access token still works on routes that don't re-check active state (stateless JWT) — bounded by access-token TTL
- [ ] **[BE]** Mitra's heartbeat returns `403 ACCOUNT_INACTIVE` immediately (Valkey check on hot path)
- [ ] **[BE]** Mitra's next refresh token attempt fails (because `auth_sessions.revoked_at` was set) → app effectively logs them out

### Re-activation

- [ ] **[CC]** Re-activate the mitra
- [ ] **[BE]** Postgres `is_active=true`, Valkey `SISMEMBER mitras:deactivated <id>` = 0
- [ ] **[M]** Mitra re-logs in, heartbeats again successfully

---

## Section F — Read paths (Valkey-first)

Verifies all reads use Valkey with Postgres fallback.

### `isMitraReachable`

- [ ] **[BE]** Online mitra with fresh heartbeat → `isMitraReachable` returns `true` (call via `node -e` or any route that uses it, e.g. extension flow)
- [ ] **[BE]** Manually set `mitra:heartbeat:<id>` to a timestamp older than `stale_after_seconds` → `isMitraReachable` returns `false` (even though `is_online=true` in Postgres)
- [ ] **[BE]** Stop Valkey → `isMitraReachable` logs `[isMitraReachable] valkey unavailable, falling back to DB` and returns based on Postgres `is_online`

### `findAvailableMitras` (blast)

- [ ] **[BE]** Run `findAvailableMitras` (e.g. trigger a customer blast) — log shows Valkey path used (no warning about fallback)
- [ ] **[BE]** Result IDs match `SDIFF mitras:online mitras:deactivated` filtered by capacity + heartbeat freshness
- [ ] **[BE]** Stop Valkey → next blast logs `[findAvailableMitras] valkey unavailable, falling back to Postgres` and still returns correct results

### `countAvailableMitrasFromCache` (customer beacon)

- [ ] **[BE]** `curl /public/bestie-availability` returns `{available: bool, count: N}` matching reality
- [ ] **[BE]** `GET availability:snapshot` in Valkey shows cached JSON within 10s of last poll
- [ ] **[BE]** Multiple rapid polls (5+ per second from 3 different IPs) → only one Valkey-driven recompute per 10s; Postgres query log shows **zero** mitra-availability JOINs in steady state (only the once-per-10s cache miss)
- [ ] **[CC]** Operator changes `max_customers_per_mitra` → `availability:snapshot` is `DEL`d (cache bust), next poll recomputes

### Dashboard online count

- [ ] **[CC]** Dashboard "Online Mitras" stat matches `SCARD mitras:online` in Valkey
- [ ] **[BE]** Verify the dashboard query no longer hits `SELECT COUNT(*) FROM mitra_online_status WHERE is_online=true` (check query log)

---

## Section G — Auto-offline sweep (Valkey-driven)

Verifies stale heartbeats → flipped offline via Valkey diff.

- [ ] **[BE]** Set `MITRA_AUTO_OFFLINE_SWEEP_SECONDS=10` in env for faster test cycle, restart backend
- [ ] **[CC]** Set `stale_after_seconds=15` in CC settings (or directly in `app_config`)
- [ ] **[M]** Mitra goes online, sends one heartbeat
- [ ] **[BE]** Manually delete the heartbeat key: `DEL mitra:heartbeat:<id>`
- [ ] **[BE]** Wait up to 25s (15s stale + 10s sweep cadence)
- [ ] **[BE]** Postgres: `is_online=false` for this mitra, audit row in `mitra_online_logs` with status='offline'
- [ ] **[BE]** Valkey: `SISMEMBER mitras:online <id>` = 0, no `mitra:heartbeat:<id>` key

### Sweep skips on Valkey error

- [ ] **[BE]** Stop Valkey
- [ ] **[BE]** Backend log shows `[auto-offline] valkey unavailable, skipping this tick:` each sweep cadence
- [ ] **[BE]** **No Postgres UPDATE fires** during the outage (verify with query log) — confirms we don't mass-offline on Valkey hiccup

---

## Section H — Heartbeat → Postgres batched mirror

Verifies the 60s UNNEST UPDATE.

- [ ] **[BE]** Multiple mitras online, all heartbeating
- [ ] **[BE]** Set `HEARTBEAT_MIRROR_INTERVAL_SECONDS=15` for faster cycles, restart
- [ ] **[BE]** Wait one mirror cycle — Postgres log shows **one** UPDATE statement (with `UNNEST(...)` containing all online mitra IDs)
- [ ] **[BE]** `SELECT last_heartbeat_at FROM mitra_online_status WHERE is_online=true` returns timestamps within last cycle
- [ ] **[BE]** Compare with Valkey `GET mitra:heartbeat:<id>` — Postgres lags Valkey by ≤ mirror-cadence seconds (forensic-grade, not real-time)

### Mirror skips on Valkey error

- [ ] **[BE]** Stop Valkey
- [ ] **[BE]** Backend log shows `[heartbeat-mirror] valkey unavailable, skipping:` each cycle
- [ ] **[BE]** Postgres `last_heartbeat_at` does NOT advance during the outage (correct — Valkey is the source of "when was last ping?")

---

## Section I — Reconciliation sweep

Verifies drift heals every `VALKEY_ONLINE_MIRROR_SWEEP_SECONDS`.

- [ ] **[BE]** Set `VALKEY_ONLINE_MIRROR_SWEEP_SECONDS=30` for faster test, restart
- [ ] **[BE]** Manually corrupt Valkey:
  ```
  SADD mitras:online 00000000-0000-0000-0000-000000000999  # bogus ID
  SREM mitras:online <real-online-mitra-id>                 # remove a real one
  SET mitra:capacity:<real-id> 99                           # bogus capacity
  ```
- [ ] **[BE]** Wait one sweep cycle (~30s) → log shows `[valkey-mirror] seed:` again
- [ ] **[BE]** After sweep: Valkey state matches Postgres exactly (bogus ID gone, real ID present, capacity reset)

### Sweep disabled when env=0

- [ ] **[BE]** Set `VALKEY_ONLINE_MIRROR_SWEEP_SECONDS=0`, restart
- [ ] **[BE]** Confirm no periodic seed log appears after the initial startup seed

---

## Section J — Failure modes (Valkey degradation)

End-to-end behavior when Valkey is down for an extended period.

- [ ] **[BE]** Stop Valkey
- [ ] **[C]** Customer beacon poll → `availability:snapshot` GET fails → falls back to Postgres JOIN; UX unchanged but DB query rate spikes
- [ ] **[C]** Customer triggers blast → `findAvailableMitras` Valkey path fails → falls back to Postgres JOIN; blast still works
- [ ] **[M]** Mitra heartbeat → Valkey write fails (logged), but heartbeat returns 200; the missed write is irrelevant (auto-offline sweep is also skipping)
- [ ] **[M]** Mitra toggle online → Postgres update succeeds, Valkey SADD fails (logged); on next reconciliation sweep after Valkey returns, mitra is back in `mitras:online` SET
- [ ] **[BE]** Restart Valkey → reconnect listener fires → `seedFromPostgres()` runs → state restored; degraded period ends

---

## Section K — Multi-instance (defer until Cloud Run)

Run only when ≥ 2 backend instances are active.

- [ ] **[BE]** Two instances both run their own seed on startup — final Valkey state is consistent (idempotent `DEL + SADD`)
- [ ] **[BE]** Concurrent setOnline calls on the same mitra from different instances → final SET state correct (atomic SADD)
- [ ] **[BE]** `availability:snapshot` cache miss on instance A fills the snapshot; instance B's next poll reads the cached value (cluster-shared cache works)
- [ ] **[BE]** Operator changes `max_customers_per_mitra` on one instance → `config:invalidate` pub/sub fires → other instance also DELs `availability:snapshot`
- [ ] **[BE]** Heartbeat mirror UPDATEs from multiple instances are idempotent (last writer wins on timestamp, no errors)

---

## Smoke tests (quick happy path)

5-minute sanity check after any deploy:

- [ ] **[BE]** Backend log shows successful seed on startup
- [ ] **[M]** Mitra toggles online → appears in `SMEMBERS mitras:online`
- [ ] **[C]** Customer sees "Mulai Curhat" enabled
- [ ] **[C]** Customer pays → mitra accepts → chat starts → `mitra:capacity:<id>` increments
- [ ] **[C]** Chat ends → counter decrements
- [ ] **[M]** Mitra toggles offline → removed from SET

---

## Known limitations / what this checklist does NOT cover

- **Load testing** — sustained heartbeat volume at prod scale (300+ mitras × 2 pings/min). Plan: separate load-test stage when prod is provisioned.
- **Memorystore-specific behavior** — failover, RDB+AOF interaction. Plan: re-run Sections A, G, I, J against Memorystore Standard tier before prod cutover.
- **Long-running drift** — overnight runs where eviction or memory-pressure could affect Valkey state. Plan: monitor `INFO memory` in prod for the first week.