Phase 6: Valkey availability mirror — move read path off Postgres

Mitra-availability state (online flag, deactivated flag, per-mitra session count, heartbeat liveness) mirrored into Valkey so the customer beacon + pairing blast + dashboard counts no longer hit Postgres on the hot path. Postgres remains the durable source of truth; Valkey state is fully derivable via seedFromPostgres on startup + reconnect. Schema - mitras:online SET — mirror of is_online - mitras:deactivated SET — mirror of is_active=false - mitra:capacity:<id> STRING — active+pending_payment session count - mitra💓<id> STRING — ISO timestamp of last ping - availability:snapshot JSON — beacon cache, TTL 10s, cluster-shared Write paths (Postgres first, best-effort Valkey) - setOnline/setOffline mirror SADD/SREM + heartbeat SET/DEL - updateMitraStatus mirrors mitras:deactivated AND revokes auth_sessions on deactivate (bounds the "ghost online" window to access-token TTL) - heartbeat is Valkey-only on the hot path; the per-ping Postgres UPDATE on last_heartbeat_at is eliminated (was 1,200 ops/min at prod scale) - chat_session lifecycle (accept/end/reroute/extension/expiry) calls recomputeCapacityForMitra after each UPDATE — derive-from-truth avoids the bookkeeping risk of per-transition INCR/DECR Read paths (Valkey-first, Postgres fallback on Valkey error) - isMitraReachable: SISMEMBER mitras:online + heartbeat freshness - findAvailableMitras: SDIFF + pipelined GETs, filter by capacity + heartbeat - countAvailableMitrasFromCache: Valkey-driven, cached cluster-wide 10s TTL - dashboard online count: SCARD - Each reader wraps Valkey ops in try/catch → Postgres fallback on outage Heartbeat path on /api/mitra/status/heartbeat - resolveMitra preHandler replaced with heartbeatGuard: SISMEMBER on mitras:deactivated (~0 DB hits per ping). Falls back to full DB resolveMitra if Valkey is unreachable so a Valkey outage doesn't silently accept heartbeats from deactivated mitras. Three sweeps, env-configurable cadences - MITRA_AUTO_OFFLINE_SWEEP_SECONDS (30) — Valkey-driven stale detection - HEARTBEAT_MIRROR_INTERVAL_SECONDS (60) — batched UPSERT writes Valkey timestamps to Postgres last_heartbeat_at via UNNEST (1 statement per cycle, idempotent across instances) - VALKEY_ONLINE_MIRROR_SWEEP_SECONDS (300) — periodic reseed heals drift Startup - restoreActiveTimers → seedFromPostgres → bind listeners - onValkeyReady re-runs the seed on every reconnect (cold start + reseed on Valkey restart, no manual intervention) Failure semantics - Read fallback: every Valkey read wrapped, falls back to existing Postgres JOIN query — system stays correct during Valkey outage, performance degrades not breaks - Write best-effort: Postgres write commits before Valkey is touched; Valkey errors log + continue; reconciliation sweep heals drift - Auto-offline sweep aborts entirely on Valkey error (does NOT mass- offline via Postgres scan during Valkey hiccup) Tests - New: 32 integration tests in mitra-status.valkey-mirror.test.js covering seed, write-through, fallbacks, capacity lifecycle, auto-offline sweep, heartbeat mirror, deactivation flow, beacon cache - Updated: fixtures.js seeds Valkey alongside Postgres when isOnline=true - Updated: helpers/db.js resetDb also flushes test Valkey - Fixed 2 pre-existing session-timer flakes (string IDs failed uuid parse; vi.advanceTimersByTimeAsync raced real Postgres I/O) - All 124/124 backend tests pass (was 90/92) Docs - requirement/valkey-online-mirror-plan.md — canonical plan - requirement/valkey-online-mirror-testing.md — manual E2E checklist - requirement/deployment.md — infra + Valkey persistence guidance for prod (Memorystore Standard tier recommended; migration from self-hosted Valkey is zero-downtime via reseed-from-Postgres) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-25 18:07:55 +08:00
parent 3fff4b1c6e
commit 553dbac52f
20 changed files with 1839 additions and 82 deletions
--- a/requirement/deployment.md
+++ b/requirement/deployment.md
@@ -0,0 +1,107 @@
+# Deployment notes
+
+Operational decisions and dependency configuration for staging/production. Keep this updated as we make infra choices; cross-link from feature plans when a deploy-time setting matters.
+
+## Infrastructure summary
+
+| Component | Service | Tier / Notes |
+|---|---|---|
+| Backend (public + internal) | GCP Cloud Run | Horizontal scaling; SIGTERM trapped for graceful drain ([server.js](../backend/src/server.js)) |
+| Database | GCP Cloud SQL (PostgreSQL) | Source of truth for all durable state |
+| Pub/sub + cache | Valkey | Self-hosted on VM today; Memorystore Standard (HA) recommended for prod (see [§ Valkey](#valkey)) |
+| Networking | GCP VPC | Internal listener (port 3001) never exposed; CC reaches it via VPN |
+| Payment | Xendit | See [phase5-xendit-plan.md](phase5-xendit-plan.md) for keys / webhook URL setup |
+| Auth | Self-managed JWT + FCM-only Firebase | See [backend/CLAUDE.md](../backend/CLAUDE.md) |
+
+## Valkey
+
+Valkey is used for two distinct purposes:
+
+1. **Pub/sub** — cross-instance event fan-out (chat messages, session lifecycle, config invalidation). See [backend/src/plugins/valkey.js](../backend/src/plugins/valkey.js).
+2. **Availability mirror** — `mitras:online`, `mitras:deactivated`, `mitra:capacity:<id>`, `mitra:heartbeat:<id>`, and `availability:snapshot` per [valkey-online-mirror-plan.md](valkey-online-mirror-plan.md). Postgres remains the durable source of truth; Valkey is the hot read path.
+
+### Persistence — required or optional?
+
+**Not required.** All durable state lives in Postgres; Valkey is a cache + ephemeral liveness layer that fully rebuilds via `seedFromPostgres()` on backend reconnect.
+
+What's actually in Valkey, and what happens if it's wiped:
+
+| Key | Derivable from Postgres? | Cost of loss |
+|---|---|---|
+| `mitras:online` | yes | reseeded on reconnect |
+| `mitras:deactivated` | yes | reseeded on reconnect |
+| `mitra:capacity:<id>` | yes (`COUNT(*) FROM chat_sessions`) | reseeded on reconnect |
+| `mitra:heartbeat:<id>` | **no** — pure transient liveness | seed writes `NOW`; ≤ a few seconds of fuzz on `last_heartbeat_at` forensics |
+| `availability:snapshot` | recomputable | next beacon poll repopulates |
+
+Reader code in services/* has explicit Postgres fallbacks for every Valkey op, so the cold-cache window during a restart degrades performance, not correctness.
+
+### Persistence recommendation by environment
+
+| Environment | Setting | Reason |
+|---|---|---|
+| **Dev / local** | No persistence (`--save "" --appendonly no` or just default) | Restarts wipe state; reseed handles it cleanly; zero disk overhead |
+| **Staging** | AOF on (`--appendonly yes`) | Verifies prod-like behavior; tiny disk cost |
+| **Production** | AOF on, optionally RDB too (`--appendonly yes --save 60 1000`) | Eliminates cold-cache window after restart; trivial disk footprint (few MB) |
+
+The application code is identical across all three — persistence is a deploy-time knob, not a code-level concern.
+
+### Self-hosted Valkey (current state, dev/staging)
+
+Docker container on the existing VM. Reference config:
+
+```yaml
+valkey:
+  image: valkey/valkey:7-alpine
+  command: valkey-server --appendonly yes --save 60 1000
+  volumes:
+    - valkey-data:/data
+  ports:
+    - "6379:6379"
+  restart: unless-stopped
+```
+
+Backend reaches it via `VALKEY_URL=redis://<vm-ip>:6379` in `backend/.env` (or Cloud Run env var).
+
+### Memorystore migration (when going to prod)
+
+The reseed-from-Postgres flow makes migration trivial — Valkey state is never load-bearing:
+
+1. Provision **Memorystore for Valkey, Standard tier** (HA with replica) in the same VPC + region as Cloud Run.
+   - Smallest available size (~1 GB) is plenty; actual data footprint is well under 1 MB.
+   - Cost: ~$50/month at minimum sizing in asia-southeast2.
+2. Update Cloud Run env: `VALKEY_URL=redis://<memorystore-internal-ip>:6379`.
+3. Deploy new revision. Cloud Run rolling deploy → new instances seed Memorystore from Postgres; old instances drain on old Valkey.
+4. Shut down old Valkey once traffic has migrated.
+
+**Zero downtime.** No data migration needed (state is derivable). The cold-cache window on new instances is handled by the existing Postgres-fallback reader paths.
+
+### Tier choice rationale
+
+| Tier | When to use | Failover behavior |
+|---|---|---|
+| Self-hosted Docker | Dev, staging | Manual restart; backend reseeds when Valkey comes back |
+| Memorystore Basic | Cost-sensitive single-AZ staging | ~1–5 min outage per maintenance event; backend handles via Postgres fallback |
+| Memorystore Standard (HA) | **Production** | ~30s automatic failover; replica keeps data live |
+
+The system is correct on any tier — HA reduces customer-visible latency spikes during Valkey events from minutes to seconds.
+
+## Cloud Run
+
+(Placeholder — fill in as we make decisions about region, min/max instances, concurrency, secrets manager wiring.)
+
+## Cloud SQL
+
+(Placeholder — pool size, machine type, HA flag, backup retention.)
+
+## Xendit
+
+See [phase5-xendit-plan.md](phase5-xendit-plan.md) for credential setup and webhook URL configuration. Stage 8 (live E2E) is currently blocked on test-mode keys.
+
+## Open ops decisions
+
+- [ ] Confirm Memorystore Standard tier for prod deploy (recommended in [§ Valkey](#valkey)).
+- [ ] Pin GCP region for backend + Cloud SQL + Memorystore (all must match for sub-ms internal latency).
+- [ ] Secrets manager (GCP Secret Manager vs Cloud Run env vars) for `AUTH_JWT_SECRET`, `XENDIT_SECRET_KEY`, etc.
+- [ ] Backup retention policy for Cloud SQL.
+- [ ] CI/CD pipeline for Cloud Run deploys.
--- a/requirement/valkey-online-mirror-plan.md
+++ b/requirement/valkey-online-mirror-plan.md
@@ -0,0 +1,376 @@
+# Valkey mirror for mitra availability — plan
+
+**Status:** approved (2026-05-25). Open issue: heartbeat auth preHandler (see [§ Open issues](#open-issues)).
+**Created:** 2026-05-25
+**Owner:** Ramadhan
+
+## Goal
+
+Move the **read path** for "is this mitra available to blast?" entirely into Valkey at production scale (hundreds of online mitras, thousands of customers polling). Eliminate per-heartbeat Postgres writes. Keep Postgres as the durable source of truth via either real-time mirroring (writes that already exist) or periodic batched mirroring (heartbeats).
+
+## North star
+
+- **Postgres = durable source of truth.** Every fact lives there.
+- **Valkey = read path + ephemeral-write target.** Mirrors Postgres state; reads compute availability from primitive signals.
+- **Valkey unreachable on a read:** fall back to the existing Postgres JOIN query. Outage degrades performance, never breaks pairing.
+- **Valkey unreachable on a write:** log + continue. Reconciliation sweep heals drift.
+- **Valkey restart / cold start:** reseed from Postgres before serving traffic.
+
+## Schema
+
+Four Valkey structures. None has a TTL (heartbeat freshness is computed by sweep, not by Redis expiry):
+
+| Key | Type | Value / members | Updated by |
+|---|---|---|---|
+| `mitras:online` | SET | mitra UUIDs | `setOnline` SADD; `setOffline` SREM; sweep SREM; bulk-SREM on deactivate |
+| `mitras:deactivated` | SET | mitra UUIDs | `updateMitraStatus(is_active=false)` SADD; `updateMitraStatus(is_active=true)` SREM |
+| `mitra:capacity:<id>` | STRING (integer) | count of active+pending_payment sessions assigned to this mitra | INCR on session accept; DECR on session end/expire/cancel; DECR/INCR pair on reroute |
+| `mitra:heartbeat:<id>` | STRING | ISO 8601 timestamp of last heartbeat | heartbeat handler `SET`; `setOnline` `SET` (seed); `setOffline` `DEL` |
+
+Postgres mirror columns (durable):
+- `mitra_online_status.is_online` — written by `setOnline` / `setOffline` / sweep. Already in schema.
+- `mitra_online_status.last_heartbeat_at` — written by the **batched heartbeat mirror** every 60s (NOT per-ping). Already in schema.
+- `mitras.is_active` — written by `updateMitraStatus`. Already in schema.
+- `chat_sessions.status` + `mitra_id` — already source-of-truth for session counts.
+
+## Read path (computing "available for blast")
+
+All reads compute on the fly from Valkey primitives. No memoized `mitras:available` set.
+
+```js
+const findAvailableMitras = async () => {
+  // 1. online minus deactivated
+  const candidates = await valkey.sdiff('mitras:online', 'mitras:deactivated')
+  if (!candidates.length) return []
+
+  // 2. fetch capacity + heartbeat for each candidate (pipelined: 1 roundtrip)
+  const pipe = valkey.pipeline()
+  for (const id of candidates) {
+    pipe.get(`mitra:capacity:${id}`)
+    pipe.get(`mitra:heartbeat:${id}`)
+  }
+  const results = await pipe.exec()
+
+  // 3. filter by capacity + heartbeat freshness
+  const { max_customers_per_mitra } = await getMaxCustomersPerMitra()
+  const { stale_after_seconds } = await getMitraPingConfig()
+  const cutoff = Date.now() - stale_after_seconds * 1000
+
+  const eligible = []
+  for (let i = 0; i < candidates.length; i++) {
+    const count = Number(results[i * 2][1] ?? 0)
+    const heartbeatAt = results[i * 2 + 1][1]
+    if (count >= max_customers_per_mitra) continue
+    if (!heartbeatAt || Date.parse(heartbeatAt) < cutoff) continue
+    eligible.push({ id: candidates[i], active_session_count: count })
+  }
+  return eligible
+}
+```
+
+**Cost at prod scale (300 online):** 1 `SDIFF` + 600 `GET` (pipelined) = ~1ms. Negligible.
+
+### Read sites — what changes
+
+| Caller | Today | After |
+|---|---|---|
+| `isMitraReachable(mitraId)` ([mitra-status.service.js:215](../backend/src/services/mitra-status.service.js#L215)) | `SELECT is_online ...` | `SISMEMBER mitras:online` + check `mitra:heartbeat:<id>` freshness |
+| `findAvailableMitras` ([pairing.service.js:79](../backend/src/services/pairing.service.js#L79)) | full JOIN with chat_sessions | Valkey-driven (above) |
+| `countAvailableMitrasFromCache` ([mitra-status.service.js:181](../backend/src/services/mitra-status.service.js#L181)) | full JOIN, cached in-process 10s | Valkey-driven, cached in **Valkey** 10s (shared cluster-wide) |
+| Dashboard online count ([dashboard.service.js:16](../backend/src/services/dashboard.service.js#L16)) | `COUNT(*) WHERE is_online=true` | `SCARD mitras:online` |
+| `getStatus(mitraId)` (mitra's own status poll) | full SELECT | Hybrid: `SISMEMBER` for `is_online`, Postgres for timestamps |
+| `getOnlineMitras` (CC dashboard) | full JOIN with display_name + active_session_count | **unchanged** — low volume, joins make sense in SQL |
+
+### Reader fallback
+
+Every Valkey read is wrapped:
+
+```js
+try {
+  return await valkey.sismember('mitras:online', mitraId)
+} catch (err) {
+  log.warn({ err }, '[mitras:online] valkey unreachable, falling back to DB')
+  const [row] = await sql`SELECT is_online FROM mitra_online_status WHERE mitra_id = ${mitraId}`
+  return Boolean(row?.is_online)
+}
+```
+
+For `findAvailableMitras`, the fallback is the existing Postgres JOIN query.
+
+## Write paths
+
+Each Postgres write is followed by a Valkey write. Postgres always commits first. Valkey failures log + continue.
+
+### Online / offline (mitra app toggle)
+
+| Action | Postgres | Valkey |
+|---|---|---|
+| `setOnline(mitraId)` | `UPDATE mitra_online_status SET is_online=true, last_online_at=NOW(), last_heartbeat_at=NOW()` | `SADD mitras:online <id>` + `SET mitra:heartbeat:<id> <now-iso>` |
+| `setOffline(mitraId)` | `UPDATE mitra_online_status SET is_online=false, last_offline_at=NOW()` | `SREM mitras:online <id>` + `DEL mitra:heartbeat:<id>` |
+
+The `SET heartbeat` on `setOnline` is critical: without it, the very next sweep would mark the freshly-online mitra stale (their first real heartbeat is up to `heartbeat_cadence_seconds` away).
+
+### Admin activate / deactivate (CC)
+
+[`updateMitraStatus`](../backend/src/services/mitra.service.js) ([mitra.service.js:32](../backend/src/services/mitra.service.js#L32)):
+
+| Action | Postgres | Valkey |
+|---|---|---|
+| activate (`is_active=true`) | `UPDATE mitras SET is_active=true` | `SREM mitras:deactivated <id>` |
+| deactivate (`is_active=false`) | `UPDATE mitras SET is_active=false` | `SADD mitras:deactivated <id>` |
+
+### Mitra heartbeat (hot path)
+
+Heartbeat handler ([mitra-status.service.js:79](../backend/src/services/mitra-status.service.js#L79)) is rewritten to operate entirely against Valkey:
+
+1. `authenticate` plugin verifies JWT signature + expiry + `userType === 'mitra'` (no DB).
+2. `SISMEMBER mitras:deactivated <userId>` → if true, return `403`.
+3. `SET mitra:heartbeat:<userId> <now-iso>`.
+
+Steps 2 + 3 pipelined into one Valkey roundtrip.
+
+| | Today | After |
+|---|---|---|
+| Auth check | `getMitraById` SELECT + `is_active` check (1 DB read) | `SISMEMBER mitras:deactivated` (Valkey only) |
+| Body | `UPDATE mitra_online_status SET last_heartbeat_at=NOW() WHERE id=? AND is_online=true` (1 DB write) | `SET mitra:heartbeat:<id>` (Valkey only) |
+| Postgres ops per ping | 2 | **0** |
+| Valkey ops per ping | 0 | 2 (one pipelined roundtrip) |
+| At prod scale (300 online × 2 pings/min) | 1,200 DB ops/min | 1,200 Valkey ops/min, **0 DB ops/min** |
+
+`mitras:deactivated` is already maintained on every CC `updateMitraStatus` call (see [§ Admin activate / deactivate](#admin-activate--deactivate-cc)) so deactivation propagates to the heartbeat path within one Valkey write window (~ms).
+
+### Heartbeat → Postgres batched mirror
+
+A background job runs every `HEARTBEAT_MIRROR_INTERVAL_SECONDS` (env, default 60). One SQL statement, touches all online mitras at once:
+
+```js
+const mirrorHeartbeatsToPostgres = async () => {
+  const onlineIds = await valkey.smembers('mitras:online')
+  if (!onlineIds.length) return
+
+  const pipe = valkey.pipeline()
+  for (const id of onlineIds) pipe.get(`mitra:heartbeat:${id}`)
+  const results = await pipe.exec()
+
+  const pairs = []
+  for (let i = 0; i < onlineIds.length; i++) {
+    const ts = results[i][1]
+    if (ts) pairs.push({ mitra_id: onlineIds[i], ts })
+  }
+  if (!pairs.length) return
+
+  await sql`
+    UPDATE mitra_online_status m
+    SET last_heartbeat_at = u.ts::timestamptz, updated_at = NOW()
+    FROM (SELECT * FROM UNNEST(
+      ${sql.array(pairs.map(p => p.mitra_id))}::uuid[],
+      ${sql.array(pairs.map(p => p.ts))}::text[]
+    ) AS t(mitra_id, ts)) u
+    WHERE m.mitra_id = u.mitra_id
+  `
+}
+```
+
+**Cost at prod scale (300 online):** 1 batched UPDATE per minute per instance. At 3 instances = 3 statements/min cluster-wide (redundant but idempotent — latest timestamp wins). ~20–60× reduction in Postgres write load vs today.
+
+**No leader election initially.** 3 redundant idempotent UPDATEs/min is still 200× cheaper than today. If WAL pressure becomes visible, add a Valkey-NX lease leader-elect (~15 LOC); deferred.
+
+### Session capacity counter
+
+Touch all four services that mutate session state:
+
+| Trigger | File | Postgres write (existing) | Valkey write |
+|---|---|---|---|
+| Mitra accepts a chat (status → `active`, mitra_id set) | `pairing.service.js` accept handler | `UPDATE chat_sessions SET status='active', mitra_id=?` | `INCR mitra:capacity:<mitra_id>` |
+| Session ends (status → `ended` / `expired` / `cancelled`) | `closure.service.js`, expiry sweepers | `UPDATE chat_sessions SET status=...` | `DECR mitra:capacity:<mitra_id>` |
+| Session reroute (mitra A → mitra B) | `session.service.js` | `UPDATE chat_sessions SET mitra_id=B` | `DECR mitra:capacity:A` + `INCR mitra:capacity:B` (pipelined) |
+
+**What counts as occupying capacity:** sessions in `ACTIVE` or `PENDING_PAYMENT` status with a non-null `mitra_id`. This matches the existing `findAvailableMitras` predicate. Extension flow (active → pending_payment → active) does NOT change capacity — mitra stays occupied throughout.
+
+**Negative-counter guard:** wrap `DECR` in a `Math.max(0, ...)` check via Lua or via a `GET` + `SET` if zero — to prevent drift if a DECR fires without a prior INCR (e.g. legacy session without Valkey mirror). The reconciliation sweep recomputes from Postgres anyway.
+
+## Auto-offline sweep — Valkey-driven
+
+Replaces the current Postgres seq-scan with a Valkey computation. Runs every `MITRA_AUTO_OFFLINE_SWEEP_SECONDS` (env, default 30):
+
+```js
+const autoOfflineStaleMitras = async () => {
+  const { stale_after_seconds, require_ping } = await getMitraPingConfig()
+  if (!require_ping) return 0
+
+  const onlineIds = await valkey.smembers('mitras:online')
+  if (!onlineIds.length) return 0
+
+  const pipe = valkey.pipeline()
+  for (const id of onlineIds) pipe.get(`mitra:heartbeat:${id}`)
+  const results = await pipe.exec()
+
+  const cutoff = Date.now() - stale_after_seconds * 1000
+  const stale = []
+  for (let i = 0; i < onlineIds.length; i++) {
+    const ts = results[i][1]
+    if (!ts || Date.parse(ts) < cutoff) stale.push(onlineIds[i])
+  }
+  if (!stale.length) return 0
+
+  // Postgres: bulk flip is_online=false
+  await sql`
+    UPDATE mitra_online_status
+    SET is_online = false, last_offline_at = NOW(), updated_at = NOW()
+    WHERE mitra_id = ANY(${sql.array(stale)}::uuid[]) AND is_online = true
+  `
+  // Log rows
+  for (const id of stale) {
+    await sql`INSERT INTO mitra_online_logs (mitra_id, status) VALUES (${id}, 'offline')`
+  }
+  // Valkey: bulk SREM + DEL heartbeat keys
+  const cleanup = valkey.pipeline()
+  cleanup.srem('mitras:online', ...stale)
+  for (const id of stale) cleanup.del(`mitra:heartbeat:${id}`)
+  await cleanup.exec()
+
+  invalidateAvailabilityCache()
+  return stale.length
+}
+```
+
+**Stale threshold:** `stale_after_seconds` is read from `getMitraPingConfig()` ([config.service.js](../backend/src/services/config.service.js)) — the existing CC-tunable knob. Not a new env.
+
+**Sweep cadence:** `MITRA_AUTO_OFFLINE_SWEEP_SECONDS` env, default 30 (matches current hardcoded setInterval).
+
+**Failure semantics:** if any Valkey op throws, the entire sweep aborts for this tick. The next tick retries. We never want to mass-offline mitras due to a transient Valkey hiccup.
+
+## Shared beacon snapshot cache
+
+Replace the in-process `availabilityCache` ([mitra-status.service.js:14](../backend/src/services/mitra-status.service.js#L14)) with a Valkey GET/SETEX key. Even though reads are now sub-ms Valkey ops, this caps total Valkey query load at high beacon-poll rates (5,000 customers × 12/min = 60,000 polls/min → cache → 6 computations/min cluster-wide).
+
+```
+KEY:     availability:snapshot
+TYPE:    string (JSON: {"available": bool, "count": number})
+TTL:     10 seconds
+```
+
+`config:invalidate` pub/sub subscriber extended to `DEL availability:snapshot` on `max_customers_per_mitra` bust.
+
+## Startup / reconnect / reseed
+
+Three triggers reseed Valkey state from Postgres. All idempotent.
+
+### Backend startup
+
+In [`server.js`](../backend/src/server.js), after Valkey emits `'ready'` and before the public listener binds:
+
+```js
+const onlineRows = await sql`SELECT mitra_id FROM mitra_online_status WHERE is_online = true`
+const deactRows = await sql`SELECT id FROM mitras WHERE is_active = false`
+const sessionCountRows = await sql`
+  SELECT mitra_id, COUNT(*)::int AS c FROM chat_sessions
+  WHERE mitra_id IS NOT NULL AND status IN ('active', 'pending_payment')
+  GROUP BY mitra_id
+`
+
+const pipe = valkey.multi()
+pipe.del('mitras:online', 'mitras:deactivated')
+if (onlineRows.length) {
+  pipe.sadd('mitras:online', ...onlineRows.map(r => r.mitra_id))
+  // Seed heartbeat timestamps to NOW so the first sweep doesn't mass-offline
+  // currently-online mitras. They'll refresh on their next ping anyway.
+  const now = new Date().toISOString()
+  for (const r of onlineRows) pipe.set(`mitra:heartbeat:${r.mitra_id}`, now)
+}
+if (deactRows.length) pipe.sadd('mitras:deactivated', ...deactRows.map(r => r.id))
+for (const r of sessionCountRows) pipe.set(`mitra:capacity:${r.mitra_id}`, r.c)
+await pipe.exec()
+```
+
+### ioredis reconnect
+
+Listen to the ioredis `'ready'` event (fires on initial connect AND each reconnect). Re-run the seed.
+
+### Periodic reconciliation sweeper
+
+`VALKEY_ONLINE_MIRROR_SWEEP_SECONDS` env, default 300. Runs the seed (idempotent — `DEL` + `SADD` + `SET` lands the same state). Belt-and-braces against drift from failed best-effort writes, out-of-band Postgres mutations, Valkey eviction.
+
+## Two sweeps, two cadences (summary)
+
+| Sweep | Purpose | Cadence env | Default | Reads from | Writes to |
+|---|---|---|---|---|---|
+| Auto-offline | Detect stale-heartbeat mitras → flip offline | `MITRA_AUTO_OFFLINE_SWEEP_SECONDS` | 30 | Valkey | Postgres + Valkey |
+| Heartbeat mirror | Persist Valkey heartbeats to Postgres for forensics/backup | `HEARTBEAT_MIRROR_INTERVAL_SECONDS` | 60 | Valkey | Postgres |
+| Reconciliation | Heal Valkey/Postgres drift | `VALKEY_ONLINE_MIRROR_SWEEP_SECONDS` | 300 | Postgres | Valkey |
+
+All three run on every backend instance independently. All idempotent. No leader election required.
+
+## Multi-instance safety
+
+Cloud Run runs N instances. Each instance:
+
+- Writes both stores on its own mutations. Atomic Valkey ops (`SADD` / `SREM` / `INCR` / `DECR` / `SET`) — no cross-instance coordination needed.
+- Runs all three sweeps independently. Redundant but idempotent.
+- Recompute-on-read for blast eligibility — no stale aggregate to invalidate.
+
+The `mitra:capacity:<id>` counter is the most race-sensitive: `INCR` / `DECR` are atomic but a session-state change must consistently fire exactly one increment and one decrement over its lifetime. The reconciliation sweep recomputes from `chat_sessions` and resets the counter, healing any drift.
+
+## Failure mode summary
+
+| | Behavior |
+|---|---|
+| Valkey unreachable on read | Fall back to Postgres JOIN query |
+| Valkey unreachable on write | Log + continue. Reconciliation sweep heals later. |
+| Postgres unreachable on heartbeat mirror | Skip this cycle. Next cycle writes the latest. |
+| Auto-offline sweep can't reach Valkey | Skip this tick. Mitras stay "online" until Valkey comes back + heartbeat ages out. |
+| Valkey crash (catastrophic) | Backend reconnects → reseed from Postgres. Worst case: ≤60s of `last_heartbeat_at` forensics lost. |
+| Backend crash | Other instances keep running. New instance reseed on startup. |
+
+## Files touched
+
+| File | Change |
+|---|---|
+| `backend/src/plugins/valkey.js` | Add wrappers: `sadd`, `srem`, `sismember`, `smembers`, `sdiff`, `scard`, `set`, `get`, `del`, `incr`, `decr`, `pipeline`/`multi` + `'ready'` reconnect hook |
+| `backend/src/services/config.service.js` | Add `getMitraAutoOfflineSweepSeconds`, `getHeartbeatMirrorIntervalSeconds`, `getValkeyOnlineMirrorSweepSeconds` getters |
+| `backend/src/services/mitra-status.service.js` | Major rewrite (see [§ Read path](#read-path-computing-available-for-blast) and [§ Write paths](#write-paths)). Add `incrementCapacity`, `decrementCapacity`, `mirrorHeartbeatsToPostgres`, `seedFromPostgres` |
+| `backend/src/services/mitra.service.js` | Wrap `updateMitraStatus` with `SADD`/`SREM mitras:deactivated` |
+| `backend/src/services/pairing.service.js` | Rewrite `findAvailableMitras` as Valkey-driven (Postgres fallback). `INCR mitra:capacity` on session accept. |
+| `backend/src/services/closure.service.js` | `DECR mitra:capacity` on session end/expire/cancel |
+| `backend/src/services/session.service.js` | `DECR` + `INCR` pair on reroute |
+| `backend/src/services/dashboard.service.js` | `SCARD mitras:online` for online count |
+| `backend/src/routes/public/mitra.status.routes.js` | Replace `resolveMitra` on `POST /heartbeat` with a Valkey `SISMEMBER mitras:deactivated` check (keep `resolveMitra` on `/online`, `/offline`, `GET /`) |
+| `backend/src/server.js` | Call `seedFromPostgres` on startup (before listener binds); replace hardcoded 30_000 setInterval with env-driven cadence; register heartbeat mirror + reconciliation sweep intervals |
+| `backend/.env.example` | Document `MITRA_AUTO_OFFLINE_SWEEP_SECONDS`, `HEARTBEAT_MIRROR_INTERVAL_SECONDS`, `VALKEY_ONLINE_MIRROR_SWEEP_SECONDS` |
+| `backend/test/services/mitra-status.service.test.js` | Add tests (see [§ Test plan](#test-plan)) |
+| `backend/test/services/pairing.service.test.js` | Update for Valkey-driven `findAvailableMitras` |
+| `backend/test/helpers/valkey.js` (new if absent) | Test helper for clean-slate Valkey state per test |
+
+**Estimated touch:** ~400 LOC + ~200 LOC tests. ~2 days focused work.
+
+## Test plan
+
+### Unit
+1. Mock Valkey; verify each writer calls Postgres → Valkey in correct order, with the seed-heartbeat-on-setOnline and DEL-on-setOffline.
+2. Verify reader fallback path runs when Valkey ops throw.
+3. Verify auto-offline sweep aborts entirely when Valkey ops throw (does NOT mass-offline via Postgres-only path).
+4. Verify capacity counter never goes negative (Math.max guard).
+
+### Integration (real Valkey + Postgres)
+1. **Seed correctness:** insert N online rows + M deactivated + session counts in DB; run startup seed; verify all four Valkey structures match.
+2. **Heartbeat refresh:** `setHeartbeat()` → verify Valkey value updates; check that the periodic mirror writes Postgres `last_heartbeat_at` within one mirror cycle.
+3. **Auto-offline:** insert online mitra, manually expire heartbeat by setting timestamp in past, run sweep, verify Postgres `is_online=false` + Valkey `SREM` + `DEL` heartbeat key.
+4. **Capacity lifecycle:** simulate session accept → end across multiple mitras; verify counter increments/decrements; verify reroute moves count from A to B atomically.
+5. **Restart resilience:** seed state, simulate Valkey restart (FLUSHALL), trigger reconnect handler, verify all four structures reseed correctly.
+6. **Reconciliation:** corrupt Valkey (random SADD of non-existent mitra, wrong capacity counter, missing entries); run reconciliation sweep; verify convergence to Postgres state.
+7. **Fallback:** disable Valkey mid-test; verify `findAvailableMitras` falls back to Postgres JOIN query and returns sensible results.
+
+### Regression
+- All 90/92 existing backend tests should still pass.
+- Maestro flows for pairing (ts-customer-*) should pass unchanged.
+
+## Decisions (locked 2026-05-25)
+
+1. **`revokeAllAuthSessions(mitraId)` added to `updateMitraStatus`** in the same PR. Bounds the deactivation gap to access-token TTL across all mitra routes (not just heartbeat).
+2. **Prod Valkey: Memorystore for Valkey, Standard tier** (HA with replica, smallest available capacity ~1GB). Built-in replication keeps heartbeat timestamps live across failover. Staging/dev can run Basic tier — the reseed-from-Postgres flow handles cold-cache restarts correctly either way.
+3. **Keep `last_heartbeat_at` column.** Written by the 60s batched mirror; remains available for operator forensics ("when was X last seen?"). Drop only if a future audit confirms no consumer reads it.
+
+## Future phases (deferred)
+
+- **Heartbeat → Valkey TTL with keyspace notifications.** Replace timestamp-comparison sweep with `notify-keyspace-events Ex` → instant detection of expired heartbeats. Requires Memorystore config change + a subscriber. Defer until 30s detection lag is the visible bottleneck.
+- **Leader-elected mirror/sweep.** Use a Valkey-NX lease so only one instance runs each background job. ~15 LOC each. Defer until the redundant work shows up in metrics.
--- a/requirement/valkey-online-mirror-testing.md
+++ b/requirement/valkey-online-mirror-testing.md
@@ -0,0 +1,272 @@
+# Valkey Availability Mirror — Testing Checklist
+
+End-to-end verification for the Valkey-mirror refactor described in [valkey-online-mirror-plan.md](valkey-online-mirror-plan.md).
+
+Cluster labels: **[BE]** backend / curl / SQL / Valkey-cli, **[CC]** control_center, **[M]** mitra_app, **[C]** client_app.
+
+> **Run order:** Section A first (seed + startup) — every subsequent section assumes a fresh seed. Sections B–J are otherwise independent.
+
+---
+
+## Setup
+
+- [ ] Backend running on `192.168.88.247:3000` (public) + `:3001` (internal) — `curl http://192.168.88.247:3000/api/shared/auth-providers` returns 200
+- [ ] Valkey reachable from backend (`VALKEY_URL` matches running instance; `[valkey] subscribed to config:invalidate` appears in backend boot log)
+- [ ] Postgres reachable; backend boot log shows `[valkey-mirror] seed: X online, Y deactivated, Z with active sessions`
+- [ ] At least 3 mitra accounts exist in DB (we need to flip them online/offline/deactivated across tests)
+- [ ] One customer account ready for the blast scenarios
+- [ ] Helpful aliases for verification (run from `backend/`):
+  ```bash
+  alias vk='node --env-file=.env -e'
+  # Then in tests: vk "(async()=>{const v=(await import('./src/plugins/valkey.js')).getValkeyClient();
+  #   console.log(await v.smembers('mitras:online'));process.exit(0)})()"
+  ```
+
+---
+
+## Section A — Seed + Startup
+
+Verifies `seedFromPostgres()` populates Valkey correctly from Postgres truth.
+
+- [ ] **[BE]** Restart backend; log shows one `[valkey-mirror] seed: N online, M deactivated, K with active sessions` line on startup
+- [ ] **[BE]** Counts in the log match Postgres truth:
+  ```sql
+  SELECT
+    (SELECT COUNT(*) FROM mitra_online_status WHERE is_online=true) AS online,
+    (SELECT COUNT(*) FROM mitras WHERE is_active=false) AS deactivated,
+    (SELECT COUNT(DISTINCT mitra_id) FROM chat_sessions
+     WHERE mitra_id IS NOT NULL AND status IN ('active','pending_payment')) AS with_sessions;
+  ```
+- [ ] **[BE]** Valkey contents match — `SMEMBERS mitras:online` returns the same IDs as `SELECT mitra_id FROM mitra_online_status WHERE is_online=true`
+- [ ] **[BE]** Valkey `mitra:heartbeat:<id>` exists for every currently-online mitra, value is a recent ISO timestamp (within seed time)
+- [ ] **[BE]** Valkey `mitra:capacity:<id>` matches `SELECT COUNT(*) FROM chat_sessions WHERE mitra_id=<id> AND status IN ('active','pending_payment')` for every online mitra
+- [ ] **[BE]** `SMEMBERS mitras:deactivated` matches `SELECT id FROM mitras WHERE is_active=false`
+
+### Reconnect re-seed
+
+- [ ] **[BE]** With backend running, `FLUSHDB` on Valkey, wait ~5s for ioredis reconnect, verify a second `[valkey-mirror] seed:` log entry appears
+- [ ] **[BE]** All four Valkey structures are rebuilt and match Postgres again
+
+---
+
+## Section B — Online / Offline toggle (write-through)
+
+Verifies `setOnline` / `setOffline` writes both stores in the right order.
+
+- [ ] **[M]** Mitra taps "online" → backend updates Postgres `mitra_online_status.is_online=true`
+- [ ] **[BE]** `SISMEMBER mitras:online <mitra-id>` returns `1` within 1s of the toggle
+- [ ] **[BE]** `GET mitra:heartbeat:<mitra-id>` returns a fresh ISO timestamp (within seconds of the toggle)
+- [ ] **[M]** Mitra taps "offline"
+- [ ] **[BE]** Postgres `is_online=false`, Valkey `SISMEMBER mitras:online <id>` returns `0`, `GET mitra:heartbeat:<id>` returns `nil`
+- [ ] **[BE]** `mitra_online_logs` has paired `online` / `offline` audit rows
+
+### Valkey-failure best-effort write
+
+- [ ] **[BE]** Stop Valkey, then toggle a mitra online → request succeeds (200), backend log shows `[valkey-mirror] setOnline <id> failed:` but Postgres is updated correctly
+- [ ] **[BE]** Restart Valkey → reconciliation sweep (≤ 300s default) eventually rebuilds the SET to include this mitra
+
+---
+
+## Section C — Heartbeat path
+
+Verifies the rewrite: per-ping = 1 Valkey write, 0 DB writes.
+
+- [ ] **[M]** Mitra online for at least one heartbeat cycle (~30s)
+- [ ] **[BE]** Watch Postgres query log during heartbeat — **no `UPDATE mitra_online_status SET last_heartbeat_at` rows fire on every ping** (only the batched mirror, default every 60s)
+- [ ] **[BE]** `GET mitra:heartbeat:<id>` value advances on each ping (re-check ~30s later, timestamp moves forward)
+- [ ] **[BE]** After 60s+ wait, `SELECT last_heartbeat_at FROM mitra_online_status WHERE mitra_id=<id>` advances (heartbeat mirror sweep ran)
+
+### Deactivation guard via Valkey
+
+- [ ] **[CC]** Admin deactivates the mitra (Phase 5 path: `is_active=false`)
+- [ ] **[BE]** `SISMEMBER mitras:deactivated <id>` immediately returns `1`
+- [ ] **[M]** Mitra app's next heartbeat → backend returns `403 ACCOUNT_INACTIVE`
+- [ ] **[BE]** No Postgres SELECT on `mitras` table for that heartbeat (verify with query log) — guard is pure Valkey
+
+### Fallback when Valkey is down
+
+- [ ] **[BE]** Stop Valkey
+- [ ] **[M]** Mitra app heartbeats → backend logs `[heartbeat] valkey check failed, falling back to DB`, request still succeeds for active mitra, returns `403` for deactivated mitra (full DB-backed resolveMitra path)
+- [ ] **[BE]** Restart Valkey → next heartbeat uses Valkey path again (no fallback log line)
+
+---
+
+## Section D — Capacity counter
+
+Verifies INCR/DECR across session lifecycle via `recomputeCapacityForMitra`.
+
+- [ ] **[BE]** Reset: pick a mitra with `mitra:capacity:<id> = 0`
+- [ ] **[C]** Customer pays → mitra accepts the blast → chat starts
+- [ ] **[BE]** `GET mitra:capacity:<id>` returns `1` within 1s of mitra accepting
+- [ ] **[C]** Second customer pays → same mitra accepts (assuming `max_customers_per_mitra >= 2`)
+- [ ] **[BE]** `GET mitra:capacity:<id>` returns `2`
+- [ ] **[C]** First session ends naturally (timer expires + goodbye flow completes)
+- [ ] **[BE]** `GET mitra:capacity:<id>` returns `1` within 1s
+- [ ] **[C]** Second session ends
+- [ ] **[BE]** `GET mitra:capacity:<id>` returns `0`
+
+### Reroute
+
+- [ ] **[CC]** Reroute an active session from mitra A → mitra B
+- [ ] **[BE]** `mitra:capacity:A` decrements, `mitra:capacity:B` increments — both atomic with the chat_sessions UPDATE
+
+### Capacity gates blast
+
+- [ ] **[BE]** Set `mitra:capacity:<id>` to `max_customers_per_mitra` directly (`SET mitra:capacity:<id> 3`)
+- [ ] **[C]** Customer pays → blast → **this mitra is excluded** (verify with `chat_request_notifications` — no row for this mitra)
+
+---
+
+## Section E — Deactivation flow (full propagation)
+
+Verifies `updateMitraStatus` + `revokeAllSessionsForUser` close the deactivation gap.
+
+- [ ] **[BE]** Mitra has an active access token (capture from `/api/mitra/auth/otp/verify` or use existing logged-in session)
+- [ ] **[BE]** Confirm mitra can call protected routes (`curl -H "Authorization: Bearer <token>" /api/mitra/...`)
+- [ ] **[CC]** Admin deactivates the mitra
+- [ ] **[BE]** Postgres: `mitras.is_active=false`, `auth_sessions.revoked_at IS NOT NULL` for this mitra
+- [ ] **[BE]** Valkey: `SISMEMBER mitras:deactivated <id>` = 1
+- [ ] **[BE]** Mitra's current access token still works on routes that don't re-check active state (stateless JWT) — bounded by access-token TTL
+- [ ] **[BE]** Mitra's heartbeat returns `403 ACCOUNT_INACTIVE` immediately (Valkey check on hot path)
+- [ ] **[BE]** Mitra's next refresh token attempt fails (because `auth_sessions.revoked_at` was set) → app effectively logs them out
+
+### Re-activation
+
+- [ ] **[CC]** Re-activate the mitra
+- [ ] **[BE]** Postgres `is_active=true`, Valkey `SISMEMBER mitras:deactivated <id>` = 0
+- [ ] **[M]** Mitra re-logs in, heartbeats again successfully
+
+---
+
+## Section F — Read paths (Valkey-first)
+
+Verifies all reads use Valkey with Postgres fallback.
+
+### `isMitraReachable`
+
+- [ ] **[BE]** Online mitra with fresh heartbeat → `isMitraReachable` returns `true` (call via `node -e` or any route that uses it, e.g. extension flow)
+- [ ] **[BE]** Manually set `mitra:heartbeat:<id>` to a timestamp older than `stale_after_seconds` → `isMitraReachable` returns `false` (even though `is_online=true` in Postgres)
+- [ ] **[BE]** Stop Valkey → `isMitraReachable` logs `[isMitraReachable] valkey unavailable, falling back to DB` and returns based on Postgres `is_online`
+
+### `findAvailableMitras` (blast)
+
+- [ ] **[BE]** Run `findAvailableMitras` (e.g. trigger a customer blast) — log shows Valkey path used (no warning about fallback)
+- [ ] **[BE]** Result IDs match `SDIFF mitras:online mitras:deactivated` filtered by capacity + heartbeat freshness
+- [ ] **[BE]** Stop Valkey → next blast logs `[findAvailableMitras] valkey unavailable, falling back to Postgres` and still returns correct results
+
+### `countAvailableMitrasFromCache` (customer beacon)
+
+- [ ] **[BE]** `curl /public/bestie-availability` returns `{available: bool, count: N}` matching reality
+- [ ] **[BE]** `GET availability:snapshot` in Valkey shows cached JSON within 10s of last poll
+- [ ] **[BE]** Multiple rapid polls (5+ per second from 3 different IPs) → only one Valkey-driven recompute per 10s; Postgres query log shows **zero** mitra-availability JOINs in steady state (only the once-per-10s cache miss)
+- [ ] **[CC]** Operator changes `max_customers_per_mitra` → `availability:snapshot` is `DEL`d (cache bust), next poll recomputes
+
+### Dashboard online count
+
+- [ ] **[CC]** Dashboard "Online Mitras" stat matches `SCARD mitras:online` in Valkey
+- [ ] **[BE]** Verify the dashboard query no longer hits `SELECT COUNT(*) FROM mitra_online_status WHERE is_online=true` (check query log)
+
+---
+
+## Section G — Auto-offline sweep (Valkey-driven)
+
+Verifies stale heartbeats → flipped offline via Valkey diff.
+
+- [ ] **[BE]** Set `MITRA_AUTO_OFFLINE_SWEEP_SECONDS=10` in env for faster test cycle, restart backend
+- [ ] **[CC]** Set `stale_after_seconds=15` in CC settings (or directly in `app_config`)
+- [ ] **[M]** Mitra goes online, sends one heartbeat
+- [ ] **[BE]** Manually delete the heartbeat key: `DEL mitra:heartbeat:<id>`
+- [ ] **[BE]** Wait up to 25s (15s stale + 10s sweep cadence)
+- [ ] **[BE]** Postgres: `is_online=false` for this mitra, audit row in `mitra_online_logs` with status='offline'
+- [ ] **[BE]** Valkey: `SISMEMBER mitras:online <id>` = 0, no `mitra:heartbeat:<id>` key
+
+### Sweep skips on Valkey error
+
+- [ ] **[BE]** Stop Valkey
+- [ ] **[BE]** Backend log shows `[auto-offline] valkey unavailable, skipping this tick:` each sweep cadence
+- [ ] **[BE]** **No Postgres UPDATE fires** during the outage (verify with query log) — confirms we don't mass-offline on Valkey hiccup
+
+---
+
+## Section H — Heartbeat → Postgres batched mirror
+
+Verifies the 60s UNNEST UPDATE.
+
+- [ ] **[BE]** Multiple mitras online, all heartbeating
+- [ ] **[BE]** Set `HEARTBEAT_MIRROR_INTERVAL_SECONDS=15` for faster cycles, restart
+- [ ] **[BE]** Wait one mirror cycle — Postgres log shows **one** UPDATE statement (with `UNNEST(...)` containing all online mitra IDs)
+- [ ] **[BE]** `SELECT last_heartbeat_at FROM mitra_online_status WHERE is_online=true` returns timestamps within last cycle
+- [ ] **[BE]** Compare with Valkey `GET mitra:heartbeat:<id>` — Postgres lags Valkey by ≤ mirror-cadence seconds (forensic-grade, not real-time)
+
+### Mirror skips on Valkey error
+
+- [ ] **[BE]** Stop Valkey
+- [ ] **[BE]** Backend log shows `[heartbeat-mirror] valkey unavailable, skipping:` each cycle
+- [ ] **[BE]** Postgres `last_heartbeat_at` does NOT advance during the outage (correct — Valkey is the source of "when was last ping?")
+
+---
+
+## Section I — Reconciliation sweep
+
+Verifies drift heals every `VALKEY_ONLINE_MIRROR_SWEEP_SECONDS`.
+
+- [ ] **[BE]** Set `VALKEY_ONLINE_MIRROR_SWEEP_SECONDS=30` for faster test, restart
+- [ ] **[BE]** Manually corrupt Valkey:
+  ```
+  SADD mitras:online 00000000-0000-0000-0000-000000000999  # bogus ID
+  SREM mitras:online <real-online-mitra-id>                 # remove a real one
+  SET mitra:capacity:<real-id> 99                           # bogus capacity
+  ```
+- [ ] **[BE]** Wait one sweep cycle (~30s) → log shows `[valkey-mirror] seed:` again
+- [ ] **[BE]** After sweep: Valkey state matches Postgres exactly (bogus ID gone, real ID present, capacity reset)
+
+### Sweep disabled when env=0
+
+- [ ] **[BE]** Set `VALKEY_ONLINE_MIRROR_SWEEP_SECONDS=0`, restart
+- [ ] **[BE]** Confirm no periodic seed log appears after the initial startup seed
+
+---
+
+## Section J — Failure modes (Valkey degradation)
+
+End-to-end behavior when Valkey is down for an extended period.
+
+- [ ] **[BE]** Stop Valkey
+- [ ] **[C]** Customer beacon poll → `availability:snapshot` GET fails → falls back to Postgres JOIN; UX unchanged but DB query rate spikes
+- [ ] **[C]** Customer triggers blast → `findAvailableMitras` Valkey path fails → falls back to Postgres JOIN; blast still works
+- [ ] **[M]** Mitra heartbeat → Valkey write fails (logged), but heartbeat returns 200; the missed write is irrelevant (auto-offline sweep is also skipping)
+- [ ] **[M]** Mitra toggle online → Postgres update succeeds, Valkey SADD fails (logged); on next reconciliation sweep after Valkey returns, mitra is back in `mitras:online` SET
+- [ ] **[BE]** Restart Valkey → reconnect listener fires → `seedFromPostgres()` runs → state restored; degraded period ends
+
+---
+
+## Section K — Multi-instance (defer until Cloud Run)
+
+Run only when ≥ 2 backend instances are active.
+
+- [ ] **[BE]** Two instances both run their own seed on startup — final Valkey state is consistent (idempotent `DEL + SADD`)
+- [ ] **[BE]** Concurrent setOnline calls on the same mitra from different instances → final SET state correct (atomic SADD)
+- [ ] **[BE]** `availability:snapshot` cache miss on instance A fills the snapshot; instance B's next poll reads the cached value (cluster-shared cache works)
+- [ ] **[BE]** Operator changes `max_customers_per_mitra` on one instance → `config:invalidate` pub/sub fires → other instance also DELs `availability:snapshot`
+- [ ] **[BE]** Heartbeat mirror UPDATEs from multiple instances are idempotent (last writer wins on timestamp, no errors)
+
+---
+
+## Smoke tests (quick happy path)
+
+5-minute sanity check after any deploy:
+
+- [ ] **[BE]** Backend log shows successful seed on startup
+- [ ] **[M]** Mitra toggles online → appears in `SMEMBERS mitras:online`
+- [ ] **[C]** Customer sees "Mulai Curhat" enabled
+- [ ] **[C]** Customer pays → mitra accepts → chat starts → `mitra:capacity:<id>` increments
+- [ ] **[C]** Chat ends → counter decrements
+- [ ] **[M]** Mitra toggles offline → removed from SET
+
+---
+
+## Known limitations / what this checklist does NOT cover
+
+- **Load testing** — sustained heartbeat volume at prod scale (300+ mitras × 2 pings/min). Plan: separate load-test stage when prod is provisioned.
+- **Memorystore-specific behavior** — failover, RDB+AOF interaction. Plan: re-run Sections A, G, I, J against Memorystore Standard tier before prod cutover.
+- **Long-running drift** — overnight runs where eviction or memory-pressure could affect Valkey state. Plan: monitor `INFO memory` in prod for the first week.