Phase 6: Valkey availability mirror — move read path off Postgres

Mitra-availability state (online flag, deactivated flag, per-mitra session
count, heartbeat liveness) mirrored into Valkey so the customer beacon
+ pairing blast + dashboard counts no longer hit Postgres on the hot path.
Postgres remains the durable source of truth; Valkey state is fully
derivable via seedFromPostgres on startup + reconnect.

Schema
- mitras:online           SET    — mirror of is_online
- mitras:deactivated      SET    — mirror of is_active=false
- mitra:capacity:<id>     STRING — active+pending_payment session count
- mitra💓<id>    STRING — ISO timestamp of last ping
- availability:snapshot   JSON   — beacon cache, TTL 10s, cluster-shared

Write paths (Postgres first, best-effort Valkey)
- setOnline/setOffline mirror SADD/SREM + heartbeat SET/DEL
- updateMitraStatus mirrors mitras:deactivated AND revokes auth_sessions
  on deactivate (bounds the "ghost online" window to access-token TTL)
- heartbeat is Valkey-only on the hot path; the per-ping Postgres UPDATE
  on last_heartbeat_at is eliminated (was 1,200 ops/min at prod scale)
- chat_session lifecycle (accept/end/reroute/extension/expiry) calls
  recomputeCapacityForMitra after each UPDATE — derive-from-truth avoids
  the bookkeeping risk of per-transition INCR/DECR

Read paths (Valkey-first, Postgres fallback on Valkey error)
- isMitraReachable: SISMEMBER mitras:online + heartbeat freshness
- findAvailableMitras: SDIFF + pipelined GETs, filter by capacity + heartbeat
- countAvailableMitrasFromCache: Valkey-driven, cached cluster-wide 10s TTL
- dashboard online count: SCARD
- Each reader wraps Valkey ops in try/catch → Postgres fallback on outage

Heartbeat path on /api/mitra/status/heartbeat
- resolveMitra preHandler replaced with heartbeatGuard: SISMEMBER on
  mitras:deactivated (~0 DB hits per ping). Falls back to full DB
  resolveMitra if Valkey is unreachable so a Valkey outage doesn't
  silently accept heartbeats from deactivated mitras.

Three sweeps, env-configurable cadences
- MITRA_AUTO_OFFLINE_SWEEP_SECONDS (30) — Valkey-driven stale detection
- HEARTBEAT_MIRROR_INTERVAL_SECONDS (60) — batched UPSERT writes
  Valkey timestamps to Postgres last_heartbeat_at via UNNEST (1 statement
  per cycle, idempotent across instances)
- VALKEY_ONLINE_MIRROR_SWEEP_SECONDS (300) — periodic reseed heals drift

Startup
- restoreActiveTimers → seedFromPostgres → bind listeners
- onValkeyReady re-runs the seed on every reconnect (cold start + reseed
  on Valkey restart, no manual intervention)

Failure semantics
- Read fallback: every Valkey read wrapped, falls back to existing
  Postgres JOIN query — system stays correct during Valkey outage,
  performance degrades not breaks
- Write best-effort: Postgres write commits before Valkey is touched;
  Valkey errors log + continue; reconciliation sweep heals drift
- Auto-offline sweep aborts entirely on Valkey error (does NOT mass-
  offline via Postgres scan during Valkey hiccup)

Tests
- New: 32 integration tests in mitra-status.valkey-mirror.test.js
  covering seed, write-through, fallbacks, capacity lifecycle,
  auto-offline sweep, heartbeat mirror, deactivation flow, beacon cache
- Updated: fixtures.js seeds Valkey alongside Postgres when isOnline=true
- Updated: helpers/db.js resetDb also flushes test Valkey
- Fixed 2 pre-existing session-timer flakes (string IDs failed uuid
  parse; vi.advanceTimersByTimeAsync raced real Postgres I/O)
- All 124/124 backend tests pass (was 90/92)

Docs
- requirement/valkey-online-mirror-plan.md — canonical plan
- requirement/valkey-online-mirror-testing.md — manual E2E checklist
- requirement/deployment.md — infra + Valkey persistence guidance for
  prod (Memorystore Standard tier recommended; migration from
  self-hosted Valkey is zero-downtime via reseed-from-Postgres)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
2026-05-25 18:07:55 +08:00
parent 3fff4b1c6e
commit 553dbac52f
20 changed files with 1839 additions and 82 deletions

View File

@@ -1,11 +1,13 @@
import { getDb } from '../db/client.js'
import { getMaxCustomersPerMitra, getPairingBlastTimeoutSeconds, getReturningChatConfirmationTimeoutSeconds } from './config.service.js'
import * as valkey from '../plugins/valkey.js'
import { VK_MITRAS_ONLINE, VK_MITRAS_DEACTIVATED, vkCapacityKey, vkHeartbeatKey } from './mitra-status.service.js'
import { getMaxCustomersPerMitra, getPairingBlastTimeoutSeconds, getReturningChatConfirmationTimeoutSeconds, getMitraPingConfig } from './config.service.js'
import { sendToUser } from '../plugins/websocket.js'
import { sendPushNotification } from './notification.service.js'
import { startSessionTimer } from './session-timer.service.js'
import { startSessionListener } from './chat-handler.service.js'
import { consumePaymentSession, failPaymentSession, getPaymentSession, recordIntermediateFailure } from './payment.service.js'
import { isMitraReachable, isMitraInActiveSessionWithCustomer, getMitraActiveSessionCount } from './mitra-status.service.js'
import { isMitraReachable, isMitraInActiveSessionWithCustomer, getMitraActiveSessionCount, recomputeCapacityForMitra, recomputeCapacityBySession } from './mitra-status.service.js'
import {
UserType,
SessionStatus,
@@ -76,10 +78,37 @@ const notifyCustomer = async (customerId, data) => {
}
}
export const findAvailableMitras = async () => {
// Valkey-driven: SDIFF(online, deactivated) → for each candidate, pipelined
// GET capacity + heartbeat, then filter by capacity gate + heartbeat freshness.
// Postgres fallback runs if any Valkey op throws (full JOIN as before).
const findAvailableMitrasFromValkey = async () => {
const { max_customers_per_mitra } = await getMaxCustomersPerMitra()
const { stale_after_seconds } = await getMitraPingConfig()
const candidates = await valkey.sdiff(VK_MITRAS_ONLINE, VK_MITRAS_DEACTIVATED)
if (!candidates.length) return []
const pipe = valkey.pipeline()
for (const id of candidates) {
pipe.get(vkCapacityKey(id))
pipe.get(vkHeartbeatKey(id))
}
const results = await pipe.exec()
const cutoff = Date.now() - stale_after_seconds * 1000
const eligible = []
for (let i = 0; i < candidates.length; i++) {
const capacity = Number(results[i * 2][1] ?? 0)
const heartbeat = results[i * 2 + 1][1]
if (capacity >= max_customers_per_mitra) continue
if (!heartbeat || Date.parse(heartbeat) < cutoff) continue
eligible.push({ id: candidates[i], active_session_count: capacity })
}
return eligible
}
const findAvailableMitrasFromPostgres = async () => {
const { max_customers_per_mitra } = await getMaxCustomersPerMitra()
// Project active_session_count alongside the mitra row so the blast loop doesn't
// need a per-mitra COUNT roundtrip later.
const mitras = await sql`
SELECT m.id, m.display_name, sub.active_session_count
FROM mitras m
@@ -96,6 +125,15 @@ export const findAvailableMitras = async () => {
return mitras
}
export const findAvailableMitras = async () => {
try {
return await findAvailableMitrasFromValkey()
} catch (err) {
console.warn('[findAvailableMitras] valkey unavailable, falling back to Postgres:', err.message)
return findAvailableMitrasFromPostgres()
}
}
/**
* Validate that a payment session is owned by the customer, confirmed, and not yet consumed.
* Throws on mismatch. Returns the loaded payment session row.
@@ -414,6 +452,10 @@ export const acceptPairingRequest = async (sessionId, mitraId) => {
})
}
// Mitra now occupies a capacity slot (PENDING_PAYMENT counts per
// findAvailableMitras predicate). Mirror to Valkey.
await recomputeCapacityForMitra(mitraId)
// Mark this mitra's notification as accepted
await sql`
UPDATE chat_request_notifications