Phase 6: Valkey availability mirror — move read path off Postgres

Mitra-availability state (online flag, deactivated flag, per-mitra session
count, heartbeat liveness) mirrored into Valkey so the customer beacon
+ pairing blast + dashboard counts no longer hit Postgres on the hot path.
Postgres remains the durable source of truth; Valkey state is fully
derivable via seedFromPostgres on startup + reconnect.

Schema
- mitras:online           SET    — mirror of is_online
- mitras:deactivated      SET    — mirror of is_active=false
- mitra:capacity:<id>     STRING — active+pending_payment session count
- mitra💓<id>    STRING — ISO timestamp of last ping
- availability:snapshot   JSON   — beacon cache, TTL 10s, cluster-shared

Write paths (Postgres first, best-effort Valkey)
- setOnline/setOffline mirror SADD/SREM + heartbeat SET/DEL
- updateMitraStatus mirrors mitras:deactivated AND revokes auth_sessions
  on deactivate (bounds the "ghost online" window to access-token TTL)
- heartbeat is Valkey-only on the hot path; the per-ping Postgres UPDATE
  on last_heartbeat_at is eliminated (was 1,200 ops/min at prod scale)
- chat_session lifecycle (accept/end/reroute/extension/expiry) calls
  recomputeCapacityForMitra after each UPDATE — derive-from-truth avoids
  the bookkeeping risk of per-transition INCR/DECR

Read paths (Valkey-first, Postgres fallback on Valkey error)
- isMitraReachable: SISMEMBER mitras:online + heartbeat freshness
- findAvailableMitras: SDIFF + pipelined GETs, filter by capacity + heartbeat
- countAvailableMitrasFromCache: Valkey-driven, cached cluster-wide 10s TTL
- dashboard online count: SCARD
- Each reader wraps Valkey ops in try/catch → Postgres fallback on outage

Heartbeat path on /api/mitra/status/heartbeat
- resolveMitra preHandler replaced with heartbeatGuard: SISMEMBER on
  mitras:deactivated (~0 DB hits per ping). Falls back to full DB
  resolveMitra if Valkey is unreachable so a Valkey outage doesn't
  silently accept heartbeats from deactivated mitras.

Three sweeps, env-configurable cadences
- MITRA_AUTO_OFFLINE_SWEEP_SECONDS (30) — Valkey-driven stale detection
- HEARTBEAT_MIRROR_INTERVAL_SECONDS (60) — batched UPSERT writes
  Valkey timestamps to Postgres last_heartbeat_at via UNNEST (1 statement
  per cycle, idempotent across instances)
- VALKEY_ONLINE_MIRROR_SWEEP_SECONDS (300) — periodic reseed heals drift

Startup
- restoreActiveTimers → seedFromPostgres → bind listeners
- onValkeyReady re-runs the seed on every reconnect (cold start + reseed
  on Valkey restart, no manual intervention)

Failure semantics
- Read fallback: every Valkey read wrapped, falls back to existing
  Postgres JOIN query — system stays correct during Valkey outage,
  performance degrades not breaks
- Write best-effort: Postgres write commits before Valkey is touched;
  Valkey errors log + continue; reconciliation sweep heals drift
- Auto-offline sweep aborts entirely on Valkey error (does NOT mass-
  offline via Postgres scan during Valkey hiccup)

Tests
- New: 32 integration tests in mitra-status.valkey-mirror.test.js
  covering seed, write-through, fallbacks, capacity lifecycle,
  auto-offline sweep, heartbeat mirror, deactivation flow, beacon cache
- Updated: fixtures.js seeds Valkey alongside Postgres when isOnline=true
- Updated: helpers/db.js resetDb also flushes test Valkey
- Fixed 2 pre-existing session-timer flakes (string IDs failed uuid
  parse; vi.advanceTimersByTimeAsync raced real Postgres I/O)
- All 124/124 backend tests pass (was 90/92)

Docs
- requirement/valkey-online-mirror-plan.md — canonical plan
- requirement/valkey-online-mirror-testing.md — manual E2E checklist
- requirement/deployment.md — infra + Valkey persistence guidance for
  prod (Memorystore Standard tier recommended; migration from
  self-hosted Valkey is zero-downtime via reseed-from-Postgres)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
2026-05-25 18:07:55 +08:00
parent 3fff4b1c6e
commit 553dbac52f
20 changed files with 1839 additions and 82 deletions

View File

@@ -1,24 +1,80 @@
import { getDb } from '../db/client.js'
import { SessionStatus } from '../constants.js'
import { getMitraPingConfig, getMaxCustomersPerMitra } from './config.service.js'
import { subscribe } from '../plugins/valkey.js'
import * as valkey from '../plugins/valkey.js'
import { subscribe, onValkeyReady } from '../plugins/valkey.js'
const sql = getDb()
// --- Short-TTL availability cache for the 5s-poll endpoint ---
// In-memory snapshot { available, count, expiresAt }. The cache:
// - is recomputed at most once per AVAILABILITY_TTL_MS (10s backstop)
// - is invalidated explicitly when CC changes max_customers_per_mitra (call invalidateAvailabilityCache())
// This keeps customer polls off the DB hot path while staying close to real time.
const AVAILABILITY_TTL_MS = 10_000
let availabilityCache = null // { available, count, expiresAt }
// Per requirement/valkey-online-mirror-plan.md § Schema.
export const VK_MITRAS_ONLINE = 'mitras:online'
export const VK_MITRAS_DEACTIVATED = 'mitras:deactivated'
export const vkCapacityKey = (mitraId) => `mitra:capacity:${mitraId}`
export const vkHeartbeatKey = (mitraId) => `mitra:heartbeat:${mitraId}`
export const invalidateAvailabilityCache = () => {
availabilityCache = null
// Rebuilds Valkey availability state from Postgres. Called on backend startup,
// on Valkey reconnect (via onValkeyReady), and by the reconciliation sweep.
// Idempotent — DEL + bulk SADD/SET produces the same final state on every run.
export const seedFromPostgres = async () => {
try {
const [onlineRows, deactRows, capacityRows] = await Promise.all([
sql`SELECT mitra_id FROM mitra_online_status WHERE is_online = true`,
sql`SELECT id FROM mitras WHERE is_active = false`,
sql`
SELECT mitra_id, COUNT(*)::int AS c FROM chat_sessions
WHERE mitra_id IS NOT NULL
AND status IN (${SessionStatus.ACTIVE}, ${SessionStatus.PENDING_PAYMENT})
GROUP BY mitra_id
`,
])
const pipe = valkey.pipeline()
pipe.del(VK_MITRAS_ONLINE)
pipe.del(VK_MITRAS_DEACTIVATED)
const now = new Date().toISOString()
if (onlineRows.length) {
pipe.sadd(VK_MITRAS_ONLINE, ...onlineRows.map((r) => r.mitra_id))
// Seed heartbeats with NOW so the first sweep after restart doesn't
// mass-offline. Mitras refresh on their next ping anyway.
for (const r of onlineRows) pipe.set(vkHeartbeatKey(r.mitra_id), now)
// Reset capacity for currently-online mitras; overlay real counts below.
// Offline mitras' stale capacity keys don't affect reads (SDIFF excludes them).
for (const r of onlineRows) pipe.set(vkCapacityKey(r.mitra_id), 0)
}
if (deactRows.length) {
pipe.sadd(VK_MITRAS_DEACTIVATED, ...deactRows.map((r) => r.id))
}
for (const r of capacityRows) pipe.set(vkCapacityKey(r.mitra_id), r.c)
await pipe.exec()
console.log(
`[valkey-mirror] seed: ${onlineRows.length} online, ${deactRows.length} deactivated, ${capacityRows.length} with active sessions`,
)
} catch (err) {
console.error('[valkey-mirror] seed failed:', err)
}
}
// Subscribe once at module load so other-instance config updates also bust this cache.
// Single-instance: the local mutator already invalidates, so this is a no-op extra.
// Re-seed on every Valkey reconnect (fires on initial connect too).
onValkeyReady(seedFromPostgres)
// --- Beacon snapshot cache (Valkey-backed, cluster-shared) ---
// `availability:snapshot` JSON `{available, count}`, TTL 10s. All backend
// instances share the same cache: one Valkey-driven compute per 10s
// cluster-wide regardless of how many customers are polling.
const AVAILABILITY_CACHE_KEY = 'availability:snapshot'
const AVAILABILITY_TTL_SECONDS = 10
export const invalidateAvailabilityCache = async () => {
try {
await valkey.del(AVAILABILITY_CACHE_KEY)
} catch (err) {
console.error('[valkey-mirror] invalidateAvailabilityCache failed:', err)
}
}
// Bust the shared cache when CC changes max_customers_per_mitra (any instance).
let _subscribed = false
const ensureSubscribed = () => {
if (_subscribed) return
@@ -43,6 +99,39 @@ export const ensureStatusRow = async (mitraId) => {
`
}
// Best-effort Valkey writer. Postgres remains source of truth; a Valkey hiccup
// shouldn't fail the originating request — the reconciliation sweep heals drift.
const tryValkey = async (fn, label) => {
try { await fn() } catch (err) {
console.error(`[valkey-mirror] ${label} failed:`, err)
}
}
// Recompute `mitra:capacity:<id>` from chat_sessions truth. Called after every
// chat_session state change that could affect a mitra's occupied-slot count.
// Recompute-from-truth avoids the bookkeeping risks of per-transition INCR/DECR
// (double-counts, missed transitions across all the UPDATE sites in pairing,
// closure, extension, session-timer, session services).
export const recomputeCapacityForMitra = async (mitraId) => {
if (!mitraId) return
const [row] = await sql`
SELECT COUNT(*)::int AS c FROM chat_sessions
WHERE mitra_id = ${mitraId}
AND status IN (${SessionStatus.ACTIVE}, ${SessionStatus.PENDING_PAYMENT})
`
await tryValkey(
() => valkey.set(vkCapacityKey(mitraId), row.c),
`recomputeCapacity ${mitraId}`,
)
}
// Lookup mitra_id from the session, then recompute. Use this from UPDATE sites
// where the session's mitra_id may not be in local scope.
export const recomputeCapacityBySession = async (sessionId) => {
const [row] = await sql`SELECT mitra_id FROM chat_sessions WHERE id = ${sessionId}`
if (row?.mitra_id) await recomputeCapacityForMitra(row.mitra_id)
}
export const setOnline = async (mitraId) => {
await ensureStatusRow(mitraId)
const now = new Date()
@@ -54,6 +143,14 @@ export const setOnline = async (mitraId) => {
await sql`
INSERT INTO mitra_online_logs (mitra_id, status) VALUES (${mitraId}, 'online')
`
await tryValkey(async () => {
const pipe = valkey.pipeline()
pipe.sadd(VK_MITRAS_ONLINE, mitraId)
pipe.set(vkHeartbeatKey(mitraId), now.toISOString())
await pipe.exec()
}, `setOnline ${mitraId}`)
invalidateAvailabilityCache()
}
@@ -73,16 +170,32 @@ export const setOffline = async (mitraId) => {
await sql`
INSERT INTO mitra_online_logs (mitra_id, status) VALUES (${mitraId}, 'offline')
`
await tryValkey(async () => {
const pipe = valkey.pipeline()
pipe.srem(VK_MITRAS_ONLINE, mitraId)
pipe.del(vkHeartbeatKey(mitraId))
await pipe.exec()
}, `setOffline ${mitraId}`)
invalidateAvailabilityCache()
}
// Heartbeat hot path: Valkey-only. Per-ping Postgres UPDATE eliminated; the
// 60s batched heartbeat-mirror job (mirrorHeartbeatsToPostgres) writes
// `last_heartbeat_at` to Postgres for forensics/restart safety.
//
// NOTE: there is intentionally no `is_online = true` gate here (the old SQL
// UPDATE had one). The Valkey SET is global; if a mitra heartbeats while
// `is_online=false` in Postgres, their TTL key gets refreshed but they're
// still not in `mitras:online`, so blast eligibility is unchanged. The
// reconciliation sweep will clean up the orphan heartbeat key.
export const heartbeat = async (mitraId) => {
const now = new Date()
await sql`
UPDATE mitra_online_status
SET last_heartbeat_at = ${now}, updated_at = ${now}
WHERE mitra_id = ${mitraId} AND is_online = true
`
const now = new Date().toISOString()
await tryValkey(
() => valkey.set(vkHeartbeatKey(mitraId), now),
`heartbeat ${mitraId}`,
)
}
export const getStatus = async (mitraId) => {
@@ -130,39 +243,95 @@ export const getOnlineLogs = async (mitraId, { page = 1, limit = 50 } = {}) => {
return { items, total: Number(count), page, limit }
}
// Valkey-driven: enumerate mitras:online, read each heartbeat timestamp from
// Valkey, find stales, then bulk-flip Postgres + clean up Valkey.
//
// Failure semantics: if any Valkey op throws, the sweep aborts entirely. We
// never mass-offline mitras via a Postgres scan because Valkey is unreachable
// — that would risk false-offlining a fleet during a Valkey hiccup.
export const autoOfflineStaleMitras = async () => {
const pingConfig = await getMitraPingConfig()
// If ping is not required, skip the auto-offline sweep entirely
if (!pingConfig.require_ping) return 0
// stale_after_seconds is the operator-facing knob — what they set is what
// they get. No multiplier, no implicit "tolerate N missed heartbeats"
// contract baked in. The CC PATCH validates that the value is >= the env-
// driven heartbeat cadence so single missed pings can't flip a mitra
// offline.
const staleSeconds = pingConfig.stale_after_seconds
const stale = await sql`
UPDATE mitra_online_status
SET is_online = false, last_offline_at = NOW(), updated_at = NOW()
WHERE is_online = true
AND last_heartbeat_at < NOW() - ${staleSeconds + ' seconds'}::interval
RETURNING mitra_id
`
for (const row of stale) {
await sql`
INSERT INTO mitra_online_logs (mitra_id, status) VALUES (${row.mitra_id}, 'offline')
`
let onlineIds, heartbeatValues
try {
onlineIds = await valkey.smembers(VK_MITRAS_ONLINE)
if (!onlineIds.length) return 0
const pipe = valkey.pipeline()
for (const id of onlineIds) pipe.get(vkHeartbeatKey(id))
const results = await pipe.exec()
heartbeatValues = results.map((r) => r[1])
} catch (err) {
console.warn('[auto-offline] valkey unavailable, skipping this tick:', err.message)
return 0
}
// Capacity may have changed (mitra went offline) — invalidate the customer-facing
// availability cache so the next poll reflects reality.
if (stale.length > 0) invalidateAvailabilityCache()
const cutoff = Date.now() - pingConfig.stale_after_seconds * 1000
const stale = []
for (let i = 0; i < onlineIds.length; i++) {
const ts = heartbeatValues[i]
if (!ts || Date.parse(ts) < cutoff) stale.push(onlineIds[i])
}
if (!stale.length) return 0
await sql`
UPDATE mitra_online_status
SET is_online = false, last_offline_at = NOW(), updated_at = NOW()
WHERE mitra_id = ANY(${sql.array(stale)}::uuid[]) AND is_online = true
`
for (const id of stale) {
await sql`INSERT INTO mitra_online_logs (mitra_id, status) VALUES (${id}, 'offline')`
}
await tryValkey(async () => {
const cleanup = valkey.pipeline()
cleanup.srem(VK_MITRAS_ONLINE, ...stale)
for (const id of stale) cleanup.del(vkHeartbeatKey(id))
await cleanup.exec()
}, `auto-offline cleanup (${stale.length} stale)`)
invalidateAvailabilityCache()
return stale.length
}
// Batched mirror: Valkey heartbeat timestamps → Postgres `last_heartbeat_at`.
// Runs every HEARTBEAT_MIRROR_INTERVAL_SECONDS (default 60). One UNNEST UPDATE
// regardless of online count. Idempotent — latest timestamp wins; multiple
// instances running concurrently is fine (no leader election needed).
export const mirrorHeartbeatsToPostgres = async () => {
let onlineIds, heartbeatValues
try {
onlineIds = await valkey.smembers(VK_MITRAS_ONLINE)
if (!onlineIds.length) return 0
const pipe = valkey.pipeline()
for (const id of onlineIds) pipe.get(vkHeartbeatKey(id))
const results = await pipe.exec()
heartbeatValues = results.map((r) => r[1])
} catch (err) {
console.warn('[heartbeat-mirror] valkey unavailable, skipping:', err.message)
return 0
}
const ids = []
const ts = []
for (let i = 0; i < onlineIds.length; i++) {
if (heartbeatValues[i]) {
ids.push(onlineIds[i])
ts.push(heartbeatValues[i])
}
}
if (!ids.length) return 0
await sql`
UPDATE mitra_online_status m
SET last_heartbeat_at = u.ts::timestamptz, updated_at = NOW()
FROM (
SELECT * FROM UNNEST(${sql.array(ids)}::uuid[], ${sql.array(ts)}::text[]) AS t(mitra_id, ts)
) u
WHERE m.mitra_id = u.mitra_id
`
return ids.length
}
/**
* Customer-home availability check, cached in-memory for AVAILABILITY_TTL_MS.
*
@@ -178,12 +347,33 @@ export const autoOfflineStaleMitras = async () => {
* sets/hashes (matching the existing memory item "Session Timer Scaling"); the contract
* of this function — Valkey/cache reads only on the hot path — stays the same.
*/
export const countAvailableMitrasFromCache = async () => {
const now = Date.now()
if (availabilityCache && availabilityCache.expiresAt > now) {
return { available: availabilityCache.available, count: availabilityCache.count }
}
const computeAvailabilityFromValkey = async () => {
const { max_customers_per_mitra } = await getMaxCustomersPerMitra()
const { stale_after_seconds } = await getMitraPingConfig()
const candidates = await valkey.sdiff(VK_MITRAS_ONLINE, VK_MITRAS_DEACTIVATED)
if (!candidates.length) return { available: false, count: 0 }
const pipe = valkey.pipeline()
for (const id of candidates) {
pipe.get(vkCapacityKey(id))
pipe.get(vkHeartbeatKey(id))
}
const results = await pipe.exec()
const cutoff = Date.now() - stale_after_seconds * 1000
let count = 0
for (let i = 0; i < candidates.length; i++) {
const capacity = Number(results[i * 2][1] ?? 0)
const heartbeat = results[i * 2 + 1][1]
if (capacity >= max_customers_per_mitra) continue
if (!heartbeat || Date.parse(heartbeat) < cutoff) continue
count++
}
return { available: count > 0, count }
}
const computeAvailabilityFromPostgres = async () => {
const { max_customers_per_mitra } = await getMaxCustomersPerMitra()
const [{ count }] = await sql`
SELECT COUNT(*)::int AS count
@@ -197,26 +387,42 @@ export const countAvailableMitrasFromCache = async () => {
AND cs.status IN (${SessionStatus.ACTIVE}, ${SessionStatus.PENDING_PAYMENT})
) < ${max_customers_per_mitra}
`
return { available: count > 0, count }
}
const available = count > 0
availabilityCache = {
available,
count,
expiresAt: now + AVAILABILITY_TTL_MS,
export const countAvailableMitrasFromCache = async () => {
try {
const cached = await valkey.get(AVAILABILITY_CACHE_KEY)
if (cached) return JSON.parse(cached)
const snapshot = await computeAvailabilityFromValkey()
await valkey.getValkeyClient().setex(AVAILABILITY_CACHE_KEY, AVAILABILITY_TTL_SECONDS, JSON.stringify(snapshot))
return snapshot
} catch (err) {
console.warn('[countAvailableMitras] valkey unavailable, falling back to Postgres:', err.message)
return computeAvailabilityFromPostgres()
}
return { available, count }
}
/**
* Mitra-online check for use during pairing/extension safeguards.
* Combines the Valkey-mirrored online flag (Postgres mitra_online_status today) with
* the WebSocket-connected check. Never use "in-session" as a proxy for "online".
* Mitra-reachable check: in `mitras:online` SET AND heartbeat is fresh.
* Falls back to a Postgres `is_online` read if Valkey is unreachable; the
* fallback skips the heartbeat-freshness check (sweep takes care of stale rows
* within `stale_after_seconds + sweep_cadence`).
*/
export const isMitraReachable = async (mitraId) => {
const [row] = await sql`
SELECT is_online FROM mitra_online_status WHERE mitra_id = ${mitraId}
`
return Boolean(row?.is_online)
try {
const inSet = await valkey.sismember(VK_MITRAS_ONLINE, mitraId)
if (!inSet) return false
const heartbeat = await valkey.get(vkHeartbeatKey(mitraId))
if (!heartbeat) return false
const { stale_after_seconds } = await getMitraPingConfig()
return Date.parse(heartbeat) >= Date.now() - stale_after_seconds * 1000
} catch (err) {
console.warn('[isMitraReachable] valkey unavailable, falling back to DB:', err.message)
const [row] = await sql`SELECT is_online FROM mitra_online_status WHERE mitra_id = ${mitraId}`
return Boolean(row?.is_online)
}
}
/**