Technical article

Production Monitoring for a Real-Time Crypto Volatility Surface

A live volatility surface needs monitoring that understands market data, not only servers. This article covers the Derivasys checks that make stale books, delayed workers, unstable SVI fits, and bad risk nodes visible before users trust the dashboard.

Sean Gordon / July 3, 2026 / production monitoring, observability, BTC options, SVI diagnostics, market data, volatility surface, surface freshness, alerting, runbooks, surface health API, SLO burn rate, alert payloads, synthetic canaries.

Monitoring scope

A live surface needs domain-aware monitoring.

Generic infrastructure monitoring can tell you that a container is running. It cannot tell you whether the BTC options surface is stale, whether the 25-delta risk reversal came from a weak wing quote, or whether SVI accepted a shape that should have been quarantined.

Derivasys monitoring has to join two worlds: operational telemetry from the pipeline and market-aware checks from the volatility surface itself.

  • Exchange connection status and reconnect counts.
  • Quote freshness by venue, instrument, expiry, and currency.
  • Order-book sequence gaps and resubscribe events.
  • IV inversion failures and slow-path counts.
  • SVI residuals, parameter movement, and reject reasons.
  • Dashboard snapshot age and WebSocket fanout lag.
Derivasys through-fit matrix used for monitoring quote and fit quality
Through-fit panels are monitoring surfaces: they show whether real quotes support the fitted smile or whether a slice needs review.

Freshness

Freshness has to be measured at several layers.

Feed freshness is a product-level signal, not just an exchange socket signal. A dashboard can look live while one layer is stale: the exchange socket can be connected, the order-book worker can publish, and the frontend can render, yet a specific expiry or wing may not have received a useful update recently.

The useful freshness signal is therefore not one timestamp. It is a chain of timestamps that shows where delay entered the system.

  • Alert on stale instruments, not only disconnected venues.
  • Track expiry-level freshness because front-end and long-dated books behave differently.
  • Expose snapshot age in dashboard and API responses.
  • Separate no-market conditions from infrastructure failures.
exchange_ts
  -> local_received_at
  -> book_published_at
  -> iv_calculated_at
  -> smile_fit_at
  -> dashboard_snapshot_at
  -> browser_received_at

Surface SLOs

The service-level objective should describe the surface users receive.

A volatility platform can meet generic uptime targets while still serving stale market structure. The meaningful SLO is not only process availability; it is whether the displayed surface, API snapshot, risk reversals, flies, and fixed-tenor rows are recent enough and backed by accepted inputs.

Derivasys monitoring therefore starts with product state. A green exchange socket is not enough if the front-expiry BTC smile was last accepted too long ago or if a displayed risk node was carried forward from a weak wing quote.

  • Track accepted surface age by currency, expiry, and venue contribution.
  • Separate dashboard-render age from source-market age and fit age.
  • Define freshness budgets for ATM rows, risk reversals, flies, and fixed-tenor rows.
  • Report degraded surface state when values are reused, interpolation-heavy, or quote-light.
surface_slo = accepted_surface_age < threshold
  && critical_expiries_fresh
  && fit_reject_rate < threshold
  && risk_nodes_have_provenance

Surface health API

The health endpoint should report market state, not only process state.

A Kubernetes readiness check can say a pod is alive while the front BTC expiry is stale, the latest SVI candidate was rejected, or the dashboard is serving carried-forward risk nodes. A volatility surface health API has to expose the product state users actually consume.

Derivasys monitoring treats the health endpoint as a compact surface status object. It should be safe for the dashboard, API consumers, and alerting system to read the same state and reach the same conclusion about whether a surface is live, degraded, held, or stale.

  • Report accepted surface age by currency, expiry, and venue contribution.
  • Expose fit status, latest reject reason, risk-node freshness, and dashboard patch age together.
  • Distinguish healthy, degraded, held-back, and stale states instead of returning one boolean.
  • Include a source snapshot id so an operator can replay the state behind the health response.
SurfaceHealth {
  currency,
  status,
  accepted_surface_age_ms,
  stale_expiries,
  latest_fit_reject,
  risk_node_age_ms,
  dashboard_patch_age_ms,
  replay_snapshot_id
}

Fit health

A successful SVI solve is not the same as a healthy surface.

SVI can produce a curve even when the inputs are weak. Monitoring has to inspect residuals, wing slopes, calendar consistency, parameter jumps, quote-through-fit status, and whether the fit is being pulled by one stale or illiquid quote.

The production question is not just whether calibration returned parameters. It is whether those parameters should be published into user-facing risk nodes and API state.

  • Track accepted, rejected, and reused fits separately.
  • Store reject reasons such as stale quotes, crossed books, bad wings, or calendar inconsistency.
  • Alert on parameter movement that is larger than the quote set justifies.
  • Compare risk reversal and fly jumps against the underlying quote provenance.
  • Keep previous accepted fits available for graceful degradation.

Reject taxonomy

Fit rejects need structured reasons, not just failed calibration logs.

SVI rejection is useful only when the operator can tell whether the issue came from market data, model constraints, solver behavior, or calendar consistency. A single failed-fit counter hides the difference between stale books, sparse wings, crossed markets, parameter jumps, and no-arbitrage checks.

Structured reject reasons also make alerting less noisy. A short-lived sparse-wing reject on a long-dated expiry is different from repeated front-expiry calendar violations that affect dashboard risk nodes.

  • Classify rejects as quote-quality, freshness, solver, parameter-bound, residual, or calendar-consistency failures.
  • Attach currency, expiry, venue mix, and affected risk nodes to every reject event.
  • Count reused previous fits separately from rejected current fits.
  • Show the latest accepted fit beside the latest rejected candidate so operators can compare them.
FitReject {
  currency,
  expiry,
  reason,
  affected_nodes,
  quote_count,
  max_residual,
  previous_fit_age,
  rejected_at
}

Pipeline lag

Queue lag tells you which worker is falling behind.

Once the platform moves into Kafka and Kubernetes, the failure mode changes. A connector might be healthy while the IV worker falls behind. A surface worker might publish late even though the order-book topic is current.

Monitoring by topic, partition, worker, currency, and expiry makes the operational bottleneck visible before it becomes a stale dashboard.

  • Measure consumer lag for order-book, volatility, surface, and dashboard topics.
  • Track slow IV inversion counts by currency and expiry.
  • Separate backlog from intentional batching.
  • Emit structured worker heartbeats with current sequence and last published snapshot.
  • Use replay to reproduce surface states after changing fit logic.
Kafka and Kubernetes roadmap for monitoring volatility surface pipeline lag
Pipeline monitoring has to identify which topic or worker introduced lag, not only that the final dashboard is delayed.

SLO burn rate

Burn-rate alerts should separate short bursts from sustained stale surfaces.

A live crypto options surface will occasionally reject a weak slice, reuse a previous smile, or hold a far-wing node. Paging on every short-lived condition creates noise. Missing a sustained stale front-expiry surface is worse.

The monitoring layer should therefore track SLO burn rate by product surface, not only by service. A short burst of IV queue lag is informational if accepted risk nodes stay current. The same lag becomes urgent when front-tenor smiles, risk reversals, flies, or fixed-tenor rows breach their freshness budget.

  • Calculate burn rate separately for front expiries, long expiries, risk nodes, and dashboard patches.
  • Use different alert windows for transient quote-light markets and sustained worker backlog.
  • Escalate when stale accepted state overlaps with fit rejects or queue lag.
  • Downgrade noisy wing-only degradation when dashboard-critical nodes remain current.
burn_rate = stale_surface_minutes / allowed_stale_minutes
page = burn_rate > threshold
  && critical_nodes_stale
  && latest_state_not_market_closed

Timestamp chain

A freshness alert should point to the layer where time was lost.

When a dashboard row is stale, restarting the frontend is often the wrong response. The delay may have entered at exchange receipt, book publication, IV calculation, SVI fitting, surface publishing, API fanout, or browser delivery.

A timestamp chain turns a vague stale-surface alert into a debugging path. It lets an operator distinguish venue silence from worker backlog, fit rejects, websocket fanout lag, and browser-side delivery issues.

  • Record venue timestamp, local receive time, book publish time, IV time, fit time, snapshot time, and browser receive time.
  • Display the largest gap in the chain beside stale dashboard panels.
  • Alert differently for venue silence, worker backlog, rejected fits, and fanout delay.
  • Keep these timestamps in API responses so downstream consumers can monitor their own freshness.
Derivasys dashboard risk analytics with timestamps and surface freshness diagnostics
Surface freshness should identify where delay entered the quote, fit, snapshot, or delivery chain.

Alert design

Useful alerts are tied to user-facing surface quality.

The worst alert is technically true but operationally useless. Volatility dashboard alerts need to explain what the user should distrust: a venue, an expiry, a wing, a risk node, a fixed tenor, or the full surface.

Derivasys alert design therefore starts from the surface object. If a condition does not change whether the dashboard or API should publish a value, it is probably not a high-priority page.

  • Page on stale front-end expiries that still appear in dashboard rows.
  • Warn on quote-through-fit breaches that affect risk reversals or flies.
  • Escalate when multiple venues disagree beyond configured thresholds.
  • Downgrade values that are interpolation-only or extrapolation-heavy.
  • Attach the latest book, IV, fit, and snapshot timestamps to alerts.
Derivasys risk reversal and fly panels used for monitoring risk-node quality
Risk-node monitoring should explain whether a skew or fly move is real market structure or a quote-quality artifact.

Alert payloads

An alert should include the quote, fit, and dashboard state needed for triage.

A message that says 'surface stale' is not enough. The useful alert tells the operator which expiry, which risk node, which worker, which source quote state, and which fit decision made the surface unsafe to trust.

That alert payload also becomes incident evidence. If the team later replays the window, the alert should point to the same QuoteState, VolState, SmileState, and dashboard patch ids used to reproduce the output.

  • Include affected currency, expiry, venue mix, risk nodes, and visible dashboard panels.
  • Attach the latest quote timestamp, IV timestamp, fit timestamp, and dashboard patch timestamp.
  • Carry fit reject reason, reused-surface reason, interpolation share, and queue-lag source.
  • Link alerts to through-fit and risk-node panels rather than forcing operators to search logs.
SurfaceAlert {
  severity,
  affected_nodes,
  source_quote_state_id,
  vol_state_id,
  smile_state_id,
  dashboard_patch_id,
  runbook_step
}

Risk-node freshness

Risk reversal and fly alerts need the quote and fit state that produced them.

Derived panels can go stale in subtle ways. The surface may have a recent headline timestamp while a 25-delta risk reversal or fly is still based on a reused smile, sparse wing, interpolation-heavy node, or previous fit.

Monitoring risk nodes independently matters because traders often consume those rows before inspecting the full surface. If a risk reversal or fly is carried forward, the dashboard should say so rather than making the value look as fresh as the latest quote.

  • Track risk-node age separately from raw quote age and surface snapshot age.
  • Attach source smile, wing quote freshness, interpolation share, and fit residual state.
  • Downgrade nodes when one wing is stale, extrapolated, or unsupported by live venue marks.
  • Link node alerts back to through-fit panels so the cause is inspectable.
Derivasys risk reversal and fly panels with freshness monitoring context
Risk-node alerts should explain whether a skew or fly move is market structure, stale input, or fit behavior.

Runbooks

Runbooks should restore confidence in the surface, not just restart services.

Restarting a worker can clear a symptom while leaving the surface state questionable. A useful runbook tells the operator how to confirm whether books are fresh, IV calculations are current, SVI fits are accepted, and dashboard snapshots are safe to publish.

That is the difference between infrastructure uptime and product reliability. The user does not care that a pod restarted successfully if the displayed volatility surface still reflects stale inputs.

1. Check venue connection and sequence gaps
2. Check quote freshness by expiry
3. Check IV inversion failures and slow paths
4. Check latest accepted SVI fit and reject reason
5. Check dashboard snapshot age
6. Re-enable API publication only after surface health recovers

Synthetic canaries

Synthetic replay canaries catch monitoring gaps before the market does.

Some failures are hard to wait for in live markets: sequence gaps, stale far wings, repeated fit rejects, and dashboard fanout delays may appear only during bursty windows. A synthetic replay canary lets the system exercise those cases deliberately.

The canary should not only assert that workers stay up. It should check that health responses, alert payloads, runbook links, and dashboard downgrade states appear when known bad market states are replayed.

  • Replay a known stale-wing window and confirm the risk-node alert fires.
  • Replay a queue-lag window and confirm the health API reports degraded surface state.
  • Replay a rejected-fit sequence and confirm the dashboard holds previous accepted values.
  • Fail the canary if monitoring stays green while the product state is unsafe.
Derivasys synthetic replay canary path for volatility surface monitoring
Synthetic replay canaries verify that quote-state, fit-state, alerting, and dashboard downgrade paths all respond to known bad market windows.

Incident review

Replay turns a production incident into a reproducible surface state.

The most useful post-incident artifact is not a screenshot of a stale dashboard. It is the replayable sequence of quote states, forward states, IV records, fit candidates, reject reasons, and dashboard snapshots that produced the bad output.

That is where monitoring connects back to the Kafka and Kubernetes roadmap. Once market state is replayable, a fix to quote filtering, SVI weighting, or risk-node publication can be tested against the exact sequence that created the incident.

  • Preserve raw venue payloads and normalized quote-state events for incident windows.
  • Store accepted and rejected fit candidates with parameter values and residual summaries.
  • Replay incidents through changed calibration logic before promoting the fix.
  • Add runbook checks when replay shows an alert was missing or too noisy.
incident_window
  -> replay QuoteState + ForwardState
  -> rebuild IV and SmileState
  -> compare risk nodes and dashboard patches
  -> promote fix after reproduced output is explained

Monitor the live surface in Derivasys.

Use the dashboard for fitted SVI smiles, risk nodes, quote diagnostics, fixed-tenor views, and API-ready volatility state.