Oncall Runbook#

This runbook provides investigation and remediation steps for common Venice alerts. It is organized by component so operators can quickly find the relevant section when an alert fires.

How to Use This Runbook#

Identify the alert — Match the firing alert metric name to a section below.
Investigate — Follow the investigation steps to identify the affected store(s) and host(s).
Remediate — Apply the recommended fix.

If your investigation uncovers a bug in Venice, please file an issue on the Venice repository. For questions or help troubleshooting, post in the Venice Slack community.

General Triage Workflow#

For most alerts, the triage flow is:

Identify which store(s) and/or host(s) are affected using your metrics dashboard. Aggregate metrics and sort by max value descending to find the top contributors.
Determine scope: is this a single host or multiple hosts? A single store or cluster-wide?
For single-host issues, the problem is often bad host state — a restart may resolve it.
For multi-host or cluster-wide issues, investigate systemic causes (deployments, config changes, capacity).
Collect diagnostics (heap dumps, thread dumps, logs) before restarting services.

Ingestion and Write Path Alerts#

Ingestion Task Errored Gauge#

Metric: current--ingestion_task_errored_gauge

A value greater than 0 means the ingestion task for a store has stopped due to an error. Hybrid stores will serve stale data from that point on, and batch stores will stop processing control messages.

Investigation steps:

Aggregate the metric and sort by max value to find the affected store(s).
Check the server logs for the affected store(s) to identify the exception that caused the ingestion task to stop.
Check total--bytes_consumed for the affected store(s) to confirm ingestion has stopped around the same time.

Remediation:

Restart the affected server node to recover the ingestion task.
If the error recurs after restart, investigate the root cause in the server logs.

Leader Producer Failure Count#

Metric: total--leader_producer_failure_count

One or more stores have experienced Kafka producer failures on the leader replica. This typically indicates that a push job or real-time ingestion has failed in the specified datacenter.

Investigation steps:

Aggregate the metric and sort by max value descending to find the store(s) involved.
Check the server logs for producer-related exceptions.
Check whether there are corresponding issues with the backup, current, or future store versions.

Remediation:

Investigate the root cause in server logs.
If caused by a transient Kafka issue, the producer may recover automatically.

Ingestion Failure Count#

Metric: total--ingestion_failure

General ingestion failures have been detected. This is a broad indicator that something in the ingestion pipeline is failing.

Investigation steps:

Check server logs for the specific exception(s) causing ingestion failures.
Correlate with other alerts (leader offset lag, producer failures) to identify the root cause.

Remediation:

Address the underlying cause identified in the logs.
Restart the affected server if the issue is caused by transient state.

Timestamp Regression DCR Error#

Metric: total--timestamp_regression_dcr_error

An error was found in the Deterministic Conflict Resolution (DCR) logic used by Active-Active replication. Timestamp regression means a record arrived with a timestamp older than the existing value, which should not happen under normal operation.

Investigation steps:

Check server logs for the specific DCR error details.
Identify which store(s) and partition(s) are affected.

Remediation:

This typically requires developer investigation to determine the root cause.

Stuck Consumer#

Metric: stuck_consumer_found

A value greater than 0 means a shared Kafka consumer task is stuck and is not consuming data. Hybrid partitions assigned to that consumer will have stale data, and batch partitions will not complete ingestion.

Investigation steps:

Find the host that is firing this alert.

Take a heap dump on the affected host:

jmap -dump:live,format=b,file=heapdump.hprof <PID>

Check the server log file for the name of the stuck consumer thread. Look for log lines like:
```
Shared consumer couldn't make any progress for over N ms!
```

Remediation:

Restart the Venice server on the affected host to recover the stuck consumer.
If the issue recurs, check for known Kafka consumer bugs (e.g., corrupt record exceptions that are recoverable via consumer restart).

Controller Alerts#

Admin Message Errors#

Metric: admin_message_div_error_report_count

The controller encountered errors processing admin messages. Admin messages are used for cluster coordination operations (store creation, version swaps, config changes, etc.).

Investigation steps:

Check controller logs for the specific error.
Determine if the error is transient or persistent.

Remediation:

This may require restarting parent controllers and skipping failed admin messages.
Important: Confirm with your development team before skipping admin messages, as this can cause inconsistencies if done incorrectly.

Failed Admin Messages#

Metric: failed_admin_messages

Admin messages are failing to be processed by the controller. This can block store operations across the cluster.

Investigation steps:

Check if the cluster is in maintenance mode — failed admin messages are expected during maintenance mode. Verify the cluster is not in maintenance mode before investigating other causes.
Check controller logs for the specific exception causing the failures.

Remediation:

If caused by maintenance mode: take the cluster out of maintenance mode once the maintenance is complete.
If not caused by maintenance mode: this may require skipping the failed admin messages.
Important: Confirm with your development team before skipping admin messages.

Controller Error Partition Gauge#

Metric: ErrorPartitionGauge (child and parent controllers, emitted by Apache Helix)

One or more cluster resources (Helix resources) are in an ERROR state. This can affect store availability and partition assignment.

Investigation steps:

Check the Helix controller UI or API for cluster resource(s) that are in ERROR state:
```
GET /admin/v2/clusters/<cluster>/resources
```
Check controller logs to see if an exception was thrown.
If this alert fires right after a config or code deployment, consider rolling back.

Remediation:

If caused by a recent deployment: roll back the change.
If caused by a transient issue: the Helix controller may self-recover once the underlying issue is resolved.

Maintenance Mode#

Metric: maintenance_mode (Helix cluster state)

A Venice cluster has entered maintenance mode. While in maintenance mode, Helix will not perform partition reassignment, which means down replicas will not be replaced.

Investigation steps:

Investigate why the cluster entered maintenance mode. Common reasons:
- Enabled manually by an operator for planned maintenance (cluster expansion, node swap, etc.).
- Number of down instances exceeded MAX_OFFLINE_INSTANCES_ALLOWED (automatic).
- Node crashes or planned infrastructure maintenance.
Check the maintenance mode reason in ZooKeeper:
```
/venice/<cluster>/CONTROLLER/MAINTENANCE
```

Remediation:

If this is planned maintenance: no action needed — take the cluster out of maintenance mode when the maintenance is complete.
If this is unplanned: recover the down instances and then take the cluster out of maintenance mode.
Disable maintenance mode via the Helix REST API or your cluster management tooling.

Rebalance Failure Gauge#

Metric: RebalanceFailureGauge (emitted by Apache Helix)

A metric value of 1 indicates that the Helix controller is unable to perform resource creation, assignment, or rebalance. Common causes include bad rack-aware configurations or bad cluster configurations. If the cluster enters maintenance mode, rebalance failures will also occur.

Investigation steps:

Check if the cluster is in maintenance mode — rebalance failures are expected during maintenance mode.

Check the Helix controller logs for rebalance-related exceptions. Look for messages like:

Failed to calculate best possible states for resource ...
Error computing assignment ...

Check for recent configuration changes that may have introduced bad rack-aware or cluster configs.

Remediation:

If caused by maintenance mode: resolve the maintenance mode issue first.
If caused by bad configuration: revert the configuration change.
If caused by a specific resource: identify and fix the problematic resource.
If the rebalance failure persists, engage Helix experts to assist with diagnosis.

Protocol Auto-Detection Service Errors#

Metric: protocol_version_auto_detection_error

The protocol version auto-detection service runs periodically (every 10 minutes) to find the minimum admin operation protocol version across all controller instances. If the service runs successfully, the error count drops to 0. A non-zero value means the detection is failing.

Investigation steps:

Check the parent controller logs for ProtocolVersionAutoDetectionService entries.

A successful run will log:

Current good Admin Operation version for cluster <cluster> is N and upstream version is N

If the service is failing, determine whether it is the parent or child controller that is failing.
This service relies on leader detection — if there is no leader, investigate why leader election has failed.
Check ZooKeeper for the current protocol version:
```
/venice-parent/<cluster>/adminTopicMetadata
```

Remediation:

If there is no leader controller, investigate and resolve the leader election issue.
The service can be disabled via controller configuration if needed as a temporary workaround:
```
controller.protocol.version.auto.detection.service.enabled=false
```

Server Resource Alerts#

JVM Heap Usage#

Metric: VeniceJVMStats--HeapUsage

JVM heap usage is approaching the configured maximum. If heap usage reaches 100%, the application will crash with an OutOfMemoryError. This applies to server, router, and controller processes.

Investigation steps:

Identify the affected node(s) from your monitoring dashboard.

Take a heap dump and thread dump on the affected node(s) as soon as possible:

# Heap dump
jmap -dump:live,format=b,file=heapdump.hprof <PID>

# Thread dump
jstack <PID> > threaddump.txt

Remediation:

Restart the application on the affected node(s) as soon as possible to prevent a crash.
Important: Collect heap and thread dumps before restarting — they are essential for root cause analysis.

Store Buffer Service Memory Usage#

Metric: total_memory_usage or max_memory_usage_per_writer

The store buffer service memory usage is high. The store buffer sits between the Kafka consumer and the storage engine, buffering records before they are written to RocksDB. Metrics are emitted per drainer type (sorted/unsorted). High usage can indicate a deadlock between the shared-consumer thread, Kafka producer callback thread, and buffer drainer thread.

Investigation steps:

Identify the affected node(s) from your monitoring dashboard.

Take a heap dump on the affected node immediately:

jmap -dump:live,format=b,file=heapdump.hprof <PID>

Remediation:

Restart the Venice server on the affected node to recover. Consult with your development team before restarting if possible, as the heap dump is critical for diagnosis.

SSD/Disk Health Status#

Metric: disk_healthy

The SSD on a Venice server node may be degraded or failing. Venice servers use local SSDs for RocksDB storage, so disk health is critical for data serving.

Investigation steps:

SSH onto the affected host and check disk health:

# Check kernel messages for NVMe errors
dmesg -T | grep nvme

# List NVMe devices (if no device is listed, the disk is dead)
sudo nvme list

# Check if the disk mounts properly
sudo mount -av

Remediation:

If the SSD shows errors but is still functional:
1. Try remounting: sudo umount /mnt/data; sudo mount -a
2. If remounting does not fix the issue, reboot the host.
If the SSD is dead (not detected by nvme list):
1. Swap the node out of the cluster.
2. File a hardware ticket for disk replacement, then re-image the host.
If this is a no-data-point alert (the monitoring agent is not emitting data):
1. Investigate why the monitoring agent is not emitting data and restart it.

Filesystem Usage#

Metric: filesystem_usage (OS-level, e.g., node_exporter or collectd)

Disk usage on a server or router node has exceeded the alert threshold. High disk usage can degrade performance and eventually cause the server to stop accepting writes.

Investigation steps:

Identify which hosts are affected and check the file usage pattern for the cluster.
Unbalanced partition assignment: If some hosts have decreased usage while others have increased, this may indicate an unbalanced Helix resource assignment.

General high usage: SSH into a problematic host and check disk usage:

# Check which filesystem has the issue
df -h

# Check log sizes
du -sh /path/to/logs/

# Check RocksDB data size
du -sh /path/to/venice/data/rocksdb/

# Check for unexpected files in /tmp
du -sh /tmp/*

For router nodes:

sudo du -hx --exclude=/proc / | sort -hr | head -n 10

Remediation:

Unbalanced assignment: Add more hosts to the affected zones so partitions are distributed more evenly. Work with the Helix team to understand the cause of the imbalance.
Log accumulation: Clean up old log files or adjust log rotation settings.
General high usage: Identify and remove the source of unexpected disk consumption.

CPU Wait#

Metric: cpu_wait (OS-level, e.g., node_exporter or collectd)

The CPU is spending a significant amount of time waiting for I/O operations to complete. This usually indicates hardware issues on the underlying disk.

Investigation steps:

Check kernel logs for hardware failures:
```
dmesg -T
```
Check I/O metrics for processes doing unexpected read/write.
Use iotop to identify the process (if any):
```
sudo iotop -o -P -a
```
Press left/right arrow keys to sort by different columns.
For Venice servers, check NVMe write statistics to confirm high I/O usage. If the problematic hosts also have higher wait time than other hosts, this usually indicates a hardware failure.

Remediation:

If caused by a rogue process: identify and stop the process, then reboot if needed.
If caused by hardware failure: swap the host out of the cluster and file a hardware repair ticket.

RocksDB Delayed Write Rate#

Metric: rocksdb.actual-delayed-write-rate

RocksDB is throttling writes due to write stalls. This can occur when compaction cannot keep up with the incoming write rate, causing L0 SST files to accumulate.

Before investigating, check whether there are any offset/time lag alerts on the hybrid current version. If there are none, a random spike of this metric is benign and does not break SLA guarantees.

Investigation steps:

Check if there are corresponding offset lag alerts for hybrid stores. If not, this may be benign.
To identify stores with high write throughput, check the per-store consumed bytes metrics for the cluster.
Check server logs for frequent offset update/sync entries.
Review RocksDB logs (typically stored alongside the data directory) for compaction-related warnings.

Remediation:

Venice tunes several RocksDB configs to reduce write stalls:

Increase total write buffer size to accommodate more memtables and avoid premature flush.
Reduce individual memtable size to fit more memtables within the total buffer.
Increase compaction threads to speed up compaction.
Reduce L1 target size to match L0 size, speeding up L0-to-L1 compaction.

If these tuning strategies do not reduce the lag, it may be a capacity issue — adding more server nodes can help distribute the write load.

For detailed information on RocksDB write stalls, see the RocksDB Write Stalls documentation.

Server Committed Memory#

Metric: os/mem.committed_as (OS-level, from /proc/meminfo)

The operating system's committed memory is higher than expected. This can be caused by huge pages not being properly reserved by the application.

Investigation steps:

SSH into the affected host and check the HugePages reservation:
```
grep -i huge /proc/meminfo
```
If HugePages_Rsvd is 0 but HugePages_Total is non-zero, the application is not using the reserved huge pages.

Remediation:

If HugePages_Rsvd is 0: restart the Venice server so it properly reserves huge pages on startup.

Metaspace Memory Usage#

Metric: metaspace_memory_pool_used (JVM-level, from JMX java.lang:type=MemoryPool)

The JVM metaspace usage has exceeded the alert threshold. This indicates the application is either loading an excessive number of classes or experiencing a classloader memory leak. This applies to server, router, and controller processes.

Investigation steps:

Examine historical patterns to identify when the problem first appeared.
Review code and dependency changes made around that time period.
To pinpoint the problematic change, use a git bisect approach — deploy older versions to verify which change introduced the regression.

Remediation:

Revert the problematic change if identified.
Restart the affected node as a temporary mitigation.

File Descriptor Usage#

Metric: file_descriptor_usage (OS-level, from /proc/<PID>/fd)

The service is approaching its file descriptor limit. When services run out of file descriptors, they cannot open new files for writing or accept new network connections. This applies to server and router processes.

Investigation steps:

Check the file descriptor limit on the host:
```
cat /proc/<PID>/limits
```
Check the current file descriptor usage:
```
lsof -p <PID> | wc -l
```
Determine if the limit is too low or if the service is behaving abnormally.

Remediation:

If the limit is low: increase the file descriptor limit in your system configuration (e.g., /etc/security/limits.conf).
If the service is consuming excessive file descriptors: investigate the root cause (connection leaks, file handle leaks, etc.).
As a temporary mitigation, add additional hosts to the cluster to distribute the load.

Participant Store Consumption Task Stuck or Dead#

Metric: <cluster>-participant_store_consumption_task--heartbeat

The participant store consumption task has stopped emitting heartbeats. This task processes participant messages that coordinate server-level operations. The alert can trigger because:

The consumption task is taking too long to process messages (may auto-recover).
The consumption task has died due to an unhandled exception.

Investigation steps:

Check server logs for exceptions related to the participant store consumption task.
Check if the heartbeat resumes on its own (transient delay).

Remediation:

Restart the Venice server to recover or reproduce the error.

Router Alerts#

Unhealthy Host Count (Router Heartbeat)#

Metric: total--unhealthy_host_count_caused_by_router_heart_beat

The router has detected unhealthy backend server(s) via heartbeat checks. This means the router is unable to route requests to those servers, reducing serving capacity.

Investigation steps:

Identify the affected router node(s) and check if the issue is with the router or the backend servers.
Check whether there has been a recent upgrade on the router or server.

Remediation:

Take a heap dump and thread dump on the affected router node(s), then restart the router.
If the problem persists and there was a recent deployment, consider rolling back.

Router CPU Usage#

Metric: router_instantaneous_cpu_usage (OS/container-level, from process CPU monitoring)

Router CPU usage is elevated, which can cause increased latency and request timeouts. Average CPU is not a good indicator of router saturation — instantaneous/P95 CPU usage is a better signal.

Investigation steps:

Check if the CPU increase correlates with increased QPS or a new workload being added to the cluster.
Verify that the router fleet has sufficient capacity for peak load.

Remediation:

Add enough routers to the cluster so that P95 CPU usage stays under 90% during peak load. Use linear ratio math based on current utilization to estimate the required number of additional routers.
If sustained growth requires fleet expansion, engage your capacity planning team.

Router Active Connection Count#

Metric: connection_pool--total_active_connection_count

The number of active connections to a router is elevated. This can indicate increased traffic or a single router receiving disproportionate load.

Investigation steps:

Check if QPS is increasing alongside the connection count.
If only one router in the cluster has significantly higher QPS and connection count, check if a load test or traffic migration is in progress.

Remediation:

If caused by uneven load distribution: investigate your load balancer configuration.
If caused by overall traffic growth: add more routers.

Router Memory (Cgroup) Usage#

Metric: cgroup_memory_usage_bytes (container/cgroup-level, from cAdvisor or kubelet)

Router memory usage is approaching the container/cgroup memory limit. High memory usage can degrade router performance and eventually cause OOM kills.

Investigation steps:

Identify which routers have the high memory issue.
Common causes:
- Planned/unplanned infrastructure maintenance (e.g., network maintenance on a rack). Check if there is ongoing maintenance that may be causing connection pooling issues.
- Software defects — take a heap dump first.
- Excessive logging — check if the router log file is growing rapidly, which can be triggered by issues in other components.

Remediation:

If only a few hosts are affected: restart the router on those hosts.
If the entire fleet is affected: engage your development team immediately to investigate.
For excessive logging: identify the logging pattern in the router log file and root-cause the issue that is triggering the excessive log output.

SSL Handshake Thread Pool Saturation#

Metric: ssl_handshake_thread_pool--active_thread_number

The number of active SSL handshake threads is growing continuously, indicating a possible SSL handshake storm. While the thread pool throttling protects the server from crashing, users will experience read availability drops. This applies to server and router processes.

Investigation steps:

Check whether the number of SSL handshakes keeps growing over time.
Check server/router logs for SSL-related errors.
Investigate which clients are performing frequent SSL handshakes and why.

Remediation:

Identify and address the root cause of the excessive SSL handshakes (e.g., client misconfiguration, connection pooling issues).
The built-in throttling will protect the server from crashing, but client-side read availability will be degraded until the root cause is resolved.

Read Path and Client Alerts#

Compute/Read Latency Spikes (P99)#

Metric: compute_storage_engine_read_compute_latency

Read compute latency on the server has spiked. This affects read-compute operations where the server performs computation (e.g., dot product, cosine similarity) on stored data before returning results.

Related metrics: compute_storage_engine_read_compute_deserialization_latency, compute_storage_engine_read_compute_serialization_latency.

Investigation steps:

Identify the affected host(s) from your monitoring dashboard.
Try collecting a JFR (Java Flight Recorder) profile on the affected host(s), one at a time:
```
jcmd <PID> JFR.start duration=10s filename=profile.jfr
```

Remediation:

Collecting a JFR profile may itself resolve the issue if it triggers JIT recompilation.
If JFR does not fix the spike, restart the Venice server on the affected node.

GET/BATCH_GET/BATCH_GET STREAMING Unhealthy Request Count#

Metric: unhealthy_request (per request type)

These metrics monitor the unhealthy (failed) request count for GET, BATCH_GET, and BATCH_GET STREAMING operations from the router's perspective. This is an aggregated view across all stores.

Investigation steps:

The aggregated metric does not show which stores are affected. Check per-store unhealthy request metrics to identify the impacted store(s).
Search server and router logs for the affected store names to identify what caused the unhealthy requests.

Remediation:

Address the root cause based on the logs (e.g., server-side exceptions, timeout issues, resource exhaustion).

Client-Side Unhealthy Requests#

Metric: unhealthy_request

This metric monitors unhealthy requests from the Venice client's perspective for each cluster. It catches issues between the client and the router layer that may not be visible from the router side alone.

Investigation steps:

Check if the issue correlates with router-side unhealthy request metrics.
Check for network issues between the client service and the Venice routers.
Check if the affected cluster's routers are healthy.

Remediation:

Address the underlying cause (router issues, network issues, client-side issues).
For client-side issues, engage the service owner to coordinate investigation.

Leader Heartbeat Delay#

Metric: heartbeat_delay_ms_leader-<region>

The leader replica's heartbeat is delayed, which may indicate that the StoreIngestionTask (SIT) thread for the affected store has died or is blocked.

Investigation steps:

Identify the problematic store name from the metrics dashboard.
Check the server logs for exceptions related to the StoreIngestionTask thread for that store. Filter by time range and exception type.
If the delay appears to be growing linearly, it is likely caused by a dead SIT thread.
If SIT is not the cause, investigate the ingestion path for other bottlenecks.

Remediation:

Restart the Venice server on the affected host to recover the SIT thread.

Infrastructure and Kafka Alerts#

Down Instance Gauge#

Metric: DownInstanceGauge (emitted by Apache Helix)

Two or more Venice server instances are down in a cluster.

Investigation steps:

Check if there is planned infrastructure maintenance in progress.
Identify which instances are down.

Remediation:

If instances are down due to a crash: attempt to reboot the node.
If the node has bad hardware: swap it out, file a repair ticket, and engage your infrastructure team for replacement.
If planned maintenance is in progress: no action needed, but monitor for completion.

Kafka Partition Count#

Metric: kafka_partitions_count (Kafka broker-level, from JMX or Kafka metrics reporter)

This monitors the total number of Kafka partitions used by Venice. If it approaches the limit of what the Kafka cluster can handle, it can impact the entire write path.

Note: The alert threshold is generally set lower than the Kafka cluster's actual limit because early action is needed to prevent larger problems.

Investigation steps:

Check if there is an abnormal surge in partition count (e.g., due to a misconfiguration or runaway store creation).
If growth is normal (organic), coordinate with your Kafka team to understand the current limits and plan for scaling.

Remediation:

Abnormal surge: Investigate and address the root cause (e.g., clean up unused topics, fix misconfiguration).
Organic growth: Work with your Kafka team to:
1. Project when Venice will reach the Kafka limit.
2. Scale up the Kafka cluster.
3. Increase the alert threshold once the Kafka cluster can safely handle the higher count.