Kafka troubleshooting gets difficult when the symptom is broad but the cause is buried somewhere across brokers, clients, storage, or the network. Throughput drops, rising latency, consumer lag, and replica instability often look similar from the outside even when the root cause is different.
This checklist is designed to guide a systematic production investigation. It helps you ask the right questions, inspect the right Kafka metrics first, and narrow the problem down faster when the cluster gets complicated.
The Foundational Mindset: Systematic Investigation
Before diving into the checklist, adopt this mindset:
- Define the Problem Clearly: What are the exact symptoms? When did they start? Is it impacting all topics/clients or a subset?
- Gather Evidence: Don’t guess. Collect logs, metrics, and configurations.
- Isolate the Scope: Is it a producer, consumer, or broker issue? Is it network, disk, or CPU?
- Correlate Events: Did the issue coincide with a deployment, configuration change, or infrastructure event?
- Reproduce (if possible): Can you reproduce the issue in a non-production environment?
Diagram 1: General Troubleshooting Flow for Kafka Issues.
A Consultant’s 15-Point Diagnostic Checklist
This checklist is categorized by common problem areas. Start with the most relevant category based on initial symptoms.
Broker & Cluster Health Diagnostics
- Check for Under-Replicated / Offline Partitions:
- Metric:
kafka.server:type=ReplicaManager,name=UnderReplicatedPartitions,kafka.controller:type=KafkaController,name=OfflinePartitionsCount. - Why: Value > 0 indicates reduced fault tolerance, potential data loss risk, or unavailable partitions.
- Action: Examine broker logs on affected leaders/replicas for errors (disk full, network issues, crashes). Check
IsrShrinksPerSecfor ISR instability. Ensure brokers are alive and reachable.
- Verify Active Controller:
- Metric:
kafka.controller:type=KafkaController,name=ActiveControllerCount. - Why: Must be 1 across the cluster. If 0 or >1, the cluster is in a critical state (split-brain or no leader election).
- Action: Check ZooKeeper (if used) or KRaft quorum health. Examine logs on all brokers for controller election messages or errors.
- Assess Broker Resource Saturation (CPU, Memory, Disk, Network):
- Metrics: OS-level CPU utilization (esp. iowait), free memory (page cache!), disk queue depth/await times, network throughput/errors. Kafka JMX:
NetworkProcessorAvgIdlePercent,RequestHandlerAvgIdlePercent. - Why: Overloaded brokers cause high latency and instability.
- Action: Identify the bottleneck. If CPU, check compression, SSL, thread pool sizes. If disk, check I/O patterns, disk health. If memory, check JVM heap vs. page cache allocation. If network, check bandwidth limits, MTU.
- Inspect Broker Logs for Recurring Errors or Warnings:
- Location:
logs/server.log,logs/controller.log,logs/state-change.log. - Why: Logs often contain explicit error messages, stack traces, or warnings about misconfigurations, hardware issues, or connectivity problems.
- Action: Look for keywords like ERROR, WARN, FATAL, Exception, timeout, “failed to”. Correlate timestamps with problem occurrence.
- Review ZooKeeper/KRaft Quorum Health:
- Metrics (ZK):
zk_avg_latency,zk_outstanding_requests,zk_followers,zk_synced_followers. Usemntrorsrvr4-letter words. - Metrics (KRaft):
kafka.raft:type=RaftManager,name=current-leader-id,...current-epoch,...high-watermark. Examine controller logs. - Why: Kafka relies on its consensus layer. Issues here (e.g., quorum loss, high latency) cripple the cluster.
- Action: Ensure quorum members are healthy, network connectivity is good between them, and they are not resource-starved.
Producer-Side Diagnostics
- Analyze Producer Error Rates & Latency:
- Metrics:
record-error-rate,request-latency-avg/max/p99,errors-total. - Why: High error rates or latencies indicate problems sending data.
- Action: Check producer logs for specific exceptions (
TimeoutException,RecordTooLargeException,NotLeaderForPartitionException). Verify broker connectivity,ackssettings,retries,max.request.size, and idempotence configuration.
- Check Producer Buffer Fullness & Request Pipelining:
- Metrics:
buffer-available-bytes,max.in.flight.requests.per.connection. - Why: Full buffers (low
buffer-available-bytes) or misconfigured pipelining can stall producers. - Action: Increase
buffer.memoryif chronically full and brokers can handle load. Ensuremax.in.flight.requests.per.connectionis appropriate (1 for strict ordering without idempotence, up to 5 with idempotence).
Consumer-Side Diagnostics
- Investigate High or Growing Consumer Lag:
- Metric:
records-lag-maxper consumer group/topic/partition. - Why: The most common symptom of consumers not keeping up with producers.
- Action:
- Is processing logic slow? Optimize consumer code. Consider async processing. - Insufficient consumer instances for partitions? Scale out consumers. - Is `max.poll.records` too high for `max.poll.interval.ms`? Adjust these. - Downstream system bottlenecked? - Check consumer logs for errors or frequent rebalances. 9. **Look for Frequent Consumer Rebalances:**
- Metrics:
join-rate,sync-rate,last-rebalance-seconds-ago. Consumer logs showing “Revoking partitions” / “Assigning partitions”. - Why: Rebalances stop consumption. Frequent rebalances indicate instability (consumers crashing,
session.timeout.mstoo low,max.poll.interval.msviolations, “fencing” with static membership if instance IDs conflict). - Action: Stabilize consumers. Increase timeouts if appropriate. Ensure unique
group.instance.idfor static members. Fix underlying consumer crashes.
- Examine Consumer Fetch Behavior & Errors:
- Metrics:
fetch-latency-avg/max/p99,fetch-throttle-time-avg/max,records-consumed-rate. - Why: High fetch latency points to broker load or network issues. Throttling indicates broker quotas are being hit.
- Action: Check broker health and network. If throttled, review quotas or consumer fetch rates.
Network & Configuration Diagnostics
- Verify Network Connectivity & Performance (Client-Broker, Broker-Broker):
- Tools:
ping,traceroute,iperf,netstat,ss. OS network error counters. - Why: Packet loss, high latency, or bandwidth saturation severely impact Kafka.
- Action: Test connectivity between all relevant nodes. Check for firewall issues, DNS resolution problems, MTU mismatches, switch/router issues.
- Review Critical Configurations for Mismatches or Suboptimal Values:
- Areas:
message.max.bytes(broker/topic) vs. clientmax.request.size/max.partition.fetch.bytes. Replication factors. Security settings (SSL, SASL).acks.linger.ms,batch.size. - Why: Incorrect or inconsistent configurations are a common source of problems.
- Action: Systematically review configurations on brokers, topics, producers, and consumers. Ensure consistency where needed (e.g., security protocols).
- Check for Silent Data Loss (If Suspected):
- Symptoms: Gaps in data, applications behaving unexpectedly.
- Why: Could be
acks=0on producer, unhandled producer errors, topic misconfiguration (e.g.,min.insync.replicastoo low withacks=all), or bugs in client logic. - Action: This is complex. Audit producer
acksand error handling. Verifymin.insync.replicas. Use tools to compare record counts/checksums if possible. This often requires deep application-level tracing.
Environment & External Factor Diagnostics
- Look for “Noisy Neighbor” Issues or Shared Resource Contention:
- Context: Virtualized environments, Kubernetes, shared storage/network.
- Why: Other applications or VMs consuming excessive resources can impact Kafka performance unpredictably.
- Action: Monitor resource usage at the host/hypervisor/node level. Consider resource quotas, dedicated nodes, or QoS if possible.
- Correlate with External Events & Changes:
- Examples: Recent deployments (Kafka, clients, infrastructure), OS patching, hardware changes, network maintenance, dependency service outages.
- Why: Kafka issues are often triggered by external changes.
- Action: Maintain a change log. Correlate issue start times with known events. Review deployment histories.
Expert Insight: The Power of Correlated Metrics Isolated metrics rarely tell the whole story. The key to diagnosing complex issues is correlating metrics across different layers. For example, high producer latency + low broker CPU + high broker network queue time might point to a saturated broker network card or an upstream network bottleneck, not a CPU issue on the broker itself. Use dashboards that allow overlaying multiple metrics.
When to Call in the Experts
While this checklist provides a strong diagnostic framework, some Kafka production issues are deeply intricate, requiring specialized knowledge of Kafka internals, distributed systems patterns, and advanced performance analysis techniques. If you’ve run through this checklist and are still struggling, or if the business impact is severe, engaging expert Kafka consultants like ActiveWizards can provide a rapid path to resolution and help implement long-term stability.
Conclusion: From Chaos to Clarity
Troubleshooting Apache Kafka in production can be challenging, but a systematic diagnostic approach transforms it from a guessing game into a methodical investigation. This 15-point checklist, born from real-world consulting engagements, provides a structured path to identify root causes and restore your Kafka cluster’s health. Remember to combine this checklist with robust monitoring, clear problem definition, and iterative testing.
Facing Stubborn Kafka Production Issues? ActiveWizards Can Help!
Our expert Kafka consultants specialize in diagnosing and resolving complex production problems, optimizing performance, and ensuring the stability of your critical data pipelines. Don’t let Kafka issues derail your business.
- Metrics: