Kafka Troubleshooting: A Production Checklist

Kafka troubleshooting gets difficult when the symptom is broad but the cause is buried somewhere across brokers, clients, storage, or the network. Throughput drops, rising latency, consumer lag, and replica instability often look similar from the outside even when the root cause is different.

This checklist is designed to guide a systematic production investigation. It helps you ask the right questions, inspect the right Kafka metrics first, and narrow the problem down faster when the cluster gets complicated.

The Foundational Mindset: Systematic Investigation

Before diving into the checklist, adopt this mindset:

Define the Problem Clearly: What are the exact symptoms? When did they start? Is it impacting all topics/clients or a subset?
Gather Evidence: Don’t guess. Collect logs, metrics, and configurations.
Isolate the Scope: Is it a producer, consumer, or broker issue? Is it network, disk, or CPU?
Correlate Events: Did the issue coincide with a deployment, configuration change, or infrastructure event?
Reproduce (if possible): Can you reproduce the issue in a non-production environment?

Diagram 1: General Troubleshooting Flow for Kafka Issues.

A Consultant’s 15-Point Diagnostic Checklist

This checklist is categorized by common problem areas. Start with the most relevant category based on initial symptoms.

Broker & Cluster Health Diagnostics

Check for Under-Replicated / Offline Partitions:

Metric: kafka.server:type=ReplicaManager,name=UnderReplicatedPartitions, kafka.controller:type=KafkaController,name=OfflinePartitionsCount.
Why: Value > 0 indicates reduced fault tolerance, potential data loss risk, or unavailable partitions.
Action: Examine broker logs on affected leaders/replicas for errors (disk full, network issues, crashes). Check IsrShrinksPerSec for ISR instability. Ensure brokers are alive and reachable.

Verify Active Controller:

Metric: kafka.controller:type=KafkaController,name=ActiveControllerCount.
Why: Must be 1 across the cluster. If 0 or >1, the cluster is in a critical state (split-brain or no leader election).
Action: Check ZooKeeper (if used) or KRaft quorum health. Examine logs on all brokers for controller election messages or errors.

Assess Broker Resource Saturation (CPU, Memory, Disk, Network):

Metrics: OS-level CPU utilization (esp. iowait), free memory (page cache!), disk queue depth/await times, network throughput/errors. Kafka JMX: NetworkProcessorAvgIdlePercent, RequestHandlerAvgIdlePercent.
Why: Overloaded brokers cause high latency and instability.
Action: Identify the bottleneck. If CPU, check compression, SSL, thread pool sizes. If disk, check I/O patterns, disk health. If memory, check JVM heap vs. page cache allocation. If network, check bandwidth limits, MTU.

Inspect Broker Logs for Recurring Errors or Warnings:

Location: logs/server.log, logs/controller.log, logs/state-change.log.
Why: Logs often contain explicit error messages, stack traces, or warnings about misconfigurations, hardware issues, or connectivity problems.
Action: Look for keywords like ERROR, WARN, FATAL, Exception, timeout, “failed to”. Correlate timestamps with problem occurrence.

Review ZooKeeper/KRaft Quorum Health:

Metrics (ZK): zk_avg_latency, zk_outstanding_requests, zk_followers, zk_synced_followers. Use mntr or srvr 4-letter words.
Metrics (KRaft): kafka.raft:type=RaftManager,name=current-leader-id, ...current-epoch, ...high-watermark. Examine controller logs.
Why: Kafka relies on its consensus layer. Issues here (e.g., quorum loss, high latency) cripple the cluster.
Action: Ensure quorum members are healthy, network connectivity is good between them, and they are not resource-starved.

Producer-Side Diagnostics

Analyze Producer Error Rates & Latency:

Metrics: record-error-rate, request-latency-avg/max/p99, errors-total.
Why: High error rates or latencies indicate problems sending data.
Action: Check producer logs for specific exceptions (TimeoutException, RecordTooLargeException, NotLeaderForPartitionException). Verify broker connectivity, acks settings, retries, max.request.size, and idempotence configuration.

Check Producer Buffer Fullness & Request Pipelining:

Metrics: buffer-available-bytes, max.in.flight.requests.per.connection.
Why: Full buffers (low buffer-available-bytes) or misconfigured pipelining can stall producers.
Action: Increase buffer.memory if chronically full and brokers can handle load. Ensure max.in.flight.requests.per.connection is appropriate (1 for strict ordering without idempotence, up to 5 with idempotence).

Consumer-Side Diagnostics

Investigate High or Growing Consumer Lag:

Metric: records-lag-max per consumer group/topic/partition.
Why: The most common symptom of consumers not keeping up with producers.
Action:

Is processing logic slow? Optimize consumer code. Consider async processing. - Insufficient consumer instances for partitions? Scale out consumers. - Is `max.poll.records` too high for `max.poll.interval.ms`? Adjust these. - Downstream system bottlenecked? - Check consumer logs for errors or frequent rebalances. 9. **Look for Frequent Consumer Rebalances:**
- Metrics: join-rate, sync-rate, last-rebalance-seconds-ago. Consumer logs showing “Revoking partitions” / “Assigning partitions”.
- Why: Rebalances stop consumption. Frequent rebalances indicate instability (consumers crashing, session.timeout.ms too low, max.poll.interval.ms violations, “fencing” with static membership if instance IDs conflict).
- Action: Stabilize consumers. Increase timeouts if appropriate. Ensure unique group.instance.id for static members. Fix underlying consumer crashes.
1. Examine Consumer Fetch Behavior & Errors:
- Metrics: fetch-latency-avg/max/p99, fetch-throttle-time-avg/max, records-consumed-rate.
- Why: High fetch latency points to broker load or network issues. Throttling indicates broker quotas are being hit.
- Action: Check broker health and network. If throttled, review quotas or consumer fetch rates.
Network & Configuration Diagnostics
1. Verify Network Connectivity & Performance (Client-Broker, Broker-Broker):
- Tools: ping, traceroute, iperf, netstat, ss. OS network error counters.
- Why: Packet loss, high latency, or bandwidth saturation severely impact Kafka.
- Action: Test connectivity between all relevant nodes. Check for firewall issues, DNS resolution problems, MTU mismatches, switch/router issues.
1. Review Critical Configurations for Mismatches or Suboptimal Values:
- Areas: message.max.bytes (broker/topic) vs. client max.request.size/max.partition.fetch.bytes. Replication factors. Security settings (SSL, SASL). acks. linger.ms, batch.size.
- Why: Incorrect or inconsistent configurations are a common source of problems.
- Action: Systematically review configurations on brokers, topics, producers, and consumers. Ensure consistency where needed (e.g., security protocols).
1. Check for Silent Data Loss (If Suspected):
- Symptoms: Gaps in data, applications behaving unexpectedly.
- Why: Could be acks=0 on producer, unhandled producer errors, topic misconfiguration (e.g., min.insync.replicas too low with acks=all), or bugs in client logic.
- Action: This is complex. Audit producer acks and error handling. Verify min.insync.replicas. Use tools to compare record counts/checksums if possible. This often requires deep application-level tracing.
Environment & External Factor Diagnostics
1. Look for “Noisy Neighbor” Issues or Shared Resource Contention:
- Context: Virtualized environments, Kubernetes, shared storage/network.
- Why: Other applications or VMs consuming excessive resources can impact Kafka performance unpredictably.
- Action: Monitor resource usage at the host/hypervisor/node level. Consider resource quotas, dedicated nodes, or QoS if possible.
1. Correlate with External Events & Changes:
- Examples: Recent deployments (Kafka, clients, infrastructure), OS patching, hardware changes, network maintenance, dependency service outages.
- Why: Kafka issues are often triggered by external changes.
- Action: Maintain a change log. Correlate issue start times with known events. Review deployment histories.
Expert Insight: The Power of Correlated Metrics Isolated metrics rarely tell the whole story. The key to diagnosing complex issues is correlating metrics across different layers. For example, high producer latency + low broker CPU + high broker network queue time might point to a saturated broker network card or an upstream network bottleneck, not a CPU issue on the broker itself. Use dashboards that allow overlaying multiple metrics.
When to Call in the Experts
While this checklist provides a strong diagnostic framework, some Kafka production issues are deeply intricate, requiring specialized knowledge of Kafka internals, distributed systems patterns, and advanced performance analysis techniques. If you’ve run through this checklist and are still struggling, or if the business impact is severe, engaging expert Kafka consultants like ActiveWizards can provide a rapid path to resolution and help implement long-term stability.
Conclusion: From Chaos to Clarity
Troubleshooting Apache Kafka in production can be challenging, but a systematic diagnostic approach transforms it from a guessing game into a methodical investigation. This 15-point checklist, born from real-world consulting engagements, provides a structured path to identify root causes and restore your Kafka cluster’s health. Remember to combine this checklist with robust monitoring, clear problem definition, and iterative testing.
Facing Stubborn Kafka Production Issues? ActiveWizards Can Help!
Our expert Kafka consultants specialize in diagnosing and resolving complex production problems, optimizing performance, and ensuring the stability of your critical data pipelines. Don’t let Kafka issues derail your business.

Kafka Troubleshooting: A Production Checklist

The Foundational Mindset: Systematic Investigation

A Consultant’s 15-Point Diagnostic Checklist

Broker & Cluster Health Diagnostics

Producer-Side Diagnostics

Consumer-Side Diagnostics

Network & Configuration Diagnostics

Environment & External Factor Diagnostics

When to Call in the Experts

Conclusion: From Chaos to Clarity

Facing Stubborn Kafka Production Issues? ActiveWizards Can Help!

Bring the system under review

Igor Bobriakov

Data Engineering

Real-Time IoT Analytics Platform for Smart Agriculture

Real-time anomaly detection processing 2.4M events/day with 70% fewer false positives

High-Throughput Real-Time Facial Recognition Platform

Related Articles

Streaming RAG: Real-Time Retrieval for Agents That Can't Wait

AI Agents for Real-Time Anomaly Detection: Kafka and AIOps Architecture

Kafka for AI Agents: A Real-Time Agent Architecture