Kafka monitoring with Prometheus and Grafana is most useful when it goes beyond generic host dashboards and focuses on the Kafka metrics that actually drive incidents: replication state, request latency, consumer lag, JVM behavior, and the infrastructure surrounding the brokers.
Prometheus, Grafana, and Telegraf remain a practical stack for Kafka observability because each tool solves a clear part of the problem:
- Kafka and its clients expose JVM and application metrics
- Prometheus scrapes and stores time-series data
- Telegraf collects host and system metrics that Kafka itself does not expose
- Grafana turns those metrics into dashboards, alerts, and operational context
This article updates the older setup-oriented guide into a production-oriented monitoring pattern you can still use in 2026.
The Monitoring Architecture
The cleanest setup is usually:
- expose Kafka broker and client metrics through JMX
- convert those metrics into a Prometheus-friendly endpoint with a JMX exporter
- collect host-level metrics with Telegraf
- scrape both sources from Prometheus
- visualize and alert from Grafana
That gives you two different layers of signal:
- Kafka layer: broker requests, replication state, fetch behavior, producer and consumer health
- Infrastructure layer: CPU saturation, disk latency, memory pressure, filesystem usage, network throughput
If you skip the infrastructure layer, you can miss the real cause of Kafka incidents. A broker may look unhealthy because of I/O wait, noisy neighbors, or network pressure rather than a Kafka configuration bug.
What to Monitor on Kafka
The highest-value Kafka metrics usually fall into four groups.
1. Broker Request Path
These metrics tell you whether the broker is keeping up with client traffic:
- request rates by API
- request queue time
- request handler idle time
- network processor idle time
- produce and fetch latency
- bytes in and bytes out
If these deteriorate, producers and consumers feel it quickly.
2. Partition and Replication Health
These metrics tell you whether the cluster is safe, not just fast:
- under-replicated partitions
- offline partitions
- in-sync replica counts
- leader elections
- replica fetch lag
Fast throughput means little if replicas are falling behind or partitions are going offline during failures.
3. Consumer Health
Many Kafka incidents are actually consumer incidents. Monitor:
- consumer lag
- rebalance frequency
- poll and fetch behavior
- records consumed rate
- commit latency and failures
Consumer lag should always be interpreted alongside throughput and partition skew. A single lag number without that context is often misleading. For a metric-focused companion article, see kafka-monitoring-key-metrics-guide.
4. JVM and Host Signals
Kafka is still a JVM application running on real machines. Monitor:
- heap usage and GC pauses
- open file descriptors
- disk throughput and disk latency
- page cache pressure
- network throughput and packet errors
- CPU usage and I/O wait
These infrastructure signals are where Telegraf earns its keep.
Why Prometheus, Telegraf, and Grafana Still Work Well Together
This stack remains relevant because it is simple and composable.
Prometheus
Prometheus gives you pull-based collection, service discovery, alert rules, and a query model that works well for infrastructure and Kafka metrics. It is especially good at tracking rates, error trends, and lag over time rather than only showing raw point-in-time numbers.
Telegraf
Telegraf fills the host-metrics gap. It is useful for collecting:
- CPU, memory, disk, filesystem, and network metrics
- container or VM metrics
- supporting signals from OS and adjacent services
Telegraf can expose those metrics on a Prometheus-compatible endpoint, which keeps your scraping model consistent.
Grafana
Grafana is where operators correlate the signals:
- consumer lag rising
- broker request time increasing
- disk I/O wait spiking
- replication health degrading
That correlation is the difference between a dashboard and an observability workflow.
A Modern Instrumentation Pattern
The older version of this article used hard-coded package versions and ZooKeeper-era examples. That ages badly. A better approach is to keep the setup generic.
Kafka JMX Export
Kafka metrics are commonly exposed through JMX and then translated for Prometheus scraping by a Java agent.
KAFKA_OPTS="-javaagent:/opt/jmx_exporter/jmx_prometheus_javaagent.jar=7071:/etc/jmx/kafka.yml"Use the same pattern for JVM-based producers, consumers, Connect workers, or stream-processing applications when they need dedicated metrics endpoints.
If you are running modern Kafka in KRaft mode, that does not change the basic observability approach. You still monitor broker request behavior, controller health, replication state, and client-side metrics. The storage and control plane changed; the need for strong telemetry did not.
Prometheus Scrape Jobs
Your Prometheus config should treat Kafka and host metrics as separate targets.
scrape_configs: - job_name: kafka-brokers static_configs: - targets: - broker-1.internal:7071 - broker-2.internal:7071 - broker-3.internal:7071
- job_name: telegraf-hosts static_configs: - targets: - broker-1.internal:9273 - broker-2.internal:9273 - broker-3.internal:9273Telegraf Prometheus Client Output
Expose Telegraf metrics on a local HTTP endpoint so Prometheus can scrape them.
[[outputs.prometheus_client]] listen = ":9273"
[[inputs.cpu]] percpu = true totalcpu = true
[[inputs.mem]]
[[inputs.disk]]
[[inputs.diskio]]
[[inputs.net]]That gives you the minimum infrastructure layer most Kafka teams need.
Dashboards That Actually Help During Incidents
The most useful Grafana dashboards are not the prettiest ones. They are the ones that reduce diagnosis time.
A good operational dashboard usually includes:
- cluster throughput and request latency
- under-replicated and offline partition counts
- broker CPU, disk, and network by node
- consumer lag by group and topic
- JVM heap and GC panels
- alerts or annotations for deploys, rebalances, and broker restarts
Avoid building one giant dashboard with every available metric. Split them by use:
- broker health
- consumer health
- capacity and infrastructure
- incident drill-down
Common Monitoring Mistakes
Several monitoring mistakes show up repeatedly in Kafka estates:
- treating consumer lag as the only health metric
- ignoring disk latency while focusing on broker CPU
- not watching replication health during traffic spikes
- scraping only brokers and not producers, consumers, or Connect
- building dashboards without alert thresholds or runbook context
Another common mistake is benchmarking and monitoring separately. In practice, they should reinforce each other. If you are tuning Kafka performance, also see kafka-benchmarking-methodologies-and-tools-for-performance.
What “Good” Looks Like
You do not need a perfect observability platform on day one. You need a stack that answers these questions quickly:
- Are brokers healthy?
- Is the cluster safe?
- Are consumers keeping up?
- Is the bottleneck inside Kafka or outside it?
- Which host, topic, partition, or client is responsible?
Prometheus, Telegraf, and Grafana still form a solid answer when they are used with that operational goal in mind.
Conclusion
Kafka monitoring works best when you combine Kafka-native metrics with infrastructure telemetry and then visualize them in a way that supports incident response, capacity planning, and tuning. Prometheus stores the time series, Telegraf fills the host-level gap, and Grafana gives operators the correlation layer.
The stack is not new, but it is still effective. What changed is the standard for using it: modern Kafka teams need production-grade dashboards, alerting, and diagnosis flows rather than a handful of setup commands.
Need Help Building a Production Monitoring Stack for Kafka?
ActiveWizards helps teams design Kafka observability, alerting, and dashboarding systems for reliable production operations.