This lesson is where many candidates overreact. Lag, throughput problems, disk alarms, and network slowness all feel urgent, but the best next step depends on which constraint is actually binding.
Triage chooser
Symptom
Strongest first focus
consumer lag rising
consumer speed, partition count, rebalance churn, downstream bottlenecks
producer timeout
broker saturation, ISR health, and network path
disk pressure
storage policy, capacity, segment growth, and broker health
unstable throughput
network path, broker load, and client behavior
What the exam is really testing
If the scenario shows…
Strong reading
lag with healthy cluster replication
consumer-side or downstream constraint may dominate
timeout during degraded cluster health
durability or broker saturation may be the real cause
storage alarms
retention and capacity may matter more than arbitrary tuning
rebalance churn
consumer-group stability is under test
Common traps
Trap
Better rule
cutting durability first to improve throughput
restore health before weakening guarantees
changing retention before confirming disk is the real cause
diagnose the binding constraint first
treating all lag as “add more consumers”
partition count, downstream speed, and rebalance behavior all matter
Decision order that usually wins
Start by asking whether the binding constraint is consumers, brokers, disk, or network.
If the cluster is healthy, lag usually belongs first in the consumer or downstream lane.
If the cluster is degraded, do not weaken durability controls before understanding replication and broker health.
For disk alarms, separate storage policy and growth rate from generic performance tuning.
Quiz
This quiz requires JavaScript to run. The questions are shown below in plain text.
Loading quiz…