Confluent CCAAK Cheat Sheet: Kafka Admin, Cluster Ops, and Security

Confluent CCAAK cheat sheet for Kafka admin, cluster ops, security, traps, and final review.

Use this for last-mile review. CCAAK rewards the answer that protects durability, cluster safety, and reversible operations instead of treating Kafka like a pile of isolated configs.

ISR: In-sync replicas currently caught up enough to count toward safe acknowledgements.

URP: Under-replicated partitions, meaning one or more followers are lagging outside the healthy replication set.

Controller: Broker role responsible for metadata leadership and partition-state coordination.

CCAAK answer sequence

Use this when the stem mixes durability, security, topic behavior, or operations.

    flowchart TD
	  S["Scenario"] --> D["Classify the durability lane"]
	  D --> C["Check connectivity or security lane"]
	  C --> T["Check topic behavior lane"]
	  T --> O["Check operations lane and evidence"]

Read the question in this order

  1. Durability lane: acks, ISR, min.insync.replicas, replication factor, and leader safety.
  2. Connectivity or security lane: listeners, advertised.listeners, TLS, SASL, and ACL boundaries.
  3. Topic behavior lane: partitions, retention, compaction, and cleanup intent.
  4. Operations lane: lag, URP, offline partitions, maintenance safety, and recovery order.

Fastest 10-minute review

If the question says… Strongest first lane
safe acknowledgements or data loss tolerance acks, ISR, and min.insync.replicas
producer says broker is up but clients still fail advertised.listeners, DNS, TLS, and SASL alignment
cluster feels unstable controller health, broker availability, URP, and leader election
replay window or storage pressure retention, compaction, and segment growth
consumer lag or throughput issue partitions, consumer speed, broker health, disk, and network path
rolling restart or config change smallest reversible change with health checks between steps

Durability control loop

    flowchart LR
	  P["Producer"] --> A["acks policy"]
	  A --> L["Leader replica"]
	  L --> I["ISR members"]
	  I --> M["min.insync.replicas gate"]
	  M --> S["Safe acknowledgement or write failure"]

Write-safety matrix

Control What it changes Strong reminder Common trap
acks=0 producer does not wait for broker ack low safety and low confirmation treating it like normal production durability
acks=1 leader-only acknowledgement better than no ack, still weaker than ISR-based safety assuming it is enough for critical data
acks=all waits for ISR-based confirmation strongest normal durability lane forgetting it depends on healthy ISR and min.insync.replicas
min.insync.replicas minimum ISR required for safe writes raises write safety when paired with acks=all setting it high without considering write availability
replication factor total copies improves resilience thinking it replaces correct producer or topic settings
unclean.leader.election.enable whether an out-of-sync replica may lead usually keep false to protect data enabling it to hide availability problems

One-sentence rule

Replication factor improves resilience, but safe writes still depend on producer acks, ISR health, and min.insync.replicas.

Topic behavior and storage chooser

Requirement Strongest first fit Why
more consumer parallelism more partitions consumer-group concurrency follows partition count
preserve full history for replay or audit delete retention policy event log semantics
keep latest value per key compaction changelog or latest-state semantics
preserve ordering for one entity stable key ordering is per partition only
reduce storage growth safely retention and segment policy review storage pressure is often a policy problem

Retention and compaction table

Topic intent Better fit Trap to avoid
immutable event stream delete retention using compaction and losing replay history
latest state per entity compaction expecting compacted topic to preserve full event history
long replay window longer retention and capacity planning shrinking retention as the first storage fix

Listener and security stack

Control What it really does What the exam is usually testing
listeners where brokers bind interface and port availability
advertised.listeners what clients are told to use wrong endpoint breaks clients even when broker is healthy
listener.security.protocol.map listener-to-protocol mapping mismatched TLS or SASL path
inter.broker.listener.name broker-to-broker communication path replication and cluster communication health
TLS encryption in transit trust and certificate correctness
SASL authentication who may connect
ACLs authorization what authenticated principals may do

Security boundary table

Pair Keep this distinction clear
TLS vs SASL vs ACLs encryption vs authentication vs authorization
listeners vs advertised.listeners bind location vs client-facing endpoint
client connectivity vs broker replication path external client path versus inter-broker path

Cluster-health and lag triage

Symptom First things to check Common trap
URP climbing broker health, disk I/O, inter-broker network, follower lag blaming consumers for replication issues
offline partitions leader availability, controller stability, multi-broker impact treating it like ordinary lag
consumer lag rising partition count, consumer throughput, rebalance churn, downstream bottleneck changing retention before confirming the consumer is simply slow
frequent rebalances long processing time, heartbeat/session timing, unstable membership tuning brokers first
disk pressure log.dirs, retention, segment growth, broker capacity increasing replication factor during a storage incident
producer timeout under load ISR health, broker saturation, network path, write safety requirements disabling durability first

High-confusion admin pairs

Pair Keep this distinction clear
replication factor vs ISR total copies versus healthy caught-up subset
URP vs offline partition lagging replica health issue versus partition lacking a working leader
broker config vs topic config cluster default versus per-topic behavior
lag vs durability issue consumer speed problem versus replication-safety problem
compaction vs retention delete latest-by-key store versus event history

Safe maintenance order

Operation Safe first step What strong answers protect
rolling restart restart one broker at a time and recheck health between nodes quorum and ISR safety
security change validate with a test client before broad rollout cluster-wide client outage
disk incident free space, stabilize brokers, then repair replication data safety before tuning
config change smallest reversible change first rollback path and blast radius
topic-setting update confirm workload intent before changing compaction or retention replay and durability behavior

Operations traps

Trap Better reading
tuning throughput while the cluster is unhealthy restore safety first, optimize later
enabling unsafe leader behavior to keep writes alive availability without durability is often the wrong trade-off for exam answers
changing multiple controls at once the exam favors contained, reversible maintenance

Last 15-minute review

Review this Because it fixes…
acks, ISR, and min.insync.replicas durability misses
listeners vs advertised.listeners connectivity questions that look like generic networking
retention vs compaction topic-behavior distractors
URP vs offline partitions severity misclassification
TLS vs SASL vs ACLs security-control confusion
rolling restart order unsafe-operations distractors

What strong answers usually do

  • reason from durability first, then tuning
  • distinguish cluster-wide behavior from topic-specific behavior
  • protect safety during maintenance instead of optimizing too early
  • classify incidents correctly before reaching for commands
  • choose the least risky next step that restores health without widening the blast radius
Revised on Sunday, May 10, 2026