Study SOA-C03 High Availability, Load Balancing and Resilience: key concepts, common traps, and exam decision cues.
This lesson is about staying up through failure. SOA-C03 expects CloudOps engineers to understand how health checks, load balancers, Multi-AZ patterns, and fault-tolerant deployment choices reduce outage impact when individual components fail.
Health check: Probe used to determine whether a target is healthy enough to receive traffic.
Fault tolerance: Ability of a system to continue operating even when one component fails.
AWS wants you to distinguish:
| If the problem is mainly about… | Strongest first lane | What it does not automatically solve |
|---|---|---|
| more traffic under normal healthy conditions | Scaling | AZ failure or regional failure |
| staying online when one component or AZ fails | High availability | Long-term data recovery after corruption |
| restoring service after a broader outage or destructive event | Disaster recovery | Real-time traffic balancing by itself |
SOA-C03 often hides this distinction in the last sentence of the stem. If the requirement is survive failure with minimal interruption, the answer is usually an availability pattern, not just larger capacity.
The exam is not math-heavy, but you should be able to reason from the basic relationship:
\[ \text{Availability} = 1 - \frac{\text{Downtime}}{\text{Total Time}} \]
If you want a quick annual downtime estimate:
\[ \text{Downtime Hours Per Year} = (1 - \text{Availability}) \times 8760 \]
This is useful because many availability questions are really asking whether the proposed design meaningfully reduces downtime, not whether it only looks redundant on a diagram.
| If the stem emphasizes… | Strongest first pattern |
|---|---|
| sending traffic only to healthy targets | load balancer plus health checks |
| surviving a single-AZ failure | Multi-AZ placement and redundant targets |
| DNS-based regional failover | Route 53 health checks and routing policy |
| removing a single point of failure in one tier | redundant components in that tier, not only bigger instances |
A workload already scales out under traffic spikes, but the company is still failing over poorly when one Availability Zone has problems. Which interpretation is strongest first?
Correct answer: B. The problem is survival through failure, not simply adding capacity during healthy operation.