Keep this cheat sheet open while drilling. SOA-C03 is an operations exam first: read the signal, classify the operational boundary, take the lowest-risk remediation step, verify the result, then improve the environment so the incident gets less likely next time.
Runbook: Step-by-step operational procedure for diagnosing, remediating, or escalating an issue safely.
Remediation: Corrective action that restores or stabilizes service after a fault or policy failure.
Blast radius: How much of the environment a change or incident can affect.
Fast lane picker
If the question is really about…
Focus first on…
Strongest first move
alarms, logs, or noisy incidents
CloudWatch, CloudTrail, log correlation, composite alarms
improve signal quality before automating reaction
outage tolerance or restore targets
Multi-AZ, backup, restore, failover, RTO/RPO
translate the continuity target before naming the service
repeatable operational work
Systems Manager, EventBridge, Lambda, CloudFormation
prefer safe automation over manual repetition
access, compliance, or investigation
IAM, resource policy, KMS, Config, Security Hub
follow the evaluation chain before guessing
connectivity or delivery path
route tables, security groups, NACLs, endpoints, CloudFront
classify the path component that owns the failure
Incident triage flow
Use this when the choice is between investigating first or remediating first.
flowchart TD
S["Signal"] --> B["Boundary: app, host, AWS service, or dependency?"]
B --> R["Recent change, logs, metrics, or dependency failure?"]
R --> L["Smallest safe remediation"]
L --> V["Verify recovery and automate the fix if it repeats"]
CloudOps control loop
flowchart LR
A["Detect signal"] --> B["Triage severity and boundary"]
B --> C["Diagnose likely root cause"]
C --> D["Apply lowest-risk remediation"]
D --> E["Verify recovery"]
E --> F["Document and automate prevention"]
SOA-C03 answer sequence
Use this when the question is really about operational judgment under a live signal.
flowchart TD
S["Scenario"] --> B["Find the service boundary"]
B --> R["Read logs, metrics, and recent changes"]
R --> M["Make the smallest safe remediation"]
M --> V["Verify recovery"]
V --> A["Automate or document the repeat fix"]
What to notice:
wrong answers often skip verification or jump to a high-blast-radius fix
SOA-C03 rewards safer operational judgment more than flashy redesigns
if a remediation pattern is stable and repeatable, automation is usually the next improvement
Quick facts
Item
Value
Questions
65
Duration
130 minutes
Passing score
720 scaled
Weighted domains
D1 22%, D2 22%, D3 22%, D4 16%, D5 18%
Monitoring and logging chooser
Need
Strongest first AWS signal
Why
resource or service health trend
CloudWatch metrics
fast operational signal
application or system event detail
CloudWatch Logs
detailed event and error context
API audit trail
CloudTrail
identity and action history
network accept/deny and path detail
VPC Flow Logs
packet-path evidence
user-facing synthetic experience
CloudWatch Synthetics or route health patterns
outside-in service verification
Alarm problem
Better operational answer
too many noisy alerts
composite alarms, tuned thresholds, actionable routing
repeated known incident pattern
EventBridge plus Lambda or SSM Automation
no root-cause evidence
correlate metrics, logs, and recent deploy/change timeline
metrics exist but host internals do not
CloudWatch agent or missing telemetry path
Reliability and continuity chooser
Requirement
Strongest first fit
Why
in-region database availability
RDS or Aurora Multi-AZ
managed HA pattern
regional DNS failover behavior
Route 53 health checks and routing policy
traffic steering on health
read offload and cache pressure reduction
ElastiCache or CloudFront where appropriate
reduces repeated backend load
point-in-time restore and data protection
backup frequency and restore design matched to RPO
backup design follows data-loss target
minimal downtime during failure
tested failover pattern, not just backups
outage-time requirement is about recovery speed
Pair
Keep this distinction clear
backup vs DR
data-restoration capability vs outage-time resilience
Multi-AZ vs read replica
availability/failover vs scaling reads
RTO vs RPO
restore time vs acceptable data-loss window
versioning vs backup
object-history feature vs broader restore strategy
Automation and provisioning chooser
Requirement
Strongest first fit
Why
declarative infrastructure
CloudFormation or CDK
repeatable provisioning
repeatable fleet operations
Systems Manager
patching, automation, session, inventory
event-driven remediation
EventBridge plus Lambda or SSM
reactive but controlled automation
multi-account deployment standardization
StackSets / Organizations-aware rollout
centralized governance
secure shell-less instance access
Session Manager
no public SSH exposure required
Deployment and automation traps
Trap
Better reading
rebuilding manually for every known incident
automate the proven fix
looking only at terminal CloudFormation error
inspect the first failing resource in stack events
using wide admin access for automation
give the runbook or function only the permissions it needs
changing many resources at once during an outage
choose the smallest reversible remediation first
Security and compliance ops chooser
Requirement
Strongest first fit
least-privilege identity evaluation
IAM policies, resource policies, Access Analyzer
audit trail and configuration history
CloudTrail and AWS Config
secret storage and rotation
Secrets Manager
encryption key custody
KMS
aggregated findings
Security Hub
threat detection
GuardDuty
workload exposure and package assessment
Inspector
If access is denied…
Check in this order
identity permissions
IAM policy or role
target resource permissions
resource policy
encryption boundary
KMS key policy or grant
org-level boundary
SCP or delegated account restrictions
Networking and content delivery chooser
Need
Strongest first fit
Why
private access to AWS-managed service
VPC endpoint / PrivateLink
avoids public NAT path
CDN and cache edge layer
CloudFront
edge caching and acceleration
global traffic acceleration
Global Accelerator
static anycast-style path optimization
hybrid private connectivity
VPN or Transit Gateway patterns
network extension design
app path troubleshooting
route table -> SG -> NACL -> endpoint/DNS order
fastest structured isolation path
Network symptom table
Symptom
First things to check
Common trap
instance or service unreachable
route tables, SGs, NACLs, gateway path
opening everything before proving the block
intermittent connectivity
return-path state, NACL stateless rules, endpoint path
assuming NAT supports inbound reachability
CloudFront serving wrong or stale content
cache behavior, TTL, invalidation, origin health
treating cache issue as origin outage immediately
private service access failing
endpoint type, route, SG, DNS
debugging IAM before proving network path
Cost-aware operations quick wins
Pattern
Operationally safer cost win
stale storage growth
lifecycle policies, snapshot retention rules, archive tiers
repeated NAT egress
use VPC endpoints where the service supports them
oversized compute
right-size from utilization and recommendation data
idle orphaned resources
clean unattached volumes, stale snapshots, unused load balancers
Last 15-minute review
Review this
Because it fixes…
CloudWatch vs CloudTrail vs Flow Logs
wrong-signal mistakes
composite alarms and runbook automation
noisy-incident and repetitive-work misses
Multi-AZ, backups, restore, RTO/RPO
continuity confusion
CloudFormation, Systems Manager, EventBridge roles
automation-pattern misses
IAM policy vs resource policy vs KMS policy
access-denied confusion
route table -> SG -> NACL -> endpoint order
networking troubleshooting mistakes
What strong answers usually do
start from the safest signal and operational boundary
prefer low-risk remediation and rollback paths before invasive reconfiguration
treat observability, backup, security, and automation as one operating model
choose the option that makes the environment more repeatable and observable after the incident
Quiz
This quiz requires JavaScript to run. The questions are shown below in plain text.
Loading quiz…