AWS SOA-C03 Cheat Sheet: Monitoring, Recovery, and Automation

AWS SOA-C03 cheat sheet for monitoring, recovery, automation, traps, and final review.

Keep this cheat sheet open while drilling. SOA-C03 is an operations exam first: read the signal, classify the operational boundary, take the lowest-risk remediation step, verify the result, then improve the environment so the incident gets less likely next time.

Runbook: Step-by-step operational procedure for diagnosing, remediating, or escalating an issue safely.

Remediation: Corrective action that restores or stabilizes service after a fault or policy failure.

Blast radius: How much of the environment a change or incident can affect.

Fast lane picker

If the question is really about… Focus first on… Strongest first move
alarms, logs, or noisy incidents CloudWatch, CloudTrail, log correlation, composite alarms improve signal quality before automating reaction
outage tolerance or restore targets Multi-AZ, backup, restore, failover, RTO/RPO translate the continuity target before naming the service
repeatable operational work Systems Manager, EventBridge, Lambda, CloudFormation prefer safe automation over manual repetition
access, compliance, or investigation IAM, resource policy, KMS, Config, Security Hub follow the evaluation chain before guessing
connectivity or delivery path route tables, security groups, NACLs, endpoints, CloudFront classify the path component that owns the failure

Incident triage flow

Use this when the choice is between investigating first or remediating first.

    flowchart TD
	  S["Signal"] --> B["Boundary: app, host, AWS service, or dependency?"]
	  B --> R["Recent change, logs, metrics, or dependency failure?"]
	  R --> L["Smallest safe remediation"]
	  L --> V["Verify recovery and automate the fix if it repeats"]

CloudOps control loop

    flowchart LR
	  A["Detect signal"] --> B["Triage severity and boundary"]
	  B --> C["Diagnose likely root cause"]
	  C --> D["Apply lowest-risk remediation"]
	  D --> E["Verify recovery"]
	  E --> F["Document and automate prevention"]

SOA-C03 answer sequence

Use this when the question is really about operational judgment under a live signal.

    flowchart TD
	  S["Scenario"] --> B["Find the service boundary"]
	  B --> R["Read logs, metrics, and recent changes"]
	  R --> M["Make the smallest safe remediation"]
	  M --> V["Verify recovery"]
	  V --> A["Automate or document the repeat fix"]

What to notice:

  • wrong answers often skip verification or jump to a high-blast-radius fix
  • SOA-C03 rewards safer operational judgment more than flashy redesigns
  • if a remediation pattern is stable and repeatable, automation is usually the next improvement

Quick facts

Item Value
Questions 65
Duration 130 minutes
Passing score 720 scaled
Weighted domains D1 22%, D2 22%, D3 22%, D4 16%, D5 18%

Monitoring and logging chooser

Need Strongest first AWS signal Why
resource or service health trend CloudWatch metrics fast operational signal
application or system event detail CloudWatch Logs detailed event and error context
API audit trail CloudTrail identity and action history
network accept/deny and path detail VPC Flow Logs packet-path evidence
user-facing synthetic experience CloudWatch Synthetics or route health patterns outside-in service verification
Alarm problem Better operational answer
too many noisy alerts composite alarms, tuned thresholds, actionable routing
repeated known incident pattern EventBridge plus Lambda or SSM Automation
no root-cause evidence correlate metrics, logs, and recent deploy/change timeline
metrics exist but host internals do not CloudWatch agent or missing telemetry path

Reliability and continuity chooser

Requirement Strongest first fit Why
in-region database availability RDS or Aurora Multi-AZ managed HA pattern
regional DNS failover behavior Route 53 health checks and routing policy traffic steering on health
read offload and cache pressure reduction ElastiCache or CloudFront where appropriate reduces repeated backend load
point-in-time restore and data protection backup frequency and restore design matched to RPO backup design follows data-loss target
minimal downtime during failure tested failover pattern, not just backups outage-time requirement is about recovery speed
Pair Keep this distinction clear
backup vs DR data-restoration capability vs outage-time resilience
Multi-AZ vs read replica availability/failover vs scaling reads
RTO vs RPO restore time vs acceptable data-loss window
versioning vs backup object-history feature vs broader restore strategy

Automation and provisioning chooser

Requirement Strongest first fit Why
declarative infrastructure CloudFormation or CDK repeatable provisioning
repeatable fleet operations Systems Manager patching, automation, session, inventory
event-driven remediation EventBridge plus Lambda or SSM reactive but controlled automation
multi-account deployment standardization StackSets / Organizations-aware rollout centralized governance
secure shell-less instance access Session Manager no public SSH exposure required

Deployment and automation traps

Trap Better reading
rebuilding manually for every known incident automate the proven fix
looking only at terminal CloudFormation error inspect the first failing resource in stack events
using wide admin access for automation give the runbook or function only the permissions it needs
changing many resources at once during an outage choose the smallest reversible remediation first

Security and compliance ops chooser

Requirement Strongest first fit
least-privilege identity evaluation IAM policies, resource policies, Access Analyzer
audit trail and configuration history CloudTrail and AWS Config
secret storage and rotation Secrets Manager
encryption key custody KMS
aggregated findings Security Hub
threat detection GuardDuty
workload exposure and package assessment Inspector
If access is denied… Check in this order
identity permissions IAM policy or role
target resource permissions resource policy
encryption boundary KMS key policy or grant
org-level boundary SCP or delegated account restrictions

Networking and content delivery chooser

Need Strongest first fit Why
private access to AWS-managed service VPC endpoint / PrivateLink avoids public NAT path
CDN and cache edge layer CloudFront edge caching and acceleration
global traffic acceleration Global Accelerator static anycast-style path optimization
hybrid private connectivity VPN or Transit Gateway patterns network extension design
app path troubleshooting route table -> SG -> NACL -> endpoint/DNS order fastest structured isolation path

Network symptom table

Symptom First things to check Common trap
instance or service unreachable route tables, SGs, NACLs, gateway path opening everything before proving the block
intermittent connectivity return-path state, NACL stateless rules, endpoint path assuming NAT supports inbound reachability
CloudFront serving wrong or stale content cache behavior, TTL, invalidation, origin health treating cache issue as origin outage immediately
private service access failing endpoint type, route, SG, DNS debugging IAM before proving network path

Cost-aware operations quick wins

Pattern Operationally safer cost win
stale storage growth lifecycle policies, snapshot retention rules, archive tiers
repeated NAT egress use VPC endpoints where the service supports them
oversized compute right-size from utilization and recommendation data
idle orphaned resources clean unattached volumes, stale snapshots, unused load balancers

Last 15-minute review

Review this Because it fixes…
CloudWatch vs CloudTrail vs Flow Logs wrong-signal mistakes
composite alarms and runbook automation noisy-incident and repetitive-work misses
Multi-AZ, backups, restore, RTO/RPO continuity confusion
CloudFormation, Systems Manager, EventBridge roles automation-pattern misses
IAM policy vs resource policy vs KMS policy access-denied confusion
route table -> SG -> NACL -> endpoint order networking troubleshooting mistakes

What strong answers usually do

  • start from the safest signal and operational boundary
  • prefer low-risk remediation and rollback paths before invasive reconfiguration
  • treat observability, backup, security, and automation as one operating model
  • choose the option that makes the environment more repeatable and observable after the incident

Quiz

Loading quiz…
Revised on Sunday, May 10, 2026