AWS SOA-C03 Cheat Sheet: CloudOps Monitoring, Recovery, and Automation

March 28, 2026

AWS SOA-C03 CloudOps cheat sheet for monitoring, remediation, reliability, automation, security, networking, troubleshooting, and final review traps.

On this page

Keep this cheat sheet open while drilling. SOA-C03 is an operations exam first: read the signal, classify the operational boundary, take the lowest-risk remediation step, verify the result, then improve the environment so the incident gets less likely next time.

The current CloudOps version of this exam is not just the old SysOps label with a new name. It emphasizes monitored operations, repeatable remediation, IaC-aware provisioning, business continuity, secure access, and network troubleshooting. The best answer usually improves the operating model, not just the immediate symptom.

Runbook: Step-by-step operational procedure for diagnosing, remediating, or escalating an issue safely.

Remediation: Corrective action that restores or stabilizes service after a fault or policy failure.

Blast radius: How much of the environment a change or incident can affect.

Quick facts (SOA-C03)

I verified these current AWS exam facts on May 24, 2026.

Item	Value
Exam	AWS Certified CloudOps Engineer - Associate
Exam code	SOA-C03
Questions	65 total
Scoring	50 scored + 15 unscored (unscored items are not identified)
Question types	Multiple choice and multiple response
Time	130 minutes
Passing score	720, scaled 100-1000
Cost	150 USD

Domain weights and review priority

Domain	Weight	What to compress for final review
Monitoring, Logging, Analysis, Remediation, and Performance Optimization	22%	CloudWatch, CloudTrail, logs, metrics, alarms, remediation, performance signals
Reliability and Business Continuity	22%	backups, restore, Multi-AZ, failover, RTO/RPO, health checks, continuity testing
Deployment, Provisioning, and Automation	22%	CloudFormation, Systems Manager, automation, patching, provisioning, safe change
Security and Compliance	16%	IAM, KMS, Config, audit trails, least privilege, secrets, compliance evidence
Networking and Content Delivery	18%	VPC routing, SG/NACL, endpoints, DNS, CloudFront, connectivity troubleshooting

SOA-C03 is now framed as a CloudOps exam. The highest-value review habit is to start from the operational signal, isolate the service boundary, apply the lowest-risk fix, and verify the result.

CloudOps proof stack

SOA-C03 questions usually ask whether an operator can detect a signal, isolate the boundary, remediate safely, and prevent recurrence. Keep this stack in mind before choosing a service:

Signal: identify the metric, log, event, alarm, finding, health check, or user symptom that proves something changed.
Boundary: decide whether the problem is workload, deployment, infrastructure, IAM/KMS, network path, DNS, storage, database, or managed service behavior.
Evidence: inspect logs, metrics, CloudTrail, Config history, stack events, Flow Logs, health checks, recent deployments, and runbook output before changing broad settings.
Low-risk remediation: apply the smallest reversible fix: rollback, restart, scale, restore, fail over, rotate, patch, update route/security policy, or run a scoped automation.
Verification: confirm recovery with the original signal plus user-facing or service-health evidence.
Prevention: add alarm tuning, runbook automation, IaC correction, backup test, Config rule, patch baseline, dashboard, or operational documentation so the same issue is less likely.

If an answer jumps straight to redesign, broad permissions, or manual console edits without evidence and verification, it is usually too risky for CloudOps.

Official task compression

Monitoring, Logging, Analysis, Remediation, and Performance Optimization covers metrics, alarms, filters, logs, dashboards, automated remediation, and compute/storage/database optimization. Signals must lead to action: metric, log, alarm, owner, remediation, validation, and cost/performance evidence.

Reliability and Business Continuity covers scaling, elasticity, high availability, resilient environments, backup, restore, and RTO/RPO. Backups are not the same as continuity; match RTO/RPO to failover, restore, and test evidence.

Deployment, Provisioning, and Automation covers AMIs, container images, CloudFormation, CDK, StackSets, RAM, deployment issues, and operational automation. Prefer repeatable provisioning and safe automation over console drift and manual fixes.

Security and Compliance covers IAM, access troubleshooting, multi-account security, encryption, secrets, findings, and compliance. Follow the authorization and evidence chain before broadening access.

Networking and Content Delivery covers VPC, private connectivity, Route 53, CloudFront, Global Accelerator, network logs, caching, and hybrid issues. Troubleshoot the path in order: DNS, route, security control, endpoint, logs, cache behavior, then remediation.

Fast lane picker

If the question is really about…	Focus first on…	Strongest first move
alarms, logs, or noisy incidents	CloudWatch, CloudTrail, log correlation, composite alarms	improve signal quality before automating reaction
outage tolerance or restore targets	Multi-AZ, backup, restore, failover, RTO/RPO	translate the continuity target before naming the service
repeatable operational work	Systems Manager, EventBridge, Lambda, CloudFormation	prefer safe automation over manual repetition
access, compliance, or investigation	IAM, resource policy, KMS, Config, Security Hub	follow the evaluation chain before guessing
connectivity or delivery path	route tables, security groups, NACLs, endpoints, CloudFront	classify the path component that owns the failure

Incident triage flow

Use this when the choice is between investigating first or remediating first.

    flowchart TD
	  S["Signal"] --> B["Boundary: app, host, AWS service, or dependency?"]
	  B --> R["Recent change, logs, metrics, or dependency failure?"]
	  R --> L["Smallest safe remediation"]
	  L --> V["Verify recovery and automate the fix if it repeats"]

CloudOps control loop

    flowchart LR
	  A["Detect signal"] --> B["Triage severity and boundary"]
	  B --> C["Diagnose likely root cause"]
	  C --> D["Apply lowest-risk remediation"]
	  D --> E["Verify recovery"]
	  E --> F["Document and automate prevention"]

SOA-C03 answer sequence

Use this when the question is really about operational judgment under a live signal.

    flowchart TD
	  S["Scenario"] --> B["Find the service boundary"]
	  B --> R["Read logs, metrics, and recent changes"]
	  R --> M["Make the smallest safe remediation"]
	  M --> V["Verify recovery"]
	  V --> A["Automate or document the repeat fix"]

What to notice:

wrong answers often skip verification or jump to a high-blast-radius fix
SOA-C03 rewards safer operational judgment more than flashy redesigns
if a remediation pattern is stable and repeatable, automation is usually the next improvement

Final answer stack

When two options both sound operationally plausible, keep the one that satisfies the full CloudOps loop:

Use the right signal first. CloudWatch metrics show health, CloudWatch Logs show application/system detail, CloudTrail shows API actors, Config shows configuration history, and Flow Logs show network flow metadata.
Fix the narrow layer. For access denied, inspect IAM, resource policy, KMS key policy, SCP, and CloudTrail before granting broad access. For networking, inspect DNS, route, SG, NACL, endpoint, and logs before opening everything.
Respect blast radius. Roll back, fail over, scale, patch, or automate the smallest safe scope before replacing architecture.
Verify against the requirement. RTO, RPO, latency, error rate, backup success, restore success, and alarm state should prove the fix worked.
Return to source of truth. Correct CloudFormation/CDK/StackSets/Systems Manager state after emergency changes so drift does not become normal.
Automate only proven fixes. EventBridge, Lambda, and Systems Manager Automation need scoped roles, idempotent steps, execution logs, and failure handling.

Signal-to-action map

Signal	First evidence	Safer first action
CloudWatch alarm fires after a deployment	deployment timeline, application logs, target health, CloudTrail change event	stop rollout, roll back, or remediate the smallest failing dependency
repeated known incident	runbook history, alarm pattern, Systems Manager execution output	automate the proven runbook with scoped permissions and execution logging
performance degradation	CloudWatch metrics, Performance Insights, container or host metrics, recent change	tune the bottleneck before replacing the architecture
failed CloudFormation update	stack events, first failed resource, dependency and permission chain	fix template, parameter, quota, dependency, or role issue through IaC
access denied during operations	IAM policy, resource policy, KMS key policy, SCP, CloudTrail	fix the narrow authorization layer instead of granting admin access
network path failure	route table, SG, NACL, DNS, endpoint, Flow Logs, Reachability Analyzer	prove the blocked hop before opening rules broadly

CloudOps evidence map

Stem evidence	Strong answer includes	Reject answers that
“EC2 memory or disk issue”	CloudWatch agent or host-level telemetry path	Use default EC2 metrics only; memory and disk usage need agent data
“alert storm” or “too many pages”	composite alarms, tuned thresholds, suppression/routing logic, clear owner	add more raw alarms with no action path
“same incident repeats”	EventBridge rule, SSM Automation/Lambda runbook, scoped role, execution output	keep manual steps as the long-term answer
“cross-account/Region visibility”	shareable CloudWatch dashboards, central logs/metrics strategy, account/Region scope	rely on one account-local console view
“need who changed it”	CloudTrail plus Config resource timeline	use CloudWatch metric graphs only
“performance bottleneck”	metric source, baseline, bottleneck layer, right-size/tune action	replace architecture before proving constraint
“recovery target stated”	RTO/RPO, backup frequency, restore method, tested restore evidence	assume any backup satisfies continuity
“network issue”	DNS, route table, SG, NACL, endpoint, flow/access logs, Reachability Analyzer	open all rules or restart instances first

Question-type traps

Question type	Exam-day habit
Multiple choice	Identify the failing layer first: metric, log, workload, deployment, access, or network path.
Multiple response	Include every required operational step: detect, diagnose, remediate, verify, and prevent recurrence.

Unanswered questions are incorrect and there is no penalty for guessing. Do not over-invest in a single long operations scenario; mark it, collect easier service-signal questions, then return.

Scenario eliminations

Stem clue	Eliminate first	Keep in play
repeated known issue with documented manual fix	keep paging humans only	EventBridge plus Systems Manager Automation or Lambda runbook
alert storm with no clear owner	more raw alarms	composite alarms, tuned thresholds, actionable routing
need “who changed what”	CloudWatch metrics only	CloudTrail and AWS Config
EC2 access needed without inbound SSH	public bastion by default	Systems Manager Session Manager
private subnet needs AWS service access	NAT gateway first	VPC endpoint or PrivateLink where supported
backup exists but recovery target is unclear	assume backup solves DR	RTO/RPO, restore testing, failover design
stack update failed	retry blindly	inspect first failing CloudFormation event and dependency/permission chain
intermittent network path	open all rules	route table, SG, NACL, DNS, endpoint, and Flow Logs evidence

CloudOps distractors often fix the symptom with the largest possible change. Prefer evidence, narrow blast radius, reversible remediation, and verification.

Monitoring and logging chooser

Need	Strongest first AWS signal	Why
resource or service health trend	CloudWatch metrics	fast operational signal
application or system event detail	CloudWatch Logs	detailed event and error context
API audit trail	CloudTrail	identity and action history
network accept/deny and path detail	VPC Flow Logs	packet-path evidence
user-facing synthetic experience	CloudWatch Synthetics or route health patterns	outside-in service verification

Alarm problem	Better operational answer
too many noisy alerts	composite alarms, tuned thresholds, actionable routing
repeated known incident pattern	EventBridge plus Lambda or SSM Automation
no root-cause evidence	correlate metrics, logs, and recent deploy/change timeline
metrics exist but host internals do not	CloudWatch agent or missing telemetry path

Alarm and remediation chooser

Operational need	Better answer	Watch for
Notify humans of an actionable condition	CloudWatch alarm to SNS/User Notifications	Alert must have owner, severity, and runbook link.
Reduce noisy multi-signal paging	Composite alarm	Do not hide real incidents; combine signals deliberately.
Trigger safe automated remediation	CloudWatch/EventBridge -> SSM Automation or Lambda	Target role must be scoped and action must be idempotent.
Route AWS service events	EventBridge rule with specific pattern	Broad event patterns create false positives.
Run known operational procedure	Predefined or custom SSM Automation runbook	Capture execution output and failure path.
Show operations status across teams	Shareable dashboard across accounts/Regions where needed	Dashboards do not replace alarms or runbooks.

Performance optimization chooser

Bottleneck clue	Check first	Stronger remediation
EC2 CPU, memory, disk, or network pressure	CloudWatch, CloudWatch agent, instance type, ENA, EBS metrics	right-size, tune storage/networking, adjust Auto Scaling, or use placement where justified
EBS latency or throughput	volume type, IOPS/throughput settings, queue length, instance limits	modify volume type or performance settings based on measured need
S3 transfer or storage cost issue	access pattern, lifecycle, multipart, transfer path, DataSync need	lifecycle policy, multipart upload, DataSync, or Transfer Acceleration only when the path justifies it
shared file storage mismatch	EFS or FSx workload fit, throughput mode, lifecycle policy	choose storage service and lifecycle behavior by workload semantics
RDS performance pressure	Performance Insights, CloudWatch metrics, connection count, storage, proxy need	tune instance/storage, add RDS Proxy where connection storms are the issue, or adjust scaling design

Reliability and continuity chooser

Requirement	Strongest first fit	Why
in-region database availability	RDS or Aurora Multi-AZ	managed HA pattern
regional DNS failover behavior	Route 53 health checks and routing policy	traffic steering on health
read offload and cache pressure reduction	ElastiCache or CloudFront where appropriate	reduces repeated backend load
point-in-time restore and data protection	backup frequency and restore design matched to RPO	backup design follows data-loss target
minimal downtime during failure	tested failover pattern, not just backups	outage-time requirement is about recovery speed

Pair	Keep this distinction clear
backup vs DR	data-restoration capability vs outage-time resilience
Multi-AZ vs read replica	availability/failover vs scaling reads
RTO vs RPO	restore time vs acceptable data-loss window
versioning vs backup	object-history feature vs broader restore strategy

Backup and restore evidence checklist

Requirement	Confirm this before choosing
“restore to a point before corruption”	PITR/snapshot capability, retention window, and restore test evidence
“recover within minutes”	failover or warm standby pattern; backup-only may be too slow
“minimal data loss”	RPO-aligned backup/replication frequency
“protect against accidental delete”	versioning, backup vault, retention/lock controls where required
“centralize backup policy”	AWS Backup plan, vault, resource assignment, cross-account/Region copy where needed
“prove backup works”	restore test, runbook, monitoring for backup job failures

Automation and provisioning chooser

Requirement	Strongest first fit	Why
declarative infrastructure	CloudFormation or CDK	repeatable provisioning
repeatable fleet operations	Systems Manager	patching, automation, session, inventory
event-driven remediation	EventBridge plus Lambda or SSM	reactive but controlled automation
multi-account deployment standardization	StackSets / Organizations-aware rollout	centralized governance
secure shell-less instance access	Session Manager	no public SSH exposure required

Deployment and automation traps

Trap	Better reading
rebuilding manually for every known incident	automate the proven fix
looking only at terminal CloudFormation error	inspect the first failing resource in stack events
using wide admin access for automation	give the runbook or function only the permissions it needs
changing many resources at once during an outage	choose the smallest reversible remediation first
treating Terraform/Git as out of scope	third-party tools and Git can appear when the question is about operational automation or deployment maintenance
StackSets without account scope	multi-account rollout still needs OU/account targeting, permissions, failure behavior, and drift handling

Security and compliance ops chooser

Requirement	Strongest first fit
least-privilege identity evaluation	IAM policies, resource policies, Access Analyzer
audit trail and configuration history	CloudTrail and AWS Config
secret storage and rotation	Secrets Manager
encryption key custody	KMS
aggregated findings	Security Hub
threat detection	GuardDuty
workload exposure and package assessment	Inspector

If access is denied…	Check in this order
identity permissions	IAM policy or role
target resource permissions	resource policy
encryption boundary	KMS key policy or grant
org-level boundary	SCP or delegated account restrictions

Compliance and findings map

Finding or requirement	Strong first path
noncompliant configuration	AWS Config rule, remediation, notification, and evidence trail
security posture finding	Security Hub or GuardDuty finding routed to owner or runbook
exposed workload package or vulnerability	Inspector assessment and patch/remediation workflow
trusted advisor security check	remediate the specific check, then monitor for recurrence
encryption at rest issue	KMS configuration, key policy, resource setting, and audit evidence
secret exposure or rotation need	Secrets Manager or Parameter Store fit, KMS, IAM scope, and rotation workflow

Networking and content delivery chooser

Need	Strongest first fit	Why
private access to AWS-managed service	VPC endpoint / PrivateLink	avoids public NAT path
CDN and cache edge layer	CloudFront	edge caching and acceleration
global traffic acceleration	Global Accelerator	static anycast-style path optimization
hybrid private connectivity	VPN or Transit Gateway patterns	network extension design
app path troubleshooting	route table -> SG -> NACL -> endpoint/DNS order	fastest structured isolation path

Network symptom table

Symptom	First things to check	Common trap
instance or service unreachable	route tables, SGs, NACLs, gateway path	opening everything before proving the block
intermittent connectivity	return-path state, NACL stateless rules, endpoint path	assuming NAT supports inbound reachability
CloudFront serving wrong or stale content	cache behavior, TTL, invalidation, origin health	treating cache issue as origin outage immediately
private service access failing	endpoint type, route, SG, DNS	debugging IAM before proving network path

Network operations evidence map

Symptom	Evidence source	Safer next action
Instance cannot reach internet	route table, NAT/IGW, subnet, SG/NACL, Flow Logs	Fix missing route/path rather than broadening every rule.
Private AWS service access fails	endpoint type, endpoint policy, private DNS, route, SG	Confirm endpoint path before assuming public NAT is required.
DNS answer is wrong	Route 53 records, Resolver rules, query logs, TTL/cache behavior	Fix DNS config before changing routes.
CloudFront returns stale content	cache behavior, TTL, invalidation, origin headers, CloudFront logs	Do not treat every stale object as origin outage.
WAF/Network Firewall behavior unclear	WAF logs, firewall logs, Flow Logs, Config	Prove whether traffic hit the control point.
Hybrid connectivity intermittent	VPN/DX metrics, BGP status, route propagation, Flow Logs	Validate path health and routes before replacing services.

Network and content-delivery traps

Trap	Better reading
NAT gateway first for private AWS service access	VPC endpoints or PrivateLink can avoid public egress and repeated NAT cost where supported.
route table only for every connectivity issue	DNS, security groups, NACLs, endpoint policy, NAT, and return path may own the failure.
CloudFront stale content treated as origin failure	Check cache behavior, TTL, invalidation, origin response, and headers first.
Global Accelerator confused with CloudFront	Global Accelerator improves static-IP TCP/UDP path and failover; CloudFront caches HTTP content.
network protection audited only by manual review	Use service logs and controls such as WAF logs, Network Firewall logs, Shield metrics, Resolver DNS Firewall, and Config where relevant.

Cost-aware operations quick wins

Pattern	Operationally safer cost win
stale storage growth	lifecycle policies, snapshot retention rules, archive tiers
repeated NAT egress	use VPC endpoints where the service supports them
oversized compute	right-size from utilization and recommendation data
idle orphaned resources	clean unattached volumes, stale snapshots, unused load balancers

Last 15-minute review

Review this	Because it fixes…
CloudWatch vs CloudTrail vs Flow Logs	wrong-signal mistakes
composite alarms and runbook automation	noisy-incident and repetitive-work misses
Multi-AZ, backups, restore, RTO/RPO	continuity confusion
CloudFormation, Systems Manager, EventBridge roles	automation-pattern misses
IAM policy vs resource policy vs KMS policy	access-denied confusion
route table -> SG -> NACL -> endpoint order	networking troubleshooting mistakes

What strong answers usually do

start from the safest signal and operational boundary
prefer low-risk remediation and rollback paths before invasive reconfiguration
treat observability, backup, security, and automation as one operating model
choose the option that makes the environment more repeatable and observable after the incident

Quiz

Loading quiz…

Revised on Monday, June 15, 2026

Study Plan

Sample Questions

Browse AWS Certification Guides