AWS MLA-C01 Cheat Sheet: SageMaker AI, MLOps, and Drift

March 28, 2026

AWS MLA-C01 cheat sheet for SageMaker AI, ML pipelines, feature engineering, model development, deployment, monitoring, drift, security, cost, and final review traps.

On this page

Keep this cheat sheet open while drilling questions. MLA‑C01 rewards “production ML realism”: data quality gates, repeatability, safe deployments, drift monitoring, cost controls, and least-privilege security.

Quick facts (MLA-C01)

I verified these current AWS exam facts on May 24, 2026.

Item	Value
Exam	AWS Certified Machine Learning Engineer - Associate
Exam code	MLA-C01
Questions	65 total
Scoring	50 scored + 15 unscored (unscored items are not identified)
Question types	Multiple choice, multiple response, ordering, and matching
Time	130 minutes
Passing score	720, scaled 100-1000
Cost	150 USD
Intended candidate	At least 1 year using SageMaker and other AWS ML engineering services

Domain weights and review priority

Domain	Weight	What to compress for final review
Data Preparation for Machine Learning	28%	ingestion, transformation, validation, feature engineering, bias, train/serve consistency
ML Model Development	26%	model approach, training, tuning, metrics, evaluation, versioning, explainability
Deployment and Orchestration of ML Workflows	22%	endpoints, batch/async/serverless inference, pipelines, CI/CD, IaC, scaling
ML Solution Monitoring, Maintenance, and Security	24%	drift, Model Monitor, logs, cost, IAM, KMS, VPC isolation, incident response

MLA-C01 is an engineering and MLOps exam, not a pure data-science theory exam. Prefer answers that make the ML workflow repeatable, observable, deployable, secure, and recoverable; reject answers that only improve model accuracy while ignoring production controls.

MLOps proof stack

MLA-C01 questions usually test whether an ML workflow can move from data to production without losing evidence, safety, or control. Before choosing an answer, walk the scenario through this stack:

Data readiness: verify ingestion, format, labels, leakage, bias, missing values, feature transformations, and train/serve consistency before changing model complexity.
Model evidence: choose the model approach, metric, tuning method, explainability control, and artifact version that match the problem and risk.
Pipeline repeatability: preserve code, parameters, data lineage, experiment results, registry state, approvals, and IaC so the workflow can be reproduced.
Deployment fit: match real-time, serverless, async, batch, or multi-model serving to latency, payload, traffic shape, cost, and rollback requirements.
Monitoring loop: baseline inputs and outputs, detect data drift/model drift, capture latency/errors, and trigger rollback or retraining from evidence.
Security boundary: apply IAM, KMS, VPC isolation, private access, CloudTrail, artifact controls, and least privilege across data, training, registry, and inference.

If an answer improves accuracy but cannot explain data quality, versioning, deployment safety, monitoring, or security, it is usually too narrow for Machine Learning Engineer - Associate.

Official task compression

Data Preparation for ML covers ingesting and storing data, transformations, feature engineering, labeling, integrity checks, bias, and compliance. Fix data quality, leakage, formats, feature consistency, bias, and sensitive-data handling before model complexity.

ML Model Development covers choosing the modeling approach, training/refinement, hyperparameter tuning, evaluation, versioning, explainability, and model comparison. Choose the simplest model/service that fits the problem, then prove performance with the right metric and versioned artifacts.

Deployment and Orchestration covers endpoint type, compute, containers, IaC, auto scaling, CI/CD, tests, rollback, and workflow automation. Deployment is an infrastructure decision: latency, payload, traffic pattern, cost, versioning, rollback, and orchestration all matter.

ML Solution Monitoring, Maintenance, and Security covers inference monitoring, drift, infrastructure/cost optimization, CloudTrail, IAM, KMS, VPC isolation, incident response, and retraining triggers. Production ML requires baselines, alarms, dashboards, cost controls, least privilege, encryption, network isolation, and a maintenance loop.

MLA-C01 boundary cues

AWS frames MLA-C01 as an ML engineering exam. Use that boundary to reject overbuilt or off-scope answers:

If the answer designs the entire enterprise ML strategy or a full end-to-end platform from scratch, it is probably beyond the associate-level target unless the stem asks for a specific workflow component.
If the answer depends on deep specialization in multiple ML domains such as NLP plus computer vision, look for the simpler MLOps/service-fit answer first.
If the answer focuses on quantization details or low-level model compression trade-offs, be cautious; the exam emphasizes building, deploying, monitoring, and maintaining ML solutions on AWS.
If the answer tunes a model before checking data leakage, feature consistency, metric fit, and evaluation representativeness, it is usually premature.
If the answer deploys a model without registry/versioning, endpoint fit, monitoring baseline, rollback path, and IAM/KMS controls, it is not production-ready.

Fast strategy (what the exam expects)

If the question says best-fit managed ML, the answer is often SageMaker (Feature Store, Pipelines, Model Registry, managed endpoints).
If the scenario is “data is messy,” think data quality checks, profiling, transformations, and feature consistency (train/serve).
If the scenario is “accuracy dropped in prod,” think drift, monitoring baselines, A/B or shadow, and retraining triggers.
If the scenario is “cost is spiking,” think right-sizing, endpoint type selection, auto scaling, Spot / Savings Plans, and budgets/tags.
If there’s “security/compliance,” include least privilege IAM, encryption, VPC isolation, and audit logging.
Read the last sentence first to capture constraints: latency, cost, ops effort, compliance, auditability.

AWS notes that the exam may use short AWS service names with a Help-button reference list. Do not let short names become a distraction; map each option back to the lifecycle layer it controls.

ML decision flow

Use this when the question is really asking which lifecycle layer is failing.

    flowchart TD
	  S["Scenario"] --> D["Data quality or feature mismatch?"]
	  D -->|yes| P["Fix prep, profiling, or Feature Store first"]
	  D -->|no| M["Model / deployment / monitoring issue?"]
	  M -->|yes| O["Tune, register, deploy, or monitor"]
	  M -->|no| G["Check governance, cost, and security controls"]

MLA-C01 answer sequence

Use this when the stem mixes data prep, model choice, deployment, and monitoring.

    flowchart TD
	  S["Scenario"] --> D["Check data quality and feature consistency"]
	  D --> M["Choose model, training, or tuning path"]
	  M --> P["Pick deployment and orchestration pattern"]
	  P --> O["Add monitoring, drift, and scaling controls"]
	  O --> G["Verify governance, security, and cost fit"]

Domain weights (how to allocate your time)

Domain	Weight	Prep focus
Domain 1: Data Preparation for ML	28%	Ingest/ETL, feature engineering, data quality and bias basics
Domain 2: ML Model Development	26%	Model choice, training/tuning, evaluation, Clarify/Debugger/Registry
Domain 3: Deployment + Orchestration	22%	Endpoint types, scaling, IaC, CI/CD for ML pipelines
Domain 4: Monitoring + Security	24%	Drift/model monitor, infra monitoring + costs, security controls

Question-type traps

Question type	Exam-day habit
Multiple choice	Find the failing lifecycle layer first: data, features, training, evaluation, deployment, monitoring, cost, or security.
Multiple response	Select the full production control set. A correct answer may require data validation plus model registry plus deployment monitoring, not just one service.
Ordering	Put the ML workflow in a safe sequence: prepare data, train/tune, evaluate, register, approve, deploy, monitor, retrain.
Matching	Match by capability rather than brand familiarity: feature consistency, tuning, explainability, registry, endpoint mode, drift monitoring, or access control.

Unanswered questions are incorrect and there is no penalty for guessing. With 65 questions in 130 minutes, your average budget is exactly 2 minutes per question; mark long ordering/matching items if they block progress.

Scenario eliminations

Stem clue	Eliminate first	Keep in play
training and inference features disagree	retrain a bigger model only	SageMaker Feature Store, shared feature definitions, validation before serving
messy training data	tune hyperparameters first	profiling, cleaning, transforms, quality checks, leakage/bias review
model performs well offline but fails in production	assume metric choice is enough	train/serve skew, endpoint logs, input drift, latency, dependency behavior
accuracy dropped after launch	redeploy last container blindly	Model Monitor baselines, drift alerts, investigation, rollback or retraining trigger
low-latency steady traffic	batch transform	real-time endpoint with scaling and monitoring
unpredictable traffic with idle gaps	always-on endpoint by default	serverless endpoint if latency/cold-start trade-off fits
large non-interactive inference backlog	real-time endpoint	batch transform or async endpoint, depending on payload and timing
need governed model promotion	overwrite endpoint manually	Model Registry, approval workflow, versioned artifacts, rollback path
training job cannot read encrypted data	S3 policy only	IAM plus KMS key policy/grants and encryption context where relevant
compliance requires private training/inference	public endpoint plus app checks	VPC configuration, private subnets/endpoints, KMS, least privilege, CloudTrail

ML failure diagnosis chain

Use this chain when a question says the model is inaccurate, slow, expensive, unstable, or noncompliant. Do not jump straight to a larger model.

    flowchart LR
	  S["Symptom"] --> D["Data and feature evidence"]
	  D --> M["Metric and model evidence"]
	  M --> P["Pipeline and registry evidence"]
	  P --> I["Inference and scaling evidence"]
	  I --> G["Security, cost, and governance evidence"]
	  G --> A["Action: fix data, tune, roll back, retrain, or right-size"]

MLA-C01 rewards disciplined troubleshooting. The best answer usually names the first evidence source, then chooses the smallest safe fix that preserves versioning, monitoring, and rollback.

Evidence-before-action map

Symptom in the stem	Inspect first	Strong action
offline metric was good but production quality fell	training/serving feature parity, capture data, input distribution, later labels	fix train/serve skew or drift before retuning blindly
validation score is high but real users complain	metric choice, class imbalance, business error cost, sample coverage	choose a better metric and representative evaluation set
training loss is unstable or not converging	Debugger data, learning rate, batch size, gradients, training logs	tune training configuration or data preprocessing from evidence
model endpoint latency increased	p50/p95 latency, invocations per instance, CPU/GPU utilization, container logs	right-size, auto scale, batch, optimize model server, or change endpoint mode
endpoint cost is high during idle periods	traffic shape, utilization, endpoint type, instance size, batch backlog	switch serving mode or scale policy before buying more capacity
new model should be promoted safely	registry metrics, approval state, lineage, canary/shadow results	approve and deploy a versioned artifact with rollback path
data or artifact access fails	execution role, S3 policy, KMS key policy, VPC endpoint, security group	fix the narrow permission or network layer, not broad admin access

Deployment mode chooser

Requirement	Best first fit	Reject when
interactive request with strict latency	real-time endpoint	workload is offline, bursty, or minutes-long
bursty traffic with idle periods	serverless inference if cold start and limits fit	p95 latency must be consistently low
large payloads or long-running inference	asynchronous inference	user needs immediate response
scheduled scoring over many records	batch transform or batch inference	live request/response is required
many models with uneven traffic	multi-model endpoint	model loading latency or isolation requirements are unacceptable
compare model candidates safely	shadow or A/B based on whether users should see candidate output	the test lacks metrics, alarms, and rollback

Training-data and feature traps

Trap	Better MLA-C01 instinct
accuracy is high on imbalanced data	check precision, recall, F1, ROC-AUC, and business cost of false positives/negatives
feature transformation differs between notebook and endpoint	centralize or version the feature pipeline, often with Feature Store for reusable features
labels arrive late after inference	monitor inputs immediately and evaluate model quality when ground truth becomes available
sensitive columns are kept because they improve accuracy	classify, mask, minimize, encrypt, and check policy before training or serving
random split leaks future information	split by time, entity, or business boundary when leakage would inflate validation metrics
retraining happens manually after incidents	automate data validation, training, evaluation, registry approval, deployment, and monitoring

Final 20-minute recall (exam day)

Cue -> best answer (pattern map)

If the question says…	Usually best answer
Data is messy/inconsistent before training	Data Wrangler/DataBrew + quality checks
Train/serve feature mismatch	SageMaker Feature Store
Need systematic hyperparameter search	SageMaker Automatic Model Tuning
Need fairness/explainability evidence	SageMaker Clarify
Training instability / convergence issues	SageMaker Debugger
Accuracy degraded in production	SageMaker Model Monitor + drift triggers + retraining
Govern model promotion and rollback	SageMaker Model Registry + approval workflow
Constant low-latency traffic	Real-time endpoint
Spiky traffic with low idle tolerance	Serverless endpoint
Long-running or non-interactive inference	Async endpoint or batch transform

Must-memorize MLA defaults

Topic	Fast recall
First failure domain	Data quality and leakage before model changes
Metric selection	Match metric to business cost (precision vs recall trade-off)
Drift controls	Baselines, alerts, and versioned retraining pipeline
Cost controls	Right-size, auto scale, pick correct endpoint mode, use Spot where safe
Security baseline	Least-privilege IAM, KMS/TLS, VPC isolation, CloudTrail

Last-minute traps

Chasing model complexity before fixing data quality.
Choosing real-time endpoints for workloads that are actually batch/async.
Treating accuracy as the only metric while ignoring latency/cost/compliance.
Deploying without monitoring baselines and rollback path.

0) SageMaker service map (high yield)

Capability	What it’s for	MLA‑C01 “why it matters”
SageMaker Data Wrangler	Data prep + feature engineering	Fast, repeatable transforms; reduces time-to-first-model
SageMaker Feature Store	Central feature storage	Avoid train/serve skew; feature reuse and governance
SageMaker Training	Managed training jobs	Repeatable, scalable training on AWS compute
SageMaker AMT	Hyperparameter tuning	Systematic search for better model configs
SageMaker Clarify	Bias + explainability	Responsible ML evidence + model understanding
SageMaker Model Debugger	Training diagnostics	Debug convergence and training instability
SageMaker Model Registry	Versioning + approvals	Auditability, rollback, safe promotion to prod
SageMaker Endpoints	Managed model serving	Real-time/serverless/async inference patterns
SageMaker Model Monitor	Monitoring workflows	Detect drift and quality issues in production
SageMaker Pipelines	ML workflow orchestration	Build-test-train-evaluate-register-deploy automation

SageMaker AI lifecycle chooser

Requirement	Strong first fit	Watch for
Reusable ML workflow	SageMaker Pipelines	Parameters, artifacts, approvals, and reruns
Governed model promotion	Model Registry	Versioned metrics, approval state, rollback path
Training diagnostics	SageMaker Debugger	Training tensors/rules/logs, not production drift
Bias and explainability evidence	SageMaker Clarify	Pretraining and post-training analysis
Train/serve feature consistency	Feature Store	Online/offline feature parity and access controls
Experiment comparison	Experiments and lineage	Metrics, artifacts, reproducibility, and traceability
Production drift detection	Model Monitor	Baselines, capture config, alerts, and retraining path

1) End-to-end ML on AWS (mental model)

    flowchart LR
	  S["Sources"] --> I["Ingest"]
	  I --> T["Transform + Quality Checks"]
	  T --> F["Feature Engineering + Feature Store"]
	  F --> TR["Train + Tune"]
	  TR --> E["Evaluate + Bias/Explainability"]
	  E --> R["Register + Approve"]
	  R --> D["Deploy Endpoint or Batch"]
	  D --> M["Monitor Drift/Quality/Cost/Security"]
	  M -->|Triggers| RT["Retrain"]
	  RT --> TR

High-yield framing: MLA‑C01 is about the pipeline, not just the model.

2) Domain 1 — Data preparation (28%)

“Which tool should I use?” (ETL and prep picker)

You need…	Typical best-fit	Why
Visual data prep + fast iteration	SageMaker Data Wrangler	Interactive + repeatable workflows
No/low-code transforms and profiling	AWS Glue DataBrew	Good for business-friendly prep
Scalable ETL jobs	AWS Glue / Spark	Production batch ETL at scale
Big Spark workloads (custom)	Amazon EMR	More control over Spark
Simple streaming transforms	AWS Lambda	Event-driven, lightweight
Streaming analytics	Managed Apache Flink	Stateful streaming at scale

Data formats (pickers)

Format	Why it shows up	Typical trade-off
Parquet / ORC	Columnar analytics + efficient reads	Best for large tabular datasets
CSV / JSON	Interop + simplicity	Bigger + slower at scale
Avro	Schema evolution + streaming	Good for pipelines
RecordIO	ML-specific record formats	Useful with some training stacks

Rule: choose formats based on access patterns (scan vs selective reads), schema evolution, and scale.

Data ingestion and storage (high yield)

Amazon S3: default data lake for ML (durable, cheap, scalable).
Amazon EFS / FSx: file-based access patterns; useful when training expects POSIX-like file semantics.
Streaming ingestion: use Kinesis/managed streaming where low-latency data arrival matters.

Common best answers:

Use AWS Glue / Spark on EMR for big ETL jobs.
Use SageMaker Data Wrangler for fast interactive prep and repeatable transformations.
Use SageMaker Feature Store to keep training/inference features consistent.

Feature Store: why it matters

Avoid train/serve skew: the feature used in training is the same feature served to inference.
Support feature reuse across teams and models.
Enable governance: feature definitions and versions.

Data integrity + bias basics (often tested)

Problem	What to do	Tooling you might name
Missing/invalid data	Add data quality checks + fail fast	Glue DataBrew / Glue Data Quality
Class imbalance	Resampling or synthetic data	(Conceptual) + Clarify for analysis
Bias sources	Identify selection/measurement bias	SageMaker Clarify (bias analysis)
Sensitive data	Classify + mask/anonymize + encrypt	KMS + access controls
Compliance constraints	Data residency + least privilege + audit logs	IAM + CloudTrail + region choices

High-yield rule: don’t “fix” model issues before you verify data quality and leakage.

3) Domain 2 — Model development (26%)

Choosing an approach

If you need…	Typical best-fit
A standard AI capability with minimal ML ops	AWS AI services (Translate/Transcribe/Rekognition, etc.)
A custom model with managed training + deployment	Amazon SageMaker
A foundation model / generative capability	Amazon Bedrock (when applicable)

Rule: don’t overbuild. If an AWS managed AI service solves it, it usually wins on time-to-value and ops.

Training and tuning (high yield)

Training loop terms: epoch, step, batch size.
Speedups: early stopping, distributed training.
Generalization controls: regularization (L1/L2, dropout, weight decay) + better data/features.
Hyperparameter tuning: random search vs Bayesian optimization; in SageMaker, use Automatic Model Tuning (AMT).

Metrics picker (what to choose)

Task	Common metrics	What the exam tries to trick you on
Classification	Accuracy, precision, recall, F1, ROC-AUC	Class imbalance makes accuracy misleading
Regression	MAE/RMSE	Outliers and error cost (what matters more?)
Model selection	Metric + cost/latency	“Best” isn’t only accuracy

Overfitting vs underfitting (signals)

Symptom	Likely issue	Typical fix
Train ↑, validation ↓	Overfitting	Regularization, simpler model, more data, better features
Both low	Underfitting	More expressive model, better features, tune hyperparameters

Clarify vs Debugger vs Model Monitor (common confusion)

Tool	What it helps with	When to name it
SageMaker Clarify	Bias + explainability	Fairness questions, “why did it predict X?”
SageMaker Model Debugger	Training diagnostics + convergence	Training instability, loss not decreasing, debugging training
SageMaker Model Monitor	Production monitoring workflows	Drift, data quality degradation, monitoring baselines

Model Registry (repeatability + governance)

Track: model artifacts, metrics, lineage, approvals.
Enables safe promotion/rollback and audit-ready workflows.

4) Domain 3 — Deployment and orchestration (22%)

Endpoint types (must-know picker)

Endpoint type	Best for	Typical constraint
Real-time	Steady, low-latency inference	Cost for always-on capacity
Serverless	Spiky traffic, scale-to-zero	Cold starts + limits
Asynchronous	Long inference time, bursty workloads	Event-style patterns + polling/callback
Batch inference	Scheduled/offline scoring	Not interactive

Deployment mode traps

Scenario wording	Better answer	Why the distractor fails
“User waits for an immediate response”	Real-time endpoint	Batch and async add workflow delay.
“Traffic is intermittent and idle most of the day”	Serverless endpoint if cold starts fit	Always-on real-time capacity can waste cost.
“Payloads are large or inference takes minutes”	Async inference	Real-time endpoints are built for lower-latency request/response.
“Millions of records scored overnight”	Batch transform or batch inference	Live endpoints add unnecessary serving cost.
“Many related models with uneven traffic”	Multi-model endpoint	One endpoint per model can overprovision idle capacity.
“Compare a candidate model without affecting users”	Shadow test	A/B sends user traffic to both variants intentionally.
“Shift a small percentage of production traffic”	Canary or A/B	Shadow is observation, not direct user-serving comparison.

Scaling metrics (what to pick)

Metric	Good when…	Watch out
Invocations per instance	Request volume drives load	Spiky traffic can cause oscillation
Latency	You have a latency SLO	Noisy metrics require smoothing
CPU/GPU utilization	Compute bound models	Not always correlated to request rate

Deployment evidence map

Symptom	Evidence to inspect	Likely fix pattern
p95 latency rises under load	Endpoint metrics, model/container logs, invocation rate	Tune auto scaling, instance type, batching, or model server config.
Errors spike after a model update	Endpoint logs, deployment event history, registry version	Roll back, compare artifacts, and add promotion gates.
Cold starts hurt users	Serverless endpoint metrics and latency distribution	Use provisioned/real-time capacity when latency is strict.
GPU instance is expensive but underused	GPU utilization, request batching, model size	Right-size, batch requests, or choose a CPU/accelerated alternative.
Endpoint cannot read model artifact	Execution role, S3 policy, KMS key policy	Align IAM and KMS permissions for the serving role.
Private endpoint cannot reach dependencies	VPC route tables, security groups, VPC endpoints	Add private service access or adjust network isolation design.

Multi-model / multi-container (why they exist)

Multi-model: multiple models behind one endpoint to reduce cost.
Multi-container: pre/post-processing plus model serving, or multiple frameworks.

IaC + containers (exam patterns)

IaC: CloudFormation or CDK for reproducible environments.
Containers: build/publish to ECR, deploy via SageMaker, ECS, or EKS.

CI/CD for ML (what’s different)

You version and validate more than code:

Code + data + features + model artifacts + evaluation reports
Promotion gates: accuracy thresholds, bias checks, smoke tests, canary/shadow validation

Typical services: CodePipeline/CodeBuild/CodeDeploy, SageMaker Pipelines, EventBridge triggers.

    flowchart LR
	  G["Git push"] --> CP["CodePipeline"]
	  CP --> CB["CodeBuild: tests + build"]
	  CB --> P["SageMaker Pipeline: process/train/eval"]
	  P --> Gate{"Meets<br/>thresholds?"}
	  Gate -->|yes| MR["Model Registry approve"]
	  Gate -->|no| Stop["Stop + report"]
	  MR --> Dep["Deploy (canary/shadow)"]
	  Dep --> Mon["Monitor + rollback triggers"]

5) Domain 4 — Monitoring, cost, and security (24%)

Monitoring and drift (high yield)

Data drift: input distribution changed.
Concept drift: relationship between input and label changed.
Use baselines + ongoing checks; monitor latency/errors too.

Common services/patterns:

SageMaker Model Monitor for monitoring workflows.
A/B testing or shadow deployments for safe comparison.

Monitoring checklist (what to instrument)

Inference quality: when ground truth is available later, compare predicted vs actual.
Data quality: nulls, ranges, schema changes, category explosion.
Distribution shift: feature histograms/summary stats vs baseline.
Ops signals: p50/p95 latency, error rate, throttles, timeouts.
Safety/security: anomalous traffic spikes, abuse patterns, permission failures.

Drift and retraining gates

Gate	What it proves	Exam trap
Data-quality baseline	Inputs still match expected schema, ranges, and null rules	Alerting on drift without knowing the original baseline.
Model-quality check	Predictions still match later ground truth	Declaring success before labels arrive.
Bias/explainability review	Important groups and features remain acceptable	Monitoring accuracy while ignoring fairness or explainability constraints.
Feature-attribution comparison	Drivers of predictions have not changed unexpectedly	Treating every accuracy drop as only a hyperparameter problem.
Retraining pipeline	A validated path exists from new data to approved model	Manual retraining with no repeatability or rollback.
Registry approval	A promoted model has versioned evidence	Replacing production artifacts without audit history.

Infra + cost optimization (high yield)

Theme	What to do
Observability	CloudWatch metrics/logs/alarms; Logs Insights; X-Ray for traces
Rightsizing	Pick instance family/size based on perf; use Inference Recommender + Compute Optimizer
Spend control	Tags + Cost Explorer + Budgets + Trusted Advisor
Purchasing options	Spot / Reserved / Savings Plans where the workload fits

Cost levers (common “best answer” patterns)

Choose the right inference mode first: batch (cheapest) → async → serverless → real-time (most always-on).
Right-size and auto scale; don’t leave endpoints overprovisioned.
Use Spot for fault-tolerant training/batch where interruptions are acceptable.
Use Budgets + tags early (before the bills surprise you).

Security defaults (high yield)

Least privilege IAM for training jobs, pipelines, and endpoints.
Encrypt at rest + in transit (KMS + TLS).
VPC isolation (subnets + security groups) for ML resources when required.
Audit trails (CloudTrail) + controlled access to logs and artifacts.

Security and isolation chooser

Requirement	Choose	Include in the answer
Private training or inference	VPC configuration	Private subnets, security groups, VPC endpoints, no unnecessary public internet path
Encrypted datasets and artifacts	KMS-backed encryption	IAM permissions plus KMS key policy/grants
Controlled model promotion	Model Registry and audit logs	Approval state, CloudTrail/EventBridge events, versioned artifacts
Sensitive notebooks or Studio access	Role-based access	Least privilege, scoped data access, secret handling, log hygiene
Cross-account model or data access	Explicit resource policies	Trust policy, S3/KMS permissions, and narrow prefixes
Compliance review	Centralized logs and retention	CloudTrail, CloudWatch Logs, S3 access logs where appropriate

Common IAM/security “gotchas”

Training role can read S3 but can’t decrypt KMS key (KMS key policy vs IAM policy mismatch).
Endpoint role has broad S3 access (“*”) instead of a tight prefix.
Secrets leak into logs/artifacts (build logs, notebooks, environment variables).
No audit trail for model registry approvals or endpoint updates.

Common production traps

Trap answer	Better MLA-C01 answer
“Retrain a larger model” when features changed	Validate feature pipeline, train/serve skew, and input distribution first.
“Use real-time endpoint” for every inference workload	Match serving mode to latency, payload size, traffic shape, and cost.
“Add Model Monitor” with no baseline or response plan	Create baseline, alert, investigate, retrain, approve, and redeploy.
“Use Feature Store” without governance	Define online/offline parity, ownership, access, and update process.
“Tune hyperparameters” for biased or incomplete data	Fix sampling, labeling, bias, and leakage before tuning.
“Encrypt with KMS” but omit permissions	Include both IAM and KMS key policy alignment.

Next steps

Use Resources to stay anchored to the official exam guide and SageMaker docs.
Use the FAQ to confirm expected depth and where the exam is more engineering than data science.
Turn weak deployment, monitoring, and security rows into timed scenario drills.

Quiz

Loading quiz…

Revised on Monday, June 15, 2026

Study Plan

Sample Questions

Browse AWS Certification Guides