AWS MLA-C01 data prep guide covering ingestion, feature engineering, labeling, quality, bias, and readiness decisions.
This chapter is where MLA-C01 tests whether you can get data ready for ML in a way that is technically sound and operationally realistic. AWS expects you to know how data is ingested, stored, transformed, labeled, validated, and prepared for training without creating hidden integrity or compliance problems.
AWS currently weights Data Preparation for Machine Learning at 28% of scored content.
This domain is not just about moving data into S3. It is testing whether you can:
| Lesson | Focus |
|---|---|
| 1.1 Ingestion & Feature Store | Learn how AWS expects ML engineers to choose storage, formats, ingestion paths, and feature-serving foundations. |
| 1.2 Features & Labeling | Learn how data is cleaned, transformed, encoded, labeled, and turned into useful features. |
| 1.3 Quality, Bias & Readiness | Learn how validation, bias checks, encryption, masking, and model-input readiness shape the final training dataset. |
| If the question is really about… | Go first to… |
|---|---|
| S3, EFS, FSx, Kinesis, Kafka, file formats, feature store, or ingestion bottlenecks | 1.1 Data Ingestion, Storage, Formats & Feature Store |
| Data Wrangler, Glue, DataBrew, Spark, encoding, feature creation, or labeling | 1.2 Transformations, Feature Engineering & Labeling |
| Data quality, Clarify, bias mitigation, PII, PHI, masking, or training-data loading choices | 1.3 Data Quality, Bias, Compliance & Modeling Readiness |
| Symptom | What is usually going wrong | Fix first |
|---|---|---|
| every storage answer seems valid | you are not mapping format and storage to the access pattern | rework 1.1 and classify whether the problem is ingestion, serving, training throughput, or feature reuse |
| feature engineering questions feel hand-wavy | you are not separating raw cleanup from model-useful transformation | rework 1.2 and track what changes the information content versus what only cleans the pipeline |
| compliance and quality stems feel like policy trivia | you are not treating bad data as a model-readiness problem | rework 1.3 and tie every control to training quality, fairness, or safe use |
| you keep optimizing model choice before dataset quality | you are skipping the upstream failure point | stay in the data-prep lane until the training set is trustworthy and usable |
Make sure you can explain:
Then move to 2. Model Dev, where AWS assumes the data pipeline is good enough and starts testing model-family and evaluation judgment.