Study MLA-C01 Data Ingestion, Storage, Formats and Feature Store: key concepts, common traps, and exam decision cues.
This lesson covers the first ML engineering decision AWS tests repeatedly: how the data gets into the workflow, where it lives, and in what format it should be stored. MLA-C01 expects you to match the data source and storage pattern to the access path, scale, and ML workflow requirement.
Columnar format: Storage format such as Parquet or ORC that is optimized for reading selected columns efficiently.
Streaming source: Data source that continuously emits events or records instead of arriving only in static batches.
Feature Store: Managed system for storing and serving engineered features consistently across training and inference.
AWS wants you to distinguish:
flowchart LR
A["Incoming data"] --> B{"Arrives continuously?"}
B -->|Yes| C["Streaming ingestion path"]
B -->|No| D["Batch landing path"]
C --> E{"Need reusable online/offline features?"}
D --> E
E -->|Yes| F["Feature Store and curated feature path"]
E -->|No| G["Raw and curated storage chosen for training and analytics access"]
The exam usually punishes answers that jump to a service before deciding whether the real issue is ingestion mode, format efficiency, or feature reuse.
| If the problem is mainly about… | Strong lane |
|---|---|
| repeated analytics-style reads over structured fields | columnar formats and efficient storage |
| low-latency streaming ingestion | Kinesis, Kafka, or related streaming path |
| reusable engineered features across training and inference | Feature Store |
| cost, performance, and scale trade-offs for the raw dataset | initial storage architecture |
| online and offline consistency for the same engineered features | Feature Store plus disciplined feature definitions |
| Decision | Main question |
|---|---|
| Storage service | Where should the data live for durability, scale, and access pattern? |
| File or record format | How should the data be organized for efficient reads and downstream processing? |
| Ingestion path | How does the data arrive: continuous stream or periodic batch? |
| Feature-serving path | How do engineered features stay reusable and consistent across workflows? |
It is common for an answer to mention both storage and format, but AWS still expects you to know which one is doing the real work in the stem.
| Symptom | What is usually going wrong | Fix first |
|---|---|---|
| every storage answer sounds plausible | you are not mapping storage to the actual access pattern | ask whether the data is mostly batch training input, streaming events, or reusable features |
| file-format distractors keep winning | you are treating format as a generic implementation detail | decide whether the problem is scan efficiency, selected-column access, or raw object durability |
| Feature Store questions feel optional | you are not thinking about training-serving consistency | ask whether the same engineered features must exist in both offline and online workflows |
| streaming answers seem overused | you are not separating continuous ingestion from scheduled data refresh | decide whether the source is event-driven or periodic first |
| Trap | Better reading |
|---|---|
| “S3 answers every storage question.” | S3 may be the raw landing zone, but the stem may really be about format, feature reuse, or streaming shape. |
| “Feature Store is just another storage bucket.” | Feature Store is about feature consistency and serving, not generic raw-object storage. |
| “Streaming is always better because it is more modern.” | Streaming is stronger only when the source and latency requirements are actually continuous. |
| “Any format works if the compute is large enough.” | MLA-C01 often rewards the format that reduces unnecessary read cost and processing effort. |
A team receives clickstream events continuously, trains recommendation models in scheduled jobs, and serves a small set of engineered features during online inference. They currently rebuild the same feature logic separately for training and inference, causing skew and inconsistent results.
The strongest first answer is to introduce a feature-store-centered path while keeping the right raw landing and streaming ingestion pattern. The real problem is not just storage durability. It is training-serving inconsistency.