MLA-C01 Data Ingestion, Storage, Formats and Feature Store Guide

Study MLA-C01 Data Ingestion, Storage, Formats and Feature Store: key concepts, common traps, and exam decision cues.

This lesson covers the first ML engineering decision AWS tests repeatedly: how the data gets into the workflow, where it lives, and in what format it should be stored. MLA-C01 expects you to match the data source and storage pattern to the access path, scale, and ML workflow requirement.

Columnar format: Storage format such as Parquet or ORC that is optimized for reading selected columns efficiently.

Streaming source: Data source that continuously emits events or records instead of arriving only in static batches.

Feature Store: Managed system for storing and serving engineered features consistently across training and inference.

What AWS is really testing here

AWS wants you to distinguish:

  • batch ingestion from streaming ingestion
  • file format choice from storage service choice
  • initial storage decisions from later feature transformation
  • raw data landing zones from curated feature-serving paths

Start with the access pattern, not the product name

    flowchart LR
	  A["Incoming data"] --> B{"Arrives continuously?"}
	  B -->|Yes| C["Streaming ingestion path"]
	  B -->|No| D["Batch landing path"]
	  C --> E{"Need reusable online/offline features?"}
	  D --> E
	  E -->|Yes| F["Feature Store and curated feature path"]
	  E -->|No| G["Raw and curated storage chosen for training and analytics access"]

The exam usually punishes answers that jump to a service before deciding whether the real issue is ingestion mode, format efficiency, or feature reuse.

Strongest-first chooser

If the problem is mainly about… Strong lane
repeated analytics-style reads over structured fields columnar formats and efficient storage
low-latency streaming ingestion Kinesis, Kafka, or related streaming path
reusable engineered features across training and inference Feature Store
cost, performance, and scale trade-offs for the raw dataset initial storage architecture
online and offline consistency for the same engineered features Feature Store plus disciplined feature definitions

Storage choice and format choice are different decisions

Decision Main question
Storage service Where should the data live for durability, scale, and access pattern?
File or record format How should the data be organized for efficient reads and downstream processing?
Ingestion path How does the data arrive: continuous stream or periodic batch?
Feature-serving path How do engineered features stay reusable and consistent across workflows?

It is common for an answer to mention both storage and format, but AWS still expects you to know which one is doing the real work in the stem.

If you keep missing questions in this lesson

Symptom What is usually going wrong Fix first
every storage answer sounds plausible you are not mapping storage to the actual access pattern ask whether the data is mostly batch training input, streaming events, or reusable features
file-format distractors keep winning you are treating format as a generic implementation detail decide whether the problem is scan efficiency, selected-column access, or raw object durability
Feature Store questions feel optional you are not thinking about training-serving consistency ask whether the same engineered features must exist in both offline and online workflows
streaming answers seem overused you are not separating continuous ingestion from scheduled data refresh decide whether the source is event-driven or periodic first

Common traps

Trap Better reading
“S3 answers every storage question.” S3 may be the raw landing zone, but the stem may really be about format, feature reuse, or streaming shape.
“Feature Store is just another storage bucket.” Feature Store is about feature consistency and serving, not generic raw-object storage.
“Streaming is always better because it is more modern.” Streaming is stronger only when the source and latency requirements are actually continuous.
“Any format works if the compute is large enough.” MLA-C01 often rewards the format that reduces unnecessary read cost and processing effort.

Harder scenario

A team receives clickstream events continuously, trains recommendation models in scheduled jobs, and serves a small set of engineered features during online inference. They currently rebuild the same feature logic separately for training and inference, causing skew and inconsistent results.

The strongest first answer is to introduce a feature-store-centered path while keeping the right raw landing and streaming ingestion pattern. The real problem is not just storage durability. It is training-serving inconsistency.

Decision order that usually wins

  1. First classify the problem as arrival pattern, data layout, feature reuse, or downstream training fit.
  2. If the same engineered features must stay consistent across training and inference, think Feature Store first.
  3. If the data arrives continuously and freshness matters, stay in the streaming-ingestion lane before debating later modeling choices.
  4. If the question is about efficient analytical access over large structured datasets, think columnar formats such as Parquet.
  5. Separate how data arrives from how features are served because the exam often puts both ideas in the same stem.

Quiz

Loading quiz…
Revised on Sunday, May 10, 2026