Learn the Kinesis, Glue, Athena, EMR, DataSync, Lake Formation, and transfer-pattern choices AWS tests for SAA-C03 ingestion and transformation scenarios.
This newer SAA-C03 task group is about building data paths that scale cleanly from ingestion through transformation and analytics. The exam is not testing you as a dedicated data engineer. It is testing whether you can choose the right AWS-managed path for transfer, streaming, transformation, and analysis requirements.
The exam guide points to analytics and visualization services, ingestion patterns, transfer services such as DataSync and Storage Gateway, transformation services such as Glue, secure access to ingestion points, streaming services such as Kinesis, and format transformation choices.
This is really a pipeline-shape question:
| Requirement | Strongest first fit | Why |
|---|---|---|
| Real-time streaming ingestion | Kinesis | Purpose-built for streaming pipelines |
| Managed ETL and cataloging | Glue | Strong fit for transformation workflows and data catalog integration |
| Query data in place in S3 | Athena | Fast analytical query pattern without managing clusters |
| Large-scale data processing cluster | EMR | Better fit for heavier distributed processing needs |
| Online or batch transfer into AWS storage | DataSync or Storage Gateway | Stronger than custom copy scripts for transfer patterns |
| Stage | Typical service fit | What the exam is really asking |
|---|---|---|
| Transfer into AWS | DataSync or Storage Gateway | How the data gets there reliably |
| Real-time streaming | Kinesis | How events flow continuously |
| Transformation and cataloging | Glue | How raw data becomes usable |
| Data lake governance | Lake Formation | How access is controlled and shared safely |
| Query and visualization | Athena and QuickSight | How people consume results without provisioning analytical clusters |
If the problem is secure lake access, Athena alone is not the answer. If the problem is format conversion, DataSync alone is not the answer. SAA-C03 rewards the candidate who notices which stage is actually broken.
High-performing data pipelines are still security designs.
| Requirement | Strongest first fit | Why |
|---|---|---|
| Private ingestion into S3 from VPC-based workloads | VPC endpoint plus bucket policy and least-privilege IAM | Reduces public exposure and tightens the path |
| Central data lake access control across accounts or teams | Lake Formation | Stronger governance answer than scattered bucket permissions alone |
| Encryption and controlled key usage for pipeline data | S3 encryption plus KMS key policy design | Keeps pipeline access and data protection aligned |
flowchart LR
I["Transfer or streaming ingress"] --> R["Raw S3 landing zone"]
R --> G["Glue transform and catalog"]
G --> C["Curated S3 in Parquet or optimized layout"]
C --> L["Lake Formation governance"]
L --> A["Athena and QuickSight consumption"]
The exam often asks which stage is the real decision point. If the problem is transfer, Glue is usually not the answer. If the problem is transformation or query speed, the right answer is often a format-and-catalog decision rather than a bigger cluster.
1Resources:
2 AppEventsStream:
3 Type: AWS::Kinesis::Stream
4 Properties:
5 Name: app-events
6 ShardCount: 2
7 RetentionPeriodHours: 24
What to notice:
This is the kind of format-conversion move AWS wants you to recognize when the data arrives in one shape but must be queried efficiently in another.
1raw = spark.read.option("header", "true").csv("s3://raw-orders-bucket/orders/")
2
3(raw
4 .repartition(8)
5 .write
6 .mode("overwrite")
7 .format("parquet")
8 .save("s3://curated-orders-bucket/orders/"))
What to notice:
Do not stop at ingestion. AWS explicitly includes analytics and visualization use cases here.
| Requirement | Strongest first fit | Why |
|---|---|---|
| SQL-style analysis directly on data in S3 | Athena | Query-in-place answer with low ops |
| Governed data-lake sharing and permissions | Lake Formation | Controls lake access patterns more cleanly |
| Business dashboards on top of analytical data | QuickSight | Visualization layer rather than pipeline layer |
| Heavy distributed processing with cluster control | EMR | Better when managed query-in-place is not enough |
| Symptom | Strongest first check | Why |
|---|---|---|
| Data arrives slowly from on-premises systems | Transfer method and network path | This is usually a transfer problem before it is a Glue or Athena problem |
| Data is present in S3 but analysts cannot query it effectively | Catalog and format layer | Query services work best when the data is shaped and described correctly |
| The team is managing clusters for simple transformation work | EMR versus managed ETL fit | The exam often prefers managed transformation when cluster control is unnecessary |
| Streaming consumers fall behind | Stream throughput and consumer design | This is an ingestion-capacity and consumer-scaling question, not a pure storage question |
| Analysts can query the lake but permissions are messy across teams | Governance layer fit | Lake Formation or tighter lake-access design may be the real missing layer |
Move next into 4. Cost-Optimized Architectures to study how the same storage, compute, database, and network choices change when cost becomes the deciding constraint.