AWS DEA-C01 cheat sheet for ingestion, metadata, lakehouse choices, traps, and final review.
Keep this cheat sheet open while drilling questions. DEA‑C01 rewards “production data platform realism”: correct service selection, replayability/backfills, partitioning/file formats, monitoring and data quality, and governance-by-default.
CDC: Change data capture, where source-system changes are emitted as events for downstream ingestion.
ETL: Extract, transform, and load workflow for moving and reshaping data into a target system.
Lake Formation: AWS governance layer for permissions and controls on S3-based data lakes.
MWAA: Managed Workflows for Apache Airflow, the AWS managed orchestration service for Airflow DAGs.
Use this when the question is really about where the pipeline should land, transform, govern, and publish.
flowchart LR
SRC["Sources"] --> ING["Ingest: batch, stream, or CDC"]
ING --> RAW["Raw S3 landing"]
RAW --> ETL["Transform: Glue, EMR, Lambda"]
ETL --> CUR["Curated S3 / warehouse"]
CUR --> GOV["Governance: Lake Formation + catalog + permissions"]
CUR --> OBS["Observe: CloudWatch, CloudTrail, Macie"]
Use this when the stem mixes sources, ingestion, storage, governance, and consumption.
flowchart TD
S["Scenario"] --> I["Identify the source shape"]
I --> P["Pick batch, streaming, or CDC"]
P --> L["Land in durable storage first"]
L --> T["Transform with the right engine"]
T --> G["Apply catalog + permissions + governance"]
G --> C["Choose the serving layer"]
C --> M["Monitor quality, cost, and replayability"]
| Item | Value |
|---|---|
| Questions | 65 (multiple-choice + multiple-response) |
| Time | 130 minutes |
| Passing score | 720 (scaled 100–1000) |
| Cost | 150 USD |
| Domains | D1 34% • D2 26% • D3 22% • D4 18% |
| If the question says… | Usually best answer |
|---|---|
| Replayable ingest and backfills | S3 raw zone + idempotent processing + checkpoints |
| Database replication / CDC | AWS DMS |
| Low-latency event stream analytics | Kinesis Data Streams or MSK (+ Flink when stateful processing is needed) |
| Cheapest ad-hoc SQL on S3 | Athena + Parquet + partition pruning |
| Warehouse-style analytics and mixed workload SQL | Redshift (plus Spectrum for external S3 data) |
| Cross-engine data permissions on lake data | Lake Formation + Glue Data Catalog |
| Production orchestration with dependencies/retries | MWAA or Step Functions |
| PII discovery in S3 | Amazon Macie |
| Schema discovery and metadata | Glue crawlers + explicit table design where needed |
| Data quality guardrails | In-pipeline checks + quarantine + alerting |
| Topic | Fast recall |
|---|---|
| File format for analytics | Parquet/ORC beats CSV/JSON for scan cost and speed |
| S3 table performance | Partition on query predicates; avoid tiny files |
| Delivery semantics | Most streaming/integration paths are at-least-once |
| Governance baseline | CloudTrail, encryption (KMS), least-privilege access |
| Query cost lever | Reduce data scanned first (partition + columnar + projection) |
flowchart LR
SRC["Sources<br/>(SaaS, DBs, apps, streams)"] --> ING["Ingest<br/>(DMS, AppFlow, Kinesis, MSK)"]
ING --> RAW["S3 data lake<br/>(raw/bronze)"]
RAW --> ETL["Transform<br/>(Glue, EMR, Lambda)"]
ETL --> CUR["S3 curated<br/>(silver/gold)"]
CUR --> CAT["Glue Data Catalog"]
CAT --> ATH["Athena<br/>(serverless SQL)"]
CUR --> RS["Redshift<br/>(warehouse)"]
ATH --> BI["QuickSight / BI"]
RS --> BI
CUR --> GOV["Lake Formation<br/>(permissions)"]
ING --> ORCH["Orchestrate<br/>(MWAA, Step Functions, EventBridge)"]
ORCH --> ETL
MON["Monitor + audit<br/>(CloudWatch, CloudTrail, Macie)"] --> ORCH
MON --> RS
MON --> ATH
High-yield framing: DEA‑C01 is about the pipeline + platform, not just one service.
| Pattern | Best for | Typical AWS answers | Common gotcha |
|---|---|---|---|
| Batch | Daily/hourly loads, predictable schedules | S3 landing + Glue/EMR; EventBridge schedule; AppFlow | Backfills + late data handling |
| Streaming | Near-real-time events | Kinesis Data Streams; MSK; (optional) Flink | Ordering, retries, consumer lag |
| CDC (change data capture) | Database replication | AWS DMS | Exactly-once isn’t guaranteed; handle duplicates |
| Need | Typical best-fit |
|---|---|
| Run every N minutes | EventBridge schedule |
| Run when file arrives in S3 | S3 event notifications or EventBridge |
| Complex dependencies + retries | MWAA or Step Functions |
| You need… | Best-fit (typical) | Why |
|---|---|---|
| Managed Spark ETL with less ops | AWS Glue | Serverless-ish ETL + integrations |
| Full control over Spark (big jobs) | Amazon EMR | More knobs/control; long-running clusters optional |
| Lightweight transforms or glue code | AWS Lambda | Event-driven, simple steps |
| SQL transforms close to the warehouse | Amazon Redshift | Push compute to the warehouse when appropriate |
| Approach | When it’s best | Risk |
|---|---|---|
| Glue crawler | Fast discovery, unknown schemas | Schema drift surprises |
| Explicit DDL | Strong contracts | More manual maintenance |
High-yield rule: keep partitions in sync (MSCK REPAIR / partition projection / crawler updates), or queries “miss” new data.
| You need… | Best-fit | Why |
|---|---|---|
| Ad hoc SQL on S3 | Athena | Serverless, pay per scan |
| High concurrency BI dashboards | Redshift | Warehouse optimization + caching |
| Query S3 from Redshift | Redshift Spectrum | External tables on S3 |
COPY from S3 for fast loads (parallel, columnar-friendly).UNLOAD to export query results back to S3.If your table is partitioned by dt, always filter by it:
1SELECT *
2FROM curated.events
3WHERE dt = '2025-12-12'
4 AND event_type = 'purchase';
1CREATE TABLE curated.daily_sales
2WITH (format='PARQUET', partitioned_by=ARRAY['dt'])
3AS
4SELECT dt, customer_id, SUM(amount) AS total
5FROM raw.sales
6GROUP BY dt, customer_id;
| You need… | Best-fit | Why |
|---|---|---|
| DAGs, complex dependencies, retries | MWAA (Airflow) | Mature DAG patterns |
| Serverless state machine orchestration | Step Functions | Visual state, retries, integration patterns |
flowchart LR
E["EventBridge schedule"] --> W["Workflow start"]
W --> I["Ingest"]
I --> V{"Valid?"}
V -->|yes| T["Transform"]
V -->|no| Q["Quarantine + alert"]
T --> C["Catalog/partitions update"]
C --> P["Publish dataset"]
P --> N["Notify (SNS)"]
High-yield reliability rules:
Common AWS tooling:
| Dimension | Example check |
|---|---|
| Completeness | Required fields not null |
| Consistency | Same customer_id format across sources |
| Accuracy | Values within expected ranges |
| Integrity | Valid foreign keys / referential relationships |
High-yield pattern: run checks in-pipeline, quarantine bad records, and alert.
Lake Formation helps you manage fine-grained permissions for data in S3 across engines like Athena/EMR/Redshift Spectrum, using a consistent governance model.