AWS DEA-C01 Cheat Sheet: Ingestion, Lakehouse, and Governance

March 28, 2026

AWS DEA-C01 cheat sheet for ingestion, transformation, lakehouse storage, catalogs, operations, security, governance, cost, and final review traps.

On this page

Keep this cheat sheet open while drilling questions. DEA‑C01 rewards “production data platform realism”: correct service selection, replayability/backfills, partitioning/file formats, monitoring and data quality, and governance-by-default.

CDC: Change data capture, where source-system changes are emitted as events for downstream ingestion.

ETL: Extract, transform, and load workflow for moving and reshaping data into a target system.

Lake Formation: AWS governance layer for permissions and controls on S3-based data lakes.

MWAA: Managed Workflows for Apache Airflow, the AWS managed orchestration service for Airflow DAGs.

Data flow chooser

Use this when the question is really about where the pipeline should land, transform, govern, and publish.

    flowchart LR
	  SRC["Sources"] --> ING["Ingest: batch, stream, or CDC"]
	  ING --> RAW["Raw S3 landing"]
	  RAW --> ETL["Transform: Glue, EMR, Lambda"]
	  ETL --> CUR["Curated S3 / warehouse"]
	  CUR --> GOV["Governance: Lake Formation + catalog + permissions"]
	  CUR --> OBS["Observe: CloudWatch, CloudTrail, Macie"]

DEA-C01 answer sequence

Use this when the stem mixes sources, ingestion, storage, governance, and consumption.

    flowchart TD
	  S["Scenario"] --> I["Identify the source shape"]
	  I --> P["Pick batch, streaming, or CDC"]
	  P --> L["Land in durable storage first"]
	  L --> T["Transform with the right engine"]
	  T --> G["Apply catalog + permissions + governance"]
	  G --> C["Choose the serving layer"]
	  C --> M["Monitor quality, cost, and replayability"]

Quick facts (DEA-C01)

I verified these current AWS exam facts on May 24, 2026.

Item	Value
Exam	AWS Certified Data Engineer - Associate
Exam code	DEA-C01
Questions	65 total
Scoring	50 scored + 15 unscored (unscored items are not identified)
Question types	Multiple choice and multiple response
Time	130 minutes
Passing score	720, scaled 100-1000
Cost	150 USD
Intended candidate	2-3 years of data engineering experience, including 1-2 years of hands-on AWS experience

Domain weights and review priority

Domain	Weight	What to compress for final review
Data Ingestion and Transformation	34%	batch, streaming, CDC, Glue, EMR, Lambda, orchestration, idempotency, replay
Data Store Management	26%	S3 lake layout, file formats, partitioning, cataloging, Redshift, Athena, lifecycle
Data Operations and Support	22%	monitoring, troubleshooting, data quality, cost/performance, lineage, alerts
Data Security and Governance	18%	Lake Formation, IAM, KMS, privacy, logging, compliance, access boundaries

DEA-C01 is an implementation exam. Prefer answers that make pipelines repeatable, observable, governed, and recoverable; reject answers that only move data once without replay, quality checks, ownership, or cost controls.

Data engineering proof stack

DEA-C01 questions usually test whether a pipeline can survive real data-platform pressure, not whether you can name a single analytics service. Before choosing the answer, walk the scenario through this stack:

Source contract: identify batch, stream, CDC, API, file, database, SaaS, or event source behavior, including schema and arrival guarantees.
Durable landing: preserve raw or minimally transformed data when replay, audit, or backfill matters.
Processing semantics: choose the engine from volume, velocity, transformation complexity, latency, state, ordering, and cost.
Data shape: control file format, compression, partitioning, small-file behavior, schema evolution, table format, and catalog metadata.
Quality path: validate freshness, completeness, duplicates, nulls, schema drift, bad records, and quarantine/replay behavior.
Governance boundary: apply IAM, Lake Formation, KMS, Macie, CloudTrail, logging, privacy controls, and cross-account sharing deliberately.
Operations evidence: prove success with job history, metrics, logs, lineage, freshness checks, cost signals, and alert ownership.

If an answer moves data successfully but cannot replay, explain freshness, isolate bad records, or prove governed access, it is usually too thin for Data Engineer - Associate.

Official task compression

Data Ingestion and Transformation covers batch, streaming, CDC, APIs, scheduling, fan-in/fan-out, replayability, transformation, orchestration, code, and IaC. Preserve raw data, handle duplicates, support replay, and choose the transform engine from volume, velocity, variety, latency, and operational cost.

Data Store Management covers store choice, cataloging, lifecycle, schema evolution, Redshift/S3 movement, open table formats, and vector concepts. Storage decisions are access-pattern decisions: query engine, file format, partitioning, catalog, lifecycle, schema evolution, and governance all interact.

Data Operations and Support covers automation, analysis, monitoring, logs, troubleshooting, data quality, skew, sampling, cost/performance, and operational evidence. A pipeline is not production-ready until freshness, quality, cost, failures, lineage, and alerts are observable.

Data Security and Governance covers IAM, Lake Formation, KMS, masking, audit logs, privacy, data sharing, sovereignty, and controlled access. Prefer least privilege, fine-grained lake permissions, encryption, auditability, PII detection, and Region controls over informal bucket sprawl.

DEA-C01 boundary cues

AWS frames DEA-C01 as a data engineering implementation exam. Use that boundary to reject overbuilt or off-scope answers:

If the answer performs ML training or inference, it is probably outside the intended candidate scope unless the stem is only asking about data preparation for that downstream use.
If the answer draws business conclusions from data, it is likely analyst work; DEA-C01 expects the data engineer to make the data reliable, queryable, governed, and observable.
If the answer depends on language-specific syntax, be careful. The exam expects programming concepts for pipelines, not memorization of one language’s syntax details.
If the answer chooses a store before identifying volume, variety, velocity, query pattern, governance, and lifecycle, it is probably guessing.
If the answer fixes a failed pipeline without logs, metrics, job history, checkpoints, and replay/quarantine behavior, it is operationally incomplete.

Fast strategy (what the exam expects)

If the requirement is replayable + backfillable, design for idempotency, checkpoints, and reprocessing (S3 as durable landing is common).
If you see “best cost/performance for queries on S3”, think Parquet + partitioning + Athena/Redshift Spectrum, not raw CSV scans.
If you see “govern access to S3 data across services”, think Lake Formation + Glue Data Catalog, not just IAM.
If you see “batch vs streaming”, focus on latency, ordering, retention, and operational complexity.
If you see “audit” or “governance”, include CloudTrail, central log storage, and controlled access to logs.
If you see vectors in a data-engineering stem, stay grounded in data-store and retrieval design: storage, metadata filters, security, lifecycle, and access pattern before model behavior.

Question-type traps

Question type	Exam-day habit
Multiple choice	Identify the pipeline failure point first: source capture, durable landing, transform engine, catalog, serving store, quality gate, or access control.
Multiple response	Include every required data-platform control. A strong answer may need ingestion plus checkpointing, cataloging, monitoring, and security.

Unanswered questions are incorrect and there is no penalty for guessing. With 65 questions in 130 minutes, your average budget is exactly 2 minutes per question; do not spend 6 minutes on one dense pipeline scenario before banking easier service-selection points.

Scenario eliminations

Stem clue	Eliminate first	Keep in play
need replayable ingest and backfills	transform directly into final table only	raw S3 landing, idempotent jobs, checkpoints, partitioned curated outputs
source database changes must flow downstream	scheduled full extracts forever	AWS DMS CDC, durable landing, duplicate handling, restart strategy
ad-hoc SQL on S3 with lowest operational overhead	Redshift cluster by default	Athena with columnar format, partition pruning, and catalog metadata
warehouse analytics with managed performance	Athena for every workload	Redshift, data loading design, sort/distribution choices where relevant
data lake access must be governed across engines	IAM-only bucket sprawl	Lake Formation plus Glue Data Catalog, tag-based or fine-grained permissions
tiny files causing slow queries	add more crawlers	compaction, partition strategy, columnar format, file-size hygiene
schema changed upstream	ignore crawler/catalog impact	schema evolution handling, validation, quarantine, compatibility rules
pipeline failed overnight	rerun everything manually	workflow retries, dead-letter/quarantine path, CloudWatch logs/metrics, alert ownership
PII appears in data lake	rely on naming conventions	Macie discovery, classification, encryption, access policy, retention controls
cost spike in analytics	more compute first	reduce scanned data, partition pruning, compression, right-sized service choice

Production pipeline evidence chain

Use this chain when the stem asks for a durable data platform rather than a one-time transfer.

    flowchart LR
	  S["Source contract"] --> L["Raw durable landing"]
	  L --> C["Checkpoint and dedupe key"]
	  C --> V["Validation and quarantine"]
	  V --> T["Transform to curated format"]
	  T --> M["Catalog, partition, and lineage"]
	  M --> G["Govern access and keys"]
	  G --> O["Freshness, cost, and failure alarms"]

The strongest DEA-C01 answer usually shows where records came from, how the job can replay safely, how bad records are isolated, how consumers discover the table, and how operators prove freshness and access control.

Data-platform design checks

Design question	Strong exam answer	Weak answer pattern
Can the pipeline replay or backfill?	raw immutable landing, checkpoints, idempotent transforms, dedupe keys	overwrite final tables directly and hope reruns work
Can consumers trust the schema?	explicit contracts, catalog updates, schema evolution handling, validation	crawler-only discovery with no compatibility plan
Can queries stay cost-efficient?	Parquet/ORC, partition pruning, compaction, compression, projection	scan raw CSV/JSON forever or partition on unusable columns
Can bad records be handled safely?	quarantine path, failed-rule evidence, alert owner, replay after correction	drop records silently or rerun the whole pipeline manually
Can governance cross engines?	Lake Formation, Glue Data Catalog, IAM, KMS, audit trail	bucket policies copied per consumer without catalog discipline
Can operations explain failure?	CloudWatch logs/metrics, CloudTrail, job run history, freshness checks	generic “enable logging” without searchable evidence

Storage and schema chooser

Requirement	Prefer	Watch for
ad hoc SQL over S3 with low operations	Athena plus Glue Data Catalog	scanned bytes, partitions, columnar files, workgroup controls
governed lake tables shared across engines	S3 data lake, Glue Catalog, Lake Formation	row/column/tag permissions, partitions, KMS, cross-account sharing
warehouse BI with predictable performance	Redshift	load design, distribution/sort strategy, concurrency, materialized views
key-value lookup with predictable access pattern	DynamoDB	partition key design, hot keys, TTL, streams, point-in-time recovery
source database change capture	AWS DMS CDC into durable landing	duplicates, ordering, restart position, schema conversion, backfill
lakehouse updates or deletes	Apache Iceberg or supported open table format	compaction, snapshot retention, engine compatibility, catalog integration
vector search or semantic retrieval	vector-capable store or knowledge-base pattern	embeddings, metadata filters, HNSW/IVF tradeoffs, authorization

Operations evidence map

Symptom in the stem	First evidence to inspect	Better fix
today’s dashboard is empty	S3 prefix, partition registration, freshness metric, upstream job status	sync partitions or repair upstream run before blaming BI
duplicate rows after CDC replay	checkpoint, DMS task logs, primary key or event ID, merge logic	make writes idempotent and dedupe on stable business keys
Athena bill spiked	bytes scanned, file format, partition filters, workgroup settings	convert/compress/partition data before adding more compute
Glue job runtime doubled	input file count, skew, shuffle, worker metrics, recent schema change	compact files, fix skew, tune worker type/count from evidence
downstream schema broke	catalog version, source schema diff, validation failure, consumer contract	quarantine incompatible records and publish a compatible schema path
compliance asks who accessed data	CloudTrail, Lake Formation audit path, access logs, KMS usage, Config history	produce centralized audit evidence, not screenshots or manual notes

Final 20-minute recall (exam day)

Cue -> best answer (pattern map)

If the question says…	Usually best answer
Replayable ingest and backfills	S3 raw zone + idempotent processing + checkpoints
Database replication / CDC	AWS DMS
Low-latency event stream analytics	Kinesis Data Streams or MSK (+ Flink when stateful processing is needed)
Cheapest ad-hoc SQL on S3	Athena + Parquet + partition pruning
Warehouse-style analytics and mixed workload SQL	Redshift (plus Spectrum for external S3 data)
Cross-engine data permissions on lake data	Lake Formation + Glue Data Catalog
Production orchestration with dependencies/retries	MWAA or Step Functions
PII discovery in S3	Amazon Macie
Schema discovery and metadata	Glue crawlers + explicit table design where needed
Data quality guardrails	In-pipeline checks + quarantine + alerting

Must-memorize DEA defaults

Topic	Fast recall
File format for analytics	Parquet/ORC beats CSV/JSON for scan cost and speed
S3 table performance	Partition on query predicates; avoid tiny files
Delivery semantics	Most streaming/integration paths are at-least-once
Governance baseline	CloudTrail, encryption (KMS), least-privilege access
Query cost lever	Reduce data scanned first (partition + columnar + projection)

Store and cost decision traps

Trap	Better exam instinct
Redshift for every SQL question	Athena is often better for low-ops ad hoc S3 queries; Redshift fits warehouse performance, concurrency, and modeled analytics.
Athena scans raw JSON forever	Convert to Parquet/ORC, partition by common predicates, and avoid tiny files.
DynamoDB used like a warehouse	DynamoDB is for key-value/document access patterns, not arbitrary analytical scans.
RDS selected for unbounded event analytics	Use lake, stream, or warehouse patterns when volume and analytical access dominate.
Lifecycle only after storage cost spikes	Apply lifecycle, retention, and deletion rules when designing the data product.
Cross-Region replication without sovereignty check	Data residency, privacy, backup, and replication controls are part of the answer.

Last-minute traps

Assuming exactly-once semantics by default.
Partitioning on high-cardinality columns that hurt performance.
Relying only on IAM where Lake Formation is explicitly required.
Shipping pipelines without backfill/replay design.

1) End-to-end data platform on AWS (mental model)

    flowchart LR
	  SRC["Sources<br/>(SaaS, DBs, apps, streams)"] --> ING["Ingest<br/>(DMS, AppFlow, Kinesis, MSK)"]
	  ING --> RAW["S3 data lake<br/>(raw/bronze)"]
	  RAW --> ETL["Transform<br/>(Glue, EMR, Lambda)"]
	  ETL --> CUR["S3 curated<br/>(silver/gold)"]
	  CUR --> CAT["Glue Data Catalog"]
	  CAT --> ATH["Athena<br/>(serverless SQL)"]
	  CUR --> RS["Redshift<br/>(warehouse)"]
	  ATH --> BI["QuickSight / BI"]
	  RS --> BI
	  CUR --> GOV["Lake Formation<br/>(permissions)"]
	  ING --> ORCH["Orchestrate<br/>(MWAA, Step Functions, EventBridge)"]
	  ORCH --> ETL
	  MON["Monitor + audit<br/>(CloudWatch, CloudTrail, Macie)"] --> ORCH
	  MON --> RS
	  MON --> ATH

High-yield framing: DEA‑C01 is about the pipeline + platform, not just one service.

2) Ingestion patterns (Domain 1)

Batch vs streaming vs CDC (picker)

Pattern	Best for	Typical AWS answers	Common gotcha
Batch	Daily/hourly loads, predictable schedules	S3 landing + Glue/EMR; EventBridge schedule; AppFlow	Backfills + late data handling
Streaming	Near-real-time events	Kinesis Data Streams; MSK; (optional) Flink	Ordering, retries, consumer lag
CDC (change data capture)	Database replication	AWS DMS	Exactly-once isn’t guaranteed; handle duplicates

Streaming and CDC traps

Trap	Better exam instinct
Treating stream retention as permanent storage	Land durable raw events in S3 or another appropriate store when replay/backfill matters.
Assuming one consumer can satisfy every downstream need	Design fan-out, consumer isolation, delivery guarantees, and checkpointing explicitly.
Ignoring source rate limits	Use throttling, batching, retries, backoff, and API quota-aware ingestion.
CDC without duplicate handling	DMS/streaming paths can replay records; design idempotent writes and deduplication keys.
Stateful transform treated like simple Lambda glue	Stateful windows, joins, and aggregations usually require a streaming engine or workflow with durable state.
LLM data processing used without validation	LLM-assisted extraction or cleanup still needs schema checks, confidence handling, auditability, and human review when impact is high.

Triggers and scheduling (high yield)

Need	Typical best-fit
Run every N minutes	EventBridge schedule
Run when file arrives in S3	S3 event notifications or EventBridge
Complex dependencies + retries	MWAA or Step Functions

3) ETL and processing choices (Domain 1)

Glue vs EMR vs Lambda vs Redshift (fast picker)

You need…	Best-fit (typical)	Why
Managed Spark ETL with less ops	AWS Glue	Serverless-ish ETL + integrations
Full control over Spark (big jobs)	Amazon EMR	More knobs/control; long-running clusters optional
Lightweight transforms or glue code	AWS Lambda	Event-driven, simple steps
SQL transforms close to the warehouse	Amazon Redshift	Push compute to the warehouse when appropriate

File formats (exam-friendly rules)

Use Parquet/ORC for analytics on S3 (columnar + compression).
Avoid raw CSV/JSON at scale for Athena/Redshift Spectrum scans (cost and performance).

4) Catalogs, partitions, and schema drift (Domain 2)

Glue Data Catalog (what it does)

Central metadata store for S3 data (databases/tables/partitions).
Enables engines like Athena, EMR, and Redshift Spectrum to interpret schema.

Crawlers vs explicit DDL

Approach	When it’s best	Risk
Glue crawler	Fast discovery, unknown schemas	Schema drift surprises
Explicit DDL	Strong contracts	More manual maintenance

High-yield rule: keep partitions in sync (MSCK REPAIR / partition projection / crawler updates), or queries “miss” new data.

Open table formats, vectors, and catalogs

Requirement	Strong first fit	Watch for
lakehouse table updates, deletes, or schema evolution	Apache Iceberg or another supported open table format	catalog integration, snapshot lifecycle, compaction, and engine compatibility
semantic search or RAG-style retrieval over enterprise data	vector index or knowledge-base pattern	embeddings, metadata filters, vector index type, refresh, and authorization
approximate nearest-neighbor vector search	HNSW or IVF-style index concepts	recall/latency tradeoffs and index maintenance, not normal B-tree thinking
technical metadata for query engines	Glue Data Catalog or Hive metastore	partition sync, schema drift, crawler behavior, and table definitions
business data catalog and projects	SageMaker Catalog or SageMaker Unified Studio concepts	domain, domain unit, project, ownership, and access governance
lineage and schema evolution	catalog metadata, DMS/SCT where migration is involved, and lineage-aware tools	compatibility, downstream contracts, and audit evidence

5) Storage and analytics service selection (Domain 2/3)

Athena vs Redshift (exam picker)

You need…	Best-fit	Why
Ad hoc SQL on S3	Athena	Serverless, pay per scan
High concurrency BI dashboards	Redshift	Warehouse optimization + caching
Query S3 from Redshift	Redshift Spectrum	External tables on S3

Redshift data loading (high yield)

Use COPY from S3 for fast loads (parallel, columnar-friendly).
Use UNLOAD to export query results back to S3.

6) SQL patterns (Domain 1/3)

Partition pruning (Athena mindset)

If your table is partitioned by dt, always filter by it:

1SELECT *
2FROM curated.events
3WHERE dt = '2025-12-12'
4  AND event_type = 'purchase';

CTAS for repeatable outputs (Athena)

1CREATE TABLE curated.daily_sales
2WITH (format='PARQUET', partitioned_by=ARRAY['dt'])
3AS
4SELECT dt, customer_id, SUM(amount) AS total
5FROM raw.sales
6GROUP BY dt, customer_id;

7) Orchestration and reliability (Domain 1/3)

MWAA vs Step Functions (fast picker)

You need…	Best-fit	Why
DAGs, complex dependencies, retries	MWAA (Airflow)	Mature DAG patterns
Serverless state machine orchestration	Step Functions	Visual state, retries, integration patterns

    flowchart LR
	  E["EventBridge schedule"] --> W["Workflow start"]
	  W --> I["Ingest"]
	  I --> V{"Valid?"}
	  V -->|yes| T["Transform"]
	  V -->|no| Q["Quarantine + alert"]
	  T --> C["Catalog/partitions update"]
	  C --> P["Publish dataset"]
	  P --> N["Notify (SNS)"]

High-yield reliability rules:

Design for retries + duplicates (at-least-once is normal).
Make steps idempotent (safe re-runs).
Track freshness/latency SLIs (what matters to users).

8) Monitoring and troubleshooting (Domain 3)

What to monitor

Pipeline health: failures, retries, runtime, backlog/lag
Freshness: “is today’s partition present?”
Cost: scan volume (Athena), cluster usage (EMR/Redshift), data transfer
Security/audit: access logs, permission changes

Common AWS tooling:

CloudWatch (metrics/logs/alarms, Logs Insights)
CloudTrail (API calls; audit)
Macie (PII discovery; policy violations)

Troubleshooting map

Symptom	First evidence	Strong first fix
newest data missing in Athena	partition catalog, crawler run, partition projection, S3 prefix	sync partitions or repair table design before rerunning all ETL
Glue job slow or failing	job logs, worker sizing, input file count, skew, shuffle behavior	fix partitioning/file sizes/skew or right-size workers based on evidence
stream lag grows	shard/partition throughput, consumer errors, checkpoint age, downstream writes	scale consumers or shards, batch correctly, and remove downstream bottleneck
Redshift load slow	COPY pattern, file sizes, compression, distribution/sort design, WLM pressure	stage files correctly and tune load/query design instead of row-by-row inserts
data quality alert fires	failed rule, source change, schema version, quarantine location	quarantine bad records, alert owner, and preserve replay path
audit question asks “who changed it”	CloudTrail or CloudTrail Lake, Config timeline, service logs	use API/config evidence, not only application logs

9) Data quality (Domain 3)

Data quality dimensions (memorize)

Dimension	Example check
Completeness	Required fields not null
Consistency	Same customer_id format across sources
Accuracy	Values within expected ranges
Integrity	Valid foreign keys / referential relationships

High-yield pattern: run checks in-pipeline, quarantine bad records, and alert.

10) Security and governance (Domain 4)

Lake Formation (why it’s a big deal)

Lake Formation helps you manage fine-grained permissions for data in S3 across engines like Athena/EMR/Redshift Spectrum, using a consistent governance model.

Encryption and key points

Prefer SSE-KMS for S3 and service-level encryption for analytics services.
Use TLS for encryption in transit.
Don’t log secrets or raw PII; keep logs access-controlled.

Audit readiness checklist

CloudTrail enabled and centralized (optionally CloudTrail Lake for queries)
CloudWatch Logs retention + encryption set
Access to logs is restricted (separation of duties)
Data sharing has explicit approvals and is traceable

Governance and privacy chooser

Requirement	Strong first fit	Watch for
fine-grained access across Athena, EMR, Redshift Spectrum, and S3 lake data	Lake Formation with Glue Data Catalog	tag-based permissions, row/column controls, cross-account sharing, and IAM interaction
PII discovery in S3	Macie with classification workflow	discovery must lead to masking, access control, encryption, retention, or remediation
data masking or anonymization	transformation-time masking, view-level controls, or service-specific masking	do not leave raw sensitive data broadly queryable
cross-account encrypted data access	KMS key policy, grants, IAM, Lake Formation/resource policy	both data permission and key permission must allow the path
centralized audit queries	CloudTrail Lake, Athena over logs, CloudWatch Logs Insights, or OpenSearch based on volume/use case	retention, encryption, query access, and separation of duties
data sovereignty	Region restrictions, replication controls, backup policy, and governance rules	backup and replication can violate residency requirements if ignored
governed data projects	SageMaker Catalog projects, domains, domain units, or appropriate catalog governance	ownership and access model must be explicit, not just a bucket folder

Next steps

Use Resources to stay anchored to the official exam guide and core analytics docs.
Use the FAQ to confirm expected depth, candidate profile, and service coverage.
Turn your weak rows into replayable scenario prompts and drill them under time.

Quiz

Loading quiz…

Revised on Monday, June 15, 2026

Study Plan

Sample Questions

Browse AWS Certification Guides