AWS DEA-C01 Glossary: Lakehouse, Catalog, and Ingestion Terms

AWS DEA-C01 glossary of lakehouse, catalog, ingestion, transformation, and governance terms.

Use this glossary when streaming, lake, catalog, and warehouse terms start to blur together. Keep it beside the cheat sheet and resources instead of using it as a substitute for real service trade-off study.

Term Short meaning
Data lake Centralized storage pattern that keeps raw and curated data available for many engines and consumers
Data warehouse Analytics platform optimized for structured SQL reporting and performance-tuned query workloads
CDC Change data capture, where inserts, updates, and deletes are emitted as downstream change events
Backfill Reprocessing historical data to fill gaps or rebuild downstream tables
Schema evolution Controlled change in table or event structure over time
Partition pruning Query engine reading only the partitions needed for a specific filter
Glue Data Catalog AWS metadata catalog used by services such as Athena, Glue, and Redshift Spectrum
Crawler Glue process that discovers schema and partitions from source data
Job bookmark Glue tracking mechanism that helps avoid reprocessing already handled data
Checkpoint Persisted processing position used to resume or replay safely
Lake Formation AWS governance layer for permissions and controls on S3-backed data lakes
Dimensional model Analytics modeling pattern built around facts and dimensions for reporting
UNLOAD Redshift command that exports query results or table data to Amazon S3
TTL Time to live setting that lets DynamoDB expire items automatically after a validity window
Skew Uneven data distribution that makes one worker or partition handle far more work than others
Lineage Record of where data came from and how it changed across the pipeline
Least privilege Granting only the actions and resource scope a workload really needs

Commonly confused pairs

Pair Keep this distinction clear
Athena vs Redshift serverless SQL on S3 versus managed warehouse for broader analytical workloads
crawler vs explicit schema automatic discovery versus manual metadata control
checkpoint vs bookmark generic stream or pipeline progress marker versus Glue-specific processed-state tracking
CDC vs full load incremental source changes versus complete dataset copy
governance vs encryption access and audit control versus protection of data at rest or in transit
Lake Formation vs IAM governed lake-data permissions versus baseline AWS identity and service permissions
Glue vs EMR managed/serverless ETL path versus cluster-control big-data processing
masking vs encryption obfuscating exposed values versus cryptographically protecting stored or transmitted data
EventBridge vs Step Functions triggering or routing events versus coordinating multi-step workflow logic
versioning vs lifecycle object-history recovery versus age-based storage-tiering or expiration rules

If three terms blur together

Blur cluster Keep this separation clear
Athena / Redshift / QuickSight query engine on S3 / warehouse analytics engine / BI presentation layer
EventBridge / Step Functions / SNS trigger / orchestrator / notification fan-out
KMS / Lake Formation / IAM key management / governed lake access / baseline AWS permissions
crawler / catalog / business catalog discovery mechanism / technical metadata layer / ownership-lineage-governance context

If the confusion is really about…

Topic family Best page to revisit
service fit and high-yield trade-offs Cheat Sheet
current AWS facts and primary docs Resources
pacing and review order Study Plan
overall exam framing Guide root
Revised on Sunday, May 10, 2026