AWS DEA-C01 glossary of lakehouse, catalog, ingestion, transformation, and governance terms.
Use this glossary when streaming, lake, catalog, and warehouse terms start to blur together. Keep it beside the cheat sheet and resources instead of using it as a substitute for real service trade-off study.
| Term | Short meaning |
|---|---|
| Data lake | Centralized storage pattern that keeps raw and curated data available for many engines and consumers |
| Data warehouse | Analytics platform optimized for structured SQL reporting and performance-tuned query workloads |
| CDC | Change data capture, where inserts, updates, and deletes are emitted as downstream change events |
| Backfill | Reprocessing historical data to fill gaps or rebuild downstream tables |
| Schema evolution | Controlled change in table or event structure over time |
| Partition pruning | Query engine reading only the partitions needed for a specific filter |
| Glue Data Catalog | AWS metadata catalog used by services such as Athena, Glue, and Redshift Spectrum |
| Crawler | Glue process that discovers schema and partitions from source data |
| Job bookmark | Glue tracking mechanism that helps avoid reprocessing already handled data |
| Checkpoint | Persisted processing position used to resume or replay safely |
| Lake Formation | AWS governance layer for permissions and controls on S3-backed data lakes |
| Dimensional model | Analytics modeling pattern built around facts and dimensions for reporting |
| UNLOAD | Redshift command that exports query results or table data to Amazon S3 |
| TTL | Time to live setting that lets DynamoDB expire items automatically after a validity window |
| Skew | Uneven data distribution that makes one worker or partition handle far more work than others |
| Lineage | Record of where data came from and how it changed across the pipeline |
| Least privilege | Granting only the actions and resource scope a workload really needs |
| Pair | Keep this distinction clear |
|---|---|
| Athena vs Redshift | serverless SQL on S3 versus managed warehouse for broader analytical workloads |
| crawler vs explicit schema | automatic discovery versus manual metadata control |
| checkpoint vs bookmark | generic stream or pipeline progress marker versus Glue-specific processed-state tracking |
| CDC vs full load | incremental source changes versus complete dataset copy |
| governance vs encryption | access and audit control versus protection of data at rest or in transit |
| Lake Formation vs IAM | governed lake-data permissions versus baseline AWS identity and service permissions |
| Glue vs EMR | managed/serverless ETL path versus cluster-control big-data processing |
| masking vs encryption | obfuscating exposed values versus cryptographically protecting stored or transmitted data |
| EventBridge vs Step Functions | triggering or routing events versus coordinating multi-step workflow logic |
| versioning vs lifecycle | object-history recovery versus age-based storage-tiering or expiration rules |
| Blur cluster | Keep this separation clear |
|---|---|
| Athena / Redshift / QuickSight | query engine on S3 / warehouse analytics engine / BI presentation layer |
| EventBridge / Step Functions / SNS | trigger / orchestrator / notification fan-out |
| KMS / Lake Formation / IAM | key management / governed lake access / baseline AWS permissions |
| crawler / catalog / business catalog | discovery mechanism / technical metadata layer / ownership-lineage-governance context |
| Topic family | Best page to revisit |
|---|---|
| service fit and high-yield trade-offs | Cheat Sheet |
| current AWS facts and primary docs | Resources |
| pacing and review order | Study Plan |
| overall exam framing | Guide root |