Study DEA-C01 Catalogs, Crawlers and Metadata: key concepts, common traps, and exam decision cues.
Data platforms fail fast when metadata goes stale or unreadable. DEA-C01 expects you to know the role of the catalog, crawlers, partitions, and schema tracking because many downstream services depend on them.
Data catalog: Central metadata layer that tells query and pipeline services what data exists, where it lives, and how it is structured.
Crawler: AWS Glue feature that scans data sources and infers schemas and partitions for catalog entries.
Metadata freshness: Keeping schema, partition, and connection details current enough that downstream consumers can still query or process the data correctly.
AWS wants you to separate:
DEA-C01 often hides metadata failures inside query problems. The data may exist and still be unusable because the catalog, partition state, or schema registration is stale or wrong.
| Requirement | Strongest first fit | Why |
|---|---|---|
| multiple analytics services need one shared metadata layer | AWS Glue Data Catalog | It is the common metadata plane for many AWS analytics patterns |
| schemas should be discovered from changing file-based datasets | AWS Glue crawler | The need is automated schema and partition discovery |
| table definition must stay tightly controlled | manual catalog table definition | DEA-C01 expects manual control when crawler guesses could break consumers |
| new external source or target must be cataloged through Glue | Glue connection | The issue is connectivity and catalog integration, not just schema text |
| the requirement is business metadata, ownership, or lineage discovery | business data catalog tooling such as SageMaker Catalog | The need goes beyond raw technical schema registration |
| If the stem emphasizes… | Think first | Why this fits |
|---|---|---|
| one shared metadata plane for analytics services | Glue Data Catalog | This is the technical metadata system of record |
| schema and partition discovery from changing files | Glue crawler | The center of gravity is automated inference |
| exact schema control and stable curated definitions | manual catalog table definition | Controlled tables may be safer than crawler guesses |
| reaching external or managed sources through Glue | Glue connection | The problem is integration and connectivity metadata |
| ownership, glossary, lineage, and stewardship | business data catalog | This goes beyond table registration |
This path matters because the data is not useful to analysts until the metadata layer is usable too.
flowchart LR
A["Data source"] --> B{"How is metadata created?"}
B -->|Automatic discovery| C["Glue crawler"]
B -->|Controlled definition| D["Manual table definition"]
A --> E["Glue connection when source integration is needed"]
C --> F["Glue Data Catalog"]
D --> F
E --> F
F --> G["Athena, Glue, Spectrum, downstream analytics"]
| Situation | Crawler is stronger first | Manual definition is stronger first |
|---|---|---|
| raw landing zone with many changing folders | yes | no |
| highly controlled curated tables | sometimes | often |
| partition discovery should happen automatically | yes | no |
| schema mistakes would be expensive or disruptive | maybe | yes |
| custom table properties or exact column intent matter | weak | strong |
| Problem | Better reading |
|---|---|
| new partition folders exist but queries miss them | synchronize partitions or refresh the catalog metadata |
| crawler guessed the wrong schema from mixed files | narrow the crawler scope or define the table manually |
| the data is queryable in storage but not in Athena | check the catalog entry, schema, partitions, and location mapping first |
| teams know where data lives but not what it means | a business data catalog may be the missing layer |
When metadata answers blur together, use this order:
| Situation | Stronger first answer |
|---|---|
| new partitions exist but queries miss them | refresh or synchronize partition metadata |
| curated tables must not drift because of odd files | manual table control |
| many changing raw folders need discovery | crawler |
| users need owner, glossary, and lineage context | business catalog layer |
| Trap | Better reading |
|---|---|
| “The data is in S3, so Athena will just find it.” | DEA-C01 expects a usable catalog and correct metadata, not just stored files. |
| “Crawler output is always correct.” | Crawlers are helpful, but manual control is stronger when schema drift would hurt. |
| “Cataloging is only about column names.” | Metadata also includes partitions, locations, and connection context. |
| “Business cataloging and technical cataloging are the same thing.” | DEA-C01 can distinguish technical schema discovery from broader data-governance and lineage needs. |
A team stores new Parquet files in partitioned S3 prefixes every day. Athena queries work only for old partitions, and a crawler sometimes misreads rare optional columns. What is the strongest reading first?
Correct answer: B. DEA-C01 expects you to preserve the catalog layer, keep partitions synchronized, and choose more controlled schema management when crawler inference becomes risky.