DEA-C01 Catalogs, Crawlers and Metadata Guide

Study DEA-C01 Catalogs, Crawlers and Metadata: key concepts, common traps, and exam decision cues.

Data platforms fail fast when metadata goes stale or unreadable. DEA-C01 expects you to know the role of the catalog, crawlers, partitions, and schema tracking because many downstream services depend on them.

Data catalog: Central metadata layer that tells query and pipeline services what data exists, where it lives, and how it is structured.

Crawler: AWS Glue feature that scans data sources and infers schemas and partitions for catalog entries.

Metadata freshness: Keeping schema, partition, and connection details current enough that downstream consumers can still query or process the data correctly.

What AWS is really testing here

AWS wants you to separate:

  • storing data from describing data
  • automatic discovery from controlled metadata management
  • technical schema registration from business cataloging and lineage needs
  • “the files exist in S3” from “Athena, Glue, or Redshift Spectrum can actually use them”

DEA-C01 often hides metadata failures inside query problems. The data may exist and still be unusable because the catalog, partition state, or schema registration is stale or wrong.

Cataloging chooser

Requirement Strongest first fit Why
multiple analytics services need one shared metadata layer AWS Glue Data Catalog It is the common metadata plane for many AWS analytics patterns
schemas should be discovered from changing file-based datasets AWS Glue crawler The need is automated schema and partition discovery
table definition must stay tightly controlled manual catalog table definition DEA-C01 expects manual control when crawler guesses could break consumers
new external source or target must be cataloged through Glue Glue connection The issue is connectivity and catalog integration, not just schema text
the requirement is business metadata, ownership, or lineage discovery business data catalog tooling such as SageMaker Catalog The need goes beyond raw technical schema registration

Catalog, crawler, connection, and business context are different layers

If the stem emphasizes… Think first Why this fits
one shared metadata plane for analytics services Glue Data Catalog This is the technical metadata system of record
schema and partition discovery from changing files Glue crawler The center of gravity is automated inference
exact schema control and stable curated definitions manual catalog table definition Controlled tables may be safer than crawler guesses
reaching external or managed sources through Glue Glue connection The problem is integration and connectivity metadata
ownership, glossary, lineage, and stewardship business data catalog This goes beyond table registration

Metadata control path

This path matters because the data is not useful to analysts until the metadata layer is usable too.

    flowchart LR
	  A["Data source"] --> B{"How is metadata created?"}
	  B -->|Automatic discovery| C["Glue crawler"]
	  B -->|Controlled definition| D["Manual table definition"]
	  A --> E["Glue connection when source integration is needed"]
	  C --> F["Glue Data Catalog"]
	  D --> F
	  E --> F
	  F --> G["Athena, Glue, Spectrum, downstream analytics"]

Crawlers versus manual definitions

Situation Crawler is stronger first Manual definition is stronger first
raw landing zone with many changing folders yes no
highly controlled curated tables sometimes often
partition discovery should happen automatically yes no
schema mistakes would be expensive or disruptive maybe yes
custom table properties or exact column intent matter weak strong

How strong DEA-C01 answers usually reason

  1. Ask whether the problem is technical metadata registration, automated discovery, schema control, or business context.
  2. Use Glue Data Catalog as the shared technical metadata layer.
  3. Use crawlers when discovery should be automated and the inference risk is acceptable.
  4. Prefer manual definitions when schema precision matters and crawler mistakes are costly.
  5. Separate business cataloging from the technical schema layer.

Partitions and metadata freshness

Problem Better reading
new partition folders exist but queries miss them synchronize partitions or refresh the catalog metadata
crawler guessed the wrong schema from mixed files narrow the crawler scope or define the table manually
the data is queryable in storage but not in Athena check the catalog entry, schema, partitions, and location mapping first
teams know where data lives but not what it means a business data catalog may be the missing layer

Decision order that usually wins

When metadata answers blur together, use this order:

  1. Decide whether the issue is catalog presence, schema discovery, schema control, partition freshness, or business context.
  2. If many analytics services need one technical metadata plane, choose Glue Data Catalog.
  3. If changing raw files must be discovered automatically, choose a Glue crawler.
  4. If curated definitions must stay precise, choose manual table control over crawler convenience.
  5. If the question is really about ownership, glossary, or lineage, move to the business catalog lane instead of stopping at technical registration.

Common tie-breaks

Situation Stronger first answer
new partitions exist but queries miss them refresh or synchronize partition metadata
curated tables must not drift because of odd files manual table control
many changing raw folders need discovery crawler
users need owner, glossary, and lineage context business catalog layer

Common traps

Trap Better reading
“The data is in S3, so Athena will just find it.” DEA-C01 expects a usable catalog and correct metadata, not just stored files.
“Crawler output is always correct.” Crawlers are helpful, but manual control is stronger when schema drift would hurt.
“Cataloging is only about column names.” Metadata also includes partitions, locations, and connection context.
“Business cataloging and technical cataloging are the same thing.” DEA-C01 can distinguish technical schema discovery from broader data-governance and lineage needs.

Harder scenario question

A team stores new Parquet files in partitioned S3 prefixes every day. Athena queries work only for old partitions, and a crawler sometimes misreads rare optional columns. What is the strongest reading first?

  • A. Remove the catalog entirely and query S3 directly
  • B. Keep the Glue Data Catalog, refresh partition metadata, and use tighter crawler or manual table control
  • C. Replace Athena with Route 53
  • D. Disable partitions so the crawler has less to scan

Correct answer: B. DEA-C01 expects you to preserve the catalog layer, keep partitions synchronized, and choose more controlled schema management when crawler inference becomes risky.

Quiz

Loading quiz…
Revised on Sunday, May 10, 2026