Catalogs, Crawlers, and Metadata

April 1, 2026

DEA-C01 lesson on Glue Data Catalog, crawlers, Hive metastore, partitions, connections, and business cataloging.

On this page

Data platforms fail fast when metadata goes stale or unreadable. DEA-C01 expects you to know the role of the catalog, crawlers, partitions, and schema tracking because many downstream services depend on them.

Data catalog: Central metadata layer that tells query and pipeline services what data exists, where it lives, and how it is structured.

Crawler: AWS Glue feature that scans data sources and infers schemas and partitions for catalog entries.

Metadata freshness: Keeping schema, partition, and connection details current enough that downstream consumers can still query or process the data correctly.

What AWS is really testing here

AWS wants you to separate:

storing data from describing data
automatic discovery from controlled metadata management
technical schema registration from business cataloging and lineage needs
“the files exist in S3” from “Athena, Glue, or Redshift Spectrum can actually use them”

DEA-C01 often hides metadata failures inside query problems. The data may exist and still be unusable because the catalog, partition state, or schema registration is stale or wrong.

Cataloging chooser

Requirement	Strongest first fit	Why
multiple analytics services need one shared metadata layer	AWS Glue Data Catalog	It is the common metadata plane for many AWS analytics patterns
schemas should be discovered from changing file-based datasets	AWS Glue crawler	The need is automated schema and partition discovery
table definition must stay tightly controlled	manual catalog table definition	DEA-C01 expects manual control when crawler guesses could break consumers
new external source or target must be cataloged through Glue	Glue connection	The issue is connectivity and catalog integration, not just schema text
the requirement is business metadata, ownership, or lineage discovery	business data catalog tooling such as SageMaker Catalog	The need goes beyond raw technical schema registration

Catalog, crawler, connection, and business context are different layers

If the stem emphasizes…	Think first	Why this fits
one shared metadata plane for analytics services	Glue Data Catalog	This is the technical metadata system of record
schema and partition discovery from changing files	Glue crawler	The center of gravity is automated inference
exact schema control and stable curated definitions	manual catalog table definition	Controlled tables may be safer than crawler guesses
reaching external or managed sources through Glue	Glue connection	The problem is integration and connectivity metadata
ownership, glossary, lineage, and stewardship	business data catalog	This goes beyond table registration

Metadata control path

This path matters because the data is not useful to analysts until the metadata layer is usable too.

    flowchart LR
	  A["Data source"] --> B{"How is metadata created?"}
	  B -->|Automatic discovery| C["Glue crawler"]
	  B -->|Controlled definition| D["Manual table definition"]
	  A --> E["Glue connection when source integration is needed"]
	  C --> F["Glue Data Catalog"]
	  D --> F
	  E --> F
	  F --> G["Athena, Glue, Spectrum, downstream analytics"]

Crawlers versus manual definitions

Situation	Crawler is stronger first	Manual definition is stronger first
raw landing zone with many changing folders	yes	no
highly controlled curated tables	sometimes	often
partition discovery should happen automatically	yes	no
schema mistakes would be expensive or disruptive	maybe	yes
custom table properties or exact column intent matter	weak	strong

How strong DEA-C01 answers usually reason

Ask whether the problem is technical metadata registration, automated discovery, schema control, or business context.
Use Glue Data Catalog as the shared technical metadata layer.
Use crawlers when discovery should be automated and the inference risk is acceptable.
Prefer manual definitions when schema precision matters and crawler mistakes are costly.
Separate business cataloging from the technical schema layer.

Partitions and metadata freshness

Problem	Better reading
new partition folders exist but queries miss them	synchronize partitions or refresh the catalog metadata
crawler guessed the wrong schema from mixed files	narrow the crawler scope or define the table manually
the data is queryable in storage but not in Athena	check the catalog entry, schema, partitions, and location mapping first
teams know where data lives but not what it means	a business data catalog may be the missing layer

Decision order that usually wins

When metadata answers blur together, use this order:

Decide whether the issue is catalog presence, schema discovery, schema control, partition freshness, or business context.
If many analytics services need one technical metadata plane, choose Glue Data Catalog.
If changing raw files must be discovered automatically, choose a Glue crawler.
If curated definitions must stay precise, choose manual table control over crawler convenience.
If the question is really about ownership, glossary, or lineage, move to the business catalog lane instead of stopping at technical registration.

Common tie-breaks

Situation	Stronger first answer
new partitions exist but queries miss them	refresh or synchronize partition metadata
curated tables must not drift because of odd files	manual table control
many changing raw folders need discovery	crawler
users need owner, glossary, and lineage context	business catalog layer

Common traps

Trap	Better reading
“The data is in S3, so Athena will just find it.”	DEA-C01 expects a usable catalog and correct metadata, not just stored files.
“Crawler output is always correct.”	Crawlers are helpful, but manual control is stronger when schema drift would hurt.
“Cataloging is only about column names.”	Metadata also includes partitions, locations, and connection context.
“Business cataloging and technical cataloging are the same thing.”	DEA-C01 can distinguish technical schema discovery from broader data-governance and lineage needs.

Harder scenario question

A team stores new Parquet files in partitioned S3 prefixes every day. Athena queries work only for old partitions, and a crawler sometimes misreads rare optional columns. What is the strongest reading first?

A. Remove the catalog entirely and query S3 directly
B. Keep the Glue Data Catalog, refresh partition metadata, and use tighter crawler or manual table control
C. Replace Athena with Route 53
D. Disable partitions so the crawler has less to scan

Correct answer: B. DEA-C01 expects you to preserve the catalog layer, keep partitions synchronized, and choose more controlled schema management when crawler inference becomes risky.

Quiz

Loading quiz…

Revised on Monday, June 15, 2026

2.1 Store Selection

2.3 Lifecycle

Browse AWS Certification Guides