Databricks DE-ASSOC Cheat Sheet: Ingestion, Catalogs, and Delta

March 28, 2026

Databricks DE-ASSOC cheat sheet for ingestion, catalogs, Delta, traps, and final review.

On this page

Use this for last-mile review. Keep it open while drilling mixed questions. DE-ASSOC is usually easier when you classify the stem in this order:

Platform lane: workspace, compute, serverless, SQL, notebook, or Unity Catalog?
Pipeline lane: ingestion, transformation, Lakeflow logic, workflow scheduling, or job recovery?
Table/governance lane: Delta behavior, managed vs external storage, permissions, sharing, or lineage?
Evidence lane: logs, Spark UI, failed task pattern, skew, tiny files, or bad join logic?

DE-ASSOC section map in one screen

Official section	Best cheat-sheet focus
1. Databricks Intelligence Platform	platform fit, compute choices, serverless and SQL warehouse cues
2. Development and Ingestion	notebooks, Databricks Connect, Auto Loader, checkpoint thinking
3. Data Processing and Transformations	medallion purpose, Delta rules, DDL/DML, DataFrame and SQL patterns
4. Productionizing Data Pipelines	Asset Bundles, workflows, repair and rerun, serverless jobs, Spark UI
5. Data Governance and Quality	managed vs external tables, Unity Catalog, lineage, Delta Sharing, federation

Platform and pipeline mental model

    flowchart TD
	  subgraph Platform["Platform Lane"]
	    WS["Workspace + Notebooks"] --> Compute["Interactive, Job, or SQL Compute"]
	  end
	  subgraph Pipeline["Pipeline Lane"]
	    Auto["Auto Loader / Ingest"] --> Bronze["Bronze"] --> Silver["Silver"] --> Gold["Gold"]
	  end
	  subgraph Production["Production + Governance"]
	    Gold --> Workflows["Lakeflow / Workflows / Jobs"]
	    Workflows --> Observe["Spark UI, Logs, Repair"]
	    UC["Unity Catalog + Lineage + Permissions"]
	  end
	  Compute --> Auto
	  UC -. governs .-> Bronze
	  UC -. governs .-> Silver
	  UC -. governs .-> Gold

DE-ASSOC answer sequence

Use this when the stem mixes workspace, compute, ingestion, governance, and production behavior.

    flowchart TD
	  S["Scenario"] --> P["Classify the lane"]
	  P --> W["Workspace, compute, SQL, pipeline, or governance?"]
	  W --> D["Choose the right Databricks feature"]
	  D --> G["Check Unity Catalog, table type, and permissions"]
	  G --> O["Verify logs, Spark UI, lineage, or run recovery"]

Fast platform picker

If the question is mainly about…	Strongest first lane
interactive exploration and ad hoc transformation work	notebook on the right compute
local IDE-driven development against Databricks	Databricks Connect
incremental file discovery and append-heavy ingestion	Auto Loader
declarative ETL pipeline structure	Lakeflow Declarative Pipelines
scheduled production execution and repair	Databricks Workflows
SQL-serving, dashboards, or analyst-facing queries	SQL warehouse
governance boundary, permissions, lineage, or sharing	Unity Catalog

Compute and workload fit

Workload signal	Interactive cluster	Job compute / serverless job	SQL warehouse
notebook exploration or development loop	strongest fit	weak	weak
scheduled batch pipeline	possible but less disciplined	strongest fit	weak
analyst SQL and BI path	weak	weak	strongest fit
exam trap	using interactive compute as permanent production runtime	forgetting repair/rerun and scheduling behavior	treating it like general ETL compute

Compute traps

Trap	Better reading
“it runs in a notebook, so it belongs on an interactive cluster forever”	separate development workflow from scheduled production execution
“serverless means every workload should move there”	first classify whether the question is about SQL serving, notebook work, or job execution
“workspace” and “compute” blur together	workspace is the operating environment; compute is the execution lane

Ingestion and development picker

Requirement	Strongest first lane	Why
discover new files incrementally with less manual listing logic	Auto Loader	ingestion-first tool with checkpoint/state thinking
move local dev workflow toward Databricks execution	Databricks Connect	local IDE development against platform runtime
one-off file load from stage or source into a table	direct load pattern	simpler than inventing a streaming path
understand why an ingest step failed	logs, run details, and recent source/schema changes	evidence before redesign

Auto Loader cues

Cue	Fast recall
repeated file arrival over time	Auto Loader lane
checkpoint thinking	resume incremental processing safely
schema drift concern	classify whether schema should be enforced, rescued, or intentionally evolved
common trap	treating Auto Loader like a generic transformation framework instead of an ingestion lane

Delta and transformation rules

If the question is really about…	Strongest first lane
ACID table behavior on the lake	Delta table
upsert or change-merge logic	`MERGE`
historical inspection or rollback reasoning	Delta history or time travel
incompatible write protection	schema enforcement
intentionally adding columns	schema evolution
transformation layer purpose	Bronze vs Silver vs Gold choice

Bronze / Silver / Gold

Layer	Main purpose	Common exam reading
Bronze	raw ingest, append-heavy, close to source	keep source fidelity and land data safely
Silver	cleaned, validated, joined, shaped	enforce quality and prepare reusable data
Gold	business-ready serving layer	curated output for BI, reporting, or stable consumption

High-confusion Delta pairs

Pair	Keep this distinction clear
schema enforcement vs schema evolution	reject incompatible write versus intentionally allow structure change
managed vs external table	Databricks-managed storage location versus externally controlled storage path
batch transformation logic vs streaming or incremental ingest	processing lane versus arrival/discovery lane
medallion layer choice vs Unity Catalog object boundary	data refinement stage versus governance namespace

SQL and DataFrame quick rules

Question pattern	Strongest reading
“keep all left-side rows”	left join
“find missing matches”	anti join or left-side missing-match logic
“top or latest row within each entity”	window function such as `ROW_NUMBER()`
“too much shuffle after join or aggregation”	wide transformation, possible skew, repartitioning or better key design
“slow query with excessive small reads”	file layout, compaction, pruning, and data skipping before brute-force scaling

Production, jobs, and repair cues

Requirement	Strongest first lane
packaged deployable project structure	Databricks Asset Bundles
scheduled dependency-aware execution	Workflows
rerun only the failed work rather than everything	repair / rerun logic
performance evidence instead of guesswork	Spark UI and run diagnostics
smaller operational burden for scheduled jobs	serverless jobs when the stem points there

Workflow traps

Trap	Better reading
rerun the whole pipeline every time	repair the failed path when the question is about safe recovery
notebook success means production readiness	separate interactive proof from packaged, scheduled, observable workflow behavior
“optimize” with no evidence	inspect Spark UI, task skew, shuffle pattern, and recent code changes first

If the question is mainly about…	Strongest first lane
catalogs, schemas, tables, and privilege boundaries	Unity Catalog object model
who can access what	permissions and role boundary
where data lives and who manages the path	managed vs external table choice
auditability and downstream visibility	lineage and audit logs
sharing data to others without copying every object manually	Delta Sharing
querying external systems through a governed connection	federation

Governance-boundary table

Item	What it really answers	Do not confuse it with
catalog	high-level namespace and governance boundary	a single physical data file path
schema	grouping inside a catalog	a medallion layer by itself
managed table	Databricks-managed storage lifecycle	external table storage ownership
lineage	upstream/downstream dependency evidence	permissions
sharing	controlled exposure to consumers	cloning or duplicating data pipelines

Troubleshooting first look

Symptom	Inspect first
duplicate records after upsert	`MERGE` condition and source uniqueness
write fails on mismatch	schema enforcement versus intended evolution
slow transformation after join or aggregate	shuffle, skew, partitioning, and Spark UI evidence
job failed after partial success	workflow run details, repair path, and failed task boundary
unexpected permission denial	Unity Catalog object boundary and granted privileges
too many tiny files	write pattern, compaction strategy, and table layout

Last 15-minute review

Recheck this	Because the exam often hides the miss here
development workflow vs production workflow	notebook comfort is not the same as job discipline
Auto Loader vs Lakeflow vs Workflows	ingestion, declarative pipeline logic, and scheduling are different lanes
Bronze / Silver / Gold purpose	many answers fail because the layer purpose is blurred
managed vs external tables	governance and storage ownership matter
Delta rules such as `MERGE`, time travel, and schema behavior	these are high-yield feature distinctions

What strong DE-ASSOC answers usually do

classify whether the question is about platform, pipeline, governance, or runtime evidence
choose the more repeatable and observable production behavior over the more manual notebook habit
separate ingestion, transformation, and scheduling instead of treating them as one tool choice
keep Unity Catalog, table type, lineage, and sharing boundaries precise

Revised on Monday, June 15, 2026

Study Plan

Sample Questions

Browse Databricks Certification Guides