Databricks DE-ASSOC Cheat Sheet: Ingestion, Catalogs, and Delta

Databricks DE-ASSOC cheat sheet for ingestion, catalogs, Delta, traps, and final review.

Use this for last-mile review. Keep it open while drilling mixed questions. DE-ASSOC is usually easier when you classify the stem in this order:

  1. Platform lane: workspace, compute, serverless, SQL, notebook, or Unity Catalog?
  2. Pipeline lane: ingestion, transformation, Lakeflow logic, workflow scheduling, or job recovery?
  3. Table/governance lane: Delta behavior, managed vs external storage, permissions, sharing, or lineage?
  4. Evidence lane: logs, Spark UI, failed task pattern, skew, tiny files, or bad join logic?

DE-ASSOC section map in one screen

Official section Best cheat-sheet focus
1. Databricks Intelligence Platform platform fit, compute choices, serverless and SQL warehouse cues
2. Development and Ingestion notebooks, Databricks Connect, Auto Loader, checkpoint thinking
3. Data Processing and Transformations medallion purpose, Delta rules, DDL/DML, DataFrame and SQL patterns
4. Productionizing Data Pipelines Asset Bundles, workflows, repair and rerun, serverless jobs, Spark UI
5. Data Governance and Quality managed vs external tables, Unity Catalog, lineage, Delta Sharing, federation

Platform and pipeline mental model

    flowchart TD
	  subgraph Platform["Platform Lane"]
	    WS["Workspace + Notebooks"] --> Compute["Interactive, Job, or SQL Compute"]
	  end
	  subgraph Pipeline["Pipeline Lane"]
	    Auto["Auto Loader / Ingest"] --> Bronze["Bronze"] --> Silver["Silver"] --> Gold["Gold"]
	  end
	  subgraph Production["Production + Governance"]
	    Gold --> Workflows["Lakeflow / Workflows / Jobs"]
	    Workflows --> Observe["Spark UI, Logs, Repair"]
	    UC["Unity Catalog + Lineage + Permissions"]
	  end
	  Compute --> Auto
	  UC -. governs .-> Bronze
	  UC -. governs .-> Silver
	  UC -. governs .-> Gold

DE-ASSOC answer sequence

Use this when the stem mixes workspace, compute, ingestion, governance, and production behavior.

    flowchart TD
	  S["Scenario"] --> P["Classify the lane"]
	  P --> W["Workspace, compute, SQL, pipeline, or governance?"]
	  W --> D["Choose the right Databricks feature"]
	  D --> G["Check Unity Catalog, table type, and permissions"]
	  G --> O["Verify logs, Spark UI, lineage, or run recovery"]

Fast platform picker

If the question is mainly about… Strongest first lane
interactive exploration and ad hoc transformation work notebook on the right compute
local IDE-driven development against Databricks Databricks Connect
incremental file discovery and append-heavy ingestion Auto Loader
declarative ETL pipeline structure Lakeflow Declarative Pipelines
scheduled production execution and repair Databricks Workflows
SQL-serving, dashboards, or analyst-facing queries SQL warehouse
governance boundary, permissions, lineage, or sharing Unity Catalog

Compute and workload fit

Workload signal Interactive cluster Job compute / serverless job SQL warehouse
notebook exploration or development loop strongest fit weak weak
scheduled batch pipeline possible but less disciplined strongest fit weak
analyst SQL and BI path weak weak strongest fit
exam trap using interactive compute as permanent production runtime forgetting repair/rerun and scheduling behavior treating it like general ETL compute

Compute traps

Trap Better reading
“it runs in a notebook, so it belongs on an interactive cluster forever” separate development workflow from scheduled production execution
“serverless means every workload should move there” first classify whether the question is about SQL serving, notebook work, or job execution
“workspace” and “compute” blur together workspace is the operating environment; compute is the execution lane

Ingestion and development picker

Requirement Strongest first lane Why
discover new files incrementally with less manual listing logic Auto Loader ingestion-first tool with checkpoint/state thinking
move local dev workflow toward Databricks execution Databricks Connect local IDE development against platform runtime
one-off file load from stage or source into a table direct load pattern simpler than inventing a streaming path
understand why an ingest step failed logs, run details, and recent source/schema changes evidence before redesign

Auto Loader cues

Cue Fast recall
repeated file arrival over time Auto Loader lane
checkpoint thinking resume incremental processing safely
schema drift concern classify whether schema should be enforced, rescued, or intentionally evolved
common trap treating Auto Loader like a generic transformation framework instead of an ingestion lane

Delta and transformation rules

If the question is really about… Strongest first lane
ACID table behavior on the lake Delta table
upsert or change-merge logic MERGE
historical inspection or rollback reasoning Delta history or time travel
incompatible write protection schema enforcement
intentionally adding columns schema evolution
transformation layer purpose Bronze vs Silver vs Gold choice

Bronze / Silver / Gold

Layer Main purpose Common exam reading
Bronze raw ingest, append-heavy, close to source keep source fidelity and land data safely
Silver cleaned, validated, joined, shaped enforce quality and prepare reusable data
Gold business-ready serving layer curated output for BI, reporting, or stable consumption

High-confusion Delta pairs

Pair Keep this distinction clear
schema enforcement vs schema evolution reject incompatible write versus intentionally allow structure change
managed vs external table Databricks-managed storage location versus externally controlled storage path
batch transformation logic vs streaming or incremental ingest processing lane versus arrival/discovery lane
medallion layer choice vs Unity Catalog object boundary data refinement stage versus governance namespace

SQL and DataFrame quick rules

Question pattern Strongest reading
“keep all left-side rows” left join
“find missing matches” anti join or left-side missing-match logic
“top or latest row within each entity” window function such as ROW_NUMBER()
“too much shuffle after join or aggregation” wide transformation, possible skew, repartitioning or better key design
“slow query with excessive small reads” file layout, compaction, pruning, and data skipping before brute-force scaling

Production, jobs, and repair cues

Requirement Strongest first lane
packaged deployable project structure Databricks Asset Bundles
scheduled dependency-aware execution Workflows
rerun only the failed work rather than everything repair / rerun logic
performance evidence instead of guesswork Spark UI and run diagnostics
smaller operational burden for scheduled jobs serverless jobs when the stem points there

Workflow traps

Trap Better reading
rerun the whole pipeline every time repair the failed path when the question is about safe recovery
notebook success means production readiness separate interactive proof from packaged, scheduled, observable workflow behavior
“optimize” with no evidence inspect Spark UI, task skew, shuffle pattern, and recent code changes first

Unity Catalog, sharing, and governance

If the question is mainly about… Strongest first lane
catalogs, schemas, tables, and privilege boundaries Unity Catalog object model
who can access what permissions and role boundary
where data lives and who manages the path managed vs external table choice
auditability and downstream visibility lineage and audit logs
sharing data to others without copying every object manually Delta Sharing
querying external systems through a governed connection federation

Governance-boundary table

Item What it really answers Do not confuse it with
catalog high-level namespace and governance boundary a single physical data file path
schema grouping inside a catalog a medallion layer by itself
managed table Databricks-managed storage lifecycle external table storage ownership
lineage upstream/downstream dependency evidence permissions
sharing controlled exposure to consumers cloning or duplicating data pipelines

Troubleshooting first look

Symptom Inspect first
duplicate records after upsert MERGE condition and source uniqueness
write fails on mismatch schema enforcement versus intended evolution
slow transformation after join or aggregate shuffle, skew, partitioning, and Spark UI evidence
job failed after partial success workflow run details, repair path, and failed task boundary
unexpected permission denial Unity Catalog object boundary and granted privileges
too many tiny files write pattern, compaction strategy, and table layout

Last 15-minute review

Recheck this Because the exam often hides the miss here
development workflow vs production workflow notebook comfort is not the same as job discipline
Auto Loader vs Lakeflow vs Workflows ingestion, declarative pipeline logic, and scheduling are different lanes
Bronze / Silver / Gold purpose many answers fail because the layer purpose is blurred
managed vs external tables governance and storage ownership matter
Delta rules such as MERGE, time travel, and schema behavior these are high-yield feature distinctions

What strong DE-ASSOC answers usually do

  • classify whether the question is about platform, pipeline, governance, or runtime evidence
  • choose the more repeatable and observable production behavior over the more manual notebook habit
  • separate ingestion, transformation, and scheduling instead of treating them as one tool choice
  • keep Unity Catalog, table type, lineage, and sharing boundaries precise
Revised on Sunday, May 10, 2026