Databricks DA-ASSOC Cheat Sheet: Unity Catalog, Dashboards, and SQL

Databricks DA-ASSOC cheat sheet for Unity Catalog, dashboards, SQL, traps, and final review.

Use this for last-mile review. DA-ASSOC is usually easiest when you decide, in order:

  1. Row grain: what should one row represent before any join or aggregate?
  2. Workflow layer: Unity Catalog, import path, SQL Warehouse, dashboard, alert, Genie, or secure-sharing issue?
  3. SQL correctness: join, filter, null-handling, grouping, time-travel, or window-frame issue?
  4. Trust boundary: permissions, lineage, freshness, certified-dataset quality, or trusted-asset mismatch?

Workflow in one screen

    flowchart TD
	  Source["Source or Imported Data"] --> Catalog["Unity Catalog + Certified Dataset Discovery"]
	  Catalog --> Query["Databricks SQL Query Logic"]
	  Query --> Analyze["Warehouse, Query History, Profile, Cache"]
	  Analyze --> Consume["Dashboard, Alert, or Genie Consumption"]
	  Govern["Permissions, Sharing, Lineage, Ownership"] -. protects .-> Consume

DA-ASSOC answer sequence

Use this when the stem mixes grain, SQL logic, Databricks SQL, dashboards, alerts, or governance.

    flowchart TD
	  S["Scenario"] --> G["Check row grain first"]
	  G --> W["Classify the workflow layer"]
	  W --> Q["Fix SQL correctness or query analysis"]
	  Q --> C["Check trust boundary, permissions, and lineage"]
	  C --> V["Verify with query profile, history, or trusted asset"]

Fastest 10-minute review

If the question says… Strongest first lane
dashboard number is too high check row grain and duplicate joins before chart settings
rows disappeared after a LEFT JOIN inspect filters placed after the join
“latest record per customer” ROW_NUMBER() with partition and descending order
alert did not fire inspect source query logic, alert condition, schedule, and destination
query is expensive or slow warehouse fit, query profile, filter selectivity, and data scanned
federated join question decide whether the need is on-demand querying versus ingestion first
user trusts Genie output too much or too little inspect curated datasets, instructions, trusted assets, and governed access
“metric looks wrong” lineage, freshness, and source-query correctness before dashboard cosmetics

DA-ASSOC review map

Exam area Use this cheat-sheet section first
Unity Catalog and managed data governed objects, table types, namespace, and discovery cues
query execution row-grain checks, join chooser, filter-placement traps, warehouses, and time travel
query analysis Photon, query history, query profile, caching, and liquid clustering
dashboards, alerts, and Genie parameter, refresh, alert, trusted-asset, and sharing cues
governance and trusted analytics permissions, lineage, ownership, certified datasets, and secure-consumption boundaries

Row grain first

Question to ask Why it matters
what should one row represent? almost every wrong metric starts here
is the dimension side unique on the chosen key? non-unique dimensions cause row multiplication
should I aggregate before or after the join? wrong ordering distorts totals
does the filter belong before or after the join or window? placement changes semantics

Grain-diagnostic table

Symptom Strongest first cause
counts doubled after enrichment non-unique dimension key or many-to-many join
sums inflated after lookup join repeated dimension rows or wrong grain
dashboard total differs from trusted report query grain drift before visualization
query uses repeated DISTINCT to “fix” numbers underlying join or grain problem, not a final fix
Genie answer sounds plausible but metric is off trusted asset or source query may be tied to the wrong grain

Fast grain traps

Trap Better reading
adjusting chart settings when the metric is wrong repair the query grain first
aggregating detail after a many-to-many join reduce or normalize before the join when needed
blaming warehouse size for obviously wrong totals correctness comes before performance

Join chooser

Join type Keeps rows from Strong use
INNER both sides with matches enrichment when a match is required
LEFT left side always keep all facts and attach optional dimension data
FULL both sides always mismatch or reconciliation review
SEMI left rows with a match existence filtering without bringing columns
ANTI left rows without a match find missing or unmatched records

Join correctness table

Symptom Strongest first check
expected unmatched rows are missing LEFT join plus post-join filter may have become INNER logic
inflated sums after adding a lookup table dimension uniqueness and join cardinality
output needs “records with no match” ANTI join or LEFT join plus null check
unexpected nulls in dimension columns LEFT join behavior may be correct, not broken
same data lives outside Databricks and inside Databricks federated read may fit analysis, but ingestion may fit repeated production use

Filter-placement traps

Pattern Why it matters
filter in WHERE after a LEFT join on right-table column can remove unmatched left rows
filter before the join can reduce row set early and preserve intended outer-join semantics
filter after aggregation changes grouped result, not source row set

Window-function rules

Piece What it really does Common mistake
PARTITION BY defines the group forgetting the grouping boundary
ORDER BY defines sequence expecting running logic without order
frame defines which ordered rows count relying on defaults without checking fit

Window-function picker

If the question is mainly about… Strongest first lane
latest row per entity ROW_NUMBER() with partition and descending sort
running total window aggregate with ordered frame
ranking with ties RANK() or DENSE_RANK() based on tie behavior
compare prior or next row LAG() or LEAD()
compare current data with prior table version Delta Lake history or time travel, not a window function alone

High-confusion window pairs

Pair Keep this distinction clear
partition vs order group boundary versus sequence inside that group
row number vs rank unique sequence versus tied-rank behavior
grouped aggregate vs window aggregate one row per group versus value repeated beside detail rows
current-row comparison vs running total LAG or LEAD logic versus ordered-frame accumulation

Table and storage-type cues

If the question is really about… Strongest first lane
Databricks-managed storage lifecycle managed table
data kept outside Databricks-managed storage external table
precomputed query result for repeated analytics materialized view
continuously arriving data streaming table
querying old Delta table versions time travel
governed sharing to outside systems Delta Sharing

High-confusion data-object pairs

Pair Keep this distinction clear
managed table vs external table Databricks-managed lifecycle versus externally managed storage path
materialized view vs standard view stored refreshed result versus logical saved query
materialized view vs streaming table repeated query acceleration versus continuous ingestion pattern
imported data vs federated data copied into Databricks versus queried in place

CTE, aggregation, and query-shape cues

If the question is really about… Strongest first lane
query readability or staged transformation CTE
reduce rows before expensive downstream work early filter and aggregate discipline
one metric per chosen grain aggregate at that grain before visualizing
de-duplicating after messy joins fix the join shape before leaning on DISTINCT

Bad SQL smells

Smell Better reading
many nested fixes after a wrong join the grain is probably wrong earlier
grouping and windowing without a clear business grain stop and define the row first
dashboard-ready query still returns unstable duplicates lock down metric grain before formatting

Databricks SQL warehouse and query-analysis cues

Requirement Strongest first lane
analyst-facing SQL and dashboard execution Databricks SQL warehouse
explain why query is slow query history, execution details, scanned data, and filter shape
reusable filtered analysis parameters instead of many copied queries
trusted metric reused across dashboards stable governed query logic
warehouse seems fine but answer is still wrong fix SQL logic before tuning compute

Performance first look

Symptom Inspect first
slow query filter selectivity, join shape, and data scanned
high cost or long runtime warehouse fit, repeated scans, and cache opportunity
dashboard looks stale refresh behavior, upstream data readiness, and query schedule
query returns correct logic slowly correctness is fine; then inspect warehouse and scan pattern
repeated filter on large tables is still costly liquid clustering may fit the filter pattern

Dashboard, alert, and Genie picker

If the stem is mainly about… Strongest first lane
visual presentation of governed metrics dashboard
threshold-triggered notification alert
natural-language exploration over curated governed data Genie
reusable filters or comparable views parameterized query and dashboard filter design
published analytics outside the workspace dashboard sharing or embedding

Consumption-boundary table

Feature Main job Common miss
dashboard present trusted metrics visually expecting visualization to fix wrong SQL
alert notify when threshold logic is met forgetting schedule, query result shape, or destination behavior
Genie natural-language access to curated data assuming it replaces dataset curation and governance
query history inspect performance and executions treating it as the semantic source of truth

Dashboard and Genie traps

Trap Better reading
changing the chart when the metric is wrong fix grain and query logic first
blaming Genie when the source model is weak inspect dataset quality, instructions, trusted assets, and governed access
alert fired on the wrong threshold validate source query, alert schedule, and condition definition
assuming published dashboards inherit every object permission automatically restate dashboard sharing mode and object access boundary

Governance and trusted analytics

Boundary What it really answers
Unity Catalog permissions who can access which governed object
lineage where the metric or dataset came from
sharing who can consume the output
freshness whether the result reflects current data
ownership who controls object changes and secure management

Governance traps

Trap Better reading
user cannot see the result inspect object permissions and sharing boundary before rewriting SQL
metric looks suspicious inspect lineage and source-query semantics
data is present but not trusted freshness, lineage, and documented metric meaning matter
security question sounds like storage only namespace, roles, sharing settings, and object ownership all matter

Last 15-minute review

Recheck this Because the miss usually hides here
one-row meaning before every join or aggregate wrong numbers usually start here
managed, external, materialized, streaming, and shared object differences Databricks nouns blur under time pressure
warehouse, profile, cache, and liquid-clustering cues performance questions punish wrong-layer thinking
dashboard, alert, and Genie job boundaries consumption features are related but not interchangeable
catalog, schema, object, owner, and sharing scope security questions often hide inside naming and permission boundaries

What strong DA-ASSOC answers usually do

  • separate governed data discovery, query authoring, query analysis, consumption, and security
  • keep row grain and trusted-source quality ahead of cosmetic fixes
  • decide whether the question is asking for copied data, shared data, federated data, or governed internal data
  • classify dashboard, alert, and Genie behavior before choosing a feature
    Recheck this Because the miss often hides here
    one-row-per-what row-grain mistakes poison everything downstream
    join type and filter placement many wrong answers are logic errors, not syntax gaps
    partition, order, and frame window questions punish blurred boundaries
    warehouse, dashboard, alert, and Genie roles Databricks workflow features are distinct lanes
    permissions, lineage, and freshness trusted analytics is more than valid SQL

What strong DA-ASSOC answers usually do

  • preserve the correct row grain before joining or aggregating
  • separate SQL correctness, warehouse performance, dashboard behavior, and governance
  • treat alerts, lineage, freshness, and sharing as part of trusted analytics
  • fix the query logic before polishing the visual layer
Revised on Sunday, May 10, 2026