Databricks ML-ASSOC Missing Values and Feature Transforms Guide

April 13, 2026

Study Databricks ML-ASSOC Missing Values and Feature Transforms: key concepts, common traps, and exam decision cues.

On this page

This lesson is about choosing the preprocessing move that actually fits the data. The exam is not asking whether you know a default function name. It is asking whether you understand what the transformation does to the feature and the model.

Transformation picker

Need	Better first instinct
fill missing continuous values	choose mean or median based on the distribution and robustness need
fill missing categorical values	mode often fits
represent categorical variables numerically	one-hot encoding when the model and feature cardinality make sense
reduce skew in a numeric feature	consider a log transform where appropriate

Decision order

Ask this first	Why it matters
is the feature continuous or categorical?	imputation and encoding options change immediately
is the distribution skewed or sensitive to outliers?	that affects mean versus median decisions
is the transformation helping the model, or only making the data look tidy?	the exam rewards useful preprocessing, not decorative changes

Common traps

Trap	Better rule
using mean imputation by habit	the distribution should still matter
applying one-hot encoding to every categorical situation	some model or data situations make it a poor fit
forgetting to reverse log scale for interpretation or metric calculation when needed	transformed targets may need exponentiation later

What the exam is really testing

Databricks is usually testing judgment, not library trivia:

mean vs median vs mode asks whether you noticed the feature type and distribution
one-hot encoding asks whether the categorical representation fits the model and data shape
log transform asks whether skew reduction helps enough to justify the transformed scale

The stronger answer usually explains why the transformation improves the workflow instead of naming it mechanically.

Scenario triage

Scenario clue	Stronger answer shape
“continuous column with missing values and no strong skew issue”	mean can be reasonable, but verify the distribution first
“continuous column with skew or outlier sensitivity”	median becomes more attractive
“categorical feature with missing values”	mode is often the simplest fit
“numeric feature with heavy skew”	consider log transform where the business meaning still works
“categorical representation question”	ask whether one-hot encoding is actually appropriate for the model and feature

Decision order that usually wins

This lesson usually tests whether you can match preprocessing to data shape and model needs. If a continuous feature has missing values, inspect the distribution before choosing imputation. If skew or extremes matter, median may be stronger than mean. One-hot encoding is not “always yes”; it depends on the feature and model trade-offs. The weak answer usually applies one transform by reflex.

Quiz

Loading quiz…

Revised on Monday, June 15, 2026

2.1 Summary Statistics, Outliers and Visual Comparisons

Browse Databricks Certification Guides