Study Databricks ML-ASSOC Missing Values and Feature Transforms: key concepts, common traps, and exam decision cues.
This lesson is about choosing the preprocessing move that actually fits the data. The exam is not asking whether you know a default function name. It is asking whether you understand what the transformation does to the feature and the model.
| Need | Better first instinct |
|---|---|
| fill missing continuous values | choose mean or median based on the distribution and robustness need |
| fill missing categorical values | mode often fits |
| represent categorical variables numerically | one-hot encoding when the model and feature cardinality make sense |
| reduce skew in a numeric feature | consider a log transform where appropriate |
| Ask this first | Why it matters |
|---|---|
| is the feature continuous or categorical? | imputation and encoding options change immediately |
| is the distribution skewed or sensitive to outliers? | that affects mean versus median decisions |
| is the transformation helping the model, or only making the data look tidy? | the exam rewards useful preprocessing, not decorative changes |
| Trap | Better rule |
|---|---|
| using mean imputation by habit | the distribution should still matter |
| applying one-hot encoding to every categorical situation | some model or data situations make it a poor fit |
| forgetting to reverse log scale for interpretation or metric calculation when needed | transformed targets may need exponentiation later |
Databricks is usually testing judgment, not library trivia:
The stronger answer usually explains why the transformation improves the workflow instead of naming it mechanically.
| Scenario clue | Stronger answer shape |
|---|---|
| “continuous column with missing values and no strong skew issue” | mean can be reasonable, but verify the distribution first |
| “continuous column with skew or outlier sensitivity” | median becomes more attractive |
| “categorical feature with missing values” | mode is often the simplest fit |
| “numeric feature with heavy skew” | consider log transform where the business meaning still works |
| “categorical representation question” | ask whether one-hot encoding is actually appropriate for the model and feature |
This lesson usually tests whether you can match preprocessing to data shape and model needs. If a continuous feature has missing values, inspect the distribution before choosing imputation. If skew or extremes matter, median may be stronger than mean. One-hot encoding is not “always yes”; it depends on the feature and model trade-offs. The weak answer usually applies one transform by reflex.