Databricks ML-PRO Glossary: Key Terms

Databricks ML-PRO glossary of Spark ML, training, tuning, inference, and MLOps terms.

Use this glossary when SparkML, MLflow, feature-engineering, monitoring, and deployment terms start to blur together. Keep it beside the cheat sheet and resources, not in place of scenario practice.

High-yield terms

Term Short meaning Why it matters on ML-PRO
SparkML Spark’s distributed ML library for pipelines, estimators, transformers, and scalable inference core model-development term
Nested run MLflow tracking pattern that groups child runs under a parent experiment context key advanced experimentation term
Alias Stable label pointing to a chosen registered model version key release-control term
Point-in-time correctness Feature lookup behavior that prevents leakage by using only information available at that moment one of the highest-value feature-engineering concepts
Online table Databricks feature-serving storage for low-latency applications key online-feature term
Lakehouse Monitoring Databricks monitoring surface for data and model-quality signals key drift and monitoring term
Drift metric statistical signal that tracks change in data or model behavior over time key monitoring decision term
Data parallelism split data across workers while training the same model structure key scaling strategy term
Model parallelism split model computation itself across resources key large-model scaling term
Optuna hyperparameter tuning framework used in Databricks workflows and often paired with MLflow logging key tuning term
Ray distributed compute framework often contrasted with Spark for ML workloads key scaling trade-off term
Databricks Asset Bundle packaging and deployment structure for Databricks assets and environment promotion key MLOps term
Blue-green deployment deployment strategy that shifts traffic between two environments with a clear cutover path key rollout term
Canary deployment rollout strategy that exposes a small portion of traffic first key blast-radius-control term
Custom PyFunc model MLflow model packaged through the pyfunc interface for custom serving logic key deployment-interface term
Deploy code strategy lifecycle approach where code and environment transitions manage how models move across stages key MLOps architecture term

Commonly confused pairs

Pair Keep this distinction clear
MLflow run vs registered model version experiment record versus release artifact
alias vs serving endpoint release pointer versus deployed inference interface
point-in-time correctness vs feature freshness leakage prevention versus recency of values
drift vs rollout regression gradual distribution or quality change versus bad release event
SparkML vs single-node model distributed pipeline fit versus local model path
Spark vs Ray different distributed-training ecosystems and trade-offs
retrain vs rollback create a new candidate versus restore a known good state

If three terms blur together

Cluster Fast separation
run / version / alias track the experiment, govern the releasable artifact, point the release control at the chosen version
drift / outage / rollout regression gradual change, service failure, or bad deployment event
SparkML / Ray / single-node training distributed Spark pipeline, alternative distributed framework, or local model path
blue-green / canary / rollback cutover strategy, partial rollout, or revert to a trusted prior state
point-in-time correctness / leakage / online features correct historical lookup, future-information contamination, or low-latency feature serving

One-sentence memory hooks

  • If the model scored well offline, ask whether the feature path and release path are still safe.
  • If production gets worse, separate drift, feature bug, rollout regression, and serving failure before acting.
  • If scaling is the issue, choose fit before size.
  • If monitoring fires, decide whether the right action is retrain, rollback, block promotion, or fix upstream data.
Revised on Sunday, May 10, 2026