Databricks GENAI-ASSOC Cheat Sheet: RAG, Agents, and Evaluation

Databricks GENAI-ASSOC cheat sheet for RAG, agents, evaluation, traps, and final review.

Use this for last-mile review. Keep it open while drilling mixed questions. GENAI-ASSOC usually gets easier when you classify the failure or design choice first:

  1. Design lane: business goal, model task, chain component, agent tool order, or Agent Bricks fit?
  2. Data prep lane: source quality, extraction, chunking, metadata, embeddings, or retrieval metrics?
  3. Development lane: prompt augmentation, framework choice, guardrails, model fit, or Agent Framework pattern?
  4. Deployment lane: pyfunc packaging, Vector Search, serving, registration, MCP, governance, or monitoring?

RAG system map

    flowchart TD
	  Need["Business Requirement"] --> Design["Model Task + Chain Design"]
	  Docs["Source Documents"] --> Prep["Extraction + Chunking + Delta Tables"]
	  Prep --> Search["Embeddings + Vector Search + Filters"]
	  Design --> Dev["Prompt + Chain + Agent Logic"]
	  Search --> Dev
	  Dev --> Deploy["Pyfunc / Serving / UC Registration / Interface"]
	  Gov["Governance + Guardrails"] -. constrains .-> Prep
	  Gov -. constrains .-> Dev
	  Deploy --> Eval["Evaluation, Tracing, Logging, Monitoring"]

GENAI-ASSOC answer sequence

Use this when the stem mixes task fit, retrieval, grounding, agents, deployment, or governance.

    flowchart TD
	  S["Scenario"] --> T["Clarify the business task"]
	  T --> D["Choose model task and chain design"]
	  D --> R["Check chunking, embeddings, and retrieval"]
	  R --> G["Add governance, guardrails, and permission checks"]
	  G --> V["Validate with tracing, logging, and evaluation"]

Fast lane picker

If the question is mainly about… Strongest first lane
the application solves the wrong business task requirements, model task, or chain design
documents are split badly or context windows overflow chunking strategy
the right documents are not returned retrieval quality, filters, embeddings, reranking, or top-k
answers sound fluent but wrong grounding, retrieval quality, model fit, or guardrails
latency or cost spikes context length, top-k, serving path, vector search config, or model choice
cross-tenant leakage or unsafe output metadata filters, governance, prompt-safety controls, masking, and evaluation
agent tooling or multi-step reasoning sounds overbuilt or underbuilt Agent Bricks, tools, or multi-agent design fit

Design and tool-choice cues

If the question is really about… Strongest first lane
what the AI pipeline should take in and produce define inputs and outputs from the business use case
which model task fits the requirement model task selection
which components belong in the chain chain design and tool order
whether to use Agent Bricks choose the Databricks packaged option that matches the problem
multi-stage reasoning or tool usage define and order tools explicitly

Design traps

Trap Better reading
starting with a framework before defining the business requirement requirements first, tools second
using an agent pattern when a simpler chain fits choose the least complex architecture that satisfies the task
treating Agent Bricks as generic buzzwords each brick solves a specific type of problem

Chunking and embeddings

Decision Trade-off Rule of thumb
chunk size recall versus precision big enough for meaning, small enough for focused retrieval
overlap continuity versus redundancy/cost use enough overlap to preserve context edges without flooding the index
metadata filter precision versus ingestion effort store source, version, tenant, sensitivity, and freshness if those affect retrieval
embedding choice semantic quality versus cost/latency choose the embedding path that matches the retrieval task and governance boundary

Chunking traps

Trap Better reading
chunks are huge so retrieval “has context” oversized chunks dilute precision and waste context window
chunks are tiny because “more precise is better” undersized chunks lose meaning and hurt answer quality
metadata is optional metadata often controls tenant, document version, freshness, and policy boundaries
extraction package choice does not matter OCR, PDF, HTML, and other source formats need the right extraction path

Retrieval quality

If the problem is mainly… Strongest first explanation
irrelevant documents appear weak chunking, weak embeddings, or missing filters
right documents exist but do not surface top-k, index quality, ranking, or query formulation issue
wrong tenant or version shows up missing metadata filters or governance boundary
latency is too high candidate set too large, top-k too large, or unnecessary context packing

Retrieval quick rules

Cue Fast recall
tenant isolation metadata filtering and governance boundary
most useful few documents rank and top-k discipline
query meaning mismatch reformulation or better retrieval strategy
repeated miss on same content family source documents may be weak before the model is weak
poor ordering among good candidates reranking can be the missing step

Development and generation

If the question is mainly about… Strongest first lane
organizing retrieved evidence into the prompt prompt assembly with grounded context
unsupported claims in the answer grounding or retrieval weakness, not just prompt wording
style or output format prompt instruction layer
model capability versus cost or latency model selection and experiment signal
framework choice LangChain or similar tooling fit for the application design
lifecycle tooling for agents MLflow and Agent Framework
multi-agent use with Genie or conversational APIs multi-agent pattern and Databricks-specific integration

Generation traps

Trap Better reading
clever prompting can fix missing source evidence retrieval and source quality usually dominate
bigger context is always safer too much weak context can increase noise, cost, and latency
model swap is the first response to every quality issue first classify whether the miss is retrieval, context packing, or evaluation blind spot
safety issue means only the prompt changed guardrails and policy controls are separate from prompt wording

Deployment picker

If the question is really about… Strongest first lane
package a chain with pre- and post-processing pyfunc model
retrieve from Databricks vector indexes Vector Search
serve an LLM app on Databricks model serving or Foundation Model APIs path
register the model or chain in the governed catalog Unity Catalog plus MLflow registration
store intermediate memory or structured state persistent datastore choice
batch inference against data ai_query() where it fits
promote prompts or indexes across environments CI/CD and prompt lifecycle controls
add tools via managed, external, or custom servers MCP

Evaluation loop

What to evaluate Examples
retrieval quality hit rate, useful-context rate, groundedness support
answer quality correctness, completeness, citation quality
safety prompt injection resilience, leakage resistance, harmful output behavior
regressions fixed benchmark set and repeatable comparisons

Evaluation rules

  • Keep a fixed evaluation set so changes are comparable.
  • Evaluate retrieval and answer quality separately.
  • Include safety and governance checks, not just quality checks.
  • Treat evaluation as a release gate, not an afterthought.
  • Use tracing, scorers, and SME feedback to improve the system intentionally.

Monitoring, cost, and observability

Requirement Strongest first lane
reduce repeated work cache embeddings or retrieval results where appropriate
reduce candidate set metadata filters and intentional top-k
reduce context cost shorter focused chunks and tighter prompt assembly
understand failures over time logging, tracing, inference tables, and observability
safer rollout benchmark and regression gate before broad deployment
track live endpoint behavior inference logging, Agent Monitoring, or AI Gateway tables

Cost and latency traps

Trap Better reading
unlimited top-k for “better recall” wider retrieval can increase latency and degrade answer focus
large prompt context by default only include evidence that materially helps the answer
watching token cost only tail latency, retrieval cost, and governance overhead matter too
monitoring only after deployment the current blueprint expects monitoring design choices before and after launch

Governance and safety

Boundary What it really answers
metadata filters which documents are eligible for retrieval
governance policy who can access what content and model path
safety checks whether the system resists harmful or leaking behavior
audit/evaluation records whether changes remain explainable and reviewable
legal and licensing controls whether the source data can be used safely and lawfully

High-confusion pairs

Pair Keep this distinction clear
retrieval quality vs answer quality getting the right evidence versus using it well
chunking vs prompting document preparation versus instruction layer
metadata filtering vs reranking eligibility boundary versus ordering of candidates
evaluation vs monitoring release-quality judgment versus ongoing operational observation
MLflow vs Agent Framework lifecycle tooling versus agent-building runtime framework
Vector Search vs model serving retrieving context versus serving the app or model
masking vs guardrails content protection technique versus broader runtime safety control
inference logging vs inference tables capturing requests and outputs versus structured monitoring surfaces

Last 15-minute review

Recheck this Because the miss often hides here
requirement, model task, and chain fit many architecture misses start before code
chunk size, overlap, metadata, and reranking many retrieval failures start upstream
model, embedding, and framework choice tool selection can hide inside seemingly simple scenario stems
Vector Search, serving, MLflow, and UC boundaries Databricks nouns blur under time pressure
evaluation set, tracing, and monitoring surfaces production safety depends on repeatable evidence
governance, licensing, and tenant boundaries good GenAI systems are also access-control systems

What strong GENAI-ASSOC answers usually do

  • treat RAG as a system, not as prompt-writing alone
  • separate design, retrieval, generation, deployment, evaluation, and governance
  • prefer the more observable and controllable deployment path
  • fix data and retrieval quality before assuming a prompt or model swap solves everything
Revised on Sunday, May 10, 2026