Study Databricks GENAI-ASSOC Chunking and Retrieval Inputs: key concepts, common traps, and exam decision cues.
Chunking is not just a preprocessing detail. On this exam it is a design choice that affects quality, latency, cost, and whether retrieval has the right information to work with in the first place.
| Decision | Strong reading |
|---|---|
| larger chunks | more context per chunk, lower record count, but lower precision |
| smaller chunks | tighter precision, but more records and more risk of losing context |
| overlap | preserves boundary context, but increases redundancy and cost |
| metadata | helps eligibility, freshness, tenancy, and filtering decisions |
| Need | Better first instinct |
|---|---|
| store chunked text for governed retrieval use | Delta tables in Unity Catalog |
| keep tenant, version, or sensitivity boundaries clear | write useful metadata with the chunks |
| Input-design choice | Why the exam cares |
|---|---|
| chunk size | changes recall, precision, and operating cost |
| overlap | protects boundary context but can create redundancy |
| metadata | supports filters, versioning, sensitivity, and tenancy |
| governed storage | keeps retrieval assets usable inside UC boundaries |
| Trap | Better rule |
|---|---|
| huge chunks because “the model needs more context” | oversized chunks can hurt retrieval precision |
| tiny chunks because “more records means more accuracy” | too-small chunks lose meaning |
| metadata treated as optional | metadata often carries the governance boundary |
A team stores chunked text without document version, business unit, or sensitivity metadata. Later they need filtered retrieval for tenant-specific answers and governance review. What failed first?
Correct answer: B. Retrieval quality is not only chunk size. Metadata often defines freshness, tenancy, and governance eligibility.
Chunking questions usually reward balance over extremes. If chunks are too small, they lose the semantic unit the answer needs. If they are too large, retrieval becomes noisy and expensive. Metadata matters because it can control freshness, filtering, sensitivity, and tenant boundaries. The weak answer usually maxes one variable without respecting retrieval fit.