Study Databricks GENAI-ASSOC Source Extraction and Cleaning: key concepts, common traps, and exam decision cues.
If the knowledge base is weak, the application stays weak. The exam checks whether you can pick the right source documents, extract content correctly, and remove content that hurts retrieval quality.
| If the stem says… | Read it as… |
|---|---|
| the app cannot answer a business question family | the source documents may be missing the needed knowledge |
| the documents are scanned images | extraction method matters |
| the results contain lots of irrelevant noise | content cleaning and filtering matter before embedding |
| Source type | What you need to think about |
|---|---|
| clean HTML or structured web content | parser fit and content selection |
| PDFs or office docs | structured extraction path |
| scanned images | OCR-capable extraction path |
| If the source problem is… | Better first move |
|---|---|
| wrong document set | fix coverage before tuning retrieval |
| noisy headers, footers, or irrelevant boilerplate | clean the text before embedding |
| scanned image content | choose OCR-capable extraction |
| mixed source formats | do not assume one parser or package fits all of them |
| Trap | Better rule |
|---|---|
| assuming every source can use the same extraction package | source format changes the right extraction tool |
| embedding everything without filtering | irrelevant content pollutes retrieval quality |
| blaming the model when the knowledge base is incomplete | source quality comes first |
A retrieval app keeps surfacing document footers, navigation text, and legal boilerplate instead of useful answer content. What should you inspect first?
Correct answer: A. Retrieval quality often fails before embeddings or models if the extracted text is full of irrelevant noise.
Source-prep questions usually start by asking whether the content is actually machine-readable. If the input is scanned image text, think OCR. If users keep asking things the app cannot answer, inspect whether the right source documents are present at all. The exam often rewards fixing source readiness before touching prompts or model size.