Study DEA-C01 Data Quality, Consistency and Skew: key concepts, common traps, and exam decision cues.
Data platforms are only as good as the quality checks around them. DEA-C01 expects you to know how missing fields, drift, inconsistency, and skew can distort results even when the pipeline itself technically “runs.”
Data skew: Uneven data distribution that causes partitions or workers to process very different volumes, often hurting performance.
Data validation rule: Check that enforces expectations such as required fields, type constraints, ranges, or uniqueness.
Consistency issue: Mismatch between copies, stages, timestamps, or derived outputs that should agree but do not.
AWS wants you to separate:
DEA-C01 wants you to notice that “the pipeline succeeded” does not mean “the dataset is trustworthy.” The strongest answers split correctness, reconciliation, and runtime-balance problems instead of blaming one generic “data quality” bucket.
| Requirement | Strongest first fit | Why |
|---|---|---|
| reject rows with missing mandatory values | validation rule at ingest or transform time | The need is hard data-quality gating |
| detect duplicates before downstream aggregation | deduplication or key-based validation | The issue is record uniqueness |
| identify columns whose distributions have shifted materially | profile and compare quality metrics over time | DEA-C01 expects drift-awareness, not only syntax checks |
| one partition takes far longer than the others | investigate skew and rebalance data layout or keys | The problem is uneven distribution |
| outputs from two stages disagree unexpectedly | consistency checks between source and derived layers | The issue is reconciliation, not just schema shape |
| If the stem emphasizes… | Think first | Why this fits |
|---|---|---|
| required fields, ranges, or allowed values | validation rules | The problem is field-level correctness |
| source and derived counts do not line up | reconciliation or consistency checks | The problem is stage agreement |
| one partition or worker takes far longer | skew analysis | The problem is uneven data distribution |
| repeated events inflate metrics | deduplication and idempotency controls | The issue is duplicate handling |
| values still parse but look implausible | semantic quality checks and profiling | Type correctness is not enough |
| Symptom | Better reading |
|---|---|
| required fields are blank | validation and reject-or-quarantine rules |
| row counts between stages diverge unexpectedly | reconciliation and consistency checks |
| one worker runs far longer than the others | skew or hot-key distribution problem |
| the schema still parses but values look implausible | semantic quality problem, not only schema correctness |
| dashboards fluctuate because duplicate events were loaded | deduplication and idempotency controls may be missing |
flowchart LR
A["Bad output symptom"] --> B{"What kind of failure is it?"}
B -->|Missing or invalid fields| C["Validation problem"]
B -->|Stage counts or states disagree| D["Consistency / reconciliation problem"]
B -->|Repeated records inflate totals| E["Deduplication problem"]
B -->|One worker or partition drags| F["Skew problem"]
| Problem type | What it harms first |
|---|---|
| skew | runtime and performance balance |
| missing or invalid values | data correctness |
| duplicate records | aggregation accuracy and trust |
| stage mismatch or stale outputs | reconciliation and downstream confidence |
| distribution drift | model, dashboard, or rule reliability over time |
When quality issues overlap, use this order:
| Trap | Better reading |
|---|---|
| “The job succeeded, so the data must be fine.” | DEA-C01 separates pipeline success from data trustworthiness. |
| “Skew is just another null-value issue.” | Skew is a distribution and performance problem, not a field-validation problem. |
| “A schema check catches all quality issues.” | Correct types do not guarantee correct values, uniqueness, or consistency. |
| “Duplicates are harmless if the table still loads.” | Duplicates can distort downstream metrics and business decisions. |
| Situation | Stronger first answer |
|---|---|
| row counts diverge between bronze and curated layers | reconciliation check |
| a few keys dominate processing time | skew investigation and data-layout rebalance |
| values are present but unrealistic | semantic-quality profiling |
| metrics double because retry behavior re-ingests the same event | deduplication or idempotency controls |
A nightly aggregation completes successfully, but one partition runs much longer than the rest, some records have blank required fields, and the final dashboard shows inflated counts due to repeat events. What is the strongest reading first?
Correct answer: B. DEA-C01 expects you to recognize that performance balance, field validation, and duplicate control are separate operational quality concerns.