Study DEA-C01 Transform Services and Format Trade-Offs: key concepts, common traps, and exam decision cues.
After data lands, DEA-C01 usually asks which transformation path makes sense. The right answer depends on scale, latency, SQL versus code, format conversion, and the amount of infrastructure you want to manage.
The exam is usually testing whether you can match the transformation engine to the workload shape.
| Need | Strongest first fit |
|---|---|
| serverless ETL and catalog-aware transforms | AWS Glue |
| Hadoop or Spark cluster control | Amazon EMR |
| warehouse-native transformation and serving | Amazon Redshift pattern |
| lightweight event-driven reshape step | AWS Lambda |
| If the stem emphasizes… | Think first | Why this fits |
|---|---|---|
| low-ops managed ETL with scheduling and catalog integration | Glue | The workload wants managed transformation infrastructure |
| cluster-level control, Spark tuning, or custom framework behavior | EMR | The workload needs more control than a managed ETL path |
| transforms that belong close to warehouse tables and BI-serving models | Redshift-native transform pattern | The data already lives in or should stay near the warehouse serving layer |
| tiny record-level reshape or validation around an event | Lambda | The task is narrow and event-driven, not a full data platform |
| If the stem emphasizes… | Think first | What to keep in view |
|---|---|---|
| low-ops ETL, scheduling, catalog integration | Glue | Managed transforms and schema-aware data workflows |
| Spark jobs, framework control, or custom cluster tuning | EMR | More control, more responsibility |
| transforms close to analytical querying and serving | Redshift pattern | Keep heavy warehouse work close to the warehouse |
| small stateless reshape or validation step | Lambda | Do not force Lambda into large ETL jobs |
| scan efficiency and analytics optimization | Columnar formats and partitioning | Format decisions often matter after engine choice |
flowchart LR
A["Data landed"] --> B{"What kind of transform is this?"}
B -->|Managed ETL with low ops| C["Glue"]
B -->|Custom Spark or cluster control| D["EMR"]
B -->|Warehouse-centered transform| E["Redshift pattern"]
B -->|Small event-driven step| F["Lambda"]
C --> G["Then optimize format and partitioning"]
D --> G
E --> G
F --> G
The exam often rewards patterns that convert data toward more analytics-friendly formats and partitioning strategies rather than leaving everything in the rawest possible structure forever.
Use this order when the transformation answer is not obvious:
| Trap | Better reading |
|---|---|
| “It mentions Spark, so Glue and EMR are interchangeable.” | The exam still cares about managed/serverless versus cluster-control trade-offs. |
| “It mentions a transform, so Lambda is always cheapest and best.” | Lambda is usually for smaller event-driven steps, not every heavy data-processing job. |
| “File format is secondary, so CSV forever is fine.” | DEA-C01 often rewards moving toward analytics-friendly formats and partitioning. |
| “Because data ends in a warehouse, every transform must happen outside it.” | Some stems are really about a warehouse-native transformation pattern. |
| Situation | Stronger first answer |
|---|---|
| managed ETL with low ops and shared metadata integration | Glue |
| long-running Spark jobs and custom tuning | EMR |
| repeated transforms on curated warehouse data | Redshift-native pattern |
| tiny event-driven cleanup or validation step | Lambda |
| slow analytical scans after transformation | improve format and partitioning strategy |
A data team runs small schema cleanup on ingest, large scheduled joins across many files, and warehouse-serving transforms for BI consumers. The strongest answer usually splits the work instead of forcing one engine everywhere: a lightweight event-driven step where appropriate, managed ETL or Spark for heavier processing, and warehouse-native transformation where the workload is really about analytics serving.