Question 9 - Top Amazon/AWS DEA-C01 Real Exam Questions [March 2026 Update]

Q: 9

A company uses Amazon S3 to store semi-structured data in a transactional data lake. Some of the data files are small, but other data files are tens of terabytes. A data engineer must perform a change data capture (CDC) operation to identify changed data from the data source. The data source sends a full snapshot as a JSON file every day and ingests the changed data into the data lake. Which solution will capture the changed data MOST cost-effectively?

Options

Correct Answer:

Explanation

The most cost-effective and scalable solution for performing Change Data Capture (CDC) on large, file-based snapshots in a data lake is to use an open-source transactional data lake format like Apache Hudi, Apache Iceberg, or Delta Lake. These formats are specifically designed to bring database-like capabilities, such as MERGE (upsert) operations, directly to data stored in Amazon S3. Using a service like AWS Glue or Amazon EMR, the daily snapshot can be efficiently merged with the existing data lake table, applying only the inserts and updates. This avoids the high cost and architectural complexity of loading terabyte-scale files into a relational database.

References

1. AWS Big Data Blog

"Implement a CDC-based ETL pipeline using Amazon S3

AWS Glue

and Apache Hudi": This article details the exact pattern described in the correct answer. It states

"Apache Hudi enables you to manage data at the record level in Amazon S3 to perform inserts

updates

and deletes... This helps in use cases like change data capture (CDC)..." This directly supports using a format like Hudi for the described CDC operation.

2. AWS Glue Developer Guide

"Using transactional data lake frameworks with AWS Glue": The documentation confirms native support for these formats. "AWS Glue supports the open-source transactional data lake frameworks: Apache Hudi

Apache Iceberg

and Linux Foundation Delta Lake. These frameworks allow you to run ACID transactions on your Amazon S3 based data lake." This shows that the solution in option C is a well-supported pattern on AWS.

3. AWS Lambda Developer Guide

"Lambda quotas": The official documentation lists the "Function timeout" as 900 seconds (15 minutes). This technical limitation makes option A infeasible for processing terabyte-scale files

which would take significantly longer.

4. AWS Database Migration Service User Guide

"What Is AWS Database Migration Service?": The guide describes AWS DMS as a tool to "migrate your data to and from most widely used commercial and open-source databases." Its primary use case is database-to-database replication

not performing CDC on flat files in S3. This makes options B and D architecturally inappropriate.

Premium Access Includes

FLASH OFFER

avail 10% DISCOUNT on YOUR PURCHASE