View DEA-C01 Exam Questions

Q: 1

A company extracts approximately 1 TB of data every day from data sources such as SAP HANA, Microsoft SQL Server, MongoDB, Apache Kafka, and Amazon DynamoDB. Some of the data sources have undefined data schemas or data schemas that change. A data engineer must implement a solution that can detect the schema for these data sources. The solution must extract, transform, and load the data to an Amazon S3 bucket. The company has a service level agreement (SLA) to load the data into the S3 bucket within 15 minutes of data creation. Which solution will meet these requirements with the LEAST operational overhead?

Options

Correct Answer:

Explanation

The optimal solution is AWS Glue. It is a fully managed, serverless extract, transform, and load (ETL) service designed for this exact use case. AWS Glue crawlers can automatically scan various data sources (including JDBC sources like SAP HANA and SQL Server, and NoSQL sources like MongoDB and DynamoDB), infer the schema, and populate the AWS Glue Data Catalog. This directly addresses the requirement for handling undefined or changing schemas. The subsequent ETL jobs, which run on a serverless Apache Spark environment, can then use this catalog to process and load the data into Amazon S3. This serverless architecture ensures the "LEAST operational overhead" as there are no clusters to provision or manage.

References

1. AWS Glue Developer Guide: "AWS Glue provides a serverless environment to run your ETL jobs on a fully managed

scale-out Apache Spark environment... AWS Glue crawlers scan your data stores to determine the schema for your data and then create a metadata table in your AWS Glue Data Catalog."

Source: AWS Glue Developer Guide

"What is AWS Glue?"

Introduction section.

2. AWS Documentation - Choosing between AWS Glue and Amazon EMR: "AWS Glue is a good choice when your use case is ETL and you are looking for a serverless offering... If you want to avoid managing servers

you can use AWS Glue to run your Spark and Python shell workloads." This highlights Glue's lower operational overhead.

Source: AWS Big Data Blog

"Choosing between AWS Glue and Amazon EMR".

3. AWS Glue Developer Guide - Crawlers: "A crawler connects to a data store

progresses through a prioritized list of classifiers to determine the schema for your data

and then creates metadata tables in your AWS Glue Data Catalog." This confirms its schema detection capability for diverse sources.

Source: AWS Glue Developer Guide

"Defining Crawlers"

Crawler concepts section.

4. AWS Lambda Developer Guide - Quotas: "Timeout: 900 seconds (15 minutes)". This confirms the execution time limit that makes Lambda unsuitable for the described large-scale ETL task.

Source: AWS Lambda Developer Guide

"Lambda quotas".

5. Amazon Redshift Database Developer Guide - Redshift Spectrum: "Amazon Redshift Spectrum is a feature of Amazon Redshift that enables you to run SQL queries against exabytes of unstructured data in Amazon S3." This shows Redshift Spectrum is for querying data already in S3

not for performing the initial ETL to S3.

Source: Amazon Redshift Database Developer Guide

"Getting started with Amazon Redshift Spectrum".

Q: 2

A data engineer must orchestrate a data pipeline that consists of one AWS Lambda function and one AWS Glue job. The solution must integrate with AWS services. Which solution will meet these requirements with the LEAST management overhead?

Options

Correct Answer:

Explanation

AWS Step Functions is a serverless orchestration service that allows developers to coordinate multiple AWS services into flexible workflows. It provides direct, state-based integration for both AWS Lambda (Lambda:Invoke) and AWS Glue (Glue:StartJobRun). This serverless nature means there is no underlying infrastructure like servers or clusters to provision, patch, or manage, directly fulfilling the requirement for the "LEAST management overhead." The visual workflow also simplifies development, monitoring, and debugging of the multi-step pipeline.

References

1. AWS Step Functions Developer Guide: "AWS Step Functions is a serverless orchestration service that lets you combine AWS Lambda functions and other AWS services to build business-critical applications." It explicitly lists AWS Glue and AWS Lambda as supported service integrations.

Source: AWS Step Functions Developer Guide

"What is AWS Step Functions?" and "Supported AWS Service Integrations".

2. AWS Step Functions Developer Guide - Managing AWS Glue Jobs: "You can use Step Functions to start an AWS Glue job... Step Functions can wait for the job to complete

and then retrieve the results." This confirms the direct integration required by the scenario.

Source: AWS Step Functions Developer Guide

"Manage AWS Glue jobs with Step Functions".

3. AWS Glue Developer Guide - AWS Glue Workflows: "A workflow in AWS Glue is a container for a set of related jobs

crawlers

and triggers... You can use a workflow to create a complex multi-job extract

transform

and load (ETL) pipeline." The documentation focuses on orchestrating Glue-native components

not external services like Lambda as a primary step.

Source: AWS Glue Developer Guide

"Overview of workflows in AWS Glue".

4. AWS Whitepaper - Practicing Continuous Integration and Continuous Delivery on AWS: This paper discusses the operational burden of self-managed solutions. "When you manage your own... servers on Amazon EC2

you are responsible for scaling

managing

and maintaining the infrastructure... AWS-managed services... reduce your operational overhead." This principle supports choosing a managed service (Step Functions) over a self-hosted one (Airflow on EC2/EKS).

Source: AWS Whitepaper: "Practicing Continuous Integration and Continuous Delivery on AWS"

section on "Operational Responsibility".

Q: 3

A technology company currently uses Amazon Kinesis Data Streams to collect log data in real time. The company wants to use Amazon Redshift for downstream real-time queries and to enrich the log data. Which solution will ingest data into Amazon Redshift with the LEAST operational overhead?

Options

Correct Answer:

Explanation

Amazon Redshift streaming ingestion is a native feature designed for low-latency, high-speed data ingestion directly from Amazon Kinesis Data Streams. This solution has the least operational overhead because it is a fully managed process within Redshift. The user only needs to create an external schema pointing to the Kinesis stream and a materialized view to consume the data. Redshift then automatically and continuously ingests data in the background, making it available for querying in seconds. This directly meets the requirements for real-time queries and minimal management effort.

References

1. Amazon Redshift Database Developer Guide

"Streaming ingestion for Amazon Redshift": This guide states

"Amazon Redshift streaming ingestion provides low-latency

high-speed ingestion of streaming data from Amazon Kinesis Data Streams... into an Amazon Redshift materialized view... You can get started with streaming ingestion in minutes." This confirms it is a low-latency

low-overhead

native solution.

2. Amazon Redshift Database Developer Guide

"CREATE MATERIALIZED VIEW": The documentation for creating a materialized view from an external source like Kinesis states

"Amazon Redshift automatically refreshes the materialized view with the latest data from the streaming ingestion external table." This highlights the minimal operational overhead.

3. Amazon Data Firehose Developer Guide

"Choosing Amazon Redshift for Your Destination": This document explains the Firehose to Redshift process: "Amazon Data Firehose first delivers your incoming data to your S3 bucket... Then

Amazon Data Firehose issues an Amazon Redshift COPY command to load data...". This confirms the use of an intermediate S3 bucket and a batch COPY command

which is less direct and has higher latency than native streaming.

4. Amazon Managed Service for Apache Flink Developer Guide

"Sinks": The official documentation lists available sinks for Flink applications

which include Amazon S3

Amazon Kinesis Data Streams

Amazon Kinesis Data Firehose

and Amazon OpenSearch Service

but not a direct sink for Amazon Redshift.

Q: 4

A company uses an Amazon Redshift provisioned cluster as its database. The Redshift cluster has five reserved ra3.4xlarge nodes and uses key distribution. A data engineer notices that one of the nodes frequently has a CPU load over 90%. SQL Queries that run on the node are queued. The other four nodes usually have a CPU load under 15% during daily operations. The data engineer wants to maintain the current number of compute nodes. The data engineer also wants to balance the load more evenly across all five compute nodes. Which solution will meet these requirements?

Options

Correct Answer:

Explanation

The scenario describes a significant workload imbalance, with one node having a CPU load over 90% while others are under 15%. This is a classic symptom of data skew in Amazon Redshift. Data skew occurs when data is not distributed evenly across the compute nodes. In Redshift, the distribution of table rows to nodes is determined by the distribution key (DISTKEY). To resolve the imbalance, the DISTKEY should be set to a column with high cardinality (a large number of unique values), which ensures that the data is spread uniformly across all nodes. This directly addresses the root cause of the problem by balancing the workload, as required.

References

1. Amazon Redshift Documentation

"Choosing a data distribution style": "The goal of choosing a table distribution style is to distribute data as evenly as possible to parallelize the workload... If you specify KEY distribution

you must name a distribution key (DISTKEY) column... A column with high cardinality (a high number of unique values) helps distribute the data more evenly." This directly supports choosing a proper DISTKEY to resolve data skew.

2. Amazon Redshift Documentation

"Amazon Redshift best practices for designing tables": Under the section "Choose the best distribution style

" the documentation states

"To use KEY distribution

name one column as the distribution key (DISTKEY). The distribution key should have high cardinality... Choosing a column with low cardinality results in data skew

where some nodes process more data than others." This confirms that an improper DISTKEY causes the exact problem described.

3. Amazon Redshift Documentation

"Working with sort keys": "Sorting enables efficient handling of range-restricted predicate queries. Amazon Redshift stores your data on disk in sorted order according to the sort key." This reference clarifies that the sort key's function is for query optimization via data ordering

not data distribution across nodes.

4. Amazon Redshift Documentation

"Defining constraints": "Amazon Redshift doesn't enforce unique

primary key

and foreign key constraints... However

primary keys and foreign keys are used as planning hints and they should be declared if your ETL process or some other process in your application enforces them." This confirms that primary keys do not influence data placement.

Q: 5

A data engineer is configuring an AWS Glue job to read data from an Amazon S3 bucket. The data engineer has set up the necessary AWS Glue connection details and an associated IAM role. However, when the data engineer attempts to run the AWS Glue job, the data engineer receives an error message that indicates that there are problems with the Amazon S3 VPC gateway endpoint. The data engineer must resolve the error and connect the AWS Glue job to the S3 bucket. Which solution will meet this requirement?

Options

Correct Answer:

Explanation

The error message explicitly indicates a problem with the Amazon S3 VPC gateway endpoint. A VPC gateway endpoint for S3 functions by adding a specific route to the VPC's route tables. This route directs traffic destined for the S3 service prefix list to the gateway endpoint (vpce-xxxxxxxx) instead of an internet gateway. If the route table associated with the subnet where the AWS Glue job runs is missing this route, the job cannot establish a connection to S3 via the endpoint, resulting in a connectivity failure. Therefore, verifying the route table configuration is the correct and most direct action to resolve the issue.

References

1. AWS VPC User Guide

Gateway endpoints: "When you create a gateway endpoint

you specify the VPC route tables that are used to access the service from the VPC. A route is automatically added to each of the route tables that you specify... The route has a destination that is the prefix list for the service and a target that is the gateway endpoint." This confirms the route table is the mechanism for endpoint connectivity.

2. AWS Glue Developer Guide

Setting up a VPC to connect to data stores: In the section "Step 3: Create a VPC endpoint for Amazon S3

" the documentation states: "If your job accesses Amazon S3

you can configure a VPC endpoint... When you configure the endpoint

be sure to select the route table for the subnet that you configured...". This directly links the Glue job's subnet

its associated route table

and the S3 endpoint.

3. AWS Knowledge Center

How do I resolve AWS Glue connectivity issues with a VPC endpoint?: This resource often guides users to "Confirm that the route table associated with the subnet has a route to the Amazon S3 VPC endpoint." This reinforces that the route table is the primary point of failure for this type of error.

Q: 6

A company loads transaction data for each day into Amazon Redshift tables at the end of each day. The company wants to have the ability to track which tables have been loaded and which tables still need to be loaded. A data engineer wants to store the load statuses of Redshift tables in an Amazon DynamoDB table. The data engineer creates an AWS Lambda function to publish the details of the load statuses to DynamoDB. How should the data engineer invoke the Lambda function to write load statuses to the DynamoDB table?

Options

Correct Answer:

Explanation

The most direct and efficient method to trigger the Lambda function after a Redshift load is by using the native integration between the Amazon Redshift Data API and Amazon EventBridge. When an asynchronous SQL statement executed via the Redshift Data API completes, it automatically emits a "Redshift Data API Statement Status Change" event to EventBridge. A rule can be configured in EventBridge to listen for these specific events (e.g., for a SUCCEEDED or FAILED status). This rule can then directly invoke the target Lambda function, providing it with the context of the completed job. This creates a decoupled, event-driven, and serverless workflow.

References

1. AWS Documentation: Using the Amazon Redshift Data API. "When a statement finishes

the Data API emits an event to Amazon EventBridge. You can create rules in EventBridge to take action when it receives an event." This confirms the direct integration path.

Source: AWS Documentation

Amazon Redshift Database Developer Guide

"Monitoring events with Amazon EventBridge".

2. AWS Documentation: Amazon Redshift Data API events. This page details the specific event

"Redshift Data API Statement Status Change

" that is sent to the default event bus in EventBridge when a statement changes state.

Source: AWS Documentation

Amazon EventBridge User Guide

"AWS service events"

"Amazon Redshift Data API events".

3. AWS Big Data Blog: Automate and operationalize your Amazon Redshift data loading. This official blog post explicitly describes the architecture in the correct answer. It states

"When the statement is complete

the Data API sends an event to Amazon EventBridge. You can create an EventBridge rule that looks for events from the Data API."

Source: AWS Big Data Blog

"Automate and operationalize your Amazon Redshift data loading"

published November 18

2021. Section: "Solution overview".

4. AWS Documentation: Invoking AWS Lambda functions. This document lists Amazon EventBridge (CloudWatch Events) as a primary

supported event source for invoking Lambda functions synchronously or asynchronously.

Source: AWS Documentation

AWS Lambda Developer Guide

"Using AWS Lambda with other services"

"Invoking Lambda functions".

Q: 7

A company maintains multiple extract, transform, and load (ETL) workflows that ingest data from the company's operational databases into an Amazon S3 based data lake. The ETL workflows use AWS Glue and Amazon EMR to process data. The company wants to improve the existing architecture to provide automated orchestration and to require minimal manual effort. Which solution will meet these requirements with the LEAST operational overhead?

Options

Correct Answer:

Explanation

AWS Glue workflows are specifically designed to orchestrate extract, transform, and load (ETL) activities. They provide a fully managed, serverless environment to create, visualize, and run complex data processing pipelines involving AWS Glue jobs, crawlers, and triggers. For a company already using AWS Glue and Amazon EMR, Glue workflows offer the most direct and native method for automation. This approach minimizes operational overhead as it is a built-in feature of AWS Glue, requiring no separate infrastructure management, unlike other options. It directly addresses the need for automated orchestration with minimal manual effort for the existing ETL processes.

References

1. AWS Glue Developer Guide

"Building a complex ETL workflow with AWS Glue": "You can use the AWS Glue console to create a complex extract

transform

and load (ETL) workflow with interdependent jobs

crawlers

and triggers. Each workflow manages the execution and monitoring of all its components... This provides a simple

graph-based way to author and visualize your data processing pipeline." This highlights its purpose-built nature for ETL orchestration.

2. AWS Glue Developer Guide

"Overview of AWS Glue": "AWS Glue is a fully managed ETL (extract

transform

and load) service... It is serverless

so there is no infrastructure to set up or manage." This confirms the low operational overhead of the service and its integrated features like workflows.

3. AWS Step Functions Developer Guide

"What is AWS Step Functions?": "AWS Step Functions is a serverless orchestration service that lets you integrate with AWS Lambda functions and other AWS services..." While capable

its general-purpose nature makes it less specific than Glue Workflows for this ETL-focused scenario.

4. Amazon MWAA User Guide

"What is Amazon MWAA?": "Amazon Managed Workflows for Apache Airflow (MWAA) is a managed orchestration service for Apache Airflow... You can choose a range of available compute resources to execute your workflows." The need to choose and manage compute resources indicates higher operational overhead compared to a fully serverless option.

Q: 8

A company has multiple applications that use datasets that are stored in an Amazon S3 bucket. The company has an ecommerce application that generates a dataset that contains personally identifiable information (PII). The company has an internal analytics application that does not require access to the PII. To comply with regulations, the company must not share PII unnecessarily. A data engineer needs to implement a solution that with redact PII dynamically, based on the needs of each application that accesses the dataset. Which solution will meet the requirements with the LEAST operational overhead?

Options

Correct Answer:

Explanation

S3 Object Lambda is the ideal solution as it is purpose-built to modify data as it is retrieved from Amazon S3. By creating an S3 Object Lambda endpoint, a Lambda function can be invoked to process the data from any S3 GET request made through that endpoint. This function can contain logic to inspect the request context and dynamically redact Personally Identifiable Information (PII) before the data is returned to the calling application. This approach maintains a single, original copy of the dataset in S3, fulfilling the dynamic redaction requirement with the least operational overhead and without data duplication.

References

1. Amazon S3 User Guide. The official documentation explicitly lists PII redaction as a primary use case for S3 Object Lambda. It states

"With S3 Object Lambda

you can add your own code to S3 GET...requests to modify and process data as it is returned to an application." It further clarifies a use case: "Redacting personally identifiable information (PII) – You can redact PII data...for any application that does not need access to this data." This avoids creating derivative copies.

Source: AWS Documentation

Amazon S3 User Guide

"Transforming objects with S3 Object Lambda".

2. AWS Blog: "Filter and redact sensitive data from your data lake using Amazon S3 Object Lambda". This official AWS blog post provides a detailed architectural pattern for the exact scenario in the question. It demonstrates using S3 Object Lambda to "dynamically filter and redact data based on the user’s permissions" without creating multiple copies of the dataset

highlighting it as a low-overhead solution.

Source: AWS Big Data Blog

November 15

2021.

3. AWS Whitepaper: "Best Practices for Organizing

Training

and Securing Your Data Lake". While discussing data governance and security

such whitepapers advocate for mechanisms that can apply policies dynamically at the time of access. S3 Object Lambda is a direct implementation of this principle

allowing for fine-grained

dynamic data masking without the overhead of pre-processing and storing multiple data versions.

Source: AWS Whitepaper

"Best Practices for Organizing

Training

and Securing Your Data Lake"

section on "Data Governance and Security".

Q: 9

A company uses Amazon S3 to store semi-structured data in a transactional data lake. Some of the data files are small, but other data files are tens of terabytes. A data engineer must perform a change data capture (CDC) operation to identify changed data from the data source. The data source sends a full snapshot as a JSON file every day and ingests the changed data into the data lake. Which solution will capture the changed data MOST cost-effectively?

Options

Correct Answer:

Explanation

The most cost-effective and scalable solution for performing Change Data Capture (CDC) on large, file-based snapshots in a data lake is to use an open-source transactional data lake format like Apache Hudi, Apache Iceberg, or Delta Lake. These formats are specifically designed to bring database-like capabilities, such as MERGE (upsert) operations, directly to data stored in Amazon S3. Using a service like AWS Glue or Amazon EMR, the daily snapshot can be efficiently merged with the existing data lake table, applying only the inserts and updates. This avoids the high cost and architectural complexity of loading terabyte-scale files into a relational database.

References

1. AWS Big Data Blog

"Implement a CDC-based ETL pipeline using Amazon S3

AWS Glue

and Apache Hudi": This article details the exact pattern described in the correct answer. It states

"Apache Hudi enables you to manage data at the record level in Amazon S3 to perform inserts

updates

and deletes... This helps in use cases like change data capture (CDC)..." This directly supports using a format like Hudi for the described CDC operation.

2. AWS Glue Developer Guide

"Using transactional data lake frameworks with AWS Glue": The documentation confirms native support for these formats. "AWS Glue supports the open-source transactional data lake frameworks: Apache Hudi

Apache Iceberg

and Linux Foundation Delta Lake. These frameworks allow you to run ACID transactions on your Amazon S3 based data lake." This shows that the solution in option C is a well-supported pattern on AWS.

3. AWS Lambda Developer Guide

"Lambda quotas": The official documentation lists the "Function timeout" as 900 seconds (15 minutes). This technical limitation makes option A infeasible for processing terabyte-scale files

which would take significantly longer.

4. AWS Database Migration Service User Guide

"What Is AWS Database Migration Service?": The guide describes AWS DMS as a tool to "migrate your data to and from most widely used commercial and open-source databases." Its primary use case is database-to-database replication

not performing CDC on flat files in S3. This makes options B and D architecturally inappropriate.

Q: 10

During a security review, a company identified a vulnerability in an AWS Glue job. The company discovered that credentials to access an Amazon Redshift cluster were hard coded in the job script. A data engineer must remediate the security vulnerability in the AWS Glue job. The solution must securely store the credentials. Which combination of steps should the data engineer take to meet these requirements? (Choose two.)

Options

Correct Answer:

D, E

Explanation

The most secure and recommended approach for managing database credentials within AWS services is to use AWS Secrets Manager. Storing the Amazon Redshift credentials in AWS Secrets Manager (D) ensures they are encrypted at rest and managed centrally. To allow the AWS Glue job to retrieve these credentials at runtime, the IAM role associated with the job must be granted the appropriate permissions (E), specifically secretsmanager:GetSecret-Value, to access the specific secret. This combination eliminates hardcoded credentials from the script, addressing the security vulnerability while following AWS best practices for secret management.

References

1. AWS Glue Developer Guide

"Using secrets in AWS Glue connections": "You can use AWS Secrets Manager to store your credentials and then refer to them in your connection options... To use a secret with your connection

you must add secretsmanager:GetSecretValue permission to the IAM role that you use to access the connection." This source directly validates the combination of storing credentials in Secrets Manager (D) and granting the IAM role access (E).

2. AWS Secrets Manager User Guide

"What Is AWS Secrets Manager?": "Secrets Manager helps you protect secrets needed to access your applications

services

and IT resources. The service enables you to easily rotate

manage

and retrieve database credentials

API keys

and other secrets throughout their lifecycle." This supports the choice of Secrets Manager (D) as the appropriate service for this task.

3. AWS Glue Developer Guide

"Connection properties": "We recommend that you use AWS Secrets Manager to store your credentials and then refer to them in your connection options. Don't store passwords in the connection properties." This explicitly advises against insecure practices like using job parameters (A) and recommends the chosen solution.

Question 1 of 20 · Page 1 / 2

Premium Access Includes

✓ Quiz Simulator
✓ Exam Mode
✓ Progress Tracking
✓ Question Saving
✓ Flash Cards
✓ Drag & Drops
✓ 3 Months Access
✓ PDF Downloads

Get Premium Access

Premium Access Includes

FLASH OFFER

avail 10% DISCOUNT on YOUR PURCHASE