DEA-C01.pdf
Q: 1
A company extracts approximately 1 TB of data every day from data sources such as SAP HANA,
Microsoft SQL Server, MongoDB, Apache Kafka, and Amazon DynamoDB. Some of the data sources
have undefined data schemas or data schemas that change.
A data engineer must implement a solution that can detect the schema for these data sources. The
solution must extract, transform, and load the data to an Amazon S3 bucket. The company has a
service level agreement (SLA) to load the data into the S3 bucket within 15 minutes of data creation.
Which solution will meet these requirements with the LEAST operational overhead?
Options
Q: 2
A data engineer must orchestrate a data pipeline that consists of one AWS Lambda function and one
AWS Glue job. The solution must integrate with AWS services.
Which solution will meet these requirements with the LEAST management overhead?
Options
Q: 3
A technology company currently uses Amazon Kinesis Data Streams to collect log data in real time.
The company wants to use Amazon Redshift for downstream real-time queries and to enrich the log
data.
Which solution will ingest data into Amazon Redshift with the LEAST operational overhead?
Options
Q: 4
A company uses an Amazon Redshift provisioned cluster as its database. The Redshift cluster has five
reserved ra3.4xlarge nodes and uses key distribution.
A data engineer notices that one of the nodes frequently has a CPU load over 90%. SQL Queries that
run on the node are queued. The other four nodes usually have a CPU load under 15% during daily
operations.
The data engineer wants to maintain the current number of compute nodes. The data engineer also
wants to balance the load more evenly across all five compute nodes.
Which solution will meet these requirements?
Options
Q: 5
A data engineer is configuring an AWS Glue job to read data from an Amazon S3 bucket. The data
engineer has set up the necessary AWS Glue connection details and an associated IAM role.
However, when the data engineer attempts to run the AWS Glue job, the data engineer receives an
error message that indicates that there are problems with the Amazon S3 VPC gateway endpoint.
The data engineer must resolve the error and connect the AWS Glue job to the S3 bucket.
Which solution will meet this requirement?
Options
Q: 6
A company loads transaction data for each day into Amazon Redshift tables at the end of each day.
The company wants to have the ability to track which tables have been loaded and which tables still
need to be loaded.
A data engineer wants to store the load statuses of Redshift tables in an Amazon DynamoDB table.
The data engineer creates an AWS Lambda function to publish the details of the load statuses to
DynamoDB.
How should the data engineer invoke the Lambda function to write load statuses to the DynamoDB
table?
Options
Q: 7
A company maintains multiple extract, transform, and load (ETL) workflows that ingest data from the
company's operational databases into an Amazon S3 based data lake. The ETL workflows use AWS
Glue and Amazon EMR to process data.
The company wants to improve the existing architecture to provide automated orchestration and to
require minimal manual effort.
Which solution will meet these requirements with the LEAST operational overhead?
Options
Q: 8
A company has multiple applications that use datasets that are stored in an Amazon S3 bucket. The
company has an ecommerce application that generates a dataset that contains personally
identifiable information (PII). The company has an internal analytics application that does not require
access to the PII.
To comply with regulations, the company must not share PII unnecessarily. A data engineer needs to
implement a solution that with redact PII dynamically, based on the needs of each application that
accesses the dataset.
Which solution will meet the requirements with the LEAST operational overhead?
Options
Q: 9
A company uses Amazon S3 to store semi-structured data in a transactional data lake. Some of the
data files are small, but other data files are tens of terabytes.
A data engineer must perform a change data capture (CDC) operation to identify changed data from
the data source. The data source sends a full snapshot as a JSON file every day and ingests the
changed data into the data lake.
Which solution will capture the changed data MOST cost-effectively?
Options
Q: 10
During a security review, a company identified a vulnerability in an AWS Glue job. The company
discovered that credentials to access an Amazon Redshift cluster were hard coded in the job script.
A data engineer must remediate the security vulnerability in the AWS Glue job. The solution must
securely store the credentials.
Which combination of steps should the data engineer take to meet these requirements? (Choose
two.)
Options
Q: 11
A company is building a data stream processing application. The application runs in an Amazon
Elastic Kubernetes Service (Amazon EKS) cluster. The application stores processed data in an Amazon
DynamoDB table.
The company needs the application containers in the EKS cluster to have secure access to the
DynamoDB table. The company does not want to embed AWS credentials in the containers.
Which solution will meet these requirements?
Options
Q: 12
A company uses an Amazon QuickSight dashboard to monitor usage of one of the company's
applications. The company uses AWS Glue jobs to process data for the dashboard. The company
stores the data in a single Amazon S3 bucket. The company adds new data every day.
A data engineer discovers that dashboard queries are becoming slower over time. The data engineer
determines that the root cause of the slowing queries is long-running AWS Glue jobs.
Which actions should the data engineer take to improve the performance of the AWS Glue jobs?
(Choose two.)
Options
Q: 13
A retail company has a customer data hub in an Amazon S3 bucket. Employees from many countries
use the data hub to support company-wide analytics. A governance team must ensure that the
company's data analysts can access data only for customers who are within the same country as the
analysts.
Which solution will meet these requirements with the LEAST operational effort?
Options
Q: 14
A data engineer is building a data pipeline on AWS by using AWS Glue extract, transform, and load
(ETL) jobs. The data engineer needs to process data from Amazon RDS and MongoDB, perform
transformations, and load the transformed data into Amazon Redshift for analytics. The data updates
must occur every hour.
Which combination of tasks will meet these requirements with the LEAST operational overhead?
(Choose two.)
Options
Q: 15
A company's data engineer needs to optimize the performance of table SQL queries. The company
stores data in an Amazon Redshift cluster. The data engineer cannot increase the size of the cluster
because of budget constraints.
The company stores the data in multiple tables and loads the data by using the EVEN distribution
style. Some tables are hundreds of gigabytes in size. Other tables are less than 10 MB in size.
Which solution will meet these requirements?
Options
Q: 16
A company uses Amazon Redshift for its data warehouse. The company must automate refresh
schedules for Amazon Redshift materialized views.
Which solution will meet this requirement with the LEAST effort?
Options
Q: 17
A retail company stores order information in an Amazon Aurora table named Orders. The company
needs to create operational reports from the Orders table with minimal latency. The Orders table
contains billions of rows, and over 100,000 transactions can occur each second.
A marketing team needs to join the Orders data with an Amazon Redshift table named Campaigns in
the marketing team's data warehouse. The operational Aurora database must not be affected.
Which solution will meet these requirements with the LEAST operational effort?
Options
Q: 18
A company currently uses a provisioned Amazon EMR cluster that includes general purpose Amazon
EC2 instances. The EMR cluster uses EMR managed scaling between one to five task nodes for the
company's long-running Apache Spark extract, transform, and load (ETL) job. The company runs the
ETL job every day.
When the company runs the ETL job, the EMR cluster quickly scales up to five nodes. The EMR cluster
often reaches maximum CPU usage, but the memory usage remains under 30%.
The company wants to modify the EMR cluster configuration to reduce the EMR costs to run the daily
ETL job.
Which solution will meet these requirements MOST cost-effectively?
Options
Q: 19
A company currently stores all of its data in Amazon S3 by using the S3 Standard storage class.
A data engineer examined data access patterns to identify trends. During the first 6 months, most
data files are accessed several times each day. Between 6 months and 2 years, most data files are
accessed once or twice each month. After 2 years, data files are accessed only once or twice each
year.
The data engineer needs to use an S3 Lifecycle policy to develop new data storage rules. The new
storage solution must continue to provide high availability.
Which solution will meet these requirements in the MOST cost-effective way?
Options
Q: 20
A company needs to load customer data that comes from a third party into an Amazon Redshift data
warehouse. The company stores order data and product data in the same data warehouse. The
company wants to use the combined dataset to identify potential new customers.
A data engineer notices that one of the fields in the source data includes values that are in JSON
format.
How should the data engineer load the JSON data into the data warehouse with the LEAST effort?
Options
Q: 21
A data engineer needs to use an Amazon QuickSight dashboard that is based on Amazon Athena
queries on data that is stored in an Amazon S3 bucket. When the data engineer connects to the
QuickSight dashboard, the data engineer receives an error message that indicates insufficient
permissions.
Which factors could cause to the permissions-related errors? (Choose two.)
Options
Q: 22
A company is planning to use a provisioned Amazon EMR cluster that runs Apache Spark jobs to
perform big data analysis. The company requires high reliability. A big data team must follow best
practices for running cost-optimized and long-running workloads on Amazon EMR. The team must
find a solution that will maintain the company's current level of performance.
Which combination of resources will meet these requirements MOST cost-effectively? (Choose two.)
Options
Q: 23
A company has a production AWS account that runs company workloads. The company's security
team created a security AWS account to store and analyze security logs from the production AWS
account. The security logs in the production AWS account are stored in Amazon CloudWatch Logs.
The company needs to use Amazon Kinesis Data Streams to deliver the security logs to the security
AWS account.
Which solution will meet these requirements?
Options
Q: 24
A company has a frontend ReactJS website that uses Amazon API Gateway to invoke REST APIs. The
APIs perform the functionality of the website. A data engineer needs to write a Python script that
can be occasionally invoked through API Gateway. The code must return results to API Gateway.
Which solution will meet these requirements with the LEAST operational overhead?
Options
Q: 25
A data engineer needs to create an Amazon Athena table based on a subset of data from an existing
Athena table named cities_world. The cities_world table contains cities that are located around the
world. The data engineer must create a new table named cities_us to contain only the cities from
cities_world that are located in the US.
Which SQL statement should the data engineer use to meet this requirement?


Options
Question 1 of 25