View DEA-C01 Exam Questions

Q: 11

A company is building a data stream processing application. The application runs in an Amazon Elastic Kubernetes Service (Amazon EKS) cluster. The application stores processed data in an Amazon DynamoDB table. The company needs the application containers in the EKS cluster to have secure access to the DynamoDB table. The company does not want to embed AWS credentials in the containers. Which solution will meet these requirements?

Options

Correct Answer:

Explanation

The most secure and recommended method for granting AWS permissions to applications running in Amazon EKS is to use IAM Roles for Service Accounts (IRSA). This feature associates an IAM role with a Kubernetes service account. Pods that use this service account can then assume the associated IAM role and receive temporary, short-lived security credentials from the AWS Security Token Service (STS). This approach adheres to the principle of least privilege and completely avoids the need to store or manage long-lived IAM user credentials (access keys) within the EKS cluster, pods, or container images, directly meeting the stated requirements.

References

1. AWS Documentation

Amazon EKS User Guide: "IAM roles for service accounts." This document states

"With IAM roles for service accounts

you can associate an IAM role with a Kubernetes service account. This service account can then provide AWS permissions to the containers in any pod that uses that service account." It details how the pod receives a web identity token and exchanges it for temporary AWS credentials.

Source: AWS Documentation

Amazon EKS User Guide

"IAM roles for service accounts".

2. AWS Documentation

IAM User Guide: "IAM best practices." This guide strongly advises against using long-term credentials like IAM user access keys when a temporary alternative is available. It recommends using IAM roles to grant temporary permissions.

Source: AWS Documentation

IAM User Guide

"Security best practices in IAM"

section "Grant least privilege".

3. AWS Documentation

Amazon EKS User Guide: "Amazon EKS node IAM role." This document explains the role of the worker node's IAM policy

which is a prerequisite for the cluster to function but is distinct from the application-specific permissions granted via IRSA. The IRSA mechanism is the specific solution for pod-level permissions.

Source: AWS Documentation

Amazon EKS User Guide

"Amazon EKS node IAM role".

Q: 12

A company uses an Amazon QuickSight dashboard to monitor usage of one of the company's applications. The company uses AWS Glue jobs to process data for the dashboard. The company stores the data in a single Amazon S3 bucket. The company adds new data every day. A data engineer discovers that dashboard queries are becoming slower over time. The data engineer determines that the root cause of the slowing queries is long-running AWS Glue jobs. Which actions should the data engineer take to improve the performance of the AWS Glue jobs? (Choose two.)

Options

Correct Answer:

A, B

Explanation

The problem states that AWS Glue jobs are running for a long time, which in turn slows down Amazon QuickSight queries. The goal is to improve the performance of the Glue jobs.

1. Partitioning Data (A): Partitioning the data in Amazon S3 by a logical key, such as date (year, month, day), is a fundamental performance optimization technique. When the Glue job runs, it can use partition pruning to read only the specific folders (partitions) of data it needs, rather than scanning the entire dataset in the S3 bucket. This dramatically reduces the amount of data read, which is often the primary bottleneck, thus speeding up the job.

2. Scaling Up Workers (B): AWS Glue jobs run on a serverless Apache Spark environment. Increasing the worker type (e.g., from G.1X to G.2X) provides each worker node with more vCPU, memory, and disk space. This vertical scaling is effective for jobs that are memory-intensive or CPU-bound, allowing them to process data faster.

References

1. AWS Glue Developer Guide

"Managing partitions for ETL output in AWS Glue": "You can partition your data to speed up queries. AWS Glue can prune the partitions that are not required by a query

which can speed up queries on large datasets." This directly supports option A.

2. AWS Glue Developer Guide

"Job properties": This section details the Worker type parameter. It states

"For memory-intensive jobs

you can specify the G.1X

G.2X

G.4X

G.8X

or Z.2X worker types... Choosing a more powerful worker type can help your job run faster." This directly supports option B.

3. AWS Glue Developer Guide

"Working with DynamicFrames": This documentation describes DynamicFrames as the fundamental data structure used for ETL operations in AWS Glue

not as a performance tuning destination. This refutes option C.

Q: 13

A retail company has a customer data hub in an Amazon S3 bucket. Employees from many countries use the data hub to support company-wide analytics. A governance team must ensure that the company's data analysts can access data only for customers who are within the same country as the analysts. Which solution will meet these requirements with the LEAST operational effort?

Options

Correct Answer:

Explanation

The core requirement is to implement row-level security on data stored in Amazon S3 with minimal operational overhead. AWS Lake Formation is specifically designed for this purpose. By registering the S3 bucket with Lake Formation, a governance team can centrally define and enforce fine-grained access control policies, including row-level security. This is achieved by creating data filters based on attributes (like the 'country' column) and granting permissions to principals (analysts). This approach avoids data duplication or movement and centralizes security management, directly aligning with the principle of least operational effort.

References

1. AWS Lake Formation Developer Guide

"Data filtering and cell-level security": This section explicitly describes how to use Lake Formation to grant access to specific rows and columns in your data lake. It states

"You can implement column-level

row-level

and cell-level security by granting the SELECT Lake Formation permission on tables with data filters." This directly supports option B as the intended mechanism.

2. AWS Lake Formation Developer Guide

"How data filtering works": The documentation details the process: "When a principal runs a query against a table that has data filters

Lake Formation evaluates the filters for that principal to determine which rows they are allowed to see." This confirms Lake Formation's capability to meet the requirement.

3. AWS Documentation

"Getting started with Amazon Redshift": The documentation outlines the steps to set up a Redshift cluster and load data. This process involves creating a cluster

designing tables

and running COPY commands

which represents significantly more operational effort than applying a Lake Formation policy to existing S3 data

making option D less efficient.

4. AWS Big Data Blog

"Implement row-level security with Amazon Redshift": While this blog shows row-level security is possible in Redshift

it involves creating views and managing user-to-role mappings within the database. This is more complex and service-specific than the centralized

cross-service governance provided by Lake Formation for a data lake

reinforcing that option D has higher operational effort.

Q: 14

A data engineer is building a data pipeline on AWS by using AWS Glue extract, transform, and load (ETL) jobs. The data engineer needs to process data from Amazon RDS and MongoDB, perform transformations, and load the transformed data into Amazon Redshift for analytics. The data updates must occur every hour. Which combination of tasks will meet these requirements with the LEAST operational overhead? (Choose two.)

Options

Correct Answer:

A, D

Explanation

To meet the requirements with the least operational overhead, the data engineer should leverage the native, managed features of AWS Glue. AWS Glue connections are Data Catalog objects that securely store and manage the connection information for data sources like Amazon RDS and MongoDB, and targets like Amazon Redshift. This centralizes connectivity management and simplifies the ETL script.

For the hourly execution requirement, AWS Glue triggers provide a built-in, serverless scheduling mechanism. A time-based trigger can be configured to run the ETL job every hour without the need to provision or manage external scheduling infrastructure like AWS Lambda functions, thus minimizing operational overhead.

References

1. AWS Glue Developer Guide

"Defining connections in the AWS Glue Data Catalog": This guide explains that an AWS Glue connection is a Data Catalog object used to store connection information for data stores. It explicitly lists Amazon RDS

MongoDB

and Amazon Redshift as supported through JDBC and MongoDB connection types

respectively. This supports the use of AWS Glue connections (Option D) as the standard

managed approach.

2. AWS Glue Developer Guide

"Triggering jobs in AWS Glue": This section details the different types of triggers

including scheduled triggers. It states

"You can define a time-based schedule for your jobs... This scheduling capability runs on a cron-like schedule." This confirms that triggers (Option A) are the native

low-overhead mechanism for hourly job execution.

3. AWS Glue Developer Guide

"Moving data to and from Amazon Redshift": This documentation describes the recommended method for loading data into Amazon Redshift from an AWS Glue ETL job

which involves using a JDBC connection (managed by AWS Glue Connections) and a temporary Amazon S3 directory. This reinforces that Option D is the correct approach for connectivity

making Option E a less standard

higher-overhead alternative.

Q: 15

A company's data engineer needs to optimize the performance of table SQL queries. The company stores data in an Amazon Redshift cluster. The data engineer cannot increase the size of the cluster because of budget constraints. The company stores the data in multiple tables and loads the data by using the EVEN distribution style. Some tables are hundreds of gigabytes in size. Other tables are less than 10 MB in size. Which solution will meet these requirements?

Options

Q: 16

A company uses Amazon Redshift for its data warehouse. The company must automate refresh schedules for Amazon Redshift materialized views. Which solution will meet this requirement with the LEAST effort?

Options

Correct Answer:

Explanation

The Amazon Redshift query editor v2 provides a native feature to schedule SQL queries directly from the AWS Management Console. This allows a user to write the REFRESH MATERIALIZED VIEW statement and configure a recurring schedule with minimal configuration. This approach is the most direct, integrated, and requires the least operational overhead compared to setting up external orchestration services or misusing other features. It is specifically designed for automating such SQL tasks, perfectly aligning with the "LEAST effort" requirement.

References

1. Amazon Redshift Database Developer Guide - Scheduling a query in the Amazon Redshift query editor v2: "You can schedule a SQL query to run on a recurring basis using the Amazon Redshift query editor v2. ... You can use this feature to automate your SQL statements

for example to refresh materialized views or run data definition language (DDL) statements." This directly confirms that option C is the intended and lowest-effort method.

2. Amazon Redshift Database Developer Guide - Creating a user-defined function: "A UDF is a scalar function that returns a single value for each row of input." This clarifies that the purpose of a UDF (including Lambda UDFs) is to return a value to a query

not to execute standalone DDL commands

making option B an incorrect application of the technology.

3. AWS Documentation - Using Amazon Redshift with AWS Glue: This documentation outlines the necessary setup for Glue to interact with Redshift

which includes creating connections and IAM roles

demonstrating a higher level of effort than the native scheduler.

4. Amazon Managed Workflows for Apache Airflow (MWAA) User Guide - Amazon Redshift operators: This guide shows that interacting with Redshift from Airflow is possible but requires provisioning and managing an entire Airflow environment

which is a high-effort solution for this specific task.

Q: 17

A retail company stores order information in an Amazon Aurora table named Orders. The company needs to create operational reports from the Orders table with minimal latency. The Orders table contains billions of rows, and over 100,000 transactions can occur each second. A marketing team needs to join the Orders data with an Amazon Redshift table named Campaigns in the marketing team's data warehouse. The operational Aurora database must not be affected. Which solution will meet these requirements with the LEAST operational effort?

Options

Correct Answer:

Explanation

The most optimal solution is the Amazon Aurora zero-ETL integration with Amazon Redshift. This feature is specifically designed for near real-time analytics on operational data from Aurora within Redshift. It automates the complex process of building and managing data pipelines, thus fulfilling the requirement for the "LEAST operational effort." The integration continuously and automatically replicates data from Aurora to Redshift within seconds of being written, ensuring low latency for reporting without impacting the performance of the source operational Aurora database. Creating a materialized view in Redshift on this replicated data is the correct subsequent step for performant joins with the Campaigns table.

References

1. Amazon Aurora Documentation - Zero-ETL integrations: "Amazon Aurora zero-ETL integration with Amazon Redshift enables near real-time analytics and machine learning (ML) on petabytes of transactional data stored in Aurora. ... With a few clicks in the Amazon RDS console

you can create a zero-ETL integration... This fully managed solution eliminates the need for you to build and manage complex data pipelines that perform extract

transform

and load (ETL) operations."

Source: AWS Documentation

Working with Amazon Aurora zero-ETL integrations with Amazon Redshift

Introduction section.

2. Amazon Redshift Documentation - Zero-ETL integrations: "An Amazon Aurora zero-ETL integration with Amazon Redshift automatically and continuously replicates data... to an Amazon Redshift data warehouse. This replication enables you to perform analytics on your transactional data in near-real time... without having to build and maintain a data pipeline."

Source: AWS Documentation

Amazon Redshift Database Developer Guide

"Working with zero-ETL integrations"

Overview section.

3. AWS Documentation - Comparing Zero-ETL to other methods: The documentation implicitly positions zero-ETL as the simplest approach. While DMS and Glue are powerful

they are presented as solutions for more complex scenarios involving transformations or heterogeneous sources

which implies more setup. The very name "zero-ETL" is intended to convey minimal operational burden.

Source: AWS News Blog

Announcing Amazon Aurora zero-ETL integration with Amazon Redshift in preview

November 29

2022. This blog post contrasts the new feature with the "complex and manually intensive" process of building ETL pipelines.

Q: 18

A company currently uses a provisioned Amazon EMR cluster that includes general purpose Amazon EC2 instances. The EMR cluster uses EMR managed scaling between one to five task nodes for the company's long-running Apache Spark extract, transform, and load (ETL) job. The company runs the ETL job every day. When the company runs the ETL job, the EMR cluster quickly scales up to five nodes. The EMR cluster often reaches maximum CPU usage, but the memory usage remains under 30%. The company wants to modify the EMR cluster configuration to reduce the EMR costs to run the daily ETL job. Which solution will meet these requirements MOST cost-effectively?

Options

Correct Answer:

Explanation

The scenario describes a workload that is CPU-bound, indicated by the cluster reaching maximum CPU usage, while memory is significantly underutilized (less than 30%). General-purpose instances provide a balanced ratio of CPU and memory. The most cost-effective solution is to right-size the instances to match the workload's specific resource requirements. Compute-optimized EC2 instances (e.g., C-family) provide a higher ratio of vCPU to memory compared to general-purpose instances. By switching to compute-optimized instances, the company can get more processing power to alleviate the CPU bottleneck without paying for unneeded memory, leading to faster job completion and lower overall costs.

References

1. Amazon EC2 Documentation

Instance Types: "Compute optimized instances are ideal for compute-bound applications that benefit from high-performance processors. ... Use cases: ... batch processing workloads...". This directly supports using compute-optimized instances for the described ETL job.

Source: AWS Documentation. (n.d.). Amazon EC2 Instance Types. Amazon Web Services. Retrieved from https://aws.amazon.com/ec2/instance-types/

2. AWS Well-Architected Framework

Cost Optimization Pillar: The framework emphasizes right-sizing and selecting the most appropriate resource type. "For example

workloads that are processing-intensive can be deployed on compute-optimized instances." This principle directly applies to the scenario.

Source: AWS Well-Architected Framework. (2023

July). Cost Optimization Pillar. p. 18. Retrieved from https://docs.aws.amazon.com/wellarchitected/latest/cost-optimization-pillar/cost-optimization-pillar.pdf

3. Amazon EMR Management Guide

Hardware and networking: "For workloads that are compute-intensive

consider using an instance type from the compute-optimized family (for example

c5.4xlarge)." This provides specific guidance for EMR cluster configuration based on workload characteristics.

Source: AWS Documentation. (n.d.). Choose EC2 instance types and sizes. Amazon EMR Management Guide. Retrieved from https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-plan-instances-guidelines.html

Q: 19

A company currently stores all of its data in Amazon S3 by using the S3 Standard storage class. A data engineer examined data access patterns to identify trends. During the first 6 months, most data files are accessed several times each day. Between 6 months and 2 years, most data files are accessed once or twice each month. After 2 years, data files are accessed only once or twice each year. The data engineer needs to use an S3 Lifecycle policy to develop new data storage rules. The new storage solution must continue to provide high availability. Which solution will meet these requirements in the MOST cost-effective way?

Options

Correct Answer:

Explanation

The solution must maintain high availability throughout the data lifecycle. S3 One Zone-Infrequent Access (S3 One Zone-IA) stores data in a single Availability Zone and is therefore not a highly available solution, which eliminates options A and D. For data accessed infrequently between 6 months and 2 years, S3 Standard-Infrequent Access (S3 Standard-IA) is the correct choice as it maintains multi-AZ resilience. For data older than 2 years that is accessed only once or twice annually, S3 Glacier Deep Archive is the most cost-effective storage class. It is designed for long-term data archiving and retention at the lowest cost, fitting the access pattern perfectly.

References

1. Amazon S3 User Guide

"Amazon S3 storage classes." This document details the use cases

availability

and durability of each S3 storage class. It specifies that S3 Standard-IA is for "long-lived

infrequently accessed data" and maintains the same high availability (99.9%) and durability as S3 Standard by storing data across multiple Availability Zones. In contrast

it states S3 One Zone-IA is for "recreatable

infrequently accessed data" and is stored in a single AZ

making it unsuitable for this scenario's high availability requirement. It also identifies S3 Glacier Deep Archive as the "lowest-cost storage class" for long-term archive.

2. Amazon S3 User Guide

"Managing your storage lifecycle." This guide explains how to create lifecycle configuration rules to transition objects between storage classes. The scenario described in the question is a primary use case for S3 Lifecycle policies

transitioning objects to more cost-effective storage tiers based on age and access patterns. The path from S3 Standard -> S3 Standard-IA -> S3 Glacier Deep Archive is a common and recommended lifecycle configuration for cost optimization.

3. AWS Documentation

"Comparing the Amazon S3 storage classes." The comparison table in this official documentation explicitly shows S3 Standard-IA has "≥ 3" Availability Zones

while S3 One Zone-IA has "1" Availability Zone. This directly supports the elimination of options A and D. The table also lists the primary use case for S3 Glacier Deep Archive as "Archiving data that rarely

if ever

is accessed

" which aligns with the question's scenario for data older than two years.

Q: 20

A company needs to load customer data that comes from a third party into an Amazon Redshift data warehouse. The company stores order data and product data in the same data warehouse. The company wants to use the combined dataset to identify potential new customers. A data engineer notices that one of the fields in the source data includes values that are in JSON format. How should the data engineer load the JSON data into the data warehouse with the LEAST effort?

Options

Question 11 of 20 · Page 2 / 2

Premium Access Includes

✓ Quiz Simulator
✓ Exam Mode
✓ Progress Tracking
✓ Question Saving
✓ Flash Cards
✓ Drag & Drops
✓ 3 Months Access
✓ PDF Downloads

Get Premium Access

Premium Access Includes

FLASH OFFER

avail 10% DISCOUNT on YOUR PURCHASE