Question 8

Question

[Data Engineering]
A Machine Learning Specialist is developing a daily ETL workflow containing multiple ETL jobs The
workflow consists of the following processes
* Start the workflow as soon as data is uploaded to Amazon S3
* When all the datasets are available in Amazon S3, start an ETL job to join the uploaded datasets
with multiple terabyte-sized datasets already stored in Amazon S3
* Store the results of joining datasets in Amazon S3
* If one of the jobs fails, send a notification to the Administrator
Which configuration will meet these requirements?

Accepted Answer

Use AWS Lambda to trigger an AWS Step Functions workflow to wait for dataset uploads to
complete in Amazon S3. Use AWS Glue to join the datasets Use an Amazon CloudWatch alarm to
send an SNS notification to the Administrator in the case of a failure

Taylor F. · Answer

Option A looks right to me. Step Functions can coordinate the ETL workflow and wait for uploads, then Glue is ideal for joining large datasets in S3. CloudWatch with SNS covers notifications. Pretty sure this matches the scenario, unless I’m missing something.

Jamie V. · Answer

Makes sense to pick A here. Step Functions can coordinate all dataset arrival events before kicking off Glue, which is ideal for large data joins, plus CloudWatch + SNS covers the failure notification part. Pretty sure that's the setup they want, unless I'm missing something.

Anita Y. · Answer

C/D? I'm not convinced D can handle multi-terabyte joins (Lambda limits), but C's use of AWS Batch doesn't really fit the trigger/flow requirements here either. I'd probably still say A is best, but curious if there's a real-world case where C could work. Anyone see C actually being used like this?

Hannah R. · Answer

C/D? But I don't think D can handle big datasets well, Lambda chaining hits limits with multi-TB data. C doesn't really match the workflow triggers. Pretty sure A's Step Functions with Glue is the design AWS pushes for this kind of use case. Disagree?

QuinnI · Answer

Why do you think Lambda chaining (D) works with multi-terabyte S3 joins? Isn't Glue better for that size?

Premium Access Includes

FLASH OFFER

avail 10% DISCOUNT on YOUR PURCHASE