Q: 8
[Data Engineering]
A Machine Learning Specialist is developing a daily ETL workflow containing multiple ETL jobs The
workflow consists of the following processes
* Start the workflow as soon as data is uploaded to Amazon S3
* When all the datasets are available in Amazon S3, start an ETL job to join the uploaded datasets
with multiple terabyte-sized datasets already stored in Amazon S3
* Store the results of joining datasets in Amazon S3
* If one of the jobs fails, send a notification to the Administrator
Which configuration will meet these requirements?
Options
Discussion
Option A looks right to me. Step Functions can coordinate the ETL workflow and wait for uploads, then Glue is ideal for joining large datasets in S3. CloudWatch with SNS covers notifications. Pretty sure this matches the scenario, unless I’m missing something.
Makes sense to pick A here. Step Functions can coordinate all dataset arrival events before kicking off Glue, which is ideal for large data joins, plus CloudWatch + SNS covers the failure notification part. Pretty sure that's the setup they want, unless I'm missing something.
C/D? I'm not convinced D can handle multi-terabyte joins (Lambda limits), but C's use of AWS Batch doesn't really fit the trigger/flow requirements here either. I'd probably still say A is best, but curious if there's a real-world case where C could work. Anyone see C actually being used like this?
C/D? But I don't think D can handle big datasets well, Lambda chaining hits limits with multi-TB data. C doesn't really match the workflow triggers. Pretty sure A's Step Functions with Glue is the design AWS pushes for this kind of use case. Disagree?
Why do you think Lambda chaining (D) works with multi-terabyte S3 joins? Isn't Glue better for that size?
Probably A for this one
D . Had something like this in a mock, setup was Lambda chain for the ETL parts.
Be respectful. No spam.