Q: 6
[Data Engineering]
A company wants to predict stock market price trends. The company stores stock market data each
business day in Amazon S3 in Apache Parquet format. The company stores 20 GB of data each day for
each stock code.
A data engineer must use Apache Spark to perform batch preprocessing data transformations quickly
so the company can complete prediction jobs before the stock market opens the next day. The
company plans to track more stock market codes and needs a way to scale the preprocessing data
transformations.
Which AWS service or feature will meet these requirements with the LEAST development effort over
time?
Options
Discussion
B
A tbh
I don’t think EMR (B) is right for "least dev effort". Glue (A) offers managed Spark, scales easily, and there’s basically no infrastructure to maintain. EMR’s more hands-on and can be overkill unless deep Spark tuning is needed. Athena and Lambda just don’t fit here. Open to being corrected if anyone has counter examples from recent AWS docs.
Actually it's A here. EMR (B) is a trap because it takes more ongoing setup and management, while Glue handles Spark with way less dev effort as they want. Athena and Lambda just can't do this scale for Spark batch jobs. If anyone's seen a newer AWS recommendation that changes this, let me know.
Probably B here since EMR handles Spark natively and you get a lot of flexibility for scaling big batch jobs. Glue is less manual but I think EMR's better for heavy Spark workflows. Not totally sure though, open to feedback.
I've seen similar practice questions, and the official AWS guide covers Glue for this scenario. Going with B.
Ugh, AWS loves to push Glue for all ETL. I was thinking B at first since EMR gives you more direct Spark control, but the setup's a pain if you want to scale out later or just manage tons of jobs. Wouldn't Athena be easier here?
Be respectful. No spam.