Q: 6
Case study
An ML engineer is developing a fraud detection model on AWS. The training dataset includes
transaction logs, customer profiles, and tables from an on-premises MySQL database. The
transaction logs and customer profiles are stored in Amazon S3.
The dataset has a class imbalance that affects the learning of the model's algorithm. Additionally,
many of the features have interdependencies. The algorithm is not capturing all the desired
underlying patterns in the data.
Before the ML engineer trains the model, the ML engineer must resolve the issue of the imbalanced
data.
Which solution will meet this requirement with the LEAST operational effort?
Options
Discussion
Option D Data Wrangler automates balancing so it's the lowest effort compared to the others for this case.
Option D here. Data Wrangler's balance data is specifically designed for quick class balancing and sits right inside SageMaker, so fewer steps needed than Glue or Athena. Small catch: if the source data wasn't S3 or SageMaker, setup could be more involved, but in this scenario it's the fastest. Pretty sure that's why D wins.
Maybe D , I'd say. Glue DataBrew (C) is tempting since it has built-in transforms, but in AWS practice exams and real workflows, Data Wrangler's balance data operation is purpose-built for this and a lot simpler if you're already in SageMaker. C looks good but it's more manual. Disagree?
B tbh. SageMaker Studio Classic has built-in algorithms for handling imbalanced data, which can automate a lot of the process if you're already in Studio. I know D is more direct, but figured B would also minimize effort here? Correct me if I'm missing something.
D . C is a trap since it works but takes more setup if you're using SageMaker already. Practice exams usually flag Data Wrangler as lowest effort for this use case.
Its D, not totally sure since sometimes C gets picked but Data Wrangler balance data is made for this and super fast.
B doesn't fit for least effort, D does. Data Wrangler's balance data is way faster here.
I’d say this is same as a common exam questions on practice tests and D is usually the right move. SageMaker Data Wrangler has that balance data operation built in, so oversampling takes just a couple clicks. The workflow stays in SageMaker too, so it's definitely less effort than using Glue or Athena. I think D but let me know if you disagree.
Probably D, since Data Wrangler has the balance data operation baked in. Super quick to oversample and fits right into SageMaker workflow. C needs more setup in Glue so not as low-effort. Let me know if you think otherwise.
D . Data Wrangler makes oversampling super easy for class imbalance, barely any manual setup compared to others.
Be respectful. No spam.