Q: 14
An ML engineer is using Amazon SageMaker to train a deep learning model that requires distributed
training. After some training attempts, the ML engineer observes that the instances are not
performing as expected. The ML engineer identifies communication overhead between the training
instances.
What should the ML engineer do to MINIMIZE the communication overhead between the instances?
Options
Discussion
C . Keeping training instances and data in the same AZ really cuts network latency for distributed jobs. Official AWS ML guide and practice exams talk about minimizing cross-AZ traffic for this exact reason, so pretty confident here.
Option C is it. Keeping compute and data in the same AZ (and subnet) really reduces network latency for distributed ML jobs, which exam guides and AWS whitepapers drill on. I remember labs where any cross-AZ setup added delays fast. Pretty sure about this but open to other interpretations if someone's seen different in practice.
Option C
C. not D
C not D. Only C puts both compute and data in the same AZ, so network latency is lowest. Pretty sure that's what matters most for distributed training here. If someone disagrees let me know.
Anyone checked the official doc or tried labs for this scenario?
C
Feels like C, since same AZ for both compute and data removes cross-AZ latency which can slow down distributed training. D looks tempting if you think about fault tolerance but not right if overhead is the main concern. Correct me if you see it differently.
A is wrong, C. If you want the lowest communication overhead for distributed training, data and compute need to be in the same AZ. That avoids cross-AZ latency and extra costs. Pretty sure C lines up with AWS ML best practices here, but let me know if you think otherwise.
C tbh, seen this same scenario in AWS exam guides and labs.
Be respectful. No spam.