Q: 13
You work for the AI team of an automobile company, and you are developing a visual defect
detection model using TensorFlow and Keras. To improve your model performance, you want to
incorporate some image augmentation functions such as translation, cropping, and contrast
tweaking. You randomly apply these functions to each training batch. You want to optimize your data
processing pipeline for run time and compute resources utilization. What should you do?
Options
Discussion
A. Official TensorFlow docs or practice labs usually push for tf.Data pipeline for this type of augmentation.
A . Had something like this in a mock and tf.Data pipelines are way more efficient for augmentation since they use native TensorFlow ops and support optimizations like parallel mapping and prefetching. Keras generators are decent for small stuff but don't scale or integrate as well, especially with distributed training. Pretty sure A is what they're looking for here, unless I'm missing a subtlety.
A is the way to go here. tf.Data pipelines can parallelize and optimize these augmentations on the fly, plus it's easier to scale and integrate with distributed training. Pretty sure that's what Google expects in this scenario, though B isn't totally wrong for smaller tasks.
tf.Data pipeline wins every time for this in Google exams, tbh. A imo
C/D? I know pre-generating augmentations with Dataflow (C or D) is tempting because you could save run-time augmentation cost, but having tried something like this before, the storage overhead gets out of hand really fast. Also, tf.Data pipeline (A) typically does these ops efficiently on the fly and fits better with distributed TensorFlow workflows. Maybe someone sees a reason to pick C or D over A for a very large dataset?
I don’t think B is best. Keras generators work but they can be slower, especially for large datasets. tf.Data pipeline usually gives better performance but option C seems tempting since pre-generated augmentations could save time during training, right?
Probably B since Keras generators can handle augmentations and batching together, saw similar approach in some exam reports.
tf.Data lets you apply those augmentations on the fly and take advantage of TensorFlow's built-in optimizations. A is the efficient pick for compute and time. If I'm missing a catch about the dataset size, let me know.
Option D seems like a trap here-staging as TFRecords with Dataflow sounds scalable but adds complexity and latency. A is more efficient with tf.Data, but I get why someone might pick D for big pipelines. Not 100% though.
Be respectful. No spam.