The best option for reducing pipeline execution time and cost, while also minimizing pipeline
changes, is to enable caching for the pipeline job, and disable caching for the model training step.
This option allows you to leverage the power and simplicity of Vertex AI Pipelines to reuse the output
of the data preprocessing step, and avoid unnecessary recomputation. Vertex AI Pipelines is a service
that can orchestrate machine learning workflows using Vertex AI. Vertex AI Pipelines can run
preprocessing and training steps on custom Docker images, and evaluate, deploy, and monitor the
machine learning model. Caching is a feature of Vertex AI Pipelines that can store and reuse the
output of a pipeline step, and skip the execution of the step if the input parameters and the code
have not changed. Caching can help you reduce the pipeline execution time and cost, as you do not
need to re-run the same step with the same input and code. Caching can also help you minimize the
pipeline changes, as you do not need to add or remove any pipeline steps or parameters. By enabling
caching for the pipeline job, and disabling caching for the model training step, you can create a
Vertex AI pipeline that includes two steps. The first step preprocesses 10 TB data, completes in about
1 hour, and saves the result in a Cloud Storage bucket. The second step uses the processed data to
train a model. You can update the model’s code to allow you to test different algorithms, and run the
pipeline job with caching enabled. The pipeline job will reuse the output of the data preprocessing
step from the cache, and skip the execution of the step. The pipeline job will run the model training
step with the updated code, and disable the caching for the step. This way, you can reduce the
pipeline execution time and cost, while also minimizing pipeline changes1.
The other options are not as good as option D, for the following reasons:
Option A: Adding a pipeline parameter and an additional pipeline step, depending on the parameter
value, the pipeline step conducts or skips data preprocessing and starts model training, would
require more skills and steps than enabling caching for the pipeline job, and disabling caching for the
model training step. A pipeline parameter is a variable that can be used to control the input or
output of a pipeline step. A pipeline parameter can help you customize the pipeline logic and
behavior, and experiment with different values. An additional pipeline step is a new instance of a
pipeline component that can perform a part of the pipeline workflow, such as data preprocessing or
model training. An additional pipeline step can help you extend the pipeline functionality and
complexity, and handle different scenarios. However, adding a pipeline parameter and an additional
pipeline step, depending on the parameter value, the pipeline step conducts or skips data
preprocessing and starts model training, would require more skills and steps than enabling caching
for the pipeline job, and disabling caching for the model training step. You would need to write code,
define the pipeline parameter, create the additional pipeline step, implement the conditional logic,
and compile and run the pipeline. Moreover, this option would not reuse the output of the data
preprocessing step from the cache, but rather from the Cloud Storage bucket, which can increase the
data transfer and access costs1.
Option B: Creating another pipeline without the preprocessing step, and hardcoding the
preprocessed Cloud Storage file location for model training, would require more skills and steps than
enabling caching for the pipeline job, and disabling caching for the model training step. A pipeline
without the preprocessing step is a pipeline that only includes the model training step, and uses the
preprocessed data from the Cloud Storage bucket as the input. A pipeline without the preprocessing
step can help you avoid running the data preprocessing step every time, and reduce the pipeline
execution time and cost. However, creating another pipeline without the preprocessing step, and
hardcoding the preprocessed Cloud Storage file location for model training, would require more skills
and steps than enabling caching for the pipeline job, and disabling caching for the model training
step. You would need to write code, create a new pipeline, remove the preprocessing step, hardcode
the Cloud Storage file location, and compile and run the pipeline. Moreover, this option would not
reuse the output of the data preprocessing step from the cache, but rather from the Cloud Storage
bucket, which can increase the data transfer and access costs. Furthermore, this option would create
another pipeline, which can increase the maintenance and management costs1.
Option C: Configuring a machine with more CPU and RAM from the compute-optimized machine
family for the data preprocessing step, would not reduce the pipeline execution time and cost, while
also minimizing pipeline changes, but rather increase the pipeline execution cost and complexity. A
machine with more CPU and RAM from the compute-optimized machine family is a virtual machine
that has a high ratio of CPU cores to memory, and can provide high performance and scalability for
compute-intensive workloads. A machine with more CPU and RAM from the compute-optimized
machine family can help you optimize the data preprocessing step, and reduce the pipeline execution
time. However, configuring a machine with more CPU and RAM from the compute-optimized
machine family for the data preprocessing step, would not reduce the pipeline execution time and
cost, while also minimizing pipeline changes, but rather increase the pipeline execution cost and
complexity. You would need to write code, configure the machine type parameters for the data
preprocessing step, and compile and run the pipeline. Moreover, this option would increase the
pipeline execution cost, as machines with more CPU and RAM from the compute-optimized machine
family are more expensive than machines with less CPU and RAM from other machine
families. Furthermore, this option would not reuse the output of the data preprocessing step from
the cache, but rather re-run the data preprocessing step every time, which can increase the pipeline
execution time and cost1.
Reference:
Preparing for Google Cloud Certification: Machine Learning Engineer, Course 3: Production ML
Systems, Week 3: MLOps
Google Cloud Professional Machine Learning Engineer Exam Guide, Section 3: Scaling ML models in
production, 3.2 Automating ML workflows
Official Google Cloud Certified Professional Machine Learning Engineer Study Guide, Chapter 6:
Production ML Systems, Section 6.4: Automating ML Workflows
Vertex AI Pipelines
Caching
Pipeline parameters
Machine types