Question 14 - Google Professional-Machine-Learning-Engineer Real Exam Questions [March 2026 Update]

Q: 14

You recently deployed a scikit-learn model to a Vertex Al endpoint You are now testing the model on live production traffic While monitoring the endpoint. you discover twice as many requests per hour than expected throughout the day You want the endpoint to efficiently scale when the demand increases in the future to prevent users from experiencing high latency What should you do?

Options

Discussion

Alex X. Feb 25, 2026 6:12 am

B . Setting minReplicaCount handles the baseline traffic and lets autoscaling keep up when demand jumps. Seen similar advice for Vertex AI endpoints before, seems like the best fit here.

Sara Mar 4, 2026 4:27 am

Option C could work too. Raising target utilization in autoscalingMetricSpecs can help use resources more efficiently during spikes, and I saw similar advice in some official guides and Vertex AI practice labs. Not 100 percent sure though, maybe B is safer?

Logan W. Feb 23, 2026 8:49 am

Probably B, is right. Had something like this in a mock: minReplicaCount keeps enough replicas always running, which helps prevent slow responses when load suddenly increases. Autoscaling handles the spikes, but if your baseline is too low you’ll get cold starts. Not 100% sure but this matches how Vertex AI scaling works, agree?

Riley O. Feb 26, 2026 5:00 am

B imo. Setting minReplicaCount lets Vertex AI keep enough replicas running to avoid cold starts when traffic jumps, so users don’t get high latency. Autoscaling then takes care of big surges above baseline. The other options either miss the point (like adding GPUs if you don't need them for scikit-learn) or just balance between models, not really scaling. Pretty sure on this but let me know if someone had luck with C.

Ben Feb 27, 2026 11:33 pm

Not sure B is always best-if traffic patterns are spiky but baseline is super low, bumping minReplicaCount just burns resources. Wouldn't tuning maxReplicaCount or target utilization matter more in that edge case?

Riley O. Feb 18, 2026 1:07 am

B is the way to go. Setting minReplicaCount helps Vertex AI autoscale properly for spikes without risking high latency. The other choices don't directly address scaling for unpredictable traffic. Pretty sure about this, but open to corrections if I'm missing something.

Be respectful. No spam.

Correct Answer:

Explanation

The best option for scaling a Vertex AI endpoint efficiently when the demand increases in the future,

using a scikit-learn model that is deployed to a Vertex AI endpoint and tested on live production

traffic, is to configure an appropriate minReplicaCount value based on expected baseline traffic. This

option allows you to leverage the power and simplicity of Vertex AI to automatically scale your

endpoint resources according to the traffic patterns. Vertex AI is a unified platform for building and

deploying machine learning solutions on Google Cloud. Vertex AI can deploy a trained model to an

online prediction endpoint, which can provide low-latency predictions for individual instances.

Vertex AI can also provide various tools and services for data analysis, model development, model

deployment, model monitoring, and model governance. A minReplicaCount value is a parameter

that specifies the minimum number of replicas that the endpoint must always have, regardless of the

load. A minReplicaCount value can help you ensure that the endpoint has enough resources to

handle the expected baseline traffic, and avoid high latency or errors. By configuring an appropriate

minReplicaCount value based on expected baseline traffic, you can scale your endpoint efficiently

when the demand increases in the future. You can set the minReplicaCount value when you deploy

the model to the endpoint, or update it later. Vertex AI will automatically scale up or down the

number of replicas within the range of the minReplicaCount and maxReplicaCount values, based on

the target utilization percentage and the autoscaling metric1.

The other options are not as good as option B, for the following reasons:

Option A: Deploying two models to the same endpoint and distributing requests among them evenly

would not allow you to scale your endpoint efficiently when the demand increases in the future, and

could increase the complexity and cost of the deployment process. A model is a resource that

represents a machine learning model that you can use for prediction. A model can have one or more

versions, which are different implementations of the same model. A model version can help you

experiment and iterate on your model, and improve the model performance and accuracy. An

endpoint is a resource that provides the service endpoint (URL) you use to request the prediction. An

endpoint can have one or more deployed models, which are instances of model versions that are

associated with physical resources. A deployed model can help you serve online predictions with low

latency, and scale up or down based on the traffic. By deploying two models to the same endpoint

and distributing requests among them evenly, you can create a load balancing mechanism that can

distribute the traffic across the models, and reduce the load on each model. However, deploying two

models to the same endpoint and distributing requests among them evenly would not allow you to

scale your endpoint efficiently when the demand increases in the future, and could increase the

complexity and cost of the deployment process. You would need to write code, create and configure

the two models, deploy the models to the same endpoint, and distribute the requests among them

evenly. Moreover, this option would not use the autoscaling feature of Vertex AI, which can

automatically adjust the number of replicas based on the traffic patterns, and provide various

benefits, such as optimal resource utilization, cost savings, and performance improvement2.

Option C: Setting the target utilization percentage in the autoscalingMetricSpecs configuration to a

higher value would not allow you to scale your endpoint efficiently when the demand increases in

the future, and could cause errors or poor performance. A target utilization percentage is a

parameter that specifies the desired utilization level of each replica. A target utilization percentage

can affect the speed and accuracy of the autoscaling process. A higher target utilization percentage

can help you reduce the number of replicas, but it can also cause high latency, low throughput, or

resource exhaustion. By setting the target utilization percentage in the autoscalingMetricSpecs

configuration to a higher value, you can increase the utilization level of each replica, and save some

resources. However, setting the target utilization percentage in the autoscalingMetricSpecs

configuration to a higher value would not allow you to scale your endpoint efficiently when the

demand increases in the future, and could cause errors or poor performance. You would need to

write code, create and configure the autoscalingMetricSpecs, and set the target utilization

percentage to a higher value. Moreover, this option would not ensure that the endpoint has enough

resources to handle the expected baseline traffic, which could cause high latency or errors1.

Option D: Changing the model’s machine type to one that utilizes GPUs would not allow you to scale

your endpoint efficiently when the demand increases in the future, and could increase the

complexity and cost of the deployment process. A machine type is a parameter that specifies the

type of virtual machine that the prediction service uses for the deployed model. A machine type can

affect the speed and accuracy of the prediction process. A machine type that utilizes GPUs can help

you accelerate the computation and processing of the prediction, and handle more prediction

requests at the same time. By changing the model’s machine type to one that utilizes GPUs, you can

improve the prediction performance and efficiency of your model. However, changing the model’s

machine type to one that utilizes GPUs would not allow you to scale your endpoint efficiently when

the demand increases in the future, and could increase the complexity and cost of the deployment

process. You would need to write code, create and configure the model, deploy the model to the

endpoint, and change the machine type to one that utilizes GPUs. Moreover, this option would not

use the autoscaling feature of Vertex AI, which can automatically adjust the number of replicas based

on the traffic patterns, and provide various benefits, such as optimal resource utilization, cost

savings, and performance improvement2.

Reference:

Configure compute resources for prediction | Vertex AI | Google Cloud

Deploy a model to an endpoint | Vertex AI | Google Cloud

Premium Access Includes

FLASH OFFER

avail 10% DISCOUNT on YOUR PURCHASE