View Professional-Data-Engineer Exam Questions

Q: 1

The CUSTOM tier for Cloud Machine Learning Engine allows you to specify the number of which types of cluster nodes?

Options

Discussion

Casey G. Feb 21, 2026 3:33 am

B . Since the master node type is configurable in CUSTOM, I figured you can set numbers for all roles (masters, workers, parameter servers). Maybe a gotcha here about the master count but that's how I've always read it. Open to corrections if I missed something.

Grace I. Mar 3, 2026 12:48 am

Option B, I thought you set counts for masters too, not just workers and parameter servers.

Riley U. Feb 15, 2026 10:54 pm

My vote is C here. The CUSTOM tier specifically lets you set both worker and parameter server counts, but not multiple masters. Saw this in practice tests too. If I’m off, let me know.

Sam X. Feb 20, 2026 6:55 pm

A is wrong, C. You only get to specify workers and parameter servers, not masters. That's what the CUSTOM tier lets you tune.

Cameron P. Feb 16, 2026 4:35 am

Option C You get to set both workers and parameter servers with CUSTOM, that's not true for every tier. Pretty sure that's what the docs say too.

Nora Feb 14, 2026 7:22 am

C , you only set the number of workers and parameter servers in CUSTOM, not masters. The master is just one node, its type is configurable but not its count. Let me know if I'm missing something but pretty sure that's right.

SkepticalMentor89 Feb 23, 2026 7:00 am

I always thought it was B since you can pick master type, worker and parameter server counts, so B.

Jason A. Feb 12, 2026 11:59 pm

Yeah, it only lets you set the number of workers and parameter servers, not masters. C

Grace Z. Mar 4, 2026 12:12 pm

Ryan O. Feb 14, 2026 4:29 am

Why do Google questions always feel like a riddle just to test tiers? C, workers and parameter servers.

Be respectful. No spam.

Q: 2

You are designing the architecture to process your data from Cloud Storage to BigQuery by using Dataflow. The network team provided you with the Shared VPC network and subnetwork to be used by your pipelines. You need to enable the deployment of the pipeline on the Shared VPC network. What should you do?

Options

Discussion

Maya Feb 22, 2026 5:34 am

Option B

Parker F. Mar 3, 2026 5:07 am

B vs A. If you just need the pipeline to use Shared VPC subnetworks, it's the service account that executes the pipeline requiring compute.networkUser, not the Dataflow service agent. Pretty sure that's what GCP docs say but always gets murky if org policies are custom. Disagree?

Nathan U. Feb 21, 2026 12:02 am

B is right here. compute.networkUser has to be on the service account running Dataflow, otherwise pipeline can't use the Shared VPC subnet. The agent role is for managing infra, not network access itself. Pretty sure that's the distinction but open if anyone sees it another way.

Kevin L. Mar 1, 2026 4:56 am

A is wrong, B. The pipeline needs networkUser on the service account, not the service agent.

Anita X. Feb 21, 2026 6:53 pm

B tbh

Aisha F. Mar 1, 2026 9:23 pm

I'm leaning toward A here since the Dataflow service agent is what actually interacts with a lot of the GCP resources. If you give compute.networkUser to the agent, shouldn't that let it access Shared VPC subnetworks? Maybe missing something about the distinction between agent and pipeline service account, but feels like A covers network requirements too. Agree or did I misinterpret how roles are split?

Amelia V. Feb 15, 2026 8:27 am

Nah, not C. Tricky one but it's B because you need compute.networkUser on the service account that runs Dataflow, not on the Dataflow service agent. D is a bit of a trap since dataflow.admin isn’t about network use. Seen similar Qs on practice exams.

Kevin J. Feb 19, 2026 5:29 am

B here. Service account running Dataflow needs compute.networkUser or else subnet attach will fail on Shared VPC. Makes sense, right?

Maya Feb 19, 2026 4:44 am

Don’t think C fits here. B works since compute.networkUser needs to go on the service account that runs the actual Dataflow pipeline, otherwise you hit permission errors with Shared VPC. Could see D being a trap since dataflow.admin doesn't provide network access.

Nathan Feb 13, 2026 1:26 pm

Probably B. The compute.networkUser role is what lets the service account actually use subnets in the Shared VPC, so if you don't assign it to the service account running the pipeline, Dataflow can't create resources there. dataflow.admin by itself doesn't touch network rights. Pretty sure on this one but open to counterpoints.

Be respectful. No spam.

Q: 3

You operate a logistics company, and you want to improve event delivery reliability for vehicle-based sensors. You operate small data centers around the world to capture these events, but leased lines that provide connectivity from your event collection infrastructure to your event processing infrastructure are unreliable, with unpredictable latency. You want to address this issue in the most cost-effective way. What should you do?

Options

Discussion

Sam L. Feb 13, 2026 1:13 pm

B . A is a trap since Kafka adds overhead, Pub/Sub is cheaper and survives unreliable leased lines.

Kevin A. Feb 19, 2026 9:54 pm

B tbh, only edge case where A might win is if you needed local storage due to regulatory or privacy constraints. But question just wants cost-effective and resilient delivery, so managed Pub/Sub is perfect. Anyone disagree if on-prem cache wasn't a factor?

Jordan Feb 27, 2026 1:55 am

A isn’t it, B is the call here since Pub/Sub handles the unreliable connections and scales well for cost. This comes up in the official Google practice questions and the study guide covers managed service options pretty clearly.

Maya E. Feb 25, 2026 11:18 pm

Jack U. Feb 25, 2026 2:49 pm

Don’t think A is right, seems like a trap since managing Kafka yourself adds overhead. I’m going with B this time, Pub/Sub just fits better for unreliable leased lines and it’s fully managed. Maybe some edge cases but I think B is safest.

Mason G. Feb 24, 2026 6:46 am

A is wrong, B. Saw a similar question and Pub/Sub is just cheaper for unreliable lines.

IvyA Feb 12, 2026 6:53 am

I don't think it's A. B fits better here since Cloud Pub/Sub is designed for unreliable or lossy networks and saves you from Kafka maintenance overhead. A is tempting but adds cost and complexity you don't need for the actual requirement. Pretty sure B's the expected answer on these types of Google questions.

Adam I. Feb 27, 2026 5:51 pm

B , Pub/Sub is built to handle unreliable and lossy connections well, plus it’s managed so you avoid all the Kafka ops headaches. If the question was about buffering on-site or local failover, I'd maybe consider A, but for pure cost-effectiveness with flaky links B is safer. Seen this logic in a few practice sets. Disagree?

Arjun I. Feb 28, 2026 6:09 am

Is anyone using the official study guide or practice tests to prep for these scenario questions? I keep seeing similar cases where Pub/Sub is the recommended approach for flaky connections. Just want to confirm if the guide mentions this pattern clearly.

Emma E. Feb 25, 2026 6:04 am

B , saw a similar scenario in practice sets and Pub/Sub is reliable for flaky networks plus it's cheaper compared to Interconnect.

Be respectful. No spam.

Q: 4

Which of the following is not true about Dataflow pipelines?

Options

Discussion

ReeseJ Feb 24, 2026 2:02 am

It’s C for me. Not every Dataflow pipeline is a perfect directed graph, some complex ones can have variations or recursion. Data sharing feels possible via external storage, so I think C fits better. Disagree?

Aaron Mar 2, 2026 11:07 am

I don’t think it’s C. D is the trap here since pipelines aren’t supposed to share data between instances in Dataflow.

MethodicalArchitect571 Feb 25, 2026 6:19 am

D as per the official practice test and Google docs. In the exam guide, pipeline data is always isolated across instances.

Riley Y. Feb 28, 2026 7:02 pm

D looks right since Dataflow pipelines keep their data and state isolated, so no sharing between instances. Pretty sure that's how it's designed for reliability and consistency. If someone thinks otherwise, let me know.

Karan O. Feb 23, 2026 6:22 pm

Nah, I think it’s C. Directed graphs might not fit every pipeline case, D looks like a trap option here.

Neha T. Feb 21, 2026 4:00 am

C/D? I get why some might pick C, but Dataflow does treat pipelines as directed graphs for transforms and PCollections. D just stands out more, since sharing data across pipeline instances isn't part of Dataflow's model. Anyone got different experience with it?

Drew Feb 13, 2026 12:07 pm

A isn’t it, D is. Sharing data between pipeline instances just isn’t how Dataflow works from what I know.

CarefulReviewer7764 Feb 13, 2026 11:30 pm

Maybe D . Pipelines can’t share data between instances, that’s by design in Dataflow. All the other options describe how Dataflow pipelines function, so pretty sure D is the odd one out.

Vikram Mar 3, 2026 7:54 pm

I’d say it's C here. Pipelines might not always be a strict directed graph since some complex jobs can have cycles or more flexible structures. D looks like a trap but pretty sure C is less universally true. Agree?

Meera J. Feb 22, 2026 1:05 am

D here. Dataflow pipelines isolate their data, so no sharing between instances. Pretty sure that's in the official docs, but open if someone has seen otherwise.

Be respectful. No spam.

Q: 5

Your company is migrating their 30-node Apache Hadoop cluster to the cloud. They want to re-use Hadoop jobs they have already created and minimize the management of the cluster as much as possible. They also want to be able to persist data beyond the life of the cluster. What should you do?

Options

Discussion

Casey H. Feb 13, 2026 4:45 pm

Why not go with D? Only GCS lets your data survive after cluster deletion, not just restarts. Persistent disks (option B) are tied to the cluster lifecycle.

AmeliaJ Feb 13, 2026 1:28 pm

Its D

QuinnV Feb 25, 2026 11:19 pm

B Had something like this in a mock and picked B since persistent disks should keep HDFS data around even if nodes go down. Seemed like the best fit at the time, but open if I'm missing something.

Grace W. Feb 18, 2026 3:24 pm

Its D. Persistent disk (B) only handles node reboots but not full cluster deletion, which is a common trap in these questions. GCS with Dataproc lets your data outlive the cluster, I think that's the key requirement here. Open to other reasons if someone thinks B still works.

Anita C. Feb 14, 2026 1:46 am

D is correct here

Adam W. Feb 19, 2026 8:55 am

B tbh

Grace A. Feb 25, 2026 11:37 am

A is wrong, D. Persistent disks (B) don't help if you delete the cluster, GCS with the connector lets data survive no matter what. Classic trap in these migrations, saw a similar gotcha on practice tests.

Sofia Z. Feb 13, 2026 10:30 am

My vote is D, saw a very similar question in some exam reports and GCS is the key for persistence.

SeasonedTester6277 Feb 21, 2026 11:28 am

Don't think D is the only close answer, B seems reasonable too since persistent disks can keep HDFS data across reboots. Persistent disk option trips people up because it sounds durable-am I missing something here? B

Noah K. Feb 23, 2026 10:49 pm

Definitely something similar in the official practice questions. D

Be respectful. No spam.

Q: 6

You are running a pipeline in Cloud Dataflow that receives messages from a Cloud Pub/Sub topic and writes the results to a BigQuery dataset in the EU. Currently, your pipeline is located in europe-west4 and has a maximum of 3 workers, instance type n1-standard-1. You notice that during peak periods, your pipeline is struggling to process records in a timely fashion, when all 3 workers are at maximum CPU utilization. Which two actions can you take to increase performance of your pipeline? (Choose two.)

Options

Discussion

Logan M. Mar 3, 2026 2:52 am

A and B are what I'd pick here. Scaling worker count or boosting instance type directly tackles CPU starvation in Dataflow. The buffer options (D, E) deal more with throughput spikes or persistent queueing, not pure compute limits. Pretty sure this is the intent, but open if anyone disagrees.

Adam E. Feb 27, 2026 5:09 pm

A and B tbh. Since CPU saturation is called out, just need to scale workers or use bigger instances. Adding buffers like in D/E isn't really the fix for compute issues.

Liam Y. Feb 25, 2026 9:12 pm

C or D, since changing region or adding a buffer seems like it could help if A/B are maxed out.

SamH Feb 24, 2026 4:37 am

Had something like this in a mock before. Isn’t increasing max workers (A) and using bigger instances (B) the straightforward way to deal with CPU bottlenecks? Adding buffers like D or E sounds more for handling message surges, not compute limits. Agree?

AdamY Feb 21, 2026 7:22 pm

B , boosting instance size or scaling up the worker count (A) is exactly how you'd tackle CPU bottlenecks in Dataflow. D and E would be for buffering, not raw compute issues. If I'm missing something let me know.

Luna B. Mar 1, 2026 5:03 am

C or D? If the pipeline's stuck, moving regions or temp buffering seems like quick fixes, right?

Layla I. Feb 18, 2026 1:59 am

A and B, same as most official guides suggest for Dataflow scaling, plus seen this on some practice exams.

FocusedNeteng2707 Feb 18, 2026 4:51 pm

I don't think D or E make sense here, since those buffer options just add complexity and won't solve the CPU bottleneck. A and B are way more direct, either scale out or use beefier machines. C's tricky but violates EU data residency, so that's probably a trap. Agree?

Vikram J. Feb 18, 2026 10:56 am

A or B, both are classic answers for scaling issues. Seen similar advice in official guides and practice exams.

Mason W. Feb 14, 2026 1:54 pm

B , plus A. The real issue is CPU saturation on current workers, so either scaling up (bigger instance) or out (more workers) addresses that right away. D and E are kind of red herrings, adding complexity without solving the compute bottleneck directly. C isn't great since you have to keep the pipeline in the EU for data residency. Seen similar scenarios in practice sets, always A/B for this kind of performance issue. Disagree?

Be respectful. No spam.

Q: 7

Your company is implementing a data warehouse using BigQuery, and you have been tasked with designing the data model You move your on-premises sales data warehouse with a star data schema to BigQuery but notice performance issues when querying the data of the past 30 days Based on Google's recommended practices, what should you do to speed up the query without increasing storage costs?

Options

Discussion

Liam Q. Mar 1, 2026 5:40 pm

C . Had something like this in a mock and Google's guidance is to materialize dimensions using views when you need joins in a star schema, especially if you want to speed things up but not use more storage. Partitioning would help for date filters, but the question asks about storage impact specifically. Might be tricky but I’d stick with C here. Agree?

Daniel E. Feb 25, 2026 2:31 pm

C . Materializing dimensional data in views helps BigQuery run those star-schema joins more efficiently, and it doesn't bump up your storage bill. If the pain is with join performance and not just recent date scans, this fits. Anyone else think that's right?

Chloe M. Feb 27, 2026 2:29 am

Option C

Olivia C. Feb 19, 2026 3:21 am

C/D? If most queries filter on the last 30 days, then D (partitioning by transaction date) usually gives a noticeable boost, especially in BigQuery table scans. C helps more if joins are the pain point and you want to avoid extra storage. Kinda depends what slows down the query here. I lean C given the storage cost bit, but it feels close.

LiamK Feb 16, 2026 10:22 am

Ugh, these GCP questions love to trip me up. Probably D because partitioning by transaction date should make recent queries run way faster, especially when filtering on the past 30 days. I think that's standard for BigQuery performance tweaks, but maybe I'm missing something?

Karan Y. Feb 16, 2026 8:35 pm

B tbh. Sharding by customer ID could split the data and maybe help with performance if queries are always by customer, but I don't remember Google recommending this for recent time-based filtering. I'm pretty sure splitting that way lets BigQuery scan less if customers are evenly distributed. Could be wrong, open to corrections.

Riley U. Feb 15, 2026 7:48 pm

D, Partitioning by transaction date feels more natural here since that's usually how you target recent data performance in BigQuery.

LaylaQ Mar 4, 2026 8:02 pm

D imo. Partitioning by transaction date should make queries for recent data a lot faster in BigQuery, since it reduces scanned data without more storage cost. Maybe I'm missing something but that's how I'd approach it.

Jamie Feb 26, 2026 5:29 am

C is better here, partitioning by date (D) looks useful but could increase costs if not filtered right.

Drew S. Feb 15, 2026 4:56 am

Probably C, since materializing dimension data with views can speed up joins without more storage cost.

Be respectful. No spam.

Q: 8

You are designing a data processing pipeline. The pipeline must be able to scale automatically as load increases. Messages must be processed at least once, and must be ordered within windows of 1 hour. How should you design the solution?

Options

Discussion

Meera G. Feb 15, 2026 7:16 pm

Option D is the way to go. Pub/Sub plus Dataflow are both cloud-native, can autoscale automatically, and Dataflow supports windowed ordering for that 1-hour requirement. B is tempting if you like Kafka, but it isn't as integrated or fully managed in GCP as Pub/Sub. Anyone see a use case where B would actually be better?

Parker U. Feb 21, 2026 2:37 am

I don't think it's B. D is more cloud-native and actually autoscaling, plus Dataflow gives that windowed ordering you need.

Ivy L. Feb 13, 2026 7:37 pm

Its D since Pub/Sub and Dataflow together are cloud-native, fully managed, and both autoscale with load. Dataflow specifically gives you windowing for that 1 hour ordering which is what the question asks. Kafka options don’t fit as seamlessly on GCP for this use case. Pretty sure about this but open to pushback if I missed some requirement.

Alex J. Feb 14, 2026 3:10 am

Cloud Pub/Sub plus Dataflow (D) is serverless and autoscaling, which fits the scalable pipeline need. Dataflow handles windowed ordering for that 1-hour window. Pretty sure that's what they want here but happy if someone has another angle.

Jamie M. Feb 18, 2026 6:50 pm

D , Pub/Sub and Dataflow are fully managed and actually autoscale on demand. Also Dataflow supports windowed ordering so you can order messages per hour just like the question needs. Not 100% sure if there's any gotcha but this combo is pretty standard for GCP. Let me know if I missed anything.

Noah Q. Feb 20, 2026 4:38 pm

Probably D since Pub/Sub with Dataflow is the only fully managed combo here that autosclaes and handles windowed ordering properly.

Morgan Y. Feb 24, 2026 1:27 pm

D imo. Had something like this in a mock, Pub/Sub plus Dataflow is made for scaling and windowed order.

Jamie W. Feb 27, 2026 7:23 am

Definitely D for this one

Arjun C. Mar 4, 2026 3:59 pm

D not C. Dataflow handles windowed ordering and scales automatically. Pub/Sub plus Dataproc won't guarantee the window semantics here.

Nathan Q. Mar 3, 2026 5:05 pm

Be respectful. No spam.

Q: 9

You are designing a fault-tolerant architecture to store data in a regional BigOuery dataset. You need to ensure that your application is able to recover from a corruption event in your tables that occurred within the past seven days. You want to adopt managed services with the lowest RPO and most cost- effective solution. What should you do?

Options

Discussion

PreciseOps1509 Feb 12, 2026 3:17 pm

Option C here, as time travel is built-in for seven days and does not add extra costs. Clean and straightforward question.

Parker D. Mar 4, 2026 2:26 pm

Option C Had something like this in a mock and C was right for 7-day recovery, not D.

Grace Feb 12, 2026 7:42 pm

C , time travel is built in and covers exactly the 7 day window with no extra cost or maintenance. Snapshots (D) are only needed if you want more than 7 days. Anyone see a catch here?

Ava I. Feb 17, 2026 5:06 am

C no doubt. Time travel is automatic for 7 days and doesn't bump up costs like daily snapshots. Pretty straightforward here.

Amelia W. Mar 4, 2026 7:25 am

For me, C makes more sense here. Time travel is built-in and covers the past 7 days, so you don't pay extra like with daily snapshots (D). D is tempting if you want longer retention, but for just 7 days, C's lower cost wins. Anyone see it differently?

DanielR Mar 2, 2026 1:29 pm

C here. Time travel fits since it lets you query data as of any point in the past 7 days, no manual backups or extra cost. D would only be needed for longer retention, I think. If anyone disagrees, let me know.

Casey Feb 13, 2026 3:40 pm

Had something like this in a mock, went with C. Time travel fits the 7-day rollback, no extra cost or config.

Meera Feb 27, 2026 4:18 am

C makes sense here since BigQuery time travel lets you query the state of your data from anytime within the last 7 days, which matches exactly what they're asking for. No extra cost or management compared to doing manual snapshots (D), which you'd only need if >7 days was required. Pretty sure this is what Google wants, but happy to hear if I'm missing anything.

Parker D. Feb 19, 2026 11:18 am

C I encountered exactly similar question in my exam. Time travel covers the 7 days window without extra setup or cost, so that's the most efficient pick here.

Ben T. Mar 4, 2026 5:41 am

C is the way to go, since BigQuery time travel lets you access data up to 7 days back without extra work or costs. Snapshots (D) cost more and don't give you anything extra within that window. Pretty confident unless I'm missing something.

Be respectful. No spam.

Q: 10

You work for a car manufacturer and have set up a data pipeline using Google Cloud Pub/Sub to capture anomalous sensor events. You are using a push subscription in Cloud Pub/Sub that calls a custom HTTPS endpoint that you have created to take action of these anomalous events as they occur. Your custom HTTPS endpoint keeps getting an inordinate amount of duplicate messages. What is the most likely cause of these duplicate messages?

Options

Discussion

Reese V. Feb 28, 2026 3:17 pm

Option B is correct. When your endpoint has an out-of-date SSL certificate, Cloud Pub/Sub can't confirm delivery due to failed TLS handshake, so it keeps retrying and you get duplicate messages. Option D is tempting since missing acks also trigger retries, but this scenario is specific to SSL issues with push endpoints. Seen similar guidance in Google docs. Open to other takes though if anyone's seen differently!

Hannah U. Feb 25, 2026 4:42 pm

Option B actually makes sense here. If the SSL cert is out-of-date, Pub/Sub’s push will get handshake failures and treat it like a non-ack, which leads to retries and so you get duplicates. I know D's also common but in this scenario, cert problems will cause exactly this issue. Pretty sure B is right-correct me if I’m missing something.

Meera Mar 2, 2026 10:11 pm

A is wrong, D. Saw similar on a practice test, duplicate messages usually mean the endpoint isn't acking fast enough.

Skyler Feb 22, 2026 11:58 am

Aisha Z. Feb 26, 2026 11:40 am

B . Pub/Sub push retries happen if the SSL handshake fails, which is what you'd get with an out-of-date cert, so that can totally create dupes. D is a common cause too but the HTTPS angle makes B more likely here. Agree?

Aaron P. Feb 25, 2026 12:14 pm

Pretty sure it's B. Pub/Sub push won't deliver properly if the SSL cert is expired so you get duplicate messages.

Chloe R. Feb 16, 2026 11:28 am

Its B here. If the SSL cert is expired or invalid, Pub/Sub can't complete the push and message delivery keeps retrying, causing duplicates. D usually fits but this one calls out HTTPS issues specifically. Anyone disagree?

Ishaan U. Feb 18, 2026 9:51 am

Nah, I don’t think it’s D here. B is the catch-Pub/Sub push fails when certs are invalid, so messages aren’t acknowledged and you get retries (hence duplicates). D is a standard culprit but this question points at HTTPS/SSL issues.

Chris N. Mar 4, 2026 1:23 pm

Hard to say, B. If the SSL certificate is out-of-date, Pub/Sub can't complete the HTTPS handshake, so it keeps retrying and you get duplicate messages. D makes sense for missed acks, but here the cert problem explains the HTTPS retries. Pretty sure that's right.

Alex Feb 16, 2026 2:09 am

Had something like this in a mock, picked D. Not acknowledging messages fast enough makes Pub/Sub resend them, so you get duplicates. I think that's a more common cause than SSL issues, but open to other takes.

Be respectful. No spam.

Question 1 of 20 · Page 1 / 2

Premium Access Includes

✓ Quiz Simulator
✓ Exam Mode
✓ Progress Tracking
✓ Question Saving
✓ Flash Cards
✓ Drag & Drops
✓ 3 Months Access
✓ PDF Downloads

Get Premium Access

Premium Access Includes

FLASH OFFER

avail 10% DISCOUNT on YOUR PURCHASE