PROFESSIONAL-DATA-ENGINEER.pdf
Q: 1
You are deploying a batch pipeline in Dataflow. This pipeline reads data from Cloud Storage,
transforms the data, and then writes the data into BigQuory. The security team has enabled an
organizational constraint in Google Cloud, requiring all Compute Engine instances to use only
internal IP addresses and no external IP addresses. What should you do?
Options
Q: 2
You are using Cloud Bigtable to persist and serve stock market data for each of the major indices. To
serve the trading application, you need to access only the most recent stock prices that are streaming
in How should you design your row key and tables to ensure that you can access the data with the
most simple query?
Options
Q: 3
Your team is working on a binary classification problem. You have trained a support vector machine
(SVM) classifier with default parameters, and received an area under the Curve (AUC) of 0.87 on the
validation set. You want to increase the AUC of the model. What should you do?
Options
Q: 4
You are testing a Dataflow pipeline to ingest and transform text files. The files are compressed gzip,
errors are written to a dead-letter queue, and you are using Sidelnputs to join data You noticed that
the pipeline is taking longer to complete than expected, what should you do to expedite the
Dataflow job?
Options
Q: 5
A shipping company has live package-tracking data that is sent to an Apache Kafka stream in real
time. This is then loaded into BigQuery. Analysts in your company want to query the tracking data in
BigQuery to analyze geospatial trends in the lifecycle of a package. The table was originally created
with ingest-date partitioning. Over time, the query processing time has increased. You need to
implement a change that would improve query performance in BigQuery. What should you do?
Options
Q: 6
Data Analysts in your company have the Cloud IAM Owner role assigned to them in their projects to
allow them to work with multiple GCP products in their projects. Your organization requires that all
BigQuery data access logs be retained for 6 months. You need to ensure that only audit personnel in
your company can access the data access logs for all projects. What should you do?
Options
Q: 7
Your business users need a way to clean and prepare data before using the data for analysis. Your
business users are less technically savvy and prefer to work with graphical user interfaces to define
their transformations. After the data has been transformed, the business users want to perform their
analysis directly in a spreadsheet. You need to recommend a solution that they can use. What should
you do?
Options
Q: 8
You need to choose a database for a new project that has the following requirements:
Fully managed
Able to automatically scale up
Transactionally consistent
Able to scale up to 6 TB
Able to be queried using SQL
Which database do you choose?
Options
Q: 9
You are building a new application that you need to collect data from in a scalable way. Data arrives
continuously from the application throughout the day, and you expect to generate approximately 150
GB of JSON data per day by the end of the year. Your requirements are:
Decoupling producer from consumer
Space and cost-efficient storage of the raw ingested data, which is to be stored indefinitely
Near real-time SQL query
Maintain at least 2 years of historical data, which will be queried with SQ
Which pipeline should you use to meet these requirements?
Options
Q: 10
You've migrated a Hadoop job from an on-premises cluster to Dataproc and Good Storage. Your Spark
job is a complex analytical workload fiat consists of many shuffling operations, and initial data are
parquet toes (on average 200-400 MB size each) You see some degradation in performance after the
migration to Dataproc so you'd like to optimize for it. Your organization is very cost-sensitive so you'd
Idee to continue using Dataproc on preemptibles (with 2 non-preemptibles workers only) for this
workload. What should you do?
Options
Question 1 of 10