View Professional-Data-Engineer Exam Questions

Q: 11

You have several Spark jobs that run on a Cloud Dataproc cluster on a schedule. Some of the jobs run in sequence, and some of the jobs run concurrently. You need to automate this process. What should you do?

Options

Correct Answer:

Explanation

Cloud Composer is Google Cloud's fully managed workflow orchestration service built on Apache Airflow. It is the ideal solution for this scenario because it is designed to author, schedule, and monitor complex data pipelines. Using a Directed Acyclic Graph (DAG), you can define the precise relationships between your Spark jobs, specifying which ones must run in sequence and which can run concurrently. Cloud Composer includes robust, built-in scheduling capabilities, directly fulfilling the requirement to run the jobs on a schedule. It provides a complete, scalable, and manageable solution for automating complex workflows with dependencies.

Why Incorrect

A. Create a Cloud Dataproc Workflow Template: While Workflow Templates can define a graph of jobs with dependencies, they lack a native, built-in scheduler. You would need to integrate an external service like Cloud Scheduler to automate execution, making it an incomplete solution.

B. Create an initialization action to execute the jobs: Initialization actions are scripts that run only once during cluster creation to configure the cluster nodes (e.g., install software). They are not designed for scheduling or orchestrating recurring application jobs.

D. Create a Bash script that uses the Cloud SDK...: This approach is manual and brittle. It requires custom coding for scheduling, dependency management, error handling, and logging, all of which are standard features in a managed orchestrator like Cloud Composer.

References

1. Cloud Composer for Orchestration: "Cloud Composer is a fully managed workflow orchestration service that empowers you to author, schedule, and monitor pipelines that span across clouds and on-premises data centers." It uses Directed Acyclic Graphs (DAGs) to manage workflows.

Source: Google Cloud Documentation, "What is Cloud Composer?", Overview section. (https://cloud.google.com/composer/docs/concepts/overview)

2. Dataproc Workflow Templates vs. Cloud Composer: The documentation explicitly notes that for scheduled workflows, an external trigger is needed for Workflow Templates. "To run a workflow template on a schedule, you can use Cloud Scheduler to send a request to the Dataproc API... Cloud Composer is a fully managed workflow orchestration service that can be used to author, schedule, and monitor complex workflows, including Dataproc workflows."

Source: Google Cloud Documentation, "Dataproc Workflow Templates", section "Workflow Templates and Cloud Composer". (https://cloud.google.com/dataproc/docs/concepts/workflows/overview#workflowtemplatesandcloudcomposer)

3. Purpose of Initialization Actions: "Initialization actions are scripts or executables that Dataproc runs on all nodes in a Dataproc cluster immediately after the cluster is set up." This confirms they are for setup, not job execution.

Source: Google Cloud Documentation, "Initialization actions", Overview section. (https://cloud.google.com/dataproc/docs/concepts/configuring-clusters/init-actions)

Q: 12

Government regulations in your industry mandate that you have to maintain an auditable record of access to certain types of data. Assuming that all expiring logs will be archived correctly, where should you store data that is subject to that mandate?

Options

Correct Answer:

Explanation

BigQuery is designed for large-scale data analytics and provides robust, built-in auditing capabilities. By controlling access to datasets via IAM (Identity and Access Management), you ensure only authorized personnel can query the data. Critically, BigQuery automatically generates Data Access audit logs for actions that read data, such as running a query. These logs capture the user's identity (principalEmail), the timestamp, and the exact query executed. This creates a direct, tamper-evident, and auditable trail of data access, fulfilling the regulatory mandate without requiring custom development.

Why Incorrect

A. Managing user-supplied keys externally makes auditing difficult. Google Cloud logs the service account that requested the data, but it cannot audit which end-user used the external key to decrypt it.

C. Cloud SQL Admin activity logs track administrative changes (e.g., editing an instance), not data access operations like SELECT queries. This log type is insufficient for the specified audit requirement.

D. This approach relies on a custom-built application for logging. This is less reliable and more complex than a native, managed audit logging service and places the burden of creating a compliant audit trail on the developer.

References

1. Google Cloud Documentation - Cloud Audit Logs with BigQuery: "Data Access audit logs contain log entries for API calls that read the configuration or metadata of resources, as well as user-driven API calls that create, modify, or read user-provided resource data... BigQuery writes Data Access audit logs for the following APIs... google.cloud.bigquery.v2.JobService.Query". This confirms that user queries are logged, providing the required audit trail.

2. Google Cloud Documentation - Overview of Cloud Audit Logs: "Data Access audit logs: Entries for operations that read the configuration or metadata of resources, as well as user-driven API calls that create, modify, or read user-provided resource data." This defines the purpose of Data Access logs, which directly matches the question's need.

3. Google Cloud Documentation - Cloud SQL Audit Logging: "Admin Activity audit logs contain log entries for operations that modify the configuration or metadata of a Cloud SQL resource." This explicitly states that Admin Activity logs are for configuration changes, not data access, making option C incorrect.

4. Google Cloud Documentation - Customer-Supplied Encryption Keys (CSEK): "Cloud Storage does not store your key on Google's servers or otherwise manage your key. Instead, you provide your key for each operation, and the key is purged from Google's servers after the operation is complete." This highlights that key management is external, making it impossible for Google's native audit logs to track which end-user's key was used.

Q: 13

Your team is building a data lake platform on Google Cloud. As a part of the data foundation design, you are planning to store all the raw data in Cloud Storage You are expecting to ingest approximately 25 GB of data a day and your billing department is worried about the increasing cost of storing old dat a. The current business requirements are: • The old data can be deleted anytime • You plan to use the visualization layer for current and historical reporting • The old data should be available instantly when accessed • There should not be any charges for data retrieval. What should you do to optimize for cost?

Options

Correct Answer:

Explanation

The Autoclass feature is the optimal solution as it meets all specified requirements. It automatically transitions data between a frequent access tier (equivalent to Standard) and an infrequent access tier (equivalent to Nearline) based on access patterns, thus optimizing storage costs. Crucially, Autoclass provides instant access to all data regardless of its tier and eliminates retrieval fees, which are explicit business requirements. This approach simplifies management by removing the need to configure and maintain complex Object Lifecycle Management policies while satisfying the constraints of instant availability and no retrieval charges for historical data.

Why Incorrect

B: This option is incorrect because it includes Archive storage, which does not provide instant access (retrieval can take hours). Additionally, Nearline, Coldline, and Archive storage classes all incur retrieval fees, violating a key requirement.

C: This option is incorrect for the same reasons as B (use of Archive storage and retrieval fees). It also proposes an illogical lifecycle policy, moving data from a colder class (Coldline) to a warmer one (Nearline).

D: This option is incorrect because it includes Archive storage, which violates the instant access requirement. Like option B, it also uses storage classes that have retrieval fees, which is explicitly forbidden by the requirements.

References

1. Google Cloud Documentation - Autoclass: "Autoclass simplifies storage management... Access to all objects is at Standard storage speeds, regardless of the object's assigned storage class... There are no retrieval fees when Autoclass moves an object to the Standard storage class tier." This source confirms that Autoclass provides instant access and has no retrieval fees, directly supporting option A.

2. Google Cloud Documentation - Storage Classes: This document details the properties of different storage classes.

For Nearline, Coldline: "In addition to a per-operation cost, a retrieval cost applies..." This confirms that options B, C, and D violate the "no retrieval charges" requirement.

For Archive: "Unlike the other storage classes, there are retrieval times associated with retrieving archive data." This confirms that options B, C, and D violate the "available instantly" requirement.

3. Google Cloud Documentation - Object Lifecycle Management: This page describes how lifecycle policies work by transitioning objects between storage classes. While it's a valid cost-optimization tool, its application in options B, C, and D leads to a violation of the scenario's specific access and billing constraints.

Q: 14

Which of the following IAM roles does your Compute Engine account require to be able to run pipeline jobs?

Options

Correct Answer:

Explanation

The dataflow.worker (roles/dataflow.worker) IAM role is specifically designed for the service account used by Compute Engine worker VMs. This role grants the necessary permissions for the worker instances to pull tasks from the Dataflow service, report their status and progress, and access the job's state. Without this role, the Compute Engine instances launched by Dataflow cannot function as workers to execute the pipeline's distributed processing tasks. The worker service account is distinct from the user account that launches the job.

Why Incorrect

B. dataflow.compute: This is not a valid predefined IAM role in Google Cloud. It is likely included as a distractor.

C. dataflow.developer: This role (roles/dataflow.developer) is for users or service accounts that need to create, drain, cancel, and manage Dataflow jobs, not for the worker VMs that execute the job's tasks.

D. dataflow.viewer: This role (roles/dataflow.viewer) provides read-only permissions to view Dataflow jobs and their status. It is insufficient for a worker VM, which must actively perform tasks and update its status.

References

1. Google Cloud Documentation - Dataflow security and permissions: In the section "Required roles," it explicitly states: "The worker service account needs the roles/dataflow.worker role to be able to process data as part of a Dataflow job." This role includes permissions like dataflow.workItems.lease, dataflow.workItems.reportStatus, and dataflow.workItems.sendMessage.

Source: Google Cloud Documentation, "Dataflow security and permissions", Section: "Required roles".

2. Google Cloud Documentation - Granting permissions to the worker service account: This guide details the necessary roles for the worker service account. It specifies: "To provide the necessary permissions to the worker service account, grant it the Dataflow Worker (roles/dataflow.worker) role."

Source: Google Cloud Documentation, "Granting permissions to the worker service account", Section: "Grant the required roles".

3. Google Cloud Documentation - IAM basic and predefined roles reference: The official reference for all predefined roles lists roles/dataflow.worker with the description: "Provides the permissions necessary for a Compute Engine service account to run work units for a Dataflow job." It also lists the specific permissions contained within the role.

Source: Google Cloud Documentation, "IAM basic and predefined roles reference", Filter for "Dataflow".

Q: 15

Cloud Dataproc is a managed Apache Hadoop and Apache _____ service.

Options

Correct Answer:

Explanation

Cloud Dataproc is a fully managed and highly scalable service from Google Cloud for running open-source data and analytics workloads. Its core function is to simplify the creation, management, and scaling of clusters for popular big data frameworks. The service is fundamentally built to provide managed instances of Apache Hadoop and Apache Spark, allowing users to leverage these powerful open-source tools for batch processing, querying, streaming, and machine learning without the overhead of infrastructure management.

Why Incorrect

A. Blaze: This is not a core Apache big data project offered as a managed service within Cloud Dataproc. Google's internal build tool was named Blaze (now open-sourced as Bazel).

C. Fire: This is not a recognized Apache project in the context of big data processing and is not associated with the Cloud Dataproc service.

D. Ignite: While Apache Ignite is an open-source in-memory computing platform, it is not a foundational, defining component of the Cloud Dataproc service in the same way as Hadoop and Spark.

References

1. Google Cloud Documentation - Dataproc Overview: The official product page explicitly states, "Dataproc is a managed Spark and Hadoop service that lets you take advantage of open source data tools for batch processing, querying, streaming, and machine learning." This directly confirms that Spark is the correct answer.

Source: Google Cloud, "Dataproc Overview", Section: "What is Dataproc?".

2. Google Cloud Documentation - What is Dataproc?: The introductory paragraph of the core documentation states, "Dataproc is a fully managed and highly scalable service for running Apache Spark, Apache Flink, Presto, and 30+ open source tools and frameworks." This highlights Apache Spark as a primary component.

Source: Google Cloud, "What is Dataproc?", First paragraph.

3. Coursera - Google Cloud Big Data and Machine Learning Fundamentals: In the course module covering data processing, Cloud Dataproc is introduced as Google's managed platform for the Apache Hadoop and Apache Spark ecosystem.

Source: Coursera, "Google Cloud Big Data and Machine Learning Fundamentals", Week 2: "Managed Big Data Services", Video: "Cloud Dataproc for Spark and Hadoop".

Q: 16

An external customer provides you with a daily dump of data from their database. The data flows into Google Cloud Storage GCS as comma-separated values (CSV) files. You want to analyze this data in Google BigQuery, but the data could have rows that are formatted incorrectly or corrupted. How should you build this pipeline?

Options

Correct Answer:

Explanation

A Google Cloud Dataflow batch pipeline is the most robust and scalable solution for this scenario. Dataflow is designed for large-scale Extract, Transform, and Load (ETL) operations. Within a Dataflow pipeline, you can implement custom logic to parse and validate each row from the source CSV files. By using a try-catch block or a side output pattern, the pipeline can gracefully handle parsing errors. Valid records are written to the main BigQuery table, while corrupted or malformed records are routed to a separate "dead-letter" table. This ensures the main ingestion process is not halted by bad data and provides a mechanism to isolate, analyze, and remediate the problematic records.

Why Incorrect

A. Federated queries are not suitable for ingestion pipelines with data quality issues, as a malformed row can cause the entire query to fail, and they lack robust, built-in error-handling mechanisms for routing bad records.

B. Monitoring and alerting are reactive measures. They would notify you after a load job has failed due to bad data but do not provide a mechanism to handle the bad records and successfully load the valid ones.

C. Setting maxbadrecords to 0 instructs BigQuery to fail the entire load job if even a single bad record is found. This is the opposite of the requirement to build a resilient pipeline that can handle corrupted rows.

References

1. Cloud Dataflow for Robust ETL: Google Cloud's official documentation outlines patterns for building resilient data pipelines. The use of side outputs to handle errors is a standard practice. "A common pattern is to use a dead-letter queue... For a streaming pipeline, you can use a side output to capture errors." This principle applies equally to batch pipelines for routing bad records.

Source: Google Cloud Documentation, "Handle pipeline errors," Common error-handling patterns.

2. BigQuery Load Job Error Handling: The BigQuery documentation specifies the maxBadRecords property for load jobs. "The maximum number of bad records that BigQuery can ignore when running the job. If the number of bad records exceeds this value, an invalid error is returned in the job result. The default value is 0, which requires that all records are valid." This confirms that option C would cause the job to fail.

Source: Google Cloud Documentation, BigQuery API, "Jobs," JobConfigurationLoad resource, maxBadRecords property.

3. Limitations of Federated Queries for ETL: While useful for ad-hoc analysis, federated (external) tables are not ideal for robust ETL. The documentation notes that query performance is lower and that errors in the source data can cause query failures. "If the external data is not clean, queries can fail. For example, if there are extra columns or incorrect data types."

Source: Google Cloud Documentation, "Introduction to external tables."

4. Data Processing Pipeline Architectures: The Google Cloud Architecture Center provides reference architectures for data processing. For scenarios requiring transformation and validation before loading, Dataflow is the recommended tool. "This pipeline uses Dataflow to pull messages from Pub/Sub, process them, and then write them to BigQuery for analysis." This pattern of using Dataflow for pre-load processing is standard.

Source: Google Cloud Architecture Center, "Serverless data-processing pipeline."

Q: 17

You have an upstream process that writes data to Cloud Storage. This data is then read by an Apache Spark job that runs on Dataproc. These jobs are run in the us-central1 region, but the data could be stored anywhere in the United States. You need to have a recovery process in place in case of a catastrophic single region failure. You need an approach with a maximum of 15 minutes of data loss (RPO=15 mins). You want to ensure that there is minimal latency when reading the dat a. What should you do?

Options

Q: 18

Which of the following is not true about Dataflow pipelines?

Options

Correct Answer:

Explanation

A Dataflow pipeline is a self-contained, directed acyclic graph (DAG) of data processing steps that executes as a single, isolated job. Each pipeline run, or instance, is independent and does not have a built-in mechanism to directly share in-memory data or state with other running pipeline instances. While multiple pipelines can read from or write to a common external storage system like a Pub/Sub topic or a BigQuery table, this is interaction via external sources and sinks, not direct data sharing between the pipeline execution instances themselves. The core processing logic and intermediate data within one pipeline job are isolated from another.

Why Incorrect

A. A pipeline is fundamentally a series of operations, called PTransforms, that are applied to distributed datasets, called PCollections. This statement is true.

B. A pipeline is the definition of a data processing task. When you execute a pipeline using the Dataflow service, it creates and runs a Dataflow job. This statement is true.

C. The structure of a pipeline, with its PCollections (nodes) and PTransforms (edges), forms a Directed Acyclic Graph (DAG) that represents the entire workflow. This statement is true.

References

1. Google Cloud Documentation, Dataflow, "Pipelines": "A pipeline is a graph of transformations that are applied to collections of data. In the Beam SDKs, a pipeline is represented by an object of the Pipeline class... When you run your pipeline, it executes as a single job on the Dataflow service." This source confirms that a pipeline is a graph of operations (A, C) that runs as a job (B). The concept of a job being a "single" execution implies isolation, contradicting D.

2. Apache Beam Programming Guide, "Overview": "A pipeline is a user-constructed graph of PTransforms that defines the data processing job... The Beam SDKs use this graph to create a Directed Acyclic Graph (DAG) of steps." This reference directly supports that a pipeline is a graph of steps (C) and represents a job (B).

3. Google Cloud Documentation, Dataflow, "Execution graph": "When you run your Dataflow pipeline, Dataflow creates an execution graph from the PTransforms and PCollections in your code. This execution graph is called a pipeline..." This source explicitly links the pipeline to a graph structure (C). The lack of any mention of inter-pipeline communication mechanisms further supports that D is not a feature.

Q: 19

You used Cloud Dataprep to create a recipe on a sample of data in a BigQuery table. You want to reuse this recipe on a daily upload of data with the same schema, after the load job with variable execution time completes. What should you do?

Options

Correct Answer:

Explanation

The core requirement is to run a Cloud Dataprep recipe after a preceding BigQuery load job, which has a variable completion time. This necessitates a workflow orchestration tool that can manage dependencies between tasks, rather than a simple time-based scheduler.

Cloud Composer, a managed Apache Airflow service, is designed for this purpose. By exporting the Dataprep job as a Cloud Dataflow template, it can be integrated as a task within a Cloud Composer Directed Acyclic Graph (DAG). This DAG can be designed so that the Dataflow task (running the Dataprep recipe) is triggered only upon the successful completion of the BigQuery load task, perfectly handling the variable timing.

Why Incorrect

A. Create a cron schedule in Cloud Dataprep.

This is a time-based scheduler. It cannot guarantee the preceding load job has finished, potentially causing the recipe to run on incomplete or old data.

B. Create an App Engine cron job to schedule the execution of the Cloud Dataprep job.

Similar to option A, this uses a time-based trigger (cron) and cannot manage the dependency on the completion of the variable-time load job.

C. Export the recipe as a Cloud Dataprep template, and create a job in Cloud Scheduler.

Cloud Scheduler is also a time-based cron service. It is unsuitable for event-driven workflows that depend on the completion of other tasks with variable runtimes.

References

1. Google Cloud Documentation - Cloud Dataprep: "Overview of operationalizing". This document explicitly states, "For complex pipelines, you can use Cloud Composer to orchestrate your Dataprep jobs with other data processing tasks." This directly supports using Cloud Composer for orchestration.

2. Google Cloud Documentation - Cloud Composer: "About Cloud Composer". The documentation describes it as a "workflow orchestration service" used to "author, schedule, and monitor pipelines". It highlights the use of operators to integrate with services like BigQuery and Dataflow, enabling the creation of dependency-driven workflows.

3. Google Cloud Documentation - Cloud Dataflow: "Execute a template". This page details how Dataflow templates can be executed from various environments, including using the DataflowTemplateOperator within a Cloud Composer DAG. This confirms the mechanism for integrating the exported Dataprep logic into the orchestrated workflow.

4. Google Cloud Documentation - Cloud Dataprep: "Run a Job". This page clarifies that "When you run a job, your recipe steps are converted into a Cloud Dataflow job that executes over your source data." This establishes the underlying connection between Dataprep and Dataflow, making the export to a Dataflow template a logical step for external orchestration.

Q: 20

You are planning to use Cloud Storage as pad of your data lake solution. The Cloud Storage bucket will contain objects ingested from external systems. Each object will be ingested once, and the access patterns of individual objects will be random. You want to minimize the cost of storing and retrieving these objects. You want to ensure that any cost optimization efforts are transparent to the users and applications. What should you do?

Options

Question 11 of 20 · Page 2 / 2

Premium Access Includes

✓ Quiz Simulator
✓ Exam Mode
✓ Progress Tracking
✓ Question Saving
✓ Flash Cards
✓ Drag & Drops
✓ 3 Months Access
✓ PDF Downloads

Get Premium Access

Premium Access Includes

FLASH OFFER

avail 10% DISCOUNT on YOUR PURCHASE