Question 16 - Google Professional Data Engineer Real Exam Questions [June 2026 Update]

Q: 16

An external customer provides you with a daily dump of data from their database. The data flows into Google Cloud Storage GCS as comma-separated values (CSV) files. You want to analyze this data in Google BigQuery, but the data could have rows that are formatted incorrectly or corrupted. How should you build this pipeline?

Options

Correct Answer:

Explanation

A Google Cloud Dataflow batch pipeline is the most robust and scalable solution for this scenario. Dataflow is designed for large-scale Extract, Transform, and Load (ETL) operations. Within a Dataflow pipeline, you can implement custom logic to parse and validate each row from the source CSV files. By using a try-catch block or a side output pattern, the pipeline can gracefully handle parsing errors. Valid records are written to the main BigQuery table, while corrupted or malformed records are routed to a separate "dead-letter" table. This ensures the main ingestion process is not halted by bad data and provides a mechanism to isolate, analyze, and remediate the problematic records.

Why Incorrect

A. Federated queries are not suitable for ingestion pipelines with data quality issues, as a malformed row can cause the entire query to fail, and they lack robust, built-in error-handling mechanisms for routing bad records.

B. Monitoring and alerting are reactive measures. They would notify you after a load job has failed due to bad data but do not provide a mechanism to handle the bad records and successfully load the valid ones.

C. Setting maxbadrecords to 0 instructs BigQuery to fail the entire load job if even a single bad record is found. This is the opposite of the requirement to build a resilient pipeline that can handle corrupted rows.

References

1. Cloud Dataflow for Robust ETL: Google Cloud's official documentation outlines patterns for building resilient data pipelines. The use of side outputs to handle errors is a standard practice. "A common pattern is to use a dead-letter queue... For a streaming pipeline, you can use a side output to capture errors." This principle applies equally to batch pipelines for routing bad records.

Source: Google Cloud Documentation, "Handle pipeline errors," Common error-handling patterns.

2. BigQuery Load Job Error Handling: The BigQuery documentation specifies the maxBadRecords property for load jobs. "The maximum number of bad records that BigQuery can ignore when running the job. If the number of bad records exceeds this value, an invalid error is returned in the job result. The default value is 0, which requires that all records are valid." This confirms that option C would cause the job to fail.

Source: Google Cloud Documentation, BigQuery API, "Jobs," JobConfigurationLoad resource, maxBadRecords property.

3. Limitations of Federated Queries for ETL: While useful for ad-hoc analysis, federated (external) tables are not ideal for robust ETL. The documentation notes that query performance is lower and that errors in the source data can cause query failures. "If the external data is not clean, queries can fail. For example, if there are extra columns or incorrect data types."

Source: Google Cloud Documentation, "Introduction to external tables."

4. Data Processing Pipeline Architectures: The Google Cloud Architecture Center provides reference architectures for data processing. For scenarios requiring transformation and validation before loading, Dataflow is the recommended tool. "This pipeline uses Dataflow to pull messages from Pub/Sub, process them, and then write them to BigQuery for analysis." This pattern of using Dataflow for pre-load processing is standard.

Source: Google Cloud Architecture Center, "Serverless data-processing pipeline."

Premium Access Includes

FLASH OFFER

avail 10% DISCOUNT on YOUR PURCHASE