Kubeflow is an open source platform for developing, orchestrating, deploying, and running scalable
and portable machine learning workflows on Kubernetes. Kubeflow Pipelines is a component of
Kubeflow that allows you to build and manage end-to-end machine learning pipelines using a
graphical user interface or a Python-based domain-specific language (DSL). Kubeflow Pipelines can
help you automate and orchestrate your machine learning workflows, and integrate with various
Google Cloud services and tools1
One of the Google Cloud services that you can use with Kubeflow Pipelines is BigQuery, which is a
serverless, scalable, and cost-effective data warehouse that allows you to run fast and complex
queries on large-scale data. BigQuery can help you analyze and prepare your data for machine
learning, and store and manage your machine learning models2
To execute a query against BigQuery as the first step in your Kubeflow pipeline, and use the results of
that query as the input to the next step in your pipeline, the easiest way to do that is to use the
BigQuery Query Component, which is a pre-built component that you can find in the Kubeflow
Pipelines repository on GitHub. The BigQuery Query Component allows you to run a SQL query on
BigQuery, and output the results as a table or a file. You can use the component’s URL to load the
component into your pipeline, and specify the query and the output parameters. You can then use
the output of the component as the input to the next step in your pipeline, such as a data processing
or a model training step3
The other options are not as easy or feasible. Using the BigQuery console to execute your query and
then save the query results into a new BigQuery table is not a good idea, as it does not integrate with
your Kubeflow pipeline, and requires manual intervention and duplication of data. Writing a Python
script that uses the BigQuery API to execute queries against BigQuery is not ideal, as it requires
writing custom code and handling authentication and error handling. Using the Kubeflow Pipelines
DSL to create a custom component that uses the Python BigQuery client library to execute queries is
not optimal, as it requires creating and packaging a Docker container image for the component, and
testing and debugging the component.
Reference: 1: Kubeflow Pipelines overview 2: BigQuery overview 3: BigQuery Query Component