Q: 20
You are planning to use Cloud Storage as pad of your data lake solution. The Cloud Storage bucket
will contain objects ingested from external systems. Each object will be ingested once, and the access
patterns of individual objects will be random. You want to minimize the cost of storing and retrieving
these objects. You want to ensure that any cost optimization efforts are transparent to the users and
applications. What should you do?
Options
Discussion
I think C could work since Cloud Functions can run on a schedule and process Pub/Sub messages, but not sure how well it scales with high event loads. Not as robust as Dataflow maybe, but still viable for light pipelines. Anyone see issues?
A imo. Streaming Dataflow with tumbling windows is made for real-time scalable aggregation like this. B is tempting but loses out on near-real-time and scalability, especially with huge event spikes. Open to other takes though.
Streaming aggregation is the scalable way here, so A.
Be respectful. No spam.