Q: 5
In order for Structured Streaming to reliably track the exact progress of the processing so that it can
handle any kind of failure by restarting and/or reprocessing, which of the following two approaches
is used by Spark to record the offset range of the data being processed in each trigger?
Options
Discussion
Probably A. From what I remember, Spark records offset ranges using both checkpointing and write-ahead logs to make recovery solid after a failure. Idempotent sinks matter for output guarantees, but they're not used for tracking progress itself I think. Anyone see otherwise?
A is what I've seen in the docs too. Spark uses both checkpointing and write-ahead logs to track offsets for fault tolerance so it can recover from failures. Pretty sure that's the best fit here, unless something changed recently.
C vs E? Both mention idempotent sinks but only replayable sources are needed for full recovery, right? Not 100 percent sure here.
Not C, A
Be respectful. No spam.