Q: 10
You are experimenting with a built-in distributed XGBoost model in Vertex AI Workbench user-
managed notebooks. You use BigQuery to split your data into training and validation sets using the
following queries:
CREATE OR REPLACE TABLE ‘myproject.mydataset.training‘ AS
(SELECT * FROM ‘myproject.mydataset.mytable‘ WHERE RAND() < 0.8);
CREATE OR REPLACE TABLE ‘myproject.mydataset.validation‘ AS
(SELECT * FROM ‘myproject.mydataset.mytable‘ WHERE RAND() < 0.2);
After training the model, you achieve an area under the receiver operating characteristic curve (AUC
ROC) value of 0.8, but after deploying the model to production, you notice that your model
performance has dropped to an AUC ROC value of 0.65. What problem is most likely occurring?
Options
Discussion
Wait, but doesn’t the RAND() approach here mean some records show up in both training and validation tables, not necessarily every record? Feels like partial overlap (option C) is the bigger issue, especially since D would only ever happen if you got super unlucky with a tiny dataset. Am I missing something?
C . D is a trap, it's not every record that's duplicated, it's just some overlap because RAND() is used twice per row. I've seen this come up on similar questions.
C
D here. Since RAND() < 0.2 runs separately, it's possible (though rare) for every record to satisfy both conditions and end up in both tables, especially if the dataset is tiny or badly randomized. Not totally sure, open to other takes.
D , since if a row gets RAND() < 0.2 both times, it's in both sets for sure. So technically possible every record lands in both if you're really unlucky, especially with small tables. Not totally confident though, might be missing something about typical overlap rates.
Yeah this is definitely C. The way RAND() works in both queries means some records will end up in both tables, which messes with your validation accuracy. Pretty common pitfall if you aren't using a deterministic split like FARM_FINGERPRINT. Agree?
D isn't right here. C is the real issue, because separate RAND() calls can lead to some records landing in both sets, so your validation leaks into training. Not 100% sure if there's a tiny edge case for D with a tiny dataset, but C matches what usually happens.
C tbh, partial overlap is the actual gotcha here, not full duplication. With RAND() like that you always risk leaking some records into both sets unless you hash on unique ids instead. If I'm wrong let me know, but pretty sure that's what trips people up.
C/D? If the question stressed you must avoid any overlap, D wins, but for practical leakage C is correct.
Its D since if RAND() gives a value less than 0.2, then those records would always be in both sets. Pretty edge case but technically that overlap could cover every row if the table is small enough or the random numbers lined up. Not 100% sure, so feel free to disagree.
Be respectful. No spam.