you want to see if adding new labelled data from this data source helps your CV
Assume you trained a model on all of your labelled training data
Then you label some data
You train your model on the old training data + the new pseudo-labelled data you made
Now you do CV on your new model to see if the CV improves
THERE IS IMPLICIT DATA LEAKAGE
Why? Cause when you labelled the new dataset, the model was trained on all the data
so when you do your CV, the pseudo-labeled training data is DERIVED from information on your TEST fold
“In order to fix the problem, we generated 5 different sets of pseudolabels where for each train/val split we used only those models that were trained using only the current train set.”
your pseudolabels will be less accurate, but that’s fine. you’re just doing CV
Before you submit, just retrain your model on all data before doing pseudo-labelling.