🏖️ Kaggle Solutions

Search

❯

❯

❯

pseudo-labeling

pseudo-labeling

Mar 01, 2025, 2 min read

When you take your trained model and label new data with your model’s predictions
After you have the new predictions, concat them with your old training data and retrain everything. This typically yields better results

Important Notes:

When using pseudo-labeled data, you want to make sure that you are cross validating properly:
- https://www.kaggle.com/c/google-quest-challenge/discussion/129840
- The situation:
  - you want to see if adding new labelled data from this data source helps your CV
- 1. Assume you trained a model on all of your labelled training data
- 1. Then you label some data
- 1. You train your model on the old training data + the new pseudo-labelled data you made
- 1. Now you do CV on your new model to see if the CV improves
- THERE IS IMPLICIT DATA LEAKAGE
  - Why? Cause when you labelled the new dataset, the model was trained on all the data
  - so when you do your CV, the pseudo-labeled training data is DERIVED from information on your TEST fold
- “In order to fix the problem, we generated 5 different sets of pseudolabels where for each train/val split we used only those models that were trained using only the current train set.”
  - your pseudolabels will be less accurate, but that’s fine. you’re just doing CV
- Before you submit, just retrain your model on all data before doing pseudo-labelling.

Backlinks

Tweet Sentiment Extraction

Created with Quartz v4.2.3 © 2025

Download these notes on Github!