use case:

  • you are training a model to detect cancer cells

    • but you only give it images of cancer cells that came from one patient
  • If there are MORE images from that patient in the validation set, your CV is biased

    • cause your model may only learn how cancer looks for that one patient Solution
  • Make sure all the data points for one person is in the train set OR in the test set

    • only have Bob / Tracy / Mary in the train set
    • only have Dillian’s images in the test set
  • DO NOT MIX BOB’S IMAGES IN THE TRAIN AND TEST SET

  • GroupKFold prevents this mixing from happening