• Problem: You want your model to predict ordinal targets well
  • solution: rather than predicting if y is [0, ⅓, ⅔, 1], your model predicts if y > 0, y > ⅓, y > ⅔
    • this is a better representation of the target
  • Why? I think it’s cause it’s easier for the network to learn these probabilities, than to predict if the target is 0, ⅓, ⅔, or 1
  • Note: you need to do some postprocessing, so the final output is the model’s probability that y is 0, ⅓, ⅔, or 1
    • the rest of this page explains the math of the postprocessing
  • FAQ:
    • why not represent each ordinal class as one-hot encoding?
      • cause you lose the ordering of values
    • this technique works well if your ordinal target has duplicate values

  • The main idea:
    • Let t be the true target’s value
    • Assume your target t has unique values [0, ⅓, ⅔, 1]
    • then we will generate 3 new target columns:
      • t > 0, t > ⅓, and t > ⅔
      • we fill in each of these new binary columns based on what the original value of t was
      • e.g. if the t = ⅔, then the value for that row in each column is:
        • in the t > 0 column: 1
        • in the t > ⅓ column: 1
        • in the t > ⅔ column: 0
    • We can think of these predictions as the model guessing “the probability that t > 0, t > ⅓, or t > ⅔
    • now that we have all these probabilities, we can generate our final answer:
      • t_hat =
        • where P(t = a) = P(t > term below) - P(t > a)
        • see below to understand why this works
    • When the values are evenly spaced, as in this example, then the formula simplifies into:
      • t_hat = mean(P(t > 0), P(t > ⅓), P(t > ⅔))

  • here is how we derived the case where “the values are evenly spaced”
    • P(t = 0) = 1 - P(t > 0)
    • P(t = ⅓) = P(t > 0) - P(t > ⅓)
    • P(t = ⅔) = P(t > ⅓) - P(t > ⅔)
    • P(t = 1) = P(t > ⅔) - 0
  • Then we can express the original target value as:
    • t_hat = 0 * P(t = 0) + ⅓P(t = ⅓) + ⅔P(t = ⅔) + 1 * P(t=1)
      • TBH I still don’t understand why this equation is true. how did they come up with this formula in the first place???
      • My hunch: when the target is ordinal, and there are n rows, each row has a 1/n probability of being on the ith place (if you only look at the target).
        • So in this formula: t_hat =
          • we multiply each probability by a because it somehow “encodes being on a position with 1/n probability”
        • the only reason why I have this hunch is because below, when we simplify, we cancel many terms out and get a nice “mean” of all the new target columns
  • substituting and simplifying:
    • t_hat = ⅓(P(t > 0) - P(t > 1/3)) + ⅔(P(t > ⅓) - P(t > ⅔)) + P(t > ⅔)`
    • t_hat = ⅓P(t > 0) + ⅓P(t > ⅓) + ⅓P(t > ⅔)
  • https://www.kaggle.com/competitions/google-quest-challenge/discussion/129978
    • the kaggler forgot which paper this was from
    • note that I rewrote this because their explanation wasn’t very good (their notation was bad)