binary encoded categorical ordinal targets

Problem: You want your model to predict ordinal targets well
- e.g. you’re optimizing for ranking loss functions (e.g. jaccard similarity (aka Intersection Over Union))
solution: rather than predicting if y is [0, ⅓, ⅔, 1], your model predicts if y > 0, y > ⅓, y > ⅔
- this is a better representation of the target
Why? I think it’s cause it’s easier for the network to learn these probabilities, than to predict if the target is 0, ⅓, ⅔, or 1
Note: you need to do some postprocessing, so the final output is the model’s probability that y is 0, ⅓, ⅔, or 1
- the rest of this page explains the math of the postprocessing
FAQ:
- why not represent each ordinal class as one-hot encoding?
  - cause you lose the ordering of values
- this technique works well if your ordinal target has duplicate values

here is how we derived the case where “the values are evenly spaced”
- P(t = 0) = 1 - P(t > 0)
- P(t = ⅓) = P(t > 0) - P(t > ⅓)
- P(t = ⅔) = P(t > ⅓) - P(t > ⅔)
- P(t = 1) = P(t > ⅔) - 0
Then we can express the original target value as:
- t_hat = 0 * P(t = 0) + ⅓P(t = ⅓) + ⅔P(t = ⅔) + 1 * P(t=1)
  - TBH I still don’t understand why this equation is true. how did they come up with this formula in the first place???
  - My hunch: when the target is ordinal, and there are n rows, each row has a 1/n probability of being on the ith place (if you only look at the target).
    - So in this formula: t_hat = $\sum_{a} a \cdot P (t = a)$
      - we multiply each probability by a because it somehow “encodes being on a position with 1/n probability”
    - the only reason why I have this hunch is because below, when we simplify, we cancel many terms out and get a nice “mean” of all the new target columns
substituting and simplifying:
- t_hat = ⅓(P(t > 0) - P(t > 1/3)) + ⅔(P(t > ⅓) - P(t > ⅔)) + P(t > ⅔)`
- t_hat = ⅓P(t > 0) + ⅓P(t > ⅓) + ⅓P(t > ⅔)
https://www.kaggle.com/competitions/google-quest-challenge/discussion/129978
- the kaggler forgot which paper this was from
- note that I rewrote this because their explanation wasn’t very good (their notation was bad)

🏖️ Kaggle Solutions