link: range:
summary
- KLDivergence evaluates how one probability distribution aligns or diverges from a second, expected probability distribution.
- You mainly use it when the model’s output represents a probability distribution
- This video explains it intuitively: https://www.youtube.com/watch?v=SxGYPqCgJWM
- if two distributions assign similar probabilities to the same sequence, then the distributions are similar
- and vice versa
- it’s a natural measurement of distance between probability distributions motivated by looking at how likely the second distribution would be able to generate samples from the first distribution
- if two distributions assign similar probabilities to the same sequence, then the distributions are similar
- you can use it:
- as a model loss
- to determine feature drift from your training data and real data, but population stability index (PSI) is preferred (since it’s symmetric)
L(y_pred, y_true) = y_true * (log y_true - log y_pred)
-
I’m not sure if KLDivergence is the same as adding label smoothing
- I do see people doing KLDivergence AND label smoothing together
-
I think we just need to test.
-
Is label smoothing equivalent to adding a KL divergence term or a cross entropy term?
-
KL divergence is not technically a measure of distance, since it’s not symmetric ( Dlk(P||Q) != Dlk(Q||P) ).
TODO: cover reverse/forward KL divergence
- Since it’s not symmetric, the order in which you put the P and Q distribution matters
- https://agustinus.kristia.de/techblog/2016/12/21/forward-reverse-kl/