• for classification problems, rather than making your target variables binary (0 or 1), it smooths it, so it’s 0.9
    • https://towardsdatascience.com/what-is-label-smoothing-108debd7ef06
    • If we do not use label smoothing, the label vector is the one-hot encoded vector [1, 0, 0]. Our model will make a ≫ b and a ≫ c. For example, applying softmax to the logit vector [10, 0, 0] gives [0.9999, 0, 0] rounded to 4 decimal places.
    • If we use label smoothing with α = 0.1, the smoothed label vector ≈ [0.9333, 0.0333, 0.0333]. The logit vector [3.3322, 0, 0] approximates the smoothed label vector to 4 decimal places after softmax, and it has a smaller gap. This is why we call label smoothing a regularization technique as it restrains the largest logit from becoming much bigger than the rest.
    • Label smoothing replaces one-hot encoded label vector y_hot with a mixture of y_hot and the uniform distribution:
        • where K is he number of classes

I’m not sure how label smoothing is related to KLDivergence: