cross-entropy loss

https://blmoistawinde.github.io/ml_equations_latex/#cross-entropy

Summary

aka log loss / binary cross entropy loss
It measures the performance of a classification model whose
why is it called “cross-entropy”?
- since the entropy of a random variable X is: $H = - \sum_{i} p_{i} * lo g (p_{i})$
  - the summation is over each class (i.e. distribution for each class)
  - for 2 classes, we get the BCE loss below output is a probability value between 0 and 1.

In binary classification (one target) BCELoss:

$- (y lo g (\overset{y}{^}) + (1 - y) lo g (1 - \overset{y}{^}))$

derivation:
- 1. the likelihood for a binomial distribution is:
  - $(k n) θ^{k} (1 - θ)^{n - k}$
    - for n samples, k positive class, and theta, probability of the positive class
  - since $(k n)$ is a constant, we drop it when optimizing for the likelihood function of the binomial distribution
- 1. we take the log of the likelihood:
  - $lo g (θ^{k} (1 - θ)^{n - k})$
  - = $k lo g (θ) + (n - k) lo g (1 - θ)$
- 2.5) Change the variables:
  - Let $θ = \overset{y}{^}$ $k lo g (θ) + (n - k) lo g (1 - θ)$ $k lo g (\overset{y}{^}) + (n - k) lo g (1 - \overset{y}{^})$
    - this makes sense since yhat is our model’s estimation of the theta parameter
  - divide everything by n
    - $k lo g (\overset{y}{^}) + (n - k) lo g (1 - \overset{y}{^})$
    - $\frac{k}{n} lo g (\overset{y}{^}) + \frac{n - k}{n} lo g (1 - \overset{y}{^})$
    - $\frac{k}{n} lo g (\overset{y}{^}) + (1 - \frac{k}{n}) lo g (1 - \overset{y}{^})$
  - replace $\frac{k}{n}$ with y
    - if you think about it, y is the probability of TRUE positive examples in the dataset, so this replacement makes sense
    - $\frac{k}{n} lo g (\overset{y}{^}) + (1 - \frac{k}{n}) lo g (1 - \overset{y}{^})$
    - $y lo g (\overset{y}{^}) + (1 - y) lo g (1 - \overset{y}{^})$
- 1. Define the loss as the negative of the likelihood
  - This is because minimizing the negative likelihood is performing Maximum Likelihood Estimation (MLE)
    - cause the likelihood that the distribution fits the model is larger! categorical cross entropy In multiclass classification (multiple targets: M > 2):

$- \sum_{c = 1}^{M} y_{o, c} lo g (p_{o, c})$

M: number of classes
log: the natural log
y: binary indicator (0 or 1) if class label c is the correct classification for observation o
p: predicted probability observation o is of class c

https://www.quora.com/What-are-some-advantages-and-disadvantages-of-using-cross-entropy-as-an-error-measure-when-training-neural-networks

pros

It penalizes the model more strongly for incorrect predictions that are confident, which can lead to better-calibrated models that are less overconfident in their predictions.

cons

Is sensitive to class imbalance, will be biased to the majority class
Is sensitive to outliers and noise

🏖️ Kaggle Solutions

Explorer

cross-entropy loss

Summary

pros

cons

Backlinks