https://blmoistawinde.github.io/ml_equations_latex/#cross-entropy
Summary
- aka log loss / binary cross entropy loss
- It measures the performance of a classification model whose
- why is it called “cross-entropy”?
- since the entropy of a random variable X is:
- the summation is over each class (i.e. distribution for each class)
- for 2 classes, we get the BCE loss below output is a probability value between 0 and 1.
- since the entropy of a random variable X is:
In binary classification (one target) BCELoss:
- derivation:
-
- the likelihood for a binomial distribution is:
-
- for n samples, k positive class, and theta, probability of the positive class
- since is a constant, we drop it when optimizing for the likelihood function of the binomial distribution
-
- we take the log of the likelihood:
- =
- 2.5) Change the variables:
- Let
- this makes sense since yhat is our model’s estimation of the theta parameter
- divide everything by n
- replace with y
- if you think about it, y is the probability of TRUE positive examples in the dataset, so this replacement makes sense
- Let
-
- Define the loss as the negative of the likelihood
- This is because minimizing the negative likelihood is performing Maximum Likelihood Estimation (MLE)
- cause the likelihood that the distribution fits the model is larger! categorical cross entropy In multiclass classification (multiple targets: M > 2):
-
- M: number of classes
- log: the natural log
- y: binary indicator (0 or 1) if class label c is the correct classification for observation o
- p: predicted probability observation o is of class c
pros
- It penalizes the model more strongly for incorrect predictions that are confident, which can lead to better-calibrated models that are less overconfident in their predictions.
cons
- Is sensitive to class imbalance, will be biased to the majority class
- Is sensitive to outliers and noise