Why use a softmax activation function? Why can’t we just normalize by the sum of the absolute value of each logit value?
e.g. output probabilityi=∑j∣logitj∣logiti
The problem is that our loss functions are typically applied to the outputs of our neural networks.
And these loss functions are typically logged (e.g. cross-entropy loss applies a log on the outer most layer)
so by normalizing by softmax, which raises the logits to an exponent, if our output’s error is large, the log in the loss function cancels out the exponent in the softmax, which means the gradient will scale linearly with the error of the output