why does startin off with a low learning rate when you’re just training is bad?
shouldn’t larger learning rates be better since your model is so far away from optimum, you want to take bigger steps to get there?
the answer is (hypothesized) to be because “At the beginning of training, your later weights have huge high variance and so this makes training really hard”