Gradient Descent: The Ultimate Optimizer
- When we train deep neural networks by gradient descent, we have to select a step size α for our optimizer
- If α is too small, the optimizer runs very slowly, whereas if α is too large, the optimizer fails to converge
- First, recall the standard weight update rule at step i for SGD, using some fixed step size α:
- Update rule for SGD is simply a multiplication by a constant, whose derivative is trivial.
- Adam optimizer which has a much more sophisticated update rule involving the four hyperparameters α, β1, β2, e (though e| is typically not tuned).
No comments:
Post a Comment