"No one is harder on a talented person than the person themselves" - Linda Wilkinson ; "Trust your guts and don't follow the herd" ; "Validate direction not destination" ;

February 01, 2023

Paper Read - Gradient Descent: The Ultimate Optimizer

Gradient Descent: The Ultimate Optimizer

  • When we train deep neural networks by gradient descent, we have to select a step size α for our optimizer
  • If α is too small, the optimizer runs very slowly, whereas if α is too large, the optimizer fails to converge
  • First, recall the standard weight update rule at step i for SGD, using some fixed step size α:

  • Update rule for SGD is simply a multiplication by a constant, whose derivative is trivial.
  • Adam optimizer which has a much more sophisticated update rule involving the four hyperparameters α, β1, β2, e (though e| is typically not tuned).

  • AdaGrad/AdaGrad hyperoptimizer increases α to make up for this effect. 

Keep Exploring!!!

No comments: