"No one is harder on a talented person than the person themselves" - Linda Wilkinson ; "Trust your guts and don't follow the herd" ; "Validate direction not destination" ;

March 03, 2017

Day #56 - Deep Learning Class #3 Notes - Training Deep Networks

Part I
Parameters Overview
  • Multi Layer Perceptron (Mean Square Error, Weight Decay Term - prevent overfitting, regularization term)
  • Error function non-convex loss function - Gradient Descent
  • Saddle points (Minimum along few dimensions, maximum along other dimensions), They can be problem in Deep networks. Train Deep networks to avoid saddle points
  • Vanishing Gradient Problem - Identify weights, value becomes small updates become slow. Chain rule in backpropagation. Gradient update that reaches earlier layers will be very low
  • Alexnet had only seven layers, Best of network use only 7~8 layers
  • Exploding gradient, terminate the value when it is greater than threshold
  • Mini Batch SD - Batches of GD (20 / 100 points), Average 20 points and update all layers in network. SGD with batch size
  • Iteration - Whenever a weight update is done
  • Epoch - Whenever training set is used once
  • Momentum - During GD, idea is example of blind person navigating mountain range
  • Momentum(Useful - Find the local minima or some other local minima or better minima, Highly Elliptical momentum useful /Not Useful - For shperical contour plot)
  • Spherical - Normal will take you directly to centre
  • Contour Plot - cross section of mountain
  • Nesterov Momentum - One step further in direction of step we need to take, Works very well in practice (Intermin update, compute weights)
Part II
Choosing Activation Function
  • Different Activation functions Sigmoid, tanh, Relu, Leaky Relu, maxout
  • Sigmoid (Between 0 and 1) - By Applying sigmoid this range will be reduced, Zero output will be eliminated by tanH. Brings non-linearity in network
  • TanH - Belong to logistics family of functions (-1 to +1), Used even today
  • Relu - Most popular - Rectified Linear Unit, Actually Linear on +ve Side, Negative side Zero. Max(0, x). For -ve output Relu will make it Zero. Relus is linear on positive axis, Default for images and Videos
  • Leaky Relu - y = x if x > 10, Going to let small amount of x pass thru
  • maxout - Given layer of neurons, groups of 10. Max of batch of neurons is the maximum
  • Softmax - Ensure each activation lies between 0 and 1
  • Hierarchical Softmax - Requires word2vec
  • Relu mostly for images and videos, If too many dead units try leaky Relu
  • RNN, LSTM still sigmoid and tanH is used
Part III
Choosing Loss Function
  • Loss Functions / Cost both mean the same
  • MSE - Gradient is simple
  • Cross Entropy Loss function
  • Entropy - -SumPiLogPi
  • Binary cross Entropy
  • Negative Log Likelihood
  • Softmax for binary same as Sigmoid
  • Start with NLL, Minimise NLL given particular activation function
  • KLDivergence measure distance between two distributions
Part IV
Choosing Learning Rate
  • Convex function GD will always take you to local minimum
  • Always GD reached in one step for correct learning rate
  • Hessian - second derivative matrix (Optimal learning rate - inverse of Hessian)
  • Gradient is a vector not a single value
  • Optimal Learning rate - Eigen values of Hessian
  • Adaptive is best approach to chose learning rate
  • Adagrad is one such method
  • Slow on Steep clif, On Flat Surface long long approach
  • RMSProp - Root Mean Square Prop 
  • Adam - Most popular method today (Default)
  • Momentum + Current Gradient - Adam
  • AdaDelta another similar method
  • To chose - http://sebastianruder.com/optimizing-gradient-descent
  • SGD - Mini batch SGD
  • Choice for Training - SGD + Nestrov momentum, SGD with Adagrad / RMSProp / Adam
Part V
  • Math of Backpropagation
  • Backpropagation uses GD
  • Issues with GD / Training GD
  • Using Learning Rate / Optimization
Part VI - Regularization
  • Training DL is GD
  • ML and Optimization difference is Generalization
  • ML best performance tomorrow (Work well tomorrow, generalization is important)
  • Regularization methods incorporated for Generalization performance
  • Training accuracy increases, Test Accuracy Decreases (Point to stop)
Part VII - When to Stop
  • Train Epoch, Lower learning rate and again Train
  • Maxweight change less than particular row
  • Weight decay term in Error function itself
  • L2 Weight Decay (Add Square of weights)
  • L1 Weight Decay (Absolute Value of Weights), Sparse Solutions
  • Drop Out - In each iteration, In each mini batch, In every layer randomly drop certain % of nodes. This gives excellent regularzation performance
  • Ensemble of different models 9Similarity of Random Forests)
  • DropConnect (extension of Drop out)
  • Add Noise (Data Noise) - Gaussian, Salt and Pepper Noise
  • Batch Normalization Layer (Recommended) - All implemented libraries 
  • Shuffle your inputs
  • Choose mini-batch such that network learns faster
Curriculum Learning
  • Provide slides and figure the course out
  • Lots of data + Lots of computing for Deep Learning Success (Google / FB)
  • Unsupervised Learning is approach by Facebook for Data Analysis
  • Data Programmatically - NIPS Machine Learning Conference
  • Data Augmentation (Change illumation in data, Reduce intensity of pixels, Train Network with all kinds of data - Mirror, Noise, Artificial Images)
Target Values
  • Binary classification problem ? +1 and -1
Weight Initialization
  • GD works and takes you to different local minima
  • Starting defined by how you initialize the network
  • Never Initialize to Zero
  • Recommended ways - Xaviers Initialization
  • For every layer in network get weights randomly from uniform distribution

Happy Learning!!!

No comments: