Parameters Overview
- Multi Layer Perceptron (Mean Square Error, Weight Decay Term - prevent overfitting, regularization term)
- Error function non-convex loss function - Gradient Descent
- Saddle points (Minimum along few dimensions, maximum along other dimensions), They can be problem in Deep networks. Train Deep networks to avoid saddle points
- Vanishing Gradient Problem - Identify weights, value becomes small updates become slow. Chain rule in backpropagation. Gradient update that reaches earlier layers will be very low
- Alexnet had only seven layers, Best of network use only 7~8 layers
- Exploding gradient, terminate the value when it is greater than threshold
- Mini Batch SD - Batches of GD (20 / 100 points), Average 20 points and update all layers in network. SGD with batch size
- Iteration - Whenever a weight update is done
- Epoch - Whenever training set is used once
- Momentum - During GD, idea is example of blind person navigating mountain range
- Momentum(Useful - Find the local minima or some other local minima or better minima, Highly Elliptical momentum useful /Not Useful - For shperical contour plot)
- Spherical - Normal will take you directly to centre
- Contour Plot - cross section of mountain
- Nesterov Momentum - One step further in direction of step we need to take, Works very well in practice (Intermin update, compute weights)
Choosing Activation Function
- Different Activation functions Sigmoid, tanh, Relu, Leaky Relu, maxout
- Sigmoid (Between 0 and 1) - By Applying sigmoid this range will be reduced, Zero output will be eliminated by tanH. Brings non-linearity in network
- TanH - Belong to logistics family of functions (-1 to +1), Used even today
- Relu - Most popular - Rectified Linear Unit, Actually Linear on +ve Side, Negative side Zero. Max(0, x). For -ve output Relu will make it Zero. Relus is linear on positive axis, Default for images and Videos
- Leaky Relu - y = x if x > 10, Going to let small amount of x pass thru
- maxout - Given layer of neurons, groups of 10. Max of batch of neurons is the maximum
- Softmax - Ensure each activation lies between 0 and 1
- Hierarchical Softmax - Requires word2vec
- Relu mostly for images and videos, If too many dead units try leaky Relu
- RNN, LSTM still sigmoid and tanH is used
Choosing Loss Function
- Loss Functions / Cost both mean the same
- MSE - Gradient is simple
- Cross Entropy Loss function
- Entropy - -SumPiLogPi
- Binary cross Entropy
- Negative Log Likelihood
- Softmax for binary same as Sigmoid
- Start with NLL, Minimise NLL given particular activation function
- KLDivergence measure distance between two distributions
Choosing Learning Rate
- Convex function GD will always take you to local minimum
- Always GD reached in one step for correct learning rate
- Hessian - second derivative matrix (Optimal learning rate - inverse of Hessian)
- Gradient is a vector not a single value
- Optimal Learning rate - Eigen values of Hessian
- Adaptive is best approach to chose learning rate
- Adagrad is one such method
- Slow on Steep clif, On Flat Surface long long approach
- RMSProp - Root Mean Square Prop
- Adam - Most popular method today (Default)
- Momentum + Current Gradient - Adam
- AdaDelta another similar method
- To chose - http://sebastianruder.com/optimizing-gradient-descent
- SGD - Mini batch SGD
- Choice for Training - SGD + Nestrov momentum, SGD with Adagrad / RMSProp / Adam
- Math of Backpropagation
- Backpropagation uses GD
- Issues with GD / Training GD
- Using Learning Rate / Optimization
- Training DL is GD
- ML and Optimization difference is Generalization
- ML best performance tomorrow (Work well tomorrow, generalization is important)
- Regularization methods incorporated for Generalization performance
- Training accuracy increases, Test Accuracy Decreases (Point to stop)
- Train Epoch, Lower learning rate and again Train
- Maxweight change less than particular row
- Weight decay term in Error function itself
- L2 Weight Decay (Add Square of weights)
- L1 Weight Decay (Absolute Value of Weights), Sparse Solutions
- Drop Out - In each iteration, In each mini batch, In every layer randomly drop certain % of nodes. This gives excellent regularzation performance
- Ensemble of different models 9Similarity of Random Forests)
- DropConnect (extension of Drop out)
- Add Noise (Data Noise) - Gaussian, Salt and Pepper Noise
- Batch Normalization Layer (Recommended) - All implemented libraries
- Shuffle your inputs
- Choose mini-batch such that network learns faster
Curriculum Learning
- Provide slides and figure the course out
- Lots of data + Lots of computing for Deep Learning Success (Google / FB)
- Unsupervised Learning is approach by Facebook for Data Analysis
- Data Programmatically - NIPS Machine Learning Conference
- Data Augmentation (Change illumation in data, Reduce intensity of pixels, Train Network with all kinds of data - Mirror, Noise, Artificial Images)
Target Values
- Binary classification problem ? +1 and -1
Weight Initialization
- GD works and takes you to different local minima
- Starting defined by how you initialize the network
- Never Initialize to Zero
- Recommended ways - Xaviers Initialization
- For every layer in network get weights randomly from uniform distribution
Happy Learning!!!
No comments:
Post a Comment