Data Science, Database, AI Startups and Domain Learning's (Video-Image-Text-Data-Database): Day #56 - Deep Learning Class #3 Notes

March 03, 2017

Day #56 - Deep Learning Class #3 Notes - Training Deep Networks

Part I
Parameters Overview

Multi Layer Perceptron (Mean Square Error, Weight Decay Term - prevent overfitting, regularization term)
Error function non-convex loss function - Gradient Descent
Saddle points (Minimum along few dimensions, maximum along other dimensions), They can be problem in Deep networks. Train Deep networks to avoid saddle points
Vanishing Gradient Problem - Identify weights, value becomes small updates become slow. Chain rule in backpropagation. Gradient update that reaches earlier layers will be very low
Alexnet had only seven layers, Best of network use only 7~8 layers
Exploding gradient, terminate the value when it is greater than threshold
Mini Batch SD - Batches of GD (20 / 100 points), Average 20 points and update all layers in network. SGD with batch size
Iteration - Whenever a weight update is done
Epoch - Whenever training set is used once
Momentum - During GD, idea is example of blind person navigating mountain range
Momentum(Useful - Find the local minima or some other local minima or better minima, Highly Elliptical momentum useful /Not Useful - For shperical contour plot)
Spherical - Normal will take you directly to centre
Contour Plot - cross section of mountain
Nesterov Momentum - One step further in direction of step we need to take, Works very well in practice (Intermin update, compute weights)

Part II
Choosing Activation Function

Different Activation functions Sigmoid, tanh, Relu, Leaky Relu, maxout
Sigmoid (Between 0 and 1) - By Applying sigmoid this range will be reduced, Zero output will be eliminated by tanH. Brings non-linearity in network
TanH - Belong to logistics family of functions (-1 to +1), Used even today
Relu - Most popular - Rectified Linear Unit, Actually Linear on +ve Side, Negative side Zero. Max(0, x). For -ve output Relu will make it Zero. Relus is linear on positive axis, Default for images and Videos
Leaky Relu - y = x if x > 10, Going to let small amount of x pass thru
maxout - Given layer of neurons, groups of 10. Max of batch of neurons is the maximum
Softmax - Ensure each activation lies between 0 and 1
Hierarchical Softmax - Requires word2vec
Relu mostly for images and videos, If too many dead units try leaky Relu
RNN, LSTM still sigmoid and tanH is used

Part III
Choosing Loss Function

Loss Functions / Cost both mean the same
MSE - Gradient is simple
Cross Entropy Loss function
Entropy - -SumPiLogPi
Binary cross Entropy
Negative Log Likelihood
Softmax for binary same as Sigmoid
Start with NLL, Minimise NLL given particular activation function
KLDivergence measure distance between two distributions

Part IV
Choosing Learning Rate

Convex function GD will always take you to local minimum
Always GD reached in one step for correct learning rate
Hessian - second derivative matrix (Optimal learning rate - inverse of Hessian)
Gradient is a vector not a single value
Optimal Learning rate - Eigen values of Hessian
Adaptive is best approach to chose learning rate
Adagrad is one such method
Slow on Steep clif, On Flat Surface long long approach
RMSProp - Root Mean Square Prop
Adam - Most popular method today (Default)
Momentum + Current Gradient - Adam
AdaDelta another similar method
To chose - http://sebastianruder.com/optimizing-gradient-descent
SGD - Mini batch SGD
Choice for Training - SGD + Nestrov momentum, SGD with Adagrad / RMSProp / Adam

Part V

Math of Backpropagation
Backpropagation uses GD
Issues with GD / Training GD
Using Learning Rate / Optimization

Part VI - Regularization

Training DL is GD
ML and Optimization difference is Generalization
ML best performance tomorrow (Work well tomorrow, generalization is important)
Regularization methods incorporated for Generalization performance
Training accuracy increases, Test Accuracy Decreases (Point to stop)

Part VII - When to Stop

Train Epoch, Lower learning rate and again Train
Maxweight change less than particular row
Weight decay term in Error function itself
L2 Weight Decay (Add Square of weights)
L1 Weight Decay (Absolute Value of Weights), Sparse Solutions
Drop Out - In each iteration, In each mini batch, In every layer randomly drop certain % of nodes. This gives excellent regularzation performance
Ensemble of different models 9Similarity of Random Forests)
DropConnect (extension of Drop out)
Add Noise (Data Noise) - Gaussian, Salt and Pepper Noise
Batch Normalization Layer (Recommended) - All implemented libraries
Shuffle your inputs
Choose mini-batch such that network learns faster

Curriculum Learning

Provide slides and figure the course out
Lots of data + Lots of computing for Deep Learning Success (Google / FB)
Unsupervised Learning is approach by Facebook for Data Analysis
Data Programmatically - NIPS Machine Learning Conference
Data Augmentation (Change illumation in data, Reduce intensity of pixels, Train Network with all kinds of data - Mirror, Noise, Artificial Images)

Target Values

Binary classification problem ? +1 and -1

Weight Initialization

GD works and takes you to different local minima
Starting defined by how you initialize the network
Never Initialize to Zero
Recommended ways - Xaviers Initialization
For every layer in network get weights randomly from uniform distribution

Happy Learning!!!

Data Science, Database, AI Startups and Domain Learning's (Video-Image-Text-Data-Database)

March 03, 2017

Day #56 - Deep Learning Class #3 Notes - Training Deep Networks

No comments:

Git Code Repository

About Me

What is your Expertise

Search This Blog

Translate

About Me and Disclaimer

Labels

Data Science Good Reads

Cloud, Datacentre, BigData and NOSQL Blogs

SQL Links

Archecture Blog List

Programming Problems

Startup - Reads

Perl-Python-Ruby-Linux-Oracle

Management + Leadership Blogs

Research Papers & Podcasts

My Wordpress

Interesting Reads

Useful Links - C# and .NET

Java, Selenium, QTP and Test Tools Learning

Agile Testing

Reverse Logistics Reads

Biztalk Blogs

MS BI Links

Process - Learnt it :)

Usability Guidelines - Building Better Sites

.NET Test Tools and Other Interesting Reads

Review Checklist

Blog Archive

Live Traffic

Total Pageviews

Popular Posts