"No one is harder on a talented person than the person themselves" - Linda Wilkinson ; "Trust your guts and don't follow the herd" ; "Validate direction not destination" ;

December 24, 2018

Day #172 - CNN - One Pager, Notes and Best Practices

Summary from MIT / Stanford Classes / Geoff Hinton Papers / Readings.

Deep Learning - "Learn directly from data without manual feature engineering"

A deep learning architecture is a multi-stack layer of simple modules, all of which may compute simple non-linear input-output mappings. The backpropagation procedure to compute the gradient of an objective function with respect to the weights of a multi-layer stack of the module is nothing more than a practical application of chain rule of derivatives

Overview


Step 1


Step 2
CNN Key Components
  • Input
  • Convolution
  • Strides
  • Pooling
  • Fully Connected
  • Output
Image - Represented as RGB Matrix with Height and width = 3 color channels X Height X width. Color represented in [0,255] Range
Data Augmentation - Random mix of translation, rotation, stretching, shearing, lens distortions
Convolution - Convolve the filter with the image (Slide over image spatially computing dot products). Convolution preserves the spatial relationship between pixels
Convolution Filters - Replace Large Convolutions (5x5, 7x7) with stacks of 3 x 3 convolutions
Activation Function - Introduce non-linearities in the network. Non-Linearity - Often Relu (Image data highly non-linear)
Transfer Learning - Works better for similar types of data, Freeze network and retrain top layer
Pooling - Downsampling for each feature map. MAX Pooling - Shrinks size using max


Advantages of downsampling - Decreased size of input for upcoming layers, Works against overfitting
Flattening - Convert into 1D feature vector.  Flattens all its structure to create a single long feature vector
Fully Connected - Has a neuron fully connected to output, Contains neurons that connect to the entire input volume as in ordinary neural networks
After Forward pass we compute - Loss. In Backward pass we Compute Gradient to backpropagate and update weights
Fully Connected Layer - Layer where the maxpooled matrix is flattened. FCN is fed into softmax / cross entropy layer for prediction.

CNN Concepts
Sliding Windows - Pick a portion of image and run detection, Keep Moving Horzontally, Vertically to pick and predict for each selected Area. Prediction Confidence and Threshold filter to select the same. Generate Bounding Boxes.
Non-max Supression (NMS) - Several bounding boxes, Select box with highest probability for detected objects

Keras
  • Dense Layer - A dense layer represents a matrix vector multiplication.  each input node is connected to each output node.
  • Dropout - A dropout layer is used for regularization  
  • Hidden Layer - A sparse layer is a hidden layer that is not dense
  • Fully-connected layer is a (usually) cheap way of learning non-linear combinations of these features. In place of fully connected layers, we can also use a conventional classifier like SVM. Fully Connected layers perform classification based on the features extracted by the previous layers.
  • Convolutional layers are providing a meaningful, low-dimensional, and somewhat invariant feature space
Experiments
  • Convolution Level - Standard Convolution, Dilated Convolution, Transposed Convolution, Strided Convolution, Conv2D, ConvLSTM2D
  • Kernels - 3x3 and 2x2 kernels, Convolution Tricks - 1 X 7, 7 X 1, Replace Large Convolutions (5x5, 7x7) with stacks of 3 x 3 convolutions, 
  • Loss Functions - RMSE, Cross Entropy
  • Optimizers - SGD, AgaGrad, AdaDelta, RMSProp
  • Activation Functions - Sigmoid, Tanh, Relu
CNN - Recognize Spatial Analysis
RNN - Recognize Sequential patterns
GAN - Two Networks, One to Generate, Another One Testing Output of Generation
Reinforcement Learning - Trial and Error Learning

Questions
How Forward feed and backwards prop work.
  • A Feed-Forward Neural Network is a type of Neural Network architecture where the connections are "fed forward", i.e. do not form cycles (like in recurrent nets).
  • Input to hidden and from hidden to output layer. 
  • The values are "fed forward"
Backpropagation is a training algorithm consisting of 2 steps: 
  • Feed forward the values 
  • Calculate the error and propagate it back to the earlier layers. 
So to be precise, forward-propagation is part of the backpropagation algorithm but comes before back-propagating.
  • Input for backpropagation is output_vector, target_output_vector, output is adjusted_weight_vector.
  • Backpropagation is an efficient method of computing gradients in directed graphs of computations, such as neural networks
  • Stochastic gradient descent (SGD) is an optimization method used e.g. to minimize a loss function.
  • SGD writes code in weights of neural network
The various loss functions and their considerations
  • Mean Squared Error (MSE), or quadratic, loss function is widely used in linear regression as the performance measure
  • Cross Entropy is commonly-used in binary classification (labels are assumed to take values 0 or 1) 
  • Cross Entropy Loss is usually used in classification problems. In essence, it is a measure of difference between the desired probablity distribution and the predicted probablity distribution
  • Negative Log Likelihood loss function is widely used in neural networks, it measures the accuracy of a classifier.
Loss Functions - Reference
  • Triplet Loss is another loss commonly used in CNN-based image retrieval. During training process, an image triplet (Ia,In,Ip) is fed into the model as a single sample, where Ia, In and Ip represent the anchor, postive and negative images respectively. The idea behind is that distance between anchor and positive images should be smaller than that between anchor and negative images.
  • Contrastive Loss is often used in image retrieval tasks to learn discriminative features for images. 
More Questions
  • The various activation functions and why they are needed
  • Optimization functions and why they are needed
  • Bias and variance / over and under fitting - what causes them, and the various methods to handle them
  • CNNs, RNNs, GANs, attention, Transformer, unsupervised and semi supervised, RL, decision trees, Ensemble Learning, SVM, Auto encoders…Understand interpretation, bias, fairness
  • The statistics theory (the more the better), and the linear algebra and calculus technicalities
IoU is a measure of the overlap between two bounding boxes: The ground truth bounding boxes (i.e. the hand labeled bounding boxes), The predicted bounding boxes from the model
Non-maximal suppression - Pick the bounding box with the maximum box confidence. Output this box as prediction.
  • CNN - Multiple Regions - Slide - Convolve - Feature Extraction
  • RCNN - Selective Search. 2000 regions extracted. Feed these patches to CNN, followed by SVM to predict the class of each patch.
  • Fast RCNN - Selective search generate predictions. 
  • Faster RCNN replaces selective search with a very small convolutional network called Region Proposal Network to generate regions of Interests. Faster R-CNN has a dedicated region proposal network followed by a classifier. 7 FPS (frame per second)
  • Feature Pyramid Network (FPN) is a feature extractor designed with feature pyramid concept to improve accuracy and speed
YOLO and SSD Regression-Based detectors
  • Yolo detection is a simple regression problem which takes an input image and learns the class probabilities and bounding box coordinates
  • YOLO uses DarkNet to make feature detection followed by convolutional layers.
  • One limitation for YOLO is that it only predicts 1 type of class in one grid hence, it struggles with very small objects
  • Predictions (object locations and classes) are made from one single network 
  • SSD is a single shot detector using a VGG16 network as a feature extractor (equivalent to the CNN in Faster R-CNN). Then we add custom convolution layers (blue) afterward and use convolution filters (green) to make predictions. SSD only uses upper layers for detection and therefore performs much worse for small objects.
Gradient Calculation
  • Crucial for backpropagation
  • Function with edge weights
  • Edge weights yield Error weights
  • Find the combination of w1,w2..wn that will minimize function y
  • Minimal error is the goal
  • The optimization problem in higher dimensions
  • NP-hard problem, make simulations, approximations
  • NP-hardness (non-deterministic polynomial-time hard)
  • Simulate genetic algorithms
  • TSP
  • Compute gradient adjust edge weights
  • Gradient (Partial derivative)
  • Direction of gradient
  • Derivative sigmoid function
  • Easy to code
  • Learning for backpropagation compute the derivative of the activation function
  • deltaoutput = error*dsigmoid(sum)
  • O/P to I/P layer so-called backpropagation
Backpropagation
  • Equation for updating edge weights
  • Delta = learningrate*gradient + momentum*previousChange
  • Learning Rate - How fast we learn, optimum to converge
  • Momentum - Avoid local minimums
  • derivative (loss) / (weight)
  • Error / Loss = predicted - actual
  • Update weights from right / left
Backprop Good Read - Link

CNN - Sliding Window, Reduce Dimensions, Works best on images 
RNN - Sequence, Retain some portion of history. Limit to parallel training, Hard to capture relationship when points are far
Transformers - Learn multiple ways, the relationship between each item in the input sequence to all other items in the input
Encoder - Transform inputs to embeddings. Several Multiheaded self attention models stacked up on each other.
Decoder - Embedding to output sequence
Applications - Used for Seq2Seq / Machine translations
Feedback - May not work for hierarchical relationships
Reference  - Link

Where Backprop Fails?
Every layer has multiple neurons. RNN has memory cells. Memory cells have inputs at different periods of time. GRU / LSTM / Basic cell are variations of cells. Memory cells get unrolled with time. Backprop via gradient descent. Weights / Biased as the number of connections. Calculation of loss based on cost/loss function. Randomly initialize weights/biases along the slope. The optimizer will attempt to descent down the slope to find optimal values (smallest value of error). Converge at best possible value of weights/bias for all connections in neurons.
Failure cases - Gradients doesn't change (Vanishing Gradient problem), Loss remains constant

Ref Notes 

  • Average Pooling and Max Pooling
  • Hierarchical Pooling
  • Attentive Pooling
Convolution
  • Causal Convolutions
  • Multi-Resolution CNN Block
  • Multi-Head Multi-Resolution CNN Block
A sequence encoder is the cornerstone of many advanced AI applications, such as semantic search, question-answering, machine reading comprehension
  • Batchnorm - in effect, performs a kind of coordinated rescaling of its inputs
  • Dropouts: Randomly disables neurons during the training, in order to force other neurons to be trained as well
  • L1 regularization - Cost added is proportional to the absolute value of the weights coefficients 
  • L2 regularization - Cost added is proportional to the square of the value of the weights coefficients 

Bayesian Convolutional Neural Networks (Bayesian CNNs) and traditional Convolutional Neural Networks (CNNs) are both types of neural networks used for processing grid-like data, such as images. The main difference between them lies in their approach to handling uncertainty and learning model parameters.

Convolutional Neural Networks (CNNs) are a type of feedforward neural network that use convolutional layers to learn local features in the input data. CNNs are trained using backpropagation and optimization algorithms like stochastic gradient descent to minimize a loss function. The learned parameters (weights and biases) are point estimates, meaning that they do not capture the uncertainty in the model.

Bayesian Convolutional Neural Networks (Bayesian CNNs) extend traditional CNNs by incorporating Bayesian inference to learn the distribution of model parameters instead of point estimates. This allows Bayesian CNNs to capture the uncertainty in the model, which can be useful for tasks where uncertainty quantification is important, such as medical image analysis or safety-critical applications.

Here are some pros and cons of Bayesian CNNs compared to traditional CNNs:

Pros:
Uncertainty quantification: Bayesian CNNs can provide a measure of uncertainty in their predictions, which can be useful for decision-making and risk assessment.
Regularization: The Bayesian approach naturally incorporates regularization, which can help prevent overfitting and improve generalization.
Robustness: Bayesian CNNs can be more robust to adversarial examples and noisy data, as they take into account the uncertainty in the model parameters.
Cons:

Computational complexity: Bayesian CNNs are generally more computationally expensive than traditional CNNs, as they require sampling from the posterior distribution of model parameters or approximating it using techniques like variational inference.
Implementation complexity: Implementing Bayesian CNNs can be more challenging than traditional CNNs, as it requires additional tools and techniques for handling the Bayesian aspect.
Here's a simple example of a traditional CNN using Python and TensorFlow:



Happy Mastering DL!!!

No comments: