"No one is harder on a talented person than the person themselves" - Linda Wilkinson ; "Trust your guts and don't follow the herd" ; "Validate direction not destination" ;
Showing posts with label RNN. Show all posts
Showing posts with label RNN. Show all posts

March 05, 2023

LSTM One Pager

LSTM in its core, preserves information from inputs that has already passed through it using the hidden state.

Loss Computation Steps

  • The many-to-many RNN loss is computed at each time step.
  • Many to One RNN - we make the decision based on the final hidden state of this network

LSTM

  • forget irrelevant parts of previous state
  • selectively update cell state values
  • output certain parts of cell state

LSTM / GRU

  • LSTM (Long Short Term Memory): LSTM has three gates (input, output and forget gate)
  • GRU (Gated Recurring Units): GRU has two gates (reset and update gate).
  • GRU exposes the complete memory unlike LSTM, so applications which that acts as advantage might be helpful.
  • GRUs train faster and perform better than LSTMs on less training data

Gradient Updates

  • If the gradients are large Exploding gradients, learning diverges Solution: clip the gradients to a certain max value. 
  • If the gradients are small Vanishing gradients, learning very slow or stops Solution: introducing memory via LSTM, GRU, etc

Unidirectional  vs BiLSTM 

  • Unidirectional LSTM only preserves information of the past because the only inputs it has seen are from the past.
  • BiLSTM has two networks, one access pastinformation in forward direction and another access future in the reverse direction

Ref - Link

Keep Exploring!!!

December 24, 2018

Day #172 - CNN - One Pager, Notes and Best Practices

Summary from MIT / Stanford Classes / Geoff Hinton Papers / Readings.

Deep Learning - "Learn directly from data without manual feature engineering"

A deep learning architecture is a multi-stack layer of simple modules, all of which may compute simple non-linear input-output mappings. The backpropagation procedure to compute the gradient of an objective function with respect to the weights of a multi-layer stack of the module is nothing more than a practical application of chain rule of derivatives

Overview


Step 1


Step 2
CNN Key Components
  • Input
  • Convolution
  • Strides
  • Pooling
  • Fully Connected
  • Output
Image - Represented as RGB Matrix with Height and width = 3 color channels X Height X width. Color represented in [0,255] Range
Data Augmentation - Random mix of translation, rotation, stretching, shearing, lens distortions
Convolution - Convolve the filter with the image (Slide over image spatially computing dot products). Convolution preserves the spatial relationship between pixels
Convolution Filters - Replace Large Convolutions (5x5, 7x7) with stacks of 3 x 3 convolutions
Activation Function - Introduce non-linearities in the network. Non-Linearity - Often Relu (Image data highly non-linear)
Transfer Learning - Works better for similar types of data, Freeze network and retrain top layer
Pooling - Downsampling for each feature map. MAX Pooling - Shrinks size using max


Advantages of downsampling - Decreased size of input for upcoming layers, Works against overfitting
Flattening - Convert into 1D feature vector.  Flattens all its structure to create a single long feature vector
Fully Connected - Has a neuron fully connected to output, Contains neurons that connect to the entire input volume as in ordinary neural networks
After Forward pass we compute - Loss. In Backward pass we Compute Gradient to backpropagate and update weights
Fully Connected Layer - Layer where the maxpooled matrix is flattened. FCN is fed into softmax / cross entropy layer for prediction.

CNN Concepts
Sliding Windows - Pick a portion of image and run detection, Keep Moving Horzontally, Vertically to pick and predict for each selected Area. Prediction Confidence and Threshold filter to select the same. Generate Bounding Boxes.
Non-max Supression (NMS) - Several bounding boxes, Select box with highest probability for detected objects

Keras
  • Dense Layer - A dense layer represents a matrix vector multiplication.  each input node is connected to each output node.
  • Dropout - A dropout layer is used for regularization  
  • Hidden Layer - A sparse layer is a hidden layer that is not dense
  • Fully-connected layer is a (usually) cheap way of learning non-linear combinations of these features. In place of fully connected layers, we can also use a conventional classifier like SVM. Fully Connected layers perform classification based on the features extracted by the previous layers.
  • Convolutional layers are providing a meaningful, low-dimensional, and somewhat invariant feature space
Experiments
  • Convolution Level - Standard Convolution, Dilated Convolution, Transposed Convolution, Strided Convolution, Conv2D, ConvLSTM2D
  • Kernels - 3x3 and 2x2 kernels, Convolution Tricks - 1 X 7, 7 X 1, Replace Large Convolutions (5x5, 7x7) with stacks of 3 x 3 convolutions, 
  • Loss Functions - RMSE, Cross Entropy
  • Optimizers - SGD, AgaGrad, AdaDelta, RMSProp
  • Activation Functions - Sigmoid, Tanh, Relu
CNN - Recognize Spatial Analysis
RNN - Recognize Sequential patterns
GAN - Two Networks, One to Generate, Another One Testing Output of Generation
Reinforcement Learning - Trial and Error Learning

Questions
How Forward feed and backwards prop work.
  • A Feed-Forward Neural Network is a type of Neural Network architecture where the connections are "fed forward", i.e. do not form cycles (like in recurrent nets).
  • Input to hidden and from hidden to output layer. 
  • The values are "fed forward"
Backpropagation is a training algorithm consisting of 2 steps: 
  • Feed forward the values 
  • Calculate the error and propagate it back to the earlier layers. 
So to be precise, forward-propagation is part of the backpropagation algorithm but comes before back-propagating.
  • Input for backpropagation is output_vector, target_output_vector, output is adjusted_weight_vector.
  • Backpropagation is an efficient method of computing gradients in directed graphs of computations, such as neural networks
  • Stochastic gradient descent (SGD) is an optimization method used e.g. to minimize a loss function.
  • SGD writes code in weights of neural network
The various loss functions and their considerations
  • Mean Squared Error (MSE), or quadratic, loss function is widely used in linear regression as the performance measure
  • Cross Entropy is commonly-used in binary classification (labels are assumed to take values 0 or 1) 
  • Cross Entropy Loss is usually used in classification problems. In essence, it is a measure of difference between the desired probablity distribution and the predicted probablity distribution
  • Negative Log Likelihood loss function is widely used in neural networks, it measures the accuracy of a classifier.
Loss Functions - Reference
  • Triplet Loss is another loss commonly used in CNN-based image retrieval. During training process, an image triplet (Ia,In,Ip) is fed into the model as a single sample, where Ia, In and Ip represent the anchor, postive and negative images respectively. The idea behind is that distance between anchor and positive images should be smaller than that between anchor and negative images.
  • Contrastive Loss is often used in image retrieval tasks to learn discriminative features for images. 
More Questions
  • The various activation functions and why they are needed
  • Optimization functions and why they are needed
  • Bias and variance / over and under fitting - what causes them, and the various methods to handle them
  • CNNs, RNNs, GANs, attention, Transformer, unsupervised and semi supervised, RL, decision trees, Ensemble Learning, SVM, Auto encoders…Understand interpretation, bias, fairness
  • The statistics theory (the more the better), and the linear algebra and calculus technicalities
IoU is a measure of the overlap between two bounding boxes: The ground truth bounding boxes (i.e. the hand labeled bounding boxes), The predicted bounding boxes from the model
Non-maximal suppression - Pick the bounding box with the maximum box confidence. Output this box as prediction.
  • CNN - Multiple Regions - Slide - Convolve - Feature Extraction
  • RCNN - Selective Search. 2000 regions extracted. Feed these patches to CNN, followed by SVM to predict the class of each patch.
  • Fast RCNN - Selective search generate predictions. 
  • Faster RCNN replaces selective search with a very small convolutional network called Region Proposal Network to generate regions of Interests. Faster R-CNN has a dedicated region proposal network followed by a classifier. 7 FPS (frame per second)
  • Feature Pyramid Network (FPN) is a feature extractor designed with feature pyramid concept to improve accuracy and speed
YOLO and SSD Regression-Based detectors
  • Yolo detection is a simple regression problem which takes an input image and learns the class probabilities and bounding box coordinates
  • YOLO uses DarkNet to make feature detection followed by convolutional layers.
  • One limitation for YOLO is that it only predicts 1 type of class in one grid hence, it struggles with very small objects
  • Predictions (object locations and classes) are made from one single network 
  • SSD is a single shot detector using a VGG16 network as a feature extractor (equivalent to the CNN in Faster R-CNN). Then we add custom convolution layers (blue) afterward and use convolution filters (green) to make predictions. SSD only uses upper layers for detection and therefore performs much worse for small objects.
Gradient Calculation
  • Crucial for backpropagation
  • Function with edge weights
  • Edge weights yield Error weights
  • Find the combination of w1,w2..wn that will minimize function y
  • Minimal error is the goal
  • The optimization problem in higher dimensions
  • NP-hard problem, make simulations, approximations
  • NP-hardness (non-deterministic polynomial-time hard)
  • Simulate genetic algorithms
  • TSP
  • Compute gradient adjust edge weights
  • Gradient (Partial derivative)
  • Direction of gradient
  • Derivative sigmoid function
  • Easy to code
  • Learning for backpropagation compute the derivative of the activation function
  • deltaoutput = error*dsigmoid(sum)
  • O/P to I/P layer so-called backpropagation
Backpropagation
  • Equation for updating edge weights
  • Delta = learningrate*gradient + momentum*previousChange
  • Learning Rate - How fast we learn, optimum to converge
  • Momentum - Avoid local minimums
  • derivative (loss) / (weight)
  • Error / Loss = predicted - actual
  • Update weights from right / left
Backprop Good Read - Link

CNN - Sliding Window, Reduce Dimensions, Works best on images 
RNN - Sequence, Retain some portion of history. Limit to parallel training, Hard to capture relationship when points are far
Transformers - Learn multiple ways, the relationship between each item in the input sequence to all other items in the input
Encoder - Transform inputs to embeddings. Several Multiheaded self attention models stacked up on each other.
Decoder - Embedding to output sequence
Applications - Used for Seq2Seq / Machine translations
Feedback - May not work for hierarchical relationships
Reference  - Link

Where Backprop Fails?
Every layer has multiple neurons. RNN has memory cells. Memory cells have inputs at different periods of time. GRU / LSTM / Basic cell are variations of cells. Memory cells get unrolled with time. Backprop via gradient descent. Weights / Biased as the number of connections. Calculation of loss based on cost/loss function. Randomly initialize weights/biases along the slope. The optimizer will attempt to descent down the slope to find optimal values (smallest value of error). Converge at best possible value of weights/bias for all connections in neurons.
Failure cases - Gradients doesn't change (Vanishing Gradient problem), Loss remains constant

Ref Notes 

  • Average Pooling and Max Pooling
  • Hierarchical Pooling
  • Attentive Pooling
Convolution
  • Causal Convolutions
  • Multi-Resolution CNN Block
  • Multi-Head Multi-Resolution CNN Block
A sequence encoder is the cornerstone of many advanced AI applications, such as semantic search, question-answering, machine reading comprehension
  • Batchnorm - in effect, performs a kind of coordinated rescaling of its inputs
  • Dropouts: Randomly disables neurons during the training, in order to force other neurons to be trained as well
  • L1 regularization - Cost added is proportional to the absolute value of the weights coefficients 
  • L2 regularization - Cost added is proportional to the square of the value of the weights coefficients 

Bayesian Convolutional Neural Networks (Bayesian CNNs) and traditional Convolutional Neural Networks (CNNs) are both types of neural networks used for processing grid-like data, such as images. The main difference between them lies in their approach to handling uncertainty and learning model parameters.

Convolutional Neural Networks (CNNs) are a type of feedforward neural network that use convolutional layers to learn local features in the input data. CNNs are trained using backpropagation and optimization algorithms like stochastic gradient descent to minimize a loss function. The learned parameters (weights and biases) are point estimates, meaning that they do not capture the uncertainty in the model.

Bayesian Convolutional Neural Networks (Bayesian CNNs) extend traditional CNNs by incorporating Bayesian inference to learn the distribution of model parameters instead of point estimates. This allows Bayesian CNNs to capture the uncertainty in the model, which can be useful for tasks where uncertainty quantification is important, such as medical image analysis or safety-critical applications.

Here are some pros and cons of Bayesian CNNs compared to traditional CNNs:

Pros:
Uncertainty quantification: Bayesian CNNs can provide a measure of uncertainty in their predictions, which can be useful for decision-making and risk assessment.
Regularization: The Bayesian approach naturally incorporates regularization, which can help prevent overfitting and improve generalization.
Robustness: Bayesian CNNs can be more robust to adversarial examples and noisy data, as they take into account the uncertainty in the model parameters.
Cons:

Computational complexity: Bayesian CNNs are generally more computationally expensive than traditional CNNs, as they require sampling from the posterior distribution of model parameters or approximating it using techniques like variational inference.
Implementation complexity: Implementing Bayesian CNNs can be more challenging than traditional CNNs, as it requires additional tools and techniques for handling the Bayesian aspect.
Here's a simple example of a traditional CNN using Python and TensorFlow:



Happy Mastering DL!!!

December 22, 2018

Day #170 - RNN

Updated (May 30 / 2022) - Based on student discussions :)
  • RNN = CNN with previous state/sequencing
  • LSTM: - Cell memory stores the t-1 output
    • has 3 layers- Forget, Update, Output
    • Can do bi diretional for Offiline data
  • CNNs are mostly used for images, RNNs are mainly used for sequential data like videos or texts
Key Summary
  • Recurrent Neural Networks
  • Flexibility in architecture
  • Operate over sequences of input and output
  • Image to sequence of words
  • Sequence of words and classify sentiment of sentence
  • Function of all the frames
  • RNN for processing sequentially
  • Paper - DRAW - Recurrent Neural Network for Image Generation
  • Paper - Multiple Object Recognition with Visual Attention (Paper) - sequence processing of fixed inputs
  • Arrows - Functional Dependents
  • RNN has a state - Receives through time input vectors
  • It has state internally, Modify state as function, Weights are inside RNN
  • Predict output based on certain state
  • RNN - Collection of vectors, Function of previous state + current input vector
  • Single Hidden State and Recurrence formula
  • Character level language models
  • Feed Sequence of character and ask NN to predict sequence
  • One hot representation - turn on bit that corresponds to the order
  • Hidden layer summarizes all characters until then
  • Softmax classifier over next character
  • Same function always applied at each step
  • Initialization - Setting it to zero
  • Order of data-set matters, Function of everything that comes before it
  • Character level RNN - https://gist.github.com/karpathy/d4dee566867f8291f086


RNN
  • Input, Order characters
  • Associate indexes for evert character - sequence length is 25
  • Too large data cannot be put on top of it
  • Chunks of input data (25 characters)
  • Backpropogate 25 characters
  • Wxh, Whh  - Parameters to train
  • Sampling code to generate samples of characters it thinks
  • RNN distribution of next character sequence
  • Adagrad Update
  • Loss function - Forward and backward method
  • Backward 25 all the way to 1
  • Backpropagate thru softmax, activation function
  • Sample functions generate new text data
  • 25 softmax at every batch, they all backpropagate
  • Regularization is done 
  • Loss function - Forward pass - Compute Loss, Backward pass - Compute Gradient
  • Indexes and sequences of indexes, RNN has no knowledge of characters
  • Quiet Interesting examples of poetry, formula generation, code generation
  • Three layer LSTM



Working Details
  • Character level RNN on text
  • Cell is excited or not based on hidden states
  • Quote detection cell (Until open and close)
  • Line length tracking cell
  • Deeper the expression
  • RNNs are used for training sequence models
Image Captioning
  • Sequence of words for Image
  • Image -> CNN
  • ConvNet Process Image
  • RNN - Remember Sequences
  • Conditioning generated model with output of convolution process
  • Predict Next word / Remember information
  • Word level embedding
  • Sample until end of sentence
  • Number of dimensions = Number of words + 1 (End Token)
  • Backpropagate at single time
  • Embedding - One hot representation
  • Image plugged in first step
  • Backpropagate everything completely jointly
  • We can figure out features to better describe in end

More Architectures
  • Look up image and use feature maps in image
  • Attention over image
  • Soft Attention
  • Selective Attention over inputs
Multiple RNNs
  • RNNs feed into each other
  • All in single computational graph
LSTM
  • Recurrence formula is slightly complicated
  • Cancatenate and new formula for combining vectors
LSTM instead of RNN
  • x (input), h(previous hidden state)
  • f - sigma gate - forget gate - reset some cells to zero
  • g - tan gate
  • LSTM has hidden and cell state vector (two vectors)
  • LSTM operate over cell state


LSTM
  • f, i, g, o - n dimensional vectors
RNN vs LSTM
  • Based on Hidden state operate on cell
  • Forget gate - reset some cells to zero
  • LSTM very good with vanishing gradient problem
  • Relu used here



This is a special 100th Post of Learning for this year. Also, 170th post for Data Science. I hope this incremental learning always adds the delta for the next big idea

Keep Mastering DL!!!