Data Science, Database, AI Startups and Domain Learning's (Video-Image-Text-Data-Database): RNN

March 05, 2023

LSTM One Pager

LSTM in its core, preserves information from inputs that has already passed through it using the hidden state.

Loss Computation Steps

The many-to-many RNN loss is computed at each time step.
Many to One RNN - we make the decision based on the final hidden state of this network

LSTM

forget irrelevant parts of previous state
selectively update cell state values
output certain parts of cell state

LSTM / GRU

LSTM (Long Short Term Memory): LSTM has three gates (input, output and forget gate)
GRU (Gated Recurring Units): GRU has two gates (reset and update gate).
GRU exposes the complete memory unlike LSTM, so applications which that acts as advantage might be helpful.
GRUs train faster and perform better than LSTMs on less training data

Gradient Updates

If the gradients are large Exploding gradients, learning diverges Solution: clip the gradients to a certain max value.
If the gradients are small Vanishing gradients, learning very slow or stops Solution: introducing memory via LSTM, GRU, etc

Unidirectional vs BiLSTM

Unidirectional LSTM only preserves information of the past because the only inputs it has seen are from the past.
BiLSTM has two networks, one access pastinformation in forward direction and another access future in the reverse direction

Ref - Link

Keep Exploring!!!

December 24, 2018

Day #172 - CNN - One Pager, Notes and Best Practices

Summary from MIT / Stanford Classes / Geoff Hinton Papers / Readings.

Deep Learning - "Learn directly from data without manual feature engineering"

A deep learning architecture is a multi-stack layer of simple modules, all of which may compute simple non-linear input-output mappings. The backpropagation procedure to compute the gradient of an objective function with respect to the weights of a multi-layer stack of the module is nothing more than a practical application of chain rule of derivatives

Overview

Step 1

Step 2

CNN Key Components

Input
Convolution
Strides
Pooling
Fully Connected
Output

Image - Represented as RGB Matrix with Height and width = 3 color channels X Height X width. Color represented in [0,255] Range
Data Augmentation - Random mix of translation, rotation, stretching, shearing, lens distortions
Convolution - Convolve the filter with the image (Slide over image spatially computing dot products). Convolution preserves the spatial relationship between pixels
Convolution Filters - Replace Large Convolutions (5x5, 7x7) with stacks of 3 x 3 convolutions
Activation Function - Introduce non-linearities in the network. Non-Linearity - Often Relu (Image data highly non-linear)
Transfer Learning - Works better for similar types of data, Freeze network and retrain top layer
Pooling - Downsampling for each feature map. MAX Pooling - Shrinks size using max

Advantages of downsampling - Decreased size of input for upcoming layers, Works against overfitting
Flattening - Convert into 1D feature vector. Flattens all its structure to create a single long feature vector
Fully Connected - Has a neuron fully connected to output, Contains neurons that connect to the entire input volume as in ordinary neural networks

After Forward pass we compute - Loss. In Backward pass we Compute Gradient to backpropagate and update weights
Fully Connected Layer - Layer where the maxpooled matrix is flattened. FCN is fed into softmax / cross entropy layer for prediction.

CNN Concepts
Sliding Windows - Pick a portion of image and run detection, Keep Moving Horzontally, Vertically to pick and predict for each selected Area. Prediction Confidence and Threshold filter to select the same. Generate Bounding Boxes.
Non-max Supression (NMS) - Several bounding boxes, Select box with highest probability for detected objects

Keras

Dense Layer - A dense layer represents a matrix vector multiplication. each input node is connected to each output node.
Dropout - A dropout layer is used for regularization
Hidden Layer - A sparse layer is a hidden layer that is not dense
Fully-connected layer is a (usually) cheap way of learning non-linear combinations of these features. In place of fully connected layers, we can also use a conventional classifier like SVM. Fully Connected layers perform classification based on the features extracted by the previous layers.
Convolutional layers are providing a meaningful, low-dimensional, and somewhat invariant feature space

Experiments

Convolution Level - Standard Convolution, Dilated Convolution, Transposed Convolution, Strided Convolution, Conv2D, ConvLSTM2D
Kernels - 3x3 and 2x2 kernels, Convolution Tricks - 1 X 7, 7 X 1, Replace Large Convolutions (5x5, 7x7) with stacks of 3 x 3 convolutions,
Loss Functions - RMSE, Cross Entropy
Optimizers - SGD, AgaGrad, AdaDelta, RMSProp
Activation Functions - Sigmoid, Tanh, Relu

CNN - Recognize Spatial Analysis

RNN - Recognize Sequential patterns

GAN - Two Networks, One to Generate, Another One Testing Output of Generation

Reinforcement Learning - Trial and Error Learning

Questions
How Forward feed and backwards prop work.

A Feed-Forward Neural Network is a type of Neural Network architecture where the connections are "fed forward", i.e. do not form cycles (like in recurrent nets).
Input to hidden and from hidden to output layer.
The values are "fed forward"

Backpropagation is a training algorithm consisting of 2 steps:

Feed forward the values
Calculate the error and propagate it back to the earlier layers.

So to be precise, forward-propagation is part of the backpropagation algorithm but comes before back-propagating.

Input for backpropagation is output_vector, target_output_vector, output is adjusted_weight_vector.
Backpropagation is an efficient method of computing gradients in directed graphs of computations, such as neural networks
Stochastic gradient descent (SGD) is an optimization method used e.g. to minimize a loss function.
SGD writes code in weights of neural network

The various loss functions and their considerations

Mean Squared Error (MSE), or quadratic, loss function is widely used in linear regression as the performance measure
Cross Entropy is commonly-used in binary classification (labels are assumed to take values 0 or 1)
Cross Entropy Loss is usually used in classification problems. In essence, it is a measure of difference between the desired probablity distribution and the predicted probablity distribution
Negative Log Likelihood loss function is widely used in neural networks, it measures the accuracy of a classifier.

Loss Functions - Reference

Triplet Loss is another loss commonly used in CNN-based image retrieval. During training process, an image triplet (Ia,In,Ip) is fed into the model as a single sample, where Ia, In and Ip represent the anchor, postive and negative images respectively. The idea behind is that distance between anchor and positive images should be smaller than that between anchor and negative images.
Contrastive Loss is often used in image retrieval tasks to learn discriminative features for images.

More Questions

The various activation functions and why they are needed
Optimization functions and why they are needed
Bias and variance / over and under fitting - what causes them, and the various methods to handle them
CNNs, RNNs, GANs, attention, Transformer, unsupervised and semi supervised, RL, decision trees, Ensemble Learning, SVM, Auto encoders…Understand interpretation, bias, fairness
The statistics theory (the more the better), and the linear algebra and calculus technicalities

IoU is a measure of the overlap between two bounding boxes: The ground truth bounding boxes (i.e. the hand labeled bounding boxes), The predicted bounding boxes from the model

Non-maximal suppression - Pick the bounding box with the maximum box confidence. Output this box as prediction.

CNN - Multiple Regions - Slide - Convolve - Feature Extraction
RCNN - Selective Search. 2000 regions extracted. Feed these patches to CNN, followed by SVM to predict the class of each patch.
Fast RCNN - Selective search generate predictions.
Faster RCNN replaces selective search with a very small convolutional network called Region Proposal Network to generate regions of Interests. Faster R-CNN has a dedicated region proposal network followed by a classifier. 7 FPS (frame per second)
Feature Pyramid Network (FPN) is a feature extractor designed with feature pyramid concept to improve accuracy and speed

YOLO and SSD Regression-Based detectors

Yolo detection is a simple regression problem which takes an input image and learns the class probabilities and bounding box coordinates
YOLO uses DarkNet to make feature detection followed by convolutional layers.
One limitation for YOLO is that it only predicts 1 type of class in one grid hence, it struggles with very small objects
Predictions (object locations and classes) are made from one single network

SSD is a single shot detector using a VGG16 network as a feature extractor (equivalent to the CNN in Faster R-CNN). Then we add custom convolution layers (blue) afterward and use convolution filters (green) to make predictions. SSD only uses upper layers for detection and therefore performs much worse for small objects.

Gradient Calculation

Crucial for backpropagation
Function with edge weights
Edge weights yield Error weights
Find the combination of w1,w2..wn that will minimize function y
Minimal error is the goal
The optimization problem in higher dimensions
NP-hard problem, make simulations, approximations
NP-hardness (non-deterministic polynomial-time hard)
Simulate genetic algorithms
TSP
Compute gradient adjust edge weights
Gradient (Partial derivative)
Direction of gradient
Derivative sigmoid function
Easy to code
Learning for backpropagation compute the derivative of the activation function
deltaoutput = error*dsigmoid(sum)
O/P to I/P layer so-called backpropagation

Backpropagation

Equation for updating edge weights
Delta = learningrate*gradient + momentum*previousChange
Learning Rate - How fast we learn, optimum to converge
Momentum - Avoid local minimums
derivative (loss) / (weight)
Error / Loss = predicted - actual
Update weights from right / left

Backprop Good Read - Link

CNN - Sliding Window, Reduce Dimensions, Works best on images

RNN - Sequence, Retain some portion of history. Limit to parallel training, Hard to capture relationship when points are far

Transformers - Learn multiple ways, the relationship between each item in the input sequence to all other items in the input

Encoder - Transform inputs to embeddings. Several Multiheaded self attention models stacked up on each other.

Decoder - Embedding to output sequence

Applications - Used for Seq2Seq / Machine translations

Feedback - May not work for hierarchical relationships

Reference - Link

Where Backprop Fails?

Every layer has multiple neurons. RNN has memory cells. Memory cells have inputs at different periods of time. GRU / LSTM / Basic cell are variations of cells. Memory cells get unrolled with time. Backprop via gradient descent. Weights / Biased as the number of connections. Calculation of loss based on cost/loss function. Randomly initialize weights/biases along the slope. The optimizer will attempt to descent down the slope to find optimal values (smallest value of error). Converge at best possible value of weights/bias for all connections in neurons.

Failure cases - Gradients doesn't change (Vanishing Gradient problem), Loss remains constant

Ref Notes

4 Sequence Encoding Blocks You Must Know Besides RNN/LSTM in Tensorflow

Average Pooling and Max Pooling
Hierarchical Pooling
Attentive Pooling

Convolution

Causal Convolutions
Multi-Resolution CNN Block
Multi-Head Multi-Resolution CNN Block

A sequence encoder is the cornerstone of many advanced AI applications, such as semantic search, question-answering, machine reading comprehension

Batchnorm - in effect, performs a kind of coordinated rescaling of its inputs
Dropouts: Randomly disables neurons during the training, in order to force other neurons to be trained as well
L1 regularization - Cost added is proportional to the absolute value of the weights coefficients
L2 regularization - Cost added is proportional to the square of the value of the weights coefficients

Bayesian Convolutional Neural Networks (Bayesian CNNs) and traditional Convolutional Neural Networks (CNNs) are both types of neural networks used for processing grid-like data, such as images. The main difference between them lies in their approach to handling uncertainty and learning model parameters.

Convolutional Neural Networks (CNNs) are a type of feedforward neural network that use convolutional layers to learn local features in the input data. CNNs are trained using backpropagation and optimization algorithms like stochastic gradient descent to minimize a loss function. The learned parameters (weights and biases) are point estimates, meaning that they do not capture the uncertainty in the model.

Bayesian Convolutional Neural Networks (Bayesian CNNs) extend traditional CNNs by incorporating Bayesian inference to learn the distribution of model parameters instead of point estimates. This allows Bayesian CNNs to capture the uncertainty in the model, which can be useful for tasks where uncertainty quantification is important, such as medical image analysis or safety-critical applications.

Here are some pros and cons of Bayesian CNNs compared to traditional CNNs:

Pros:

Uncertainty quantification: Bayesian CNNs can provide a measure of uncertainty in their predictions, which can be useful for decision-making and risk assessment.

Regularization: The Bayesian approach naturally incorporates regularization, which can help prevent overfitting and improve generalization.

Robustness: Bayesian CNNs can be more robust to adversarial examples and noisy data, as they take into account the uncertainty in the model parameters.

Cons:

Computational complexity: Bayesian CNNs are generally more computationally expensive than traditional CNNs, as they require sampling from the posterior distribution of model parameters or approximating it using techniques like variational inference.

Implementation complexity: Implementing Bayesian CNNs can be more challenging than traditional CNNs, as it requires additional tools and techniques for handling the Bayesian aspect.

Here's a simple example of a traditional CNN using Python and TensorFlow:

Happy Mastering DL!!!

December 22, 2018

Day #170 - RNN

Updated (May 30 / 2022) - Based on student discussions :)

RNN = CNN with previous state/sequencing
LSTM: - Cell memory stores the t-1 output

has 3 layers- Forget, Update, Output
Can do bi diretional for Offiline data

CNNs are mostly used for images, RNNs are mainly used for sequential data like videos or texts

Key Summary

Recurrent Neural Networks
Flexibility in architecture
Operate over sequences of input and output
Image to sequence of words
Sequence of words and classify sentiment of sentence
Function of all the frames
RNN for processing sequentially
Paper - DRAW - Recurrent Neural Network for Image Generation
Paper - Multiple Object Recognition with Visual Attention (Paper) - sequence processing of fixed inputs
Arrows - Functional Dependents
RNN has a state - Receives through time input vectors
It has state internally, Modify state as function, Weights are inside RNN
Predict output based on certain state
RNN - Collection of vectors, Function of previous state + current input vector
Single Hidden State and Recurrence formula
Character level language models
Feed Sequence of character and ask NN to predict sequence
One hot representation - turn on bit that corresponds to the order
Hidden layer summarizes all characters until then
Softmax classifier over next character
Same function always applied at each step
Initialization - Setting it to zero
Order of data-set matters, Function of everything that comes before it
Character level RNN - https://gist.github.com/karpathy/d4dee566867f8291f086

RNN

Input, Order characters
Associate indexes for evert character - sequence length is 25
Too large data cannot be put on top of it
Chunks of input data (25 characters)
Backpropogate 25 characters
Wxh, Whh - Parameters to train
Sampling code to generate samples of characters it thinks
RNN distribution of next character sequence
Adagrad Update
Loss function - Forward and backward method
Backward 25 all the way to 1
Backpropagate thru softmax, activation function
Sample functions generate new text data
25 softmax at every batch, they all backpropagate
Regularization is done
Loss function - Forward pass - Compute Loss, Backward pass - Compute Gradient
Indexes and sequences of indexes, RNN has no knowledge of characters
Quiet Interesting examples of poetry, formula generation, code generation
Three layer LSTM

Working Details

Character level RNN on text
Cell is excited or not based on hidden states
Quote detection cell (Until open and close)
Line length tracking cell
Deeper the expression
RNNs are used for training sequence models

Image Captioning

Sequence of words for Image
Image -> CNN
ConvNet Process Image
RNN - Remember Sequences
Conditioning generated model with output of convolution process
Predict Next word / Remember information
Word level embedding
Sample until end of sentence
Number of dimensions = Number of words + 1 (End Token)
Backpropagate at single time
Embedding - One hot representation
Image plugged in first step
Backpropagate everything completely jointly
We can figure out features to better describe in end

More Architectures

Look up image and use feature maps in image
Attention over image
Soft Attention
Selective Attention over inputs

Multiple RNNs

RNNs feed into each other
All in single computational graph

LSTM

Recurrence formula is slightly complicated
Cancatenate and new formula for combining vectors

LSTM instead of RNN

x (input), h(previous hidden state)
f - sigma gate - forget gate - reset some cells to zero
g - tan gate
LSTM has hidden and cell state vector (two vectors)
LSTM operate over cell state

LSTM

f, i, g, o - n dimensional vectors

RNN vs LSTM

Based on Hidden state operate on cell
Forget gate - reset some cells to zero
LSTM very good with vanishing gradient problem
Relu used here

This is a special 100th Post of Learning for this year. Also, 170th post for Data Science. I hope this incremental learning always adds the delta for the next big idea

Keep Mastering DL!!!

Data Science, Database, AI Startups and Domain Learning's (Video-Image-Text-Data-Database)

March 05, 2023

LSTM One Pager

December 24, 2018

Day #172 - CNN - One Pager, Notes and Best Practices

December 22, 2018

Day #170 - RNN

About Me

What is your Expertise

Search This Blog

Git Code Repository

Translate

About Me and Disclaimer

Labels

Data Science Good Reads

Cloud, Datacentre, BigData and NOSQL Blogs

SQL Links

Archecture Blog List

Programming Problems

Startup - Reads

Perl-Python-Ruby-Linux-Oracle

Management + Leadership Blogs

Research Papers & Podcasts

My Wordpress

Interesting Reads

Useful Links - C# and .NET

Java, Selenium, QTP and Test Tools Learning

Agile Testing

Reverse Logistics Reads

Biztalk Blogs

MS BI Links

Process - Learnt it :)

Usability Guidelines - Building Better Sites

.NET Test Tools and Other Interesting Reads

Review Checklist

Blog Archive

Live Traffic

Total Pageviews

Popular Posts