Data Science, Database, AI Startups and Domain Learning's (Video-Image-Text-Data-Database): Day #172 - CNN

December 24, 2018

Day #172 - CNN - One Pager, Notes and Best Practices

Summary from MIT / Stanford Classes / Geoff Hinton Papers / Readings.

Deep Learning - "Learn directly from data without manual feature engineering"

A deep learning architecture is a multi-stack layer of simple modules, all of which may compute simple non-linear input-output mappings. The backpropagation procedure to compute the gradient of an objective function with respect to the weights of a multi-layer stack of the module is nothing more than a practical application of chain rule of derivatives

Overview

Step 1

Step 2

CNN Key Components

Input
Convolution
Strides
Pooling
Fully Connected
Output

Image - Represented as RGB Matrix with Height and width = 3 color channels X Height X width. Color represented in [0,255] Range
Data Augmentation - Random mix of translation, rotation, stretching, shearing, lens distortions
Convolution - Convolve the filter with the image (Slide over image spatially computing dot products). Convolution preserves the spatial relationship between pixels
Convolution Filters - Replace Large Convolutions (5x5, 7x7) with stacks of 3 x 3 convolutions
Activation Function - Introduce non-linearities in the network. Non-Linearity - Often Relu (Image data highly non-linear)
Transfer Learning - Works better for similar types of data, Freeze network and retrain top layer
Pooling - Downsampling for each feature map. MAX Pooling - Shrinks size using max

Advantages of downsampling - Decreased size of input for upcoming layers, Works against overfitting
Flattening - Convert into 1D feature vector. Flattens all its structure to create a single long feature vector
Fully Connected - Has a neuron fully connected to output, Contains neurons that connect to the entire input volume as in ordinary neural networks

After Forward pass we compute - Loss. In Backward pass we Compute Gradient to backpropagate and update weights
Fully Connected Layer - Layer where the maxpooled matrix is flattened. FCN is fed into softmax / cross entropy layer for prediction.

CNN Concepts
Sliding Windows - Pick a portion of image and run detection, Keep Moving Horzontally, Vertically to pick and predict for each selected Area. Prediction Confidence and Threshold filter to select the same. Generate Bounding Boxes.
Non-max Supression (NMS) - Several bounding boxes, Select box with highest probability for detected objects

Keras

Dense Layer - A dense layer represents a matrix vector multiplication. each input node is connected to each output node.
Dropout - A dropout layer is used for regularization
Hidden Layer - A sparse layer is a hidden layer that is not dense
Fully-connected layer is a (usually) cheap way of learning non-linear combinations of these features. In place of fully connected layers, we can also use a conventional classifier like SVM. Fully Connected layers perform classification based on the features extracted by the previous layers.
Convolutional layers are providing a meaningful, low-dimensional, and somewhat invariant feature space

Experiments

Convolution Level - Standard Convolution, Dilated Convolution, Transposed Convolution, Strided Convolution, Conv2D, ConvLSTM2D
Kernels - 3x3 and 2x2 kernels, Convolution Tricks - 1 X 7, 7 X 1, Replace Large Convolutions (5x5, 7x7) with stacks of 3 x 3 convolutions,
Loss Functions - RMSE, Cross Entropy
Optimizers - SGD, AgaGrad, AdaDelta, RMSProp
Activation Functions - Sigmoid, Tanh, Relu

CNN - Recognize Spatial Analysis

RNN - Recognize Sequential patterns

GAN - Two Networks, One to Generate, Another One Testing Output of Generation

Reinforcement Learning - Trial and Error Learning

Questions
How Forward feed and backwards prop work.

A Feed-Forward Neural Network is a type of Neural Network architecture where the connections are "fed forward", i.e. do not form cycles (like in recurrent nets).
Input to hidden and from hidden to output layer.
The values are "fed forward"

Backpropagation is a training algorithm consisting of 2 steps:

Feed forward the values
Calculate the error and propagate it back to the earlier layers.

So to be precise, forward-propagation is part of the backpropagation algorithm but comes before back-propagating.

Input for backpropagation is output_vector, target_output_vector, output is adjusted_weight_vector.
Backpropagation is an efficient method of computing gradients in directed graphs of computations, such as neural networks
Stochastic gradient descent (SGD) is an optimization method used e.g. to minimize a loss function.
SGD writes code in weights of neural network

The various loss functions and their considerations

Mean Squared Error (MSE), or quadratic, loss function is widely used in linear regression as the performance measure
Cross Entropy is commonly-used in binary classification (labels are assumed to take values 0 or 1)
Cross Entropy Loss is usually used in classification problems. In essence, it is a measure of difference between the desired probablity distribution and the predicted probablity distribution
Negative Log Likelihood loss function is widely used in neural networks, it measures the accuracy of a classifier.

Loss Functions - Reference

Triplet Loss is another loss commonly used in CNN-based image retrieval. During training process, an image triplet (Ia,In,Ip) is fed into the model as a single sample, where Ia, In and Ip represent the anchor, postive and negative images respectively. The idea behind is that distance between anchor and positive images should be smaller than that between anchor and negative images.
Contrastive Loss is often used in image retrieval tasks to learn discriminative features for images.

More Questions

The various activation functions and why they are needed
Optimization functions and why they are needed
Bias and variance / over and under fitting - what causes them, and the various methods to handle them
CNNs, RNNs, GANs, attention, Transformer, unsupervised and semi supervised, RL, decision trees, Ensemble Learning, SVM, Auto encoders…Understand interpretation, bias, fairness
The statistics theory (the more the better), and the linear algebra and calculus technicalities

IoU is a measure of the overlap between two bounding boxes: The ground truth bounding boxes (i.e. the hand labeled bounding boxes), The predicted bounding boxes from the model

Non-maximal suppression - Pick the bounding box with the maximum box confidence. Output this box as prediction.

CNN - Multiple Regions - Slide - Convolve - Feature Extraction
RCNN - Selective Search. 2000 regions extracted. Feed these patches to CNN, followed by SVM to predict the class of each patch.
Fast RCNN - Selective search generate predictions.
Faster RCNN replaces selective search with a very small convolutional network called Region Proposal Network to generate regions of Interests. Faster R-CNN has a dedicated region proposal network followed by a classifier. 7 FPS (frame per second)
Feature Pyramid Network (FPN) is a feature extractor designed with feature pyramid concept to improve accuracy and speed

YOLO and SSD Regression-Based detectors

Yolo detection is a simple regression problem which takes an input image and learns the class probabilities and bounding box coordinates
YOLO uses DarkNet to make feature detection followed by convolutional layers.
One limitation for YOLO is that it only predicts 1 type of class in one grid hence, it struggles with very small objects
Predictions (object locations and classes) are made from one single network

SSD is a single shot detector using a VGG16 network as a feature extractor (equivalent to the CNN in Faster R-CNN). Then we add custom convolution layers (blue) afterward and use convolution filters (green) to make predictions. SSD only uses upper layers for detection and therefore performs much worse for small objects.

Gradient Calculation

Crucial for backpropagation
Function with edge weights
Edge weights yield Error weights
Find the combination of w1,w2..wn that will minimize function y
Minimal error is the goal
The optimization problem in higher dimensions
NP-hard problem, make simulations, approximations
NP-hardness (non-deterministic polynomial-time hard)
Simulate genetic algorithms
TSP
Compute gradient adjust edge weights
Gradient (Partial derivative)
Direction of gradient
Derivative sigmoid function
Easy to code
Learning for backpropagation compute the derivative of the activation function
deltaoutput = error*dsigmoid(sum)
O/P to I/P layer so-called backpropagation

Backpropagation

Equation for updating edge weights
Delta = learningrate*gradient + momentum*previousChange
Learning Rate - How fast we learn, optimum to converge
Momentum - Avoid local minimums
derivative (loss) / (weight)
Error / Loss = predicted - actual
Update weights from right / left

Backprop Good Read - Link

CNN - Sliding Window, Reduce Dimensions, Works best on images

RNN - Sequence, Retain some portion of history. Limit to parallel training, Hard to capture relationship when points are far

Transformers - Learn multiple ways, the relationship between each item in the input sequence to all other items in the input

Encoder - Transform inputs to embeddings. Several Multiheaded self attention models stacked up on each other.

Decoder - Embedding to output sequence

Applications - Used for Seq2Seq / Machine translations

Feedback - May not work for hierarchical relationships

Reference - Link

Where Backprop Fails?

Every layer has multiple neurons. RNN has memory cells. Memory cells have inputs at different periods of time. GRU / LSTM / Basic cell are variations of cells. Memory cells get unrolled with time. Backprop via gradient descent. Weights / Biased as the number of connections. Calculation of loss based on cost/loss function. Randomly initialize weights/biases along the slope. The optimizer will attempt to descent down the slope to find optimal values (smallest value of error). Converge at best possible value of weights/bias for all connections in neurons.

Failure cases - Gradients doesn't change (Vanishing Gradient problem), Loss remains constant

Ref Notes

4 Sequence Encoding Blocks You Must Know Besides RNN/LSTM in Tensorflow

Average Pooling and Max Pooling
Hierarchical Pooling
Attentive Pooling

Convolution

Causal Convolutions
Multi-Resolution CNN Block
Multi-Head Multi-Resolution CNN Block

A sequence encoder is the cornerstone of many advanced AI applications, such as semantic search, question-answering, machine reading comprehension

Batchnorm - in effect, performs a kind of coordinated rescaling of its inputs
Dropouts: Randomly disables neurons during the training, in order to force other neurons to be trained as well
L1 regularization - Cost added is proportional to the absolute value of the weights coefficients
L2 regularization - Cost added is proportional to the square of the value of the weights coefficients

Bayesian Convolutional Neural Networks (Bayesian CNNs) and traditional Convolutional Neural Networks (CNNs) are both types of neural networks used for processing grid-like data, such as images. The main difference between them lies in their approach to handling uncertainty and learning model parameters.

Convolutional Neural Networks (CNNs) are a type of feedforward neural network that use convolutional layers to learn local features in the input data. CNNs are trained using backpropagation and optimization algorithms like stochastic gradient descent to minimize a loss function. The learned parameters (weights and biases) are point estimates, meaning that they do not capture the uncertainty in the model.

Bayesian Convolutional Neural Networks (Bayesian CNNs) extend traditional CNNs by incorporating Bayesian inference to learn the distribution of model parameters instead of point estimates. This allows Bayesian CNNs to capture the uncertainty in the model, which can be useful for tasks where uncertainty quantification is important, such as medical image analysis or safety-critical applications.

Here are some pros and cons of Bayesian CNNs compared to traditional CNNs:

Pros:

Uncertainty quantification: Bayesian CNNs can provide a measure of uncertainty in their predictions, which can be useful for decision-making and risk assessment.

Regularization: The Bayesian approach naturally incorporates regularization, which can help prevent overfitting and improve generalization.

Robustness: Bayesian CNNs can be more robust to adversarial examples and noisy data, as they take into account the uncertainty in the model parameters.

Cons:

Computational complexity: Bayesian CNNs are generally more computationally expensive than traditional CNNs, as they require sampling from the posterior distribution of model parameters or approximating it using techniques like variational inference.

Implementation complexity: Implementing Bayesian CNNs can be more challenging than traditional CNNs, as it requires additional tools and techniques for handling the Bayesian aspect.

Here's a simple example of a traditional CNN using Python and TensorFlow:

Happy Mastering DL!!!


	import tensorflow as tf
	from tensorflow.keras import layers

	model = tf.keras.Sequential([
	layers.Conv2D(32, kernel_size=(3, 3), activation='relu', input_shape=(28, 28, 1)),
	layers.MaxPooling2D(pool_size=(2, 2)),
	layers.Flatten(),
	layers.Dense(128, activation='relu'),
	layers.Dense(10, activation='softmax')
	])

	model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])

	#To create a Bayesian CNN, you can use TensorFlow Probability, which provides Bayesian layers like Convolution2DFlipout and DenseFlipout:

	import tensorflow as tf
	import tensorflow_probability as tfp

	model = tf.keras.Sequential([
	tfp.layers.Convolution2DFlipout(32, kernel_size=(3, 3), activation='relu', input_shape=(28, 28, 1)),
	layers.MaxPooling2D(pool_size=(2, 2)),
	layers.Flatten(),
	tfp.layers.DenseFlipout(128, activation='relu'),
	tfp.layers.DenseFlipout(10, activation='softmax')
	])

	model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])

view raw cnn_bycnn.py hosted with ❤ by GitHub

Data Science, Database, AI Startups and Domain Learning's (Video-Image-Text-Data-Database)

December 24, 2018

Day #172 - CNN - One Pager, Notes and Best Practices

No comments:

About Me

What is your Expertise

Search This Blog

Git Code Repository

Translate

About Me and Disclaimer

Labels

Data Science Good Reads

Cloud, Datacentre, BigData and NOSQL Blogs

SQL Links

Archecture Blog List

Programming Problems

Startup - Reads

Perl-Python-Ruby-Linux-Oracle

Management + Leadership Blogs

Research Papers & Podcasts

My Wordpress

Interesting Reads

Useful Links - C# and .NET

Java, Selenium, QTP and Test Tools Learning

Agile Testing

Reverse Logistics Reads

Biztalk Blogs

MS BI Links

Process - Learnt it :)

Usability Guidelines - Building Better Sites

.NET Test Tools and Other Interesting Reads

Review Checklist

Blog Archive

Live Traffic

Total Pageviews

Popular Posts