Data Science, Database, AI Startups and Domain Learning's (Video-Image-Text-Data-Database): 2018

December 31, 2018

New Year Surprise - Featured in Top 100 Data Science Blogs

2018 gave this surprise today. I was wondering if my blog has improved in rankings. The latest list of Top 100 Data Science Blogs, Websites & Newsletters To Follow in 2019

My blog is #59th Spot :). From #76 (Honorary mention), It is moved up to #59 this year.

Keep Learning, Keep Sharing Knowledge.

Wish you all a blessed, Successful, Happy and prosperous 2019

December 30, 2018

Research Papers and Learning Resources

Happy Mastering DL!!!

December 28, 2018

Day #175 - Videos and Unsupervised Learning

Dense Trajectory features

Detect key points to track
Tracklets obtained and features get accumulated
Feature points at different scales
Track using optical flow methods
Bunch of features extracted in local coordinate system of every track
15 frames, x,y positions
Extract features in local coordinate system between two frames
Differences in the key points reflects the optical flow

Key Point Detection

Detect features
Run Optical flow algos
Displacement vector between every single frame
Optical flow methods in python check
Histogram bins
SVM

Deep Network

Process frame in Alexnet
Encode 15 frame in CNN
Sharing weights spatially
Extend filters in small amounts in time
11 x 11 x T (Temporal Extent)
3 (R,G,B)
Sliding filters in time
Carving out activation volume

Spatio-Temporal ConvNets

3D Conv in Space and Time
Slow Fusion 3D Conv Approach
Learned filters on first layers (Smaller filters - More layers)
Spatio-Temporal ConvNets
Datasets are not quiet there
3D Conv, LSTM
Single frame networks are baseline

Spatio-Temporal ConvNets

c3D
3 x 3 col, 2 x 2 pool
VGG in 3D
3D Conv is Painful
Two ConvNets look at image
One look at optical flow
Extract Optical Flow, Fuse in the end
Optical flow contains lot of information
Need to check - Compute optical flow between two frames

Long time Spatial - Temporal ConvNets

Videos with temporal dependencies
Events larger than timescale
Attention model
Attention over different parts of idea
Process images at detail level, resize at global level
RNN
Video - Classes prediction at point in time
RNN allow to have infinite context
3D conv, lstm
CNN + LSTM

Video Classification Architectures

RNN + 3D Convnet
RNN before the ConvNet processes the image (Idea)
ConvNets between frames (Scales) - Speed up and Slow down
Bakground Subtraction only look at things of interest (Check code)
Weight sharing between ConvNet and RNN

Idea

Get Rid of RNN
Convnet
All neurons in convnet is recurrent
GRU slightly different update formula
Replace through the conv
Convolve over input, Output and then RNN
RNN Convnet (Check code)

Summary

Local motion 3D Conv
Global motion LSTM

Research papers of video + audio not there
Supervised Learning - Dataset has data x, label y. Goal in supervised learning is function that takes input x and outputs y
Example - Classification, Regression, Object detection, Semantic segmentation, image captioning

Unsupervised Learning

Just Data and labels
Learn Some structure on data
Examples - Clustering, dimensionality reduction, feature learning, generative models

Autoencoders

Traditional - Feature Learning
Variational - Generate Samples

Input x -> Pass thru Encode network -> Learnable feature Z
Reconstruction - Reproduce data x from features z
Decoder - Smaller features - Blows back to original data
Encoder / Decoder sometimes share weights
PCA Optimal for L2 Reconstruction
Our intention is learn useful tasks
Generate Fake images like original images

Variational AutoEncoder

Exist outside world prior distribution
Assume distribution is Gaussian
Bayes rule tell posterior
Probablity given observed data
Unsupervised data to learn features
Maximum likelihood
Variational Inference
Insert Extra constant, break into two different terms

Adverserial Networks - Generate Samples

Generator - Mini batches of random noise
Discriminator - both original and fake images
Architecture bigger and powerful using multiscale processing
Generate at multiple scales
Low-Resolution -> Upsample -> Delta on top of it -> Upsample ..

Variational Autoencoder

Adverserial noise inputs can be changed as we generate
Interpolate between random points in latent space

GAN

Learns Nice useful representations
Variational Autoencoder
Add Adverserial network to VAE
Discriminator Network added
Pixel Loss
Generate Samples like Alexnet

Python / OpenCV code to try

Program #1 - Generate optical flow between frames, Compute sift between two frames, Color the moved pixes
Program #2 - Upsample Images
Program #3 - Generate Optical flow data, Use it to feed to CNN to classify actions
Program #4 - GANs
Program #5 - CNN with 3x3, 1x7 different filters and training / test accuracy

Happy Mastering DL!!!

December 27, 2018

Day #174 - CS231 - Lecture 13: Segmentation, soft attention, spatial transformers

Key Lessons

Segmentation
Semantic Segmentation
Instance Segmentation

Soft Attention

Discrete Locations
Continuous Locations (Spatial Transformers)

Inception-v4

Deep Network
Repeated Modules
No Padding
Strided Convolution and MaxPooling
Efficient Convolution Tricks, 1 X 7, 7 X 1

Segmentation
Two Subtasks involved
Semantic

Input Image / Fixed number of classes
Background class
Label Every pixel with one of semantic class
Higher level understanding of images
Not aware of instances
Count of objects not known

Instance

Detect Instances given category, label
Simultaneous detection and segmentation

Semantic Segmentation

Label without instances
Given input image, Extract patch from image
Run it through CNN
Classify Cetre pixel as COW
Run over entire image
Expensive Operation
100 Pixel order of magnitude receptive field
Semantic Segmentation - Multi scale
Super Pixel / Segmentation Trees

Semantic Segmentation - Refinement

Apply CNN once to get labels
Increases Effective Receptive Field for output
Recurrent Convolutional network
Iteratively Define outputs

Semantic Segmentation - Upsampling

Input Images run through convolutions extract feature maps
Learn upsampling as part of network
Skip connections - Help in low level details
Convolutional features from different layers in network
Accuracy - Classification metrics, interesection over union (Ground truth vs Region Predicted)
Learnable Upsampling - Deconvolution (Convolution Transpose), Fractionally strided convolution

Instance Segmentation

Generalization
Distinguish Instances
Detect and label instances
End up looking like detection models
SDS (Simultaneous Detection and Segmentation
RCNN - External Region proposals
Box CNN, Region CNN
Mast with mean color of dataset

Hypercolumns

Region Classification
Region Refinement
Extract Convolutional features
Upsell and Club them together
Bi-linear / nearest neighbor for upsampling

Cascases

High Resolution Images
From Conv feature maps
Each Feature Map predicts region of interest
Reshape boxes to fixed size

Attention Models

RNN for Captioning

H X W X 3
Input Image -> CNN -> Features -> Hidden State -> First Word -> Second Word
One chance to look at input image
Weighted vector from imput features
Hidden states sent to the model

Soft Vs Hard Attention

Grid of features
Distribution over grid locations
Attention - Nice Interpretable Outputs

Soft Attention for Translation

Many to Many RNN
Sequence to Sequence model
Generate Output sequence similar to encaptioning
Content based addressing
Probablity of distribution directly
Soft Attention Easy to implement and train
Constrained to fixed grid provided by feature maps
Spatial Transformer Network - Similar to texture mapping
Attend to arbitrary parts of input in a nice way

Happy

Happy Mastering DL!!!

December 26, 2018

Day #173 - CS231N - Lecture 11: Deep Learning libraries - Notes

Key Summary

Cafee

From Berkeley
Widely used for CNN
Written in C++
Python and Matlab bindings
Good for standard feedforward vanilla CNN
Blob - Weights, Pixel values, Intermediate values (n dimensional tensor)
Layer - Function - Input / Output blob
Common Problem - Not much documentation on layer types
Net - Combines bunch of layers
Solver - Intended to run forward / backward in network / resume checkpoints
Gradient Descent, RMSProp are in the solver
Protocol buffers - Binary strongly types JSON for serializing data in risk
Cafee.proto - Defines all protocol buffer files
Convert Data - File format LMDB
Proto.txt to define the net
Solver - Learning Rate, Regularization rates

Torch

NYU
Written in C
Used in Fb and Deepmind
Lua - High Level Scripting for embedded devices, similar to JS
JIT compliation to make things fast
Learn Lua in 5 mins site
Torch tensors are just like numpy arrays
GPU is just another data type
optim package implements momentum, Adam
Caffe has Nets and Layers
Torch just has modules
Modules are classes written in Lua
Containers to combine multiple modules
nngraph hookup more complex topology easily
Not great for RNN

Backward Pass

updateGradInput
accGradparameters - Accumulate grad parameters - Receive gradients from upstream

Workflow in Torch

Preprocess data
Train a model in Lua / Torch
Use Trained model

Theano

University of Montreal
High Level Wrappers - Keras, Lasange
Computational graphs
Debugging hard

Lasagne - High Level Wrapper for theano

Tensorflow

Similar to Theano
From Professional Engineers
First ground up from Industrial Place
Create Placeholders for data and labels - Create input nodes
Initialize variables with numpy arrays
Compute Score, Probs, Loss
SGD to minimise loss
Wrap it in Session Code
One hot - Y always integer
In some frameworks it is a vector where everything is zero except the correct class
Tensorflow wants one hot
Tensorboard to visualise the network
Async or Sync training

Projects and Architecture Inputs

#1. Image Captioning

Need Pretained models
Need RNNs

#2. Semantic Segmentation

Need pretrained model
Need loss function

#3. Object Detection

Pretrained models
Custom imperative code
Cafe + Python

Keras - Good Presentation

Happy Mastering DL!!!

December 24, 2018

Day #172 - CNN - One Pager, Notes and Best Practices

Summary from MIT / Stanford Classes / Geoff Hinton Papers / Readings.

Deep Learning - "Learn directly from data without manual feature engineering"

A deep learning architecture is a multi-stack layer of simple modules, all of which may compute simple non-linear input-output mappings. The backpropagation procedure to compute the gradient of an objective function with respect to the weights of a multi-layer stack of the module is nothing more than a practical application of chain rule of derivatives

Overview

Step 1

Step 2

CNN Key Components

Input
Convolution
Strides
Pooling
Fully Connected
Output

Image - Represented as RGB Matrix with Height and width = 3 color channels X Height X width. Color represented in [0,255] Range
Data Augmentation - Random mix of translation, rotation, stretching, shearing, lens distortions
Convolution - Convolve the filter with the image (Slide over image spatially computing dot products). Convolution preserves the spatial relationship between pixels
Convolution Filters - Replace Large Convolutions (5x5, 7x7) with stacks of 3 x 3 convolutions
Activation Function - Introduce non-linearities in the network. Non-Linearity - Often Relu (Image data highly non-linear)
Transfer Learning - Works better for similar types of data, Freeze network and retrain top layer
Pooling - Downsampling for each feature map. MAX Pooling - Shrinks size using max

Advantages of downsampling - Decreased size of input for upcoming layers, Works against overfitting
Flattening - Convert into 1D feature vector. Flattens all its structure to create a single long feature vector
Fully Connected - Has a neuron fully connected to output, Contains neurons that connect to the entire input volume as in ordinary neural networks

After Forward pass we compute - Loss. In Backward pass we Compute Gradient to backpropagate and update weights
Fully Connected Layer - Layer where the maxpooled matrix is flattened. FCN is fed into softmax / cross entropy layer for prediction.

CNN Concepts
Sliding Windows - Pick a portion of image and run detection, Keep Moving Horzontally, Vertically to pick and predict for each selected Area. Prediction Confidence and Threshold filter to select the same. Generate Bounding Boxes.
Non-max Supression (NMS) - Several bounding boxes, Select box with highest probability for detected objects

Keras

Dense Layer - A dense layer represents a matrix vector multiplication. each input node is connected to each output node.
Dropout - A dropout layer is used for regularization
Hidden Layer - A sparse layer is a hidden layer that is not dense
Fully-connected layer is a (usually) cheap way of learning non-linear combinations of these features. In place of fully connected layers, we can also use a conventional classifier like SVM. Fully Connected layers perform classification based on the features extracted by the previous layers.
Convolutional layers are providing a meaningful, low-dimensional, and somewhat invariant feature space

Experiments

Convolution Level - Standard Convolution, Dilated Convolution, Transposed Convolution, Strided Convolution, Conv2D, ConvLSTM2D
Kernels - 3x3 and 2x2 kernels, Convolution Tricks - 1 X 7, 7 X 1, Replace Large Convolutions (5x5, 7x7) with stacks of 3 x 3 convolutions,
Loss Functions - RMSE, Cross Entropy
Optimizers - SGD, AgaGrad, AdaDelta, RMSProp
Activation Functions - Sigmoid, Tanh, Relu

CNN - Recognize Spatial Analysis

RNN - Recognize Sequential patterns

GAN - Two Networks, One to Generate, Another One Testing Output of Generation

Reinforcement Learning - Trial and Error Learning

Questions
How Forward feed and backwards prop work.

A Feed-Forward Neural Network is a type of Neural Network architecture where the connections are "fed forward", i.e. do not form cycles (like in recurrent nets).
Input to hidden and from hidden to output layer.
The values are "fed forward"

Backpropagation is a training algorithm consisting of 2 steps:

Feed forward the values
Calculate the error and propagate it back to the earlier layers.

So to be precise, forward-propagation is part of the backpropagation algorithm but comes before back-propagating.

Input for backpropagation is output_vector, target_output_vector, output is adjusted_weight_vector.
Backpropagation is an efficient method of computing gradients in directed graphs of computations, such as neural networks
Stochastic gradient descent (SGD) is an optimization method used e.g. to minimize a loss function.
SGD writes code in weights of neural network

The various loss functions and their considerations

Mean Squared Error (MSE), or quadratic, loss function is widely used in linear regression as the performance measure
Cross Entropy is commonly-used in binary classification (labels are assumed to take values 0 or 1)
Cross Entropy Loss is usually used in classification problems. In essence, it is a measure of difference between the desired probablity distribution and the predicted probablity distribution
Negative Log Likelihood loss function is widely used in neural networks, it measures the accuracy of a classifier.

Loss Functions - Reference

Triplet Loss is another loss commonly used in CNN-based image retrieval. During training process, an image triplet (Ia,In,Ip) is fed into the model as a single sample, where Ia, In and Ip represent the anchor, postive and negative images respectively. The idea behind is that distance between anchor and positive images should be smaller than that between anchor and negative images.
Contrastive Loss is often used in image retrieval tasks to learn discriminative features for images.

More Questions

The various activation functions and why they are needed
Optimization functions and why they are needed
Bias and variance / over and under fitting - what causes them, and the various methods to handle them
CNNs, RNNs, GANs, attention, Transformer, unsupervised and semi supervised, RL, decision trees, Ensemble Learning, SVM, Auto encoders…Understand interpretation, bias, fairness
The statistics theory (the more the better), and the linear algebra and calculus technicalities

IoU is a measure of the overlap between two bounding boxes: The ground truth bounding boxes (i.e. the hand labeled bounding boxes), The predicted bounding boxes from the model

Non-maximal suppression - Pick the bounding box with the maximum box confidence. Output this box as prediction.

CNN - Multiple Regions - Slide - Convolve - Feature Extraction
RCNN - Selective Search. 2000 regions extracted. Feed these patches to CNN, followed by SVM to predict the class of each patch.
Fast RCNN - Selective search generate predictions.
Faster RCNN replaces selective search with a very small convolutional network called Region Proposal Network to generate regions of Interests. Faster R-CNN has a dedicated region proposal network followed by a classifier. 7 FPS (frame per second)
Feature Pyramid Network (FPN) is a feature extractor designed with feature pyramid concept to improve accuracy and speed

YOLO and SSD Regression-Based detectors

Yolo detection is a simple regression problem which takes an input image and learns the class probabilities and bounding box coordinates
YOLO uses DarkNet to make feature detection followed by convolutional layers.
One limitation for YOLO is that it only predicts 1 type of class in one grid hence, it struggles with very small objects
Predictions (object locations and classes) are made from one single network

SSD is a single shot detector using a VGG16 network as a feature extractor (equivalent to the CNN in Faster R-CNN). Then we add custom convolution layers (blue) afterward and use convolution filters (green) to make predictions. SSD only uses upper layers for detection and therefore performs much worse for small objects.

Gradient Calculation

Crucial for backpropagation
Function with edge weights
Edge weights yield Error weights
Find the combination of w1,w2..wn that will minimize function y
Minimal error is the goal
The optimization problem in higher dimensions
NP-hard problem, make simulations, approximations
NP-hardness (non-deterministic polynomial-time hard)
Simulate genetic algorithms
TSP
Compute gradient adjust edge weights
Gradient (Partial derivative)
Direction of gradient
Derivative sigmoid function
Easy to code
Learning for backpropagation compute the derivative of the activation function
deltaoutput = error*dsigmoid(sum)
O/P to I/P layer so-called backpropagation

Backpropagation

Equation for updating edge weights
Delta = learningrate*gradient + momentum*previousChange
Learning Rate - How fast we learn, optimum to converge
Momentum - Avoid local minimums
derivative (loss) / (weight)
Error / Loss = predicted - actual
Update weights from right / left

Backprop Good Read - Link

CNN - Sliding Window, Reduce Dimensions, Works best on images

RNN - Sequence, Retain some portion of history. Limit to parallel training, Hard to capture relationship when points are far

Transformers - Learn multiple ways, the relationship between each item in the input sequence to all other items in the input

Encoder - Transform inputs to embeddings. Several Multiheaded self attention models stacked up on each other.

Decoder - Embedding to output sequence

Applications - Used for Seq2Seq / Machine translations

Feedback - May not work for hierarchical relationships

Reference - Link

Where Backprop Fails?

Every layer has multiple neurons. RNN has memory cells. Memory cells have inputs at different periods of time. GRU / LSTM / Basic cell are variations of cells. Memory cells get unrolled with time. Backprop via gradient descent. Weights / Biased as the number of connections. Calculation of loss based on cost/loss function. Randomly initialize weights/biases along the slope. The optimizer will attempt to descent down the slope to find optimal values (smallest value of error). Converge at best possible value of weights/bias for all connections in neurons.

Failure cases - Gradients doesn't change (Vanishing Gradient problem), Loss remains constant

Ref Notes

4 Sequence Encoding Blocks You Must Know Besides RNN/LSTM in Tensorflow

Average Pooling and Max Pooling
Hierarchical Pooling
Attentive Pooling

Convolution

Causal Convolutions
Multi-Resolution CNN Block
Multi-Head Multi-Resolution CNN Block

A sequence encoder is the cornerstone of many advanced AI applications, such as semantic search, question-answering, machine reading comprehension

Batchnorm - in effect, performs a kind of coordinated rescaling of its inputs
Dropouts: Randomly disables neurons during the training, in order to force other neurons to be trained as well
L1 regularization - Cost added is proportional to the absolute value of the weights coefficients
L2 regularization - Cost added is proportional to the square of the value of the weights coefficients

Bayesian Convolutional Neural Networks (Bayesian CNNs) and traditional Convolutional Neural Networks (CNNs) are both types of neural networks used for processing grid-like data, such as images. The main difference between them lies in their approach to handling uncertainty and learning model parameters.

Convolutional Neural Networks (CNNs) are a type of feedforward neural network that use convolutional layers to learn local features in the input data. CNNs are trained using backpropagation and optimization algorithms like stochastic gradient descent to minimize a loss function. The learned parameters (weights and biases) are point estimates, meaning that they do not capture the uncertainty in the model.

Bayesian Convolutional Neural Networks (Bayesian CNNs) extend traditional CNNs by incorporating Bayesian inference to learn the distribution of model parameters instead of point estimates. This allows Bayesian CNNs to capture the uncertainty in the model, which can be useful for tasks where uncertainty quantification is important, such as medical image analysis or safety-critical applications.

Here are some pros and cons of Bayesian CNNs compared to traditional CNNs:

Pros:

Uncertainty quantification: Bayesian CNNs can provide a measure of uncertainty in their predictions, which can be useful for decision-making and risk assessment.

Regularization: The Bayesian approach naturally incorporates regularization, which can help prevent overfitting and improve generalization.

Robustness: Bayesian CNNs can be more robust to adversarial examples and noisy data, as they take into account the uncertainty in the model parameters.

Cons:

Computational complexity: Bayesian CNNs are generally more computationally expensive than traditional CNNs, as they require sampling from the posterior distribution of model parameters or approximating it using techniques like variational inference.

Implementation complexity: Implementing Bayesian CNNs can be more challenging than traditional CNNs, as it requires additional tools and techniques for handling the Bayesian aspect.

Here's a simple example of a traditional CNN using Python and TensorFlow:

Happy Mastering DL!!!

Day #171 - ConvNets in practice - CS231n Lessons

Key Summary

RNN, For Modelling Sequences - Vanilla, LSTM
RNN for language models
CNN + RNN for image captioning
Feedforward - Feedforward function

Low Level CNN Working Practice
Making Most of Data

Data Augmentation
Images + Labels -> CNN -> Compute Loss -> Back Propagate
Images + Transformation + Labels -> CNN -> Compute Loss -> Back Propagate
Artificially expand training set, Preserve Labels, Widely used in practice
Types of Transformation
Horizontal Flip
Random Crops / Samples from Training Images / Random Scale and Rotation
Color Jitter (Randomly jitter contrast)
Color Jitter with PCA
Data Augmentation - Random mix of translation, rotation, stretching, shearing, lens distortions
Dropout/ DropConnect - Randomly drop or sets weights to zero

Data Augmentation - Summary

Simple to implement, Use it
Useful for small datasets
Fits into framework of noise / marginalization

Transfer Learning

You need a lot of data if you want to train / use CNNs
Train on Image Net / Pre-train model download
Treat it as feature extractor
Replace last layer with Linear Classifier
Freeze network and retrain top layer
Train only the last layers (Final Layers)
Works better for similar types of data
Edges, Color, Gabor applicable for any type of visual data
Image captioning word vectors (Pre-trained)

Convolutions

Computational workhouse

Design Efficient Network Architecture

3 3 X3 similar as 7 x 7
H, W, C Filters, Stride 1

Convolution - Summary

Replace Large Convolutions (5x5, 7x7) with stacks of 3 x 3 convolutions
1 x 1 bottleneck convolutions are very efficient
Can factor N x N convolutions into 1 x N and N x 1
All the above gives fewer parameters, less compute and more non-linearity

All about Convolutions (Computing them)

im2col (Convolution recast as natrix multiply)
im2col memory overhead
Depth C to match input
Take Each Convolutional weights compute inner products
FFT - Convolution theorem, Convolution of signals same as FFT (Element wise transform of signals)
FFT of weights, input image
Elementwise computation
Compute inverse, Speed up only for larger filters
FFT doesn't work too well in practice
FFT doesn't handle striding too well

Fast Algorithms

Strassen's Algorithm
Naive matrix multiplication

Processing

NVidia much common for GPU
GPU good at matrix multiplication
Floating point precision discussions
16 bit floating point operations from Nirvana
Lower precision makes things faster and still works

Happy Mastering DL!!!

December 31, 2018

December 30, 2018

December 28, 2018

December 27, 2018

December 26, 2018

December 24, 2018

Git Code Repository

About Me

What is your Expertise

Search This Blog

Translate

About Me and Disclaimer

Labels

Data Science Good Reads

Cloud, Datacentre, BigData and NOSQL Blogs

SQL Links

Archecture Blog List

Programming Problems

Startup - Reads

Perl-Python-Ruby-Linux-Oracle

Management + Leadership Blogs

Research Papers & Podcasts

My Wordpress

Interesting Reads

Useful Links - C# and .NET

Java, Selenium, QTP and Test Tools Learning

Agile Testing

Reverse Logistics Reads

Biztalk Blogs

MS BI Links

Process - Learnt it :)

Usability Guidelines - Building Better Sites

.NET Test Tools and Other Interesting Reads

Review Checklist

Blog Archive

Live Traffic

Total Pageviews

Popular Posts