"No one is harder on a talented person than the person themselves" - Linda Wilkinson ; "Trust your guts and don't follow the herd" ; "Validate direction not destination" ;

December 31, 2018

New Year Surprise - Featured in Top 100 Data Science Blogs

2018 gave this surprise today. I was wondering if my blog has improved in rankings. The latest list of  Top 100 Data Science Blogs, Websites & Newsletters To Follow in 2019

My blog is #59th Spot :). From #76 (Honorary mention), It is moved up to #59 this year.


Keep Learning, Keep Sharing Knowledge.

Wish you all a blessed, Successful, Happy and prosperous 2019

December 28, 2018

Day #175 - Videos and Unsupervised Learning

Dense Trajectory features
  • Detect key points to track
  • Tracklets obtained and features get accumulated
  • Feature points at different scales
  • Track using optical flow methods
  • Bunch of features extracted in local coordinate system of every track
  • 15 frames, x,y positions
  • Extract features in local coordinate system between two frames
  • Differences in the key points reflects the optical flow



Key Point Detection
  • Detect features
  • Run Optical flow algos
  • Displacement vector between every single frame
  • Optical flow methods in python check
  • Histogram bins
  • SVM

Deep Network
  • Process frame in Alexnet
  • Encode 15 frame in CNN
  • Sharing weights spatially
  • Extend filters in small amounts in time
  • 11 x 11 x T (Temporal Extent)
  • 3 (R,G,B)
  • Sliding filters in time
  • Carving out activation volume


Spatio-Temporal ConvNets
  • 3D Conv in Space and Time
  • Slow Fusion 3D Conv Approach
  • Learned filters on first layers (Smaller filters - More layers)
  • Spatio-Temporal ConvNets
  • Datasets are not quiet there
  • 3D Conv, LSTM
  • Single frame networks are baseline



Spatio-Temporal ConvNets
  • c3D
  • 3 x 3 col, 2 x 2 pool
  • VGG in 3D
  • 3D Conv is Painful
  • Two ConvNets look at image
  • One look at optical flow
  • Extract Optical Flow, Fuse in the end
  • Optical flow contains lot of information
  • Need to check - Compute optical flow between two frames
Long time Spatial - Temporal ConvNets
  • Videos with temporal dependencies
  • Events larger than timescale
  • Attention model
  • Attention over different parts of idea
  • Process images at detail level, resize at global level
  • RNN
  • Video - Classes prediction at point in time
  • RNN allow to have infinite context
  • 3D conv, lstm
  • CNN + LSTM



Video Classification Architectures
  • RNN + 3D Convnet
  • RNN before the ConvNet processes the image (Idea)
  • ConvNets between frames (Scales) - Speed up and Slow down
  • Bakground Subtraction only look at things of interest (Check code)
  • Weight sharing between ConvNet and RNN
Idea
  • Get Rid of RNN
  • Convnet
  • All neurons in convnet is recurrent
  • GRU slightly different update formula
  • Replace through the conv
  • Convolve over input, Output and then RNN
  • RNN Convnet (Check code)
Summary
  • Local motion 3D Conv
  • Global motion LSTM



Research papers of video + audio not there
Supervised Learning - Dataset has data x, label y. Goal in supervised learning is function that takes input x and outputs y
Example - Classification, Regression, Object detection, Semantic segmentation, image captioning

Unsupervised Learning
  • Just Data and labels
  • Learn Some structure on data
  • Examples - Clustering, dimensionality reduction, feature learning, generative models
Autoencoders
  • Traditional - Feature Learning
  • Variational - Generate Samples


  • Input x -> Pass thru Encode network -> Learnable feature Z
  • Reconstruction - Reproduce data x from features z
  • Decoder - Smaller features - Blows back to original data
  • Encoder / Decoder sometimes share weights
  • PCA Optimal for L2 Reconstruction
  • Our intention is learn useful tasks
  • Generate Fake images like original images
Variational AutoEncoder
  • Exist outside world prior distribution
  • Assume distribution is Gaussian
  • Bayes rule tell posterior
  • Probablity given observed data
  • Unsupervised data to learn features
  • Maximum likelihood
  • Variational Inference
  • Insert Extra constant, break into two different terms

Adverserial Networks - Generate Samples
  • Generator - Mini batches of random noise
  • Discriminator - both original and fake images
  • Architecture bigger and powerful using multiscale processing
  • Generate at multiple scales
  • Low-Resolution -> Upsample -> Delta on top of it -> Upsample ..
Variational Autoencoder
  • Adverserial noise inputs can be changed as we generate
  • Interpolate between random points in latent space
GAN
  • Learns Nice useful representations
  • Variational Autoencoder
  • Add Adverserial network to VAE
  • Discriminator Network added
  • Pixel Loss
  • Generate Samples like Alexnet


Python / OpenCV code to try
  • Program #1 - Generate optical flow between frames, Compute sift between two frames, Color the moved pixes
  • Program #2 - Upsample Images
  • Program #3 - Generate Optical flow data, Use it to feed to CNN to classify actions
  • Program #4 - GANs
  • Program #5 - CNN with 3x3, 1x7 different filters and training / test accuracy



Happy Mastering DL!!!

December 27, 2018

Day #174 - CS231 - Lecture 13: Segmentation, soft attention, spatial transformers

Key Lessons
  • Segmentation
  • Semantic Segmentation
  • Instance Segmentation
Soft Attention
  • Discrete Locations
  • Continuous Locations (Spatial Transformers)
Inception-v4
  • Deep Network
  • Repeated Modules
  • No Padding
  • Strided Convolution and MaxPooling
  • Efficient Convolution Tricks, 1 X 7, 7 X 1
Segmentation
Two Subtasks involved
Semantic
  • Input Image / Fixed number of classes
  • Background class
  • Label Every pixel with one of semantic class
  • Higher level understanding of images
  • Not aware of instances
  • Count of objects not known
Instance
  • Detect Instances given category, label
  • Simultaneous detection and segmentation
Semantic Segmentation
  • Label without instances
  • Given input image, Extract patch from image
  • Run it through CNN
  • Classify Cetre pixel as COW
  • Run over entire image
  • Expensive Operation
  • 100 Pixel order of magnitude receptive field
  • Semantic Segmentation - Multi scale
  • Super Pixel / Segmentation Trees


Semantic Segmentation - Refinement
  • Apply CNN once to get labels
  • Increases Effective Receptive Field for output
  • Recurrent Convolutional network
  • Iteratively Define outputs

Semantic Segmentation - Upsampling
  • Input Images run through convolutions extract feature maps
  • Learn upsampling as part of network
  • Skip connections - Help in low level details
  • Convolutional features from different layers in network
  • Accuracy - Classification metrics, interesection over union (Ground truth vs Region Predicted)
  • Learnable Upsampling - Deconvolution (Convolution Transpose), Fractionally strided convolution
Instance Segmentation
  • Generalization
  • Distinguish Instances
  • Detect and label instances
  • End up looking like detection models
  • SDS (Simultaneous Detection and Segmentation
  • RCNN - External Region proposals
  • Box CNN, Region CNN
  • Mast with mean color of dataset

Hypercolumns
  • Region Classification
  • Region Refinement
  • Extract Convolutional features
  • Upsell and Club them together
  • Bi-linear / nearest neighbor for upsampling

Cascases
  • High Resolution Images
  • From Conv feature maps
  • Each Feature Map predicts region of interest
  • Reshape boxes to fixed size



Attention Models
RNN for Captioning
  • H X W X 3
  • Input Image -> CNN -> Features -> Hidden State -> First Word -> Second Word
  • One chance to look at input image
  • Weighted vector from imput features
  • Hidden states sent to the model
Soft Vs Hard Attention
  • Grid of features
  • Distribution over grid locations
  • Attention - Nice Interpretable Outputs


Soft Attention for Translation
  • Many to Many RNN
  • Sequence to Sequence model
  • Generate Output sequence similar to encaptioning
  • Content based addressing
  • Probablity of distribution directly
  • Soft Attention Easy to implement and train
  • Constrained to fixed grid provided by feature maps
  • Spatial Transformer Network - Similar to texture mapping
  • Attend to arbitrary parts of input in a nice way


Happy

Happy Mastering DL!!!

December 26, 2018

Day #173 - CS231N - Lecture 11: Deep Learning libraries - Notes

Key Summary

Cafee
  • From Berkeley
  • Widely used for CNN
  • Written in C++
  • Python and Matlab bindings
  • Good for standard feedforward vanilla CNN
  • Blob - Weights, Pixel values, Intermediate values (n dimensional tensor)
  • Layer - Function - Input / Output blob
  • Common Problem - Not much documentation on layer types
  • Net - Combines bunch of layers
  • Solver - Intended to run forward / backward in network / resume checkpoints
  • Gradient Descent, RMSProp are in the solver
  • Protocol buffers - Binary strongly types JSON for serializing data in risk
  • Cafee.proto - Defines all protocol buffer files
  • Convert Data - File format LMDB
  • Proto.txt to define the net
  • Solver - Learning Rate, Regularization rates




Torch
  • NYU 
  • Written in C
  • Used in Fb and Deepmind
  • Lua - High Level Scripting for embedded devices, similar to JS
  • JIT compliation to make things fast
  • Learn Lua in 5 mins site
  • Torch tensors are just like numpy arrays
  • GPU is just another data type
  • optim package implements momentum, Adam
  • Caffe has Nets and Layers
  • Torch just has modules
  • Modules are classes written in Lua
  • Containers to combine multiple modules
  • nngraph hookup more complex topology easily
  • Not great for RNN
Backward Pass
  • updateGradInput
  • accGradparameters - Accumulate grad parameters - Receive gradients from upstream
Workflow in Torch
  • Preprocess data
  • Train a model in Lua / Torch
  • Use Trained model


Theano
  • University of Montreal
  • High Level Wrappers - Keras, Lasange
  • Computational graphs
  • Debugging hard
Lasagne - High Level Wrapper for theano





Tensorflow
  • Similar to Theano
  • From Professional Engineers
  • First ground up from Industrial Place
  • Create Placeholders for data and labels - Create input nodes
  • Initialize variables with numpy arrays
  • Compute Score, Probs, Loss
  • SGD to minimise loss
  • Wrap it in Session Code
  • One hot - Y always integer
  • In some frameworks it is a vector where everything is zero except the correct class
  • Tensorflow wants one hot
  • Tensorboard to visualise the network
  • Async or Sync training
Projects and Architecture Inputs
#1. Image Captioning
  • Need Pretained models
  • Need RNNs
#2. Semantic Segmentation
  • Need pretrained model
  • Need loss function
#3. Object Detection
  • Pretrained models
  • Custom imperative code
  • Cafe + Python





Keras - Good Presentation

Happy Mastering DL!!!

December 24, 2018

Day #172 - CNN - One Pager, Notes and Best Practices

Summary from MIT / Stanford Classes / Geoff Hinton Papers / Readings.

Deep Learning - "Learn directly from data without manual feature engineering"

A deep learning architecture is a multi-stack layer of simple modules, all of which may compute simple non-linear input-output mappings. The backpropagation procedure to compute the gradient of an objective function with respect to the weights of a multi-layer stack of the module is nothing more than a practical application of chain rule of derivatives

Overview


Step 1


Step 2
CNN Key Components
  • Input
  • Convolution
  • Strides
  • Pooling
  • Fully Connected
  • Output
Image - Represented as RGB Matrix with Height and width = 3 color channels X Height X width. Color represented in [0,255] Range
Data Augmentation - Random mix of translation, rotation, stretching, shearing, lens distortions
Convolution - Convolve the filter with the image (Slide over image spatially computing dot products). Convolution preserves the spatial relationship between pixels
Convolution Filters - Replace Large Convolutions (5x5, 7x7) with stacks of 3 x 3 convolutions
Activation Function - Introduce non-linearities in the network. Non-Linearity - Often Relu (Image data highly non-linear)
Transfer Learning - Works better for similar types of data, Freeze network and retrain top layer
Pooling - Downsampling for each feature map. MAX Pooling - Shrinks size using max


Advantages of downsampling - Decreased size of input for upcoming layers, Works against overfitting
Flattening - Convert into 1D feature vector.  Flattens all its structure to create a single long feature vector
Fully Connected - Has a neuron fully connected to output, Contains neurons that connect to the entire input volume as in ordinary neural networks
After Forward pass we compute - Loss. In Backward pass we Compute Gradient to backpropagate and update weights
Fully Connected Layer - Layer where the maxpooled matrix is flattened. FCN is fed into softmax / cross entropy layer for prediction.

CNN Concepts
Sliding Windows - Pick a portion of image and run detection, Keep Moving Horzontally, Vertically to pick and predict for each selected Area. Prediction Confidence and Threshold filter to select the same. Generate Bounding Boxes.
Non-max Supression (NMS) - Several bounding boxes, Select box with highest probability for detected objects

Keras
  • Dense Layer - A dense layer represents a matrix vector multiplication.  each input node is connected to each output node.
  • Dropout - A dropout layer is used for regularization  
  • Hidden Layer - A sparse layer is a hidden layer that is not dense
  • Fully-connected layer is a (usually) cheap way of learning non-linear combinations of these features. In place of fully connected layers, we can also use a conventional classifier like SVM. Fully Connected layers perform classification based on the features extracted by the previous layers.
  • Convolutional layers are providing a meaningful, low-dimensional, and somewhat invariant feature space
Experiments
  • Convolution Level - Standard Convolution, Dilated Convolution, Transposed Convolution, Strided Convolution, Conv2D, ConvLSTM2D
  • Kernels - 3x3 and 2x2 kernels, Convolution Tricks - 1 X 7, 7 X 1, Replace Large Convolutions (5x5, 7x7) with stacks of 3 x 3 convolutions, 
  • Loss Functions - RMSE, Cross Entropy
  • Optimizers - SGD, AgaGrad, AdaDelta, RMSProp
  • Activation Functions - Sigmoid, Tanh, Relu
CNN - Recognize Spatial Analysis
RNN - Recognize Sequential patterns
GAN - Two Networks, One to Generate, Another One Testing Output of Generation
Reinforcement Learning - Trial and Error Learning

Questions
How Forward feed and backwards prop work.
  • A Feed-Forward Neural Network is a type of Neural Network architecture where the connections are "fed forward", i.e. do not form cycles (like in recurrent nets).
  • Input to hidden and from hidden to output layer. 
  • The values are "fed forward"
Backpropagation is a training algorithm consisting of 2 steps: 
  • Feed forward the values 
  • Calculate the error and propagate it back to the earlier layers. 
So to be precise, forward-propagation is part of the backpropagation algorithm but comes before back-propagating.
  • Input for backpropagation is output_vector, target_output_vector, output is adjusted_weight_vector.
  • Backpropagation is an efficient method of computing gradients in directed graphs of computations, such as neural networks
  • Stochastic gradient descent (SGD) is an optimization method used e.g. to minimize a loss function.
  • SGD writes code in weights of neural network
The various loss functions and their considerations
  • Mean Squared Error (MSE), or quadratic, loss function is widely used in linear regression as the performance measure
  • Cross Entropy is commonly-used in binary classification (labels are assumed to take values 0 or 1) 
  • Cross Entropy Loss is usually used in classification problems. In essence, it is a measure of difference between the desired probablity distribution and the predicted probablity distribution
  • Negative Log Likelihood loss function is widely used in neural networks, it measures the accuracy of a classifier.
Loss Functions - Reference
  • Triplet Loss is another loss commonly used in CNN-based image retrieval. During training process, an image triplet (Ia,In,Ip) is fed into the model as a single sample, where Ia, In and Ip represent the anchor, postive and negative images respectively. The idea behind is that distance between anchor and positive images should be smaller than that between anchor and negative images.
  • Contrastive Loss is often used in image retrieval tasks to learn discriminative features for images. 
More Questions
  • The various activation functions and why they are needed
  • Optimization functions and why they are needed
  • Bias and variance / over and under fitting - what causes them, and the various methods to handle them
  • CNNs, RNNs, GANs, attention, Transformer, unsupervised and semi supervised, RL, decision trees, Ensemble Learning, SVM, Auto encoders…Understand interpretation, bias, fairness
  • The statistics theory (the more the better), and the linear algebra and calculus technicalities
IoU is a measure of the overlap between two bounding boxes: The ground truth bounding boxes (i.e. the hand labeled bounding boxes), The predicted bounding boxes from the model
Non-maximal suppression - Pick the bounding box with the maximum box confidence. Output this box as prediction.
  • CNN - Multiple Regions - Slide - Convolve - Feature Extraction
  • RCNN - Selective Search. 2000 regions extracted. Feed these patches to CNN, followed by SVM to predict the class of each patch.
  • Fast RCNN - Selective search generate predictions. 
  • Faster RCNN replaces selective search with a very small convolutional network called Region Proposal Network to generate regions of Interests. Faster R-CNN has a dedicated region proposal network followed by a classifier. 7 FPS (frame per second)
  • Feature Pyramid Network (FPN) is a feature extractor designed with feature pyramid concept to improve accuracy and speed
YOLO and SSD Regression-Based detectors
  • Yolo detection is a simple regression problem which takes an input image and learns the class probabilities and bounding box coordinates
  • YOLO uses DarkNet to make feature detection followed by convolutional layers.
  • One limitation for YOLO is that it only predicts 1 type of class in one grid hence, it struggles with very small objects
  • Predictions (object locations and classes) are made from one single network 
  • SSD is a single shot detector using a VGG16 network as a feature extractor (equivalent to the CNN in Faster R-CNN). Then we add custom convolution layers (blue) afterward and use convolution filters (green) to make predictions. SSD only uses upper layers for detection and therefore performs much worse for small objects.
Gradient Calculation
  • Crucial for backpropagation
  • Function with edge weights
  • Edge weights yield Error weights
  • Find the combination of w1,w2..wn that will minimize function y
  • Minimal error is the goal
  • The optimization problem in higher dimensions
  • NP-hard problem, make simulations, approximations
  • NP-hardness (non-deterministic polynomial-time hard)
  • Simulate genetic algorithms
  • TSP
  • Compute gradient adjust edge weights
  • Gradient (Partial derivative)
  • Direction of gradient
  • Derivative sigmoid function
  • Easy to code
  • Learning for backpropagation compute the derivative of the activation function
  • deltaoutput = error*dsigmoid(sum)
  • O/P to I/P layer so-called backpropagation
Backpropagation
  • Equation for updating edge weights
  • Delta = learningrate*gradient + momentum*previousChange
  • Learning Rate - How fast we learn, optimum to converge
  • Momentum - Avoid local minimums
  • derivative (loss) / (weight)
  • Error / Loss = predicted - actual
  • Update weights from right / left
Backprop Good Read - Link

CNN - Sliding Window, Reduce Dimensions, Works best on images 
RNN - Sequence, Retain some portion of history. Limit to parallel training, Hard to capture relationship when points are far
Transformers - Learn multiple ways, the relationship between each item in the input sequence to all other items in the input
Encoder - Transform inputs to embeddings. Several Multiheaded self attention models stacked up on each other.
Decoder - Embedding to output sequence
Applications - Used for Seq2Seq / Machine translations
Feedback - May not work for hierarchical relationships
Reference  - Link

Where Backprop Fails?
Every layer has multiple neurons. RNN has memory cells. Memory cells have inputs at different periods of time. GRU / LSTM / Basic cell are variations of cells. Memory cells get unrolled with time. Backprop via gradient descent. Weights / Biased as the number of connections. Calculation of loss based on cost/loss function. Randomly initialize weights/biases along the slope. The optimizer will attempt to descent down the slope to find optimal values (smallest value of error). Converge at best possible value of weights/bias for all connections in neurons.
Failure cases - Gradients doesn't change (Vanishing Gradient problem), Loss remains constant

Ref Notes 

  • Average Pooling and Max Pooling
  • Hierarchical Pooling
  • Attentive Pooling
Convolution
  • Causal Convolutions
  • Multi-Resolution CNN Block
  • Multi-Head Multi-Resolution CNN Block
A sequence encoder is the cornerstone of many advanced AI applications, such as semantic search, question-answering, machine reading comprehension
  • Batchnorm - in effect, performs a kind of coordinated rescaling of its inputs
  • Dropouts: Randomly disables neurons during the training, in order to force other neurons to be trained as well
  • L1 regularization - Cost added is proportional to the absolute value of the weights coefficients 
  • L2 regularization - Cost added is proportional to the square of the value of the weights coefficients 

Bayesian Convolutional Neural Networks (Bayesian CNNs) and traditional Convolutional Neural Networks (CNNs) are both types of neural networks used for processing grid-like data, such as images. The main difference between them lies in their approach to handling uncertainty and learning model parameters.

Convolutional Neural Networks (CNNs) are a type of feedforward neural network that use convolutional layers to learn local features in the input data. CNNs are trained using backpropagation and optimization algorithms like stochastic gradient descent to minimize a loss function. The learned parameters (weights and biases) are point estimates, meaning that they do not capture the uncertainty in the model.

Bayesian Convolutional Neural Networks (Bayesian CNNs) extend traditional CNNs by incorporating Bayesian inference to learn the distribution of model parameters instead of point estimates. This allows Bayesian CNNs to capture the uncertainty in the model, which can be useful for tasks where uncertainty quantification is important, such as medical image analysis or safety-critical applications.

Here are some pros and cons of Bayesian CNNs compared to traditional CNNs:

Pros:
Uncertainty quantification: Bayesian CNNs can provide a measure of uncertainty in their predictions, which can be useful for decision-making and risk assessment.
Regularization: The Bayesian approach naturally incorporates regularization, which can help prevent overfitting and improve generalization.
Robustness: Bayesian CNNs can be more robust to adversarial examples and noisy data, as they take into account the uncertainty in the model parameters.
Cons:

Computational complexity: Bayesian CNNs are generally more computationally expensive than traditional CNNs, as they require sampling from the posterior distribution of model parameters or approximating it using techniques like variational inference.
Implementation complexity: Implementing Bayesian CNNs can be more challenging than traditional CNNs, as it requires additional tools and techniques for handling the Bayesian aspect.
Here's a simple example of a traditional CNN using Python and TensorFlow:



Happy Mastering DL!!!

Day #171 - ConvNets in practice - CS231n Lessons

Key Summary
  • RNN, For Modelling Sequences - Vanilla, LSTM
  • RNN for language models
  • CNN + RNN for image captioning
  • Feedforward - Feedforward function


Low Level CNN Working Practice
Making Most of Data
  • Data Augmentation
  • Images + Labels -> CNN -> Compute Loss -> Back Propagate
  • Images + Transformation + Labels -> CNN -> Compute Loss -> Back Propagate
  • Artificially expand training set, Preserve Labels, Widely used in practice
  • Types of Transformation 
  • Horizontal Flip
  • Random Crops / Samples from Training Images / Random Scale and Rotation 
  • Color Jitter (Randomly jitter contrast)
  • Color Jitter with PCA
  • Data Augmentation - Random mix of translation, rotation, stretching, shearing, lens distortions
  • Dropout/ DropConnect - Randomly drop or sets weights to zero
Data Augmentation - Summary
  • Simple to implement, Use it
  • Useful for small datasets
  • Fits into framework of noise / marginalization
Transfer Learning
  • You need a lot of data if you want to train / use CNNs
  • Train on Image Net / Pre-train model download
  • Treat it as feature extractor
  • Replace last layer with Linear Classifier
  • Freeze network and retrain top layer
  • Train only the last layers (Final Layers)
  • Works better for similar types of data
  • Edges, Color, Gabor applicable for any type of visual data
  • Image captioning word vectors (Pre-trained)
Convolutions
  • Computational workhouse
Design Efficient Network Architecture
  • 3 3 X3 similar as 7 x 7
  • H, W, C Filters, Stride 1
Convolution - Summary
  • Replace Large Convolutions (5x5, 7x7) with stacks of 3 x 3 convolutions
  • 1 x 1 bottleneck convolutions are very efficient
  • Can factor N x N convolutions into 1 x N and N x 1
  • All the above gives fewer parameters, less compute and more non-linearity


All about Convolutions (Computing them)
  • im2col (Convolution recast as natrix multiply)
  • im2col memory overhead
  • Depth C to match input
  • Take Each Convolutional weights compute inner products
  • FFT - Convolution theorem, Convolution of signals same as FFT (Element wise transform of signals)
  • FFT of weights, input image
  • Elementwise computation
  • Compute inverse, Speed up only for larger filters
  • FFT doesn't work too well in practice
  • FFT doesn't handle striding too well

Fast Algorithms
  • Strassen's Algorithm
  • Naive matrix multiplication
Processing
  • NVidia much common for GPU 
  • GPU good at matrix multiplication
  • Floating point precision discussions
  • 16 bit floating point operations from Nirvana
  • Lower precision makes things faster and still works




Happy Mastering DL!!!