- Detect key points to track
- Tracklets obtained and features get accumulated
- Feature points at different scales
- Track using optical flow methods
- Bunch of features extracted in local coordinate system of every track
- 15 frames, x,y positions
- Extract features in local coordinate system between two frames
- Differences in the key points reflects the optical flow
Key Point Detection
- Detect features
- Run Optical flow algos
- Displacement vector between every single frame
- Optical flow methods in python check
- Histogram bins
- SVM
- Process frame in Alexnet
- Encode 15 frame in CNN
- Sharing weights spatially
- Extend filters in small amounts in time
- 11 x 11 x T (Temporal Extent)
- 3 (R,G,B)
- Sliding filters in time
- Carving out activation volume
Spatio-Temporal ConvNets
- 3D Conv in Space and Time
- Slow Fusion 3D Conv Approach
- Learned filters on first layers (Smaller filters - More layers)
- Spatio-Temporal ConvNets
- Datasets are not quiet there
- 3D Conv, LSTM
- Single frame networks are baseline
Spatio-Temporal ConvNets
- c3D
- 3 x 3 col, 2 x 2 pool
- VGG in 3D
- 3D Conv is Painful
- Two ConvNets look at image
- One look at optical flow
- Extract Optical Flow, Fuse in the end
- Optical flow contains lot of information
- Need to check - Compute optical flow between two frames
- Videos with temporal dependencies
- Events larger than timescale
- Attention model
- Attention over different parts of idea
- Process images at detail level, resize at global level
- RNN
- Video - Classes prediction at point in time
- RNN allow to have infinite context
- 3D conv, lstm
- CNN + LSTM
Video Classification Architectures
- RNN + 3D Convnet
- RNN before the ConvNet processes the image (Idea)
- ConvNets between frames (Scales) - Speed up and Slow down
- Bakground Subtraction only look at things of interest (Check code)
- Weight sharing between ConvNet and RNN
- Get Rid of RNN
- Convnet
- All neurons in convnet is recurrent
- GRU slightly different update formula
- Replace through the conv
- Convolve over input, Output and then RNN
- RNN Convnet (Check code)
- Local motion 3D Conv
- Global motion LSTM
Research papers of video + audio not there
Supervised Learning - Dataset has data x, label y. Goal in supervised learning is function that takes input x and outputs y
Example - Classification, Regression, Object detection, Semantic segmentation, image captioning
Unsupervised Learning
- Just Data and labels
- Learn Some structure on data
- Examples - Clustering, dimensionality reduction, feature learning, generative models
- Traditional - Feature Learning
- Variational - Generate Samples
- Input x -> Pass thru Encode network -> Learnable feature Z
- Reconstruction - Reproduce data x from features z
- Decoder - Smaller features - Blows back to original data
- Encoder / Decoder sometimes share weights
- PCA Optimal for L2 Reconstruction
- Our intention is learn useful tasks
- Generate Fake images like original images
- Exist outside world prior distribution
- Assume distribution is Gaussian
- Bayes rule tell posterior
- Probablity given observed data
- Unsupervised data to learn features
- Maximum likelihood
- Variational Inference
- Insert Extra constant, break into two different terms
Adverserial Networks - Generate Samples
- Generator - Mini batches of random noise
- Discriminator - both original and fake images
- Architecture bigger and powerful using multiscale processing
- Generate at multiple scales
- Low-Resolution -> Upsample -> Delta on top of it -> Upsample ..
- Adverserial noise inputs can be changed as we generate
- Interpolate between random points in latent space
- Learns Nice useful representations
- Variational Autoencoder
- Add Adverserial network to VAE
- Discriminator Network added
- Pixel Loss
- Generate Samples like Alexnet
Python / OpenCV code to try
- Program #1 - Generate optical flow between frames, Compute sift between two frames, Color the moved pixes
- Program #2 - Upsample Images
- Program #3 - Generate Optical flow data, Use it to feed to CNN to classify actions
- Program #4 - GANs
- Program #5 - CNN with 3x3, 1x7 different filters and training / test accuracy
Happy Mastering DL!!!
No comments:
Post a Comment