"No one is harder on a talented person than the person themselves" - Linda Wilkinson ; "Trust your guts and don't follow the herd" ; "Validate direction not destination" ;
Showing posts with label Action Recognition. Show all posts
Showing posts with label Action Recognition. Show all posts

June 06, 2020

Learning Notes - Action Recognition - Part II

Paper #1 - Learning Motion in Feature Space: Locally-Consistent Deformable Convolution Networks for Fine-Grained Action Detection

Key Notes
  • Extraction of local spatio-temporal features followed by temporal modeling
Spatio-temporal feature extraction
  • Sample consecutive frames
  • Optical flow for temporal modeling
  • Dense Trajectory (IDT), Motion History Image (MHI)
Network Architecture
  • Bi-directional LSTM
  • Spatial-temporal CNN (STCNN) with Segmentation models
  • Temporal convolutional networks (TCN)
  • Temporal deformable residual networks (TDRN) 
Different Convolution Strategies
  • Standard convolution - The standard convolutions use the box, unchangeable shape of the filters
  • Dilated convolution - Dilating the filter means expanding its size filling the empty positions with zeros.
  • #out = Conv2D(10, (3, 3), dilation_rate=2)(input_tensor)
  • Deformable convolution - he deformable convolutions learn the filter shapes and adjust shapes to the most frequent cases
Implementation
  • Downsampled to 6fps
  • Frames were resized to 224x224 and augmented using random cropping and mean removal
  • Each video snippet contained 16 frames after sampling
Key Notes
  • Generative Adversarial Network (GAN) to generate exact joint locations from noisy probability heat maps
  • Detection classification is applied to a continuous sequence of videos of multiple activities
  • Generative adversarial network (GAN) to produce potential body joint locations in an unsupervised manner
Features
  • Optical flow (OF) and feature matching
  • Picking from shelf vs putting back
  • Joint location estimation results using GAN-based approach.
  • Actions - Reach, Retract, Hand in, Insp. Product, Insp. Shelf
  • Fashion Dataset Keypoint detection similar approach can be leveraged here too

Key Notes
  • Temporal Convolutional Networks (TCNs)
  • Two types of TCNs 
  • First, our EncoderDecoder TCN (ED-TCN) only uses a hierarchy of temporal convolutions, pooling, and upsampling but can efficiently capture long-range temporal patterns.
  • Second, Dilated TCN uses dilated convolutions
Code Temporal Convolutional Networks
More Reads
An introduction to ConvLSTM
Keras Convolutional LSTM network
Dense-Optical-Flow
Anomaly Detection in Videos using LSTM Convolutional Autoencoder
Attention Based CNN-ConvLSTM for Pedestrian Attribute Recognition

December 01, 2018

Day #157 - Video Analysis using Deep Learning - Research papers - Action Recognition

Paper 1 - Action Classification and Highlighting in Videos

Limitation of RNN is the inability to backpropagate error through long-range temporal interval (a problem known as vanishing gradient effect)

Key Summary notes from paper
  • End-to-end encoder-decoder LSTM framework with the built-in attention mechanism, LSTM decoder is equipped with an attention/alignment model
  • Encodes a video into a temporal sequence of visual representations and chooses an adaptively wighted subset of that sequence for prediction
  • Classify actions and highlight frames associated with the action
Implementation Learnings
  • CNN Encoder - Set of frames passed to extract features, VGGNet used in this case
  • Action Model - Feedforward network plus LSTM Decoder
Real World Implementation Article - Video Analysis to Detect Suspicious Activity Based on Deep Learning

Key Summary
  • Use Transfer Learning to extract features
  • Pass the data to new RNN
  • Perform Classification on it
Key Lessons
  • Extract frames from video
  • Use Inception network to generate features
  • Set of 15 frames used to compute action and aggregate value
  • Pass the 15 frames value to RNN (LSTM)
  • Perform Action Classification
Implementation Approach #2 - Five video classification methods implemented in Keras and TensorFlow
  • I liked the approach of combination of CNN and RNN
Presentation #1 - Multi-Dimensional LSTM Networks for Video Prediction

Key Lessons
  • Standard LSTM, Bidirectional LSTM
  • Parallel Multi-Dimensional LSTM
  • Convolutional LSTM for video prediction
  • Convolutional LSTM are 3D Tensors
  • 20 Convolutional LSTM layers + 2 skip connections
Paper #2 - What is Convolutional LSTM ?

Key Lessons
  • Extending Fully Connected LSTM to have convolutional structures in both input to state and state to state transitions
  • LSTM encoder-decoder framework proposed in [23] provides a general framework for sequence-to-sequence learning problems by training temporally concatenated LSTMs
  • ConvLSTM are 3D tensors whose last two dimensions are spatial dimensions (rows and columns)
Paper #3 - Exploiting Objects with LSTMs for Video Categorization
Key Summary
  • CNN takes frame / optional flow image as its input, hence fails to consider temporal coherence in videos
  • To exploit long term temporal dynamics recent studies adopted LSTM
  • First level CNN used to extract high level objects, then they are utilized by LSTM to capture temporal dynamics in videos
Paper #4 - Tracking of Humans in Video Stream Using LSTM Recurrent Neural Network
Key Summary
  • Yolo + LSTM = ROLO
  • Input frame - Yolo Features - Spatial constraint detection - Temporal constraint LSTM - Prediction

Paper #5 - Beyond Short Snippets: Deep Networks for Video Classification
Key Summary
  • RNN that uses LSTM cells that are connected to the output of underlying CNN
  • LSTM cells operates on frame level CNN activations
  • Capture videos temporal evolution
Different Feature pooling strategy
  • Conv Pooling
  • Late Pooling
  • Slow Pooling
  • Local Pooling
  • GoogLeNet Conv Pooling
More Reads
Real-Time Recurrent Regression Networks for Visual Tracking of Generic Objects
Online Video Object Detection using Association LSTM

More References
https://github.com/harvitronix/five-video-classification-methods
https://github.com/harvitronix/continuous-online-video-classification-blog
https://github.com/tencia/video_predict
https://github.com/sagarvegad/Video-Classification-CNN-and-LSTM-
http://blog.qure.ai/notes/deep-learning-for-videos-action-recognition-review
https://github.com/Guanghan/ROLO
https://www.pyimagesearch.com/2019/07/15/video-classification-with-keras-and-deep-learning/
https://github.com/sagarvegad/Video-Classification-CNN-and-LSTM-






Action Recognition 
  • Static Action Recognition 
  • Video action recognition - Optical flow between frames
  • Stitch Multiple Frames and evaluate with CNN

Session - Link
Key Lessons
  • Video is a stack of frames
  • Sports 1 Million UCF 101 dataset
  • Preprocess / Crop to a fixed size
  • Frame-based object detections
  • Late Fusion - Wide spaces (15 frames)
  • Overlapping patches
  • One on Centre of object
  • Low-resolution frame
  • Data Augmentation
  • Resize / Rotate


Happy Learning!!!