"No one is harder on a talented person than the person themselves" - Linda Wilkinson ; "Trust your guts and don't follow the herd" ; "Validate direction not destination" ;

December 01, 2018

Day #157 - Video Analysis using Deep Learning - Research papers - Action Recognition

Paper 1 - Action Classification and Highlighting in Videos

Limitation of RNN is the inability to backpropagate error through long-range temporal interval (a problem known as vanishing gradient effect)

Key Summary notes from paper
  • End-to-end encoder-decoder LSTM framework with the built-in attention mechanism, LSTM decoder is equipped with an attention/alignment model
  • Encodes a video into a temporal sequence of visual representations and chooses an adaptively wighted subset of that sequence for prediction
  • Classify actions and highlight frames associated with the action
Implementation Learnings
  • CNN Encoder - Set of frames passed to extract features, VGGNet used in this case
  • Action Model - Feedforward network plus LSTM Decoder
Real World Implementation Article - Video Analysis to Detect Suspicious Activity Based on Deep Learning

Key Summary
  • Use Transfer Learning to extract features
  • Pass the data to new RNN
  • Perform Classification on it
Key Lessons
  • Extract frames from video
  • Use Inception network to generate features
  • Set of 15 frames used to compute action and aggregate value
  • Pass the 15 frames value to RNN (LSTM)
  • Perform Action Classification
Implementation Approach #2 - Five video classification methods implemented in Keras and TensorFlow
  • I liked the approach of combination of CNN and RNN
Presentation #1 - Multi-Dimensional LSTM Networks for Video Prediction

Key Lessons
  • Standard LSTM, Bidirectional LSTM
  • Parallel Multi-Dimensional LSTM
  • Convolutional LSTM for video prediction
  • Convolutional LSTM are 3D Tensors
  • 20 Convolutional LSTM layers + 2 skip connections
Paper #2 - What is Convolutional LSTM ?

Key Lessons
  • Extending Fully Connected LSTM to have convolutional structures in both input to state and state to state transitions
  • LSTM encoder-decoder framework proposed in [23] provides a general framework for sequence-to-sequence learning problems by training temporally concatenated LSTMs
  • ConvLSTM are 3D tensors whose last two dimensions are spatial dimensions (rows and columns)
Paper #3 - Exploiting Objects with LSTMs for Video Categorization
Key Summary
  • CNN takes frame / optional flow image as its input, hence fails to consider temporal coherence in videos
  • To exploit long term temporal dynamics recent studies adopted LSTM
  • First level CNN used to extract high level objects, then they are utilized by LSTM to capture temporal dynamics in videos
Paper #4 - Tracking of Humans in Video Stream Using LSTM Recurrent Neural Network
Key Summary
  • Yolo + LSTM = ROLO
  • Input frame - Yolo Features - Spatial constraint detection - Temporal constraint LSTM - Prediction

Paper #5 - Beyond Short Snippets: Deep Networks for Video Classification
Key Summary
  • RNN that uses LSTM cells that are connected to the output of underlying CNN
  • LSTM cells operates on frame level CNN activations
  • Capture videos temporal evolution
Different Feature pooling strategy
  • Conv Pooling
  • Late Pooling
  • Slow Pooling
  • Local Pooling
  • GoogLeNet Conv Pooling
More Reads
Real-Time Recurrent Regression Networks for Visual Tracking of Generic Objects
Online Video Object Detection using Association LSTM

More References
https://github.com/harvitronix/five-video-classification-methods
https://github.com/harvitronix/continuous-online-video-classification-blog
https://github.com/tencia/video_predict
https://github.com/sagarvegad/Video-Classification-CNN-and-LSTM-
http://blog.qure.ai/notes/deep-learning-for-videos-action-recognition-review
https://github.com/Guanghan/ROLO
https://www.pyimagesearch.com/2019/07/15/video-classification-with-keras-and-deep-learning/
https://github.com/sagarvegad/Video-Classification-CNN-and-LSTM-






Action Recognition 
  • Static Action Recognition 
  • Video action recognition - Optical flow between frames
  • Stitch Multiple Frames and evaluate with CNN

Session - Link
Key Lessons
  • Video is a stack of frames
  • Sports 1 Million UCF 101 dataset
  • Preprocess / Crop to a fixed size
  • Frame-based object detections
  • Late Fusion - Wide spaces (15 frames)
  • Overlapping patches
  • One on Centre of object
  • Low-resolution frame
  • Data Augmentation
  • Resize / Rotate


Happy Learning!!!

No comments: