Data Science, Database, AI Startups and Domain Learning's (Video-Image-Text-Data-Database): Day #157 - Video Analysis using Deep Learning - Research papers

December 01, 2018

Day #157 - Video Analysis using Deep Learning - Research papers - Action Recognition

Paper 1 - Action Classification and Highlighting in Videos

Limitation of RNN is the inability to backpropagate error through long-range temporal interval (a problem known as vanishing gradient effect)

Key Summary notes from paper

End-to-end encoder-decoder LSTM framework with the built-in attention mechanism, LSTM decoder is equipped with an attention/alignment model
Encodes a video into a temporal sequence of visual representations and chooses an adaptively wighted subset of that sequence for prediction
Classify actions and highlight frames associated with the action

Implementation Learnings

CNN Encoder - Set of frames passed to extract features, VGGNet used in this case
Action Model - Feedforward network plus LSTM Decoder

Real World Implementation Article - Video Analysis to Detect Suspicious Activity Based on Deep Learning

Key Summary

Use Transfer Learning to extract features
Pass the data to new RNN
Perform Classification on it

Key Lessons

Extract frames from video
Use Inception network to generate features
Set of 15 frames used to compute action and aggregate value
Pass the 15 frames value to RNN (LSTM)
Perform Action Classification

Implementation Approach #2 - Five video classification methods implemented in Keras and TensorFlow

I liked the approach of combination of CNN and RNN

Presentation #1 - Multi-Dimensional LSTM Networks for Video Prediction

Key Lessons

Standard LSTM, Bidirectional LSTM
Parallel Multi-Dimensional LSTM
Convolutional LSTM for video prediction
Convolutional LSTM are 3D Tensors
20 Convolutional LSTM layers + 2 skip connections

Paper #2 - What is Convolutional LSTM ?

Key Lessons

Extending Fully Connected LSTM to have convolutional structures in both input to state and state to state transitions
LSTM encoder-decoder framework proposed in [23] provides a general framework for sequence-to-sequence learning problems by training temporally concatenated LSTMs
ConvLSTM are 3D tensors whose last two dimensions are spatial dimensions (rows and columns)

Paper #3 - Exploiting Objects with LSTMs for Video Categorization
Key Summary

CNN takes frame / optional flow image as its input, hence fails to consider temporal coherence in videos
To exploit long term temporal dynamics recent studies adopted LSTM
First level CNN used to extract high level objects, then they are utilized by LSTM to capture temporal dynamics in videos

Paper #4 - Tracking of Humans in Video Stream Using LSTM Recurrent Neural Network
Key Summary

Yolo + LSTM = ROLO
Input frame - Yolo Features - Spatial constraint detection - Temporal constraint LSTM - Prediction

Paper #5 - Beyond Short Snippets: Deep Networks for Video Classification
Key Summary

RNN that uses LSTM cells that are connected to the output of underlying CNN
LSTM cells operates on frame level CNN activations
Capture videos temporal evolution

Different Feature pooling strategy

Conv Pooling
Late Pooling
Slow Pooling
Local Pooling
GoogLeNet Conv Pooling

More Reads
Real-Time Recurrent Regression Networks for Visual Tracking of Generic Objects
Online Video Object Detection using Association LSTM

More References
https://github.com/harvitronix/five-video-classification-methods
https://github.com/harvitronix/continuous-online-video-classification-blog
https://github.com/tencia/video_predict
https://github.com/sagarvegad/Video-Classification-CNN-and-LSTM-
http://blog.qure.ai/notes/deep-learning-for-videos-action-recognition-review
https://github.com/Guanghan/ROLO
https://www.pyimagesearch.com/2019/07/15/video-classification-with-keras-and-deep-learning/
https://github.com/sagarvegad/Video-Classification-CNN-and-LSTM-