Only the key summary points. These are selected lines (copied) for my quick reference and understanding
Yolo Notes
- Resize Image
- Run CNN (A single convolutional network simultaneously predicts multiple bounding boxes and class probabilities for those boxes)
- Non-max Suppression
Alternative Techniques
- Sliding window and region proposal-based techniques
Implementation Details
- YOLO sees the entire image during training and test time so it encodes contextual information about classes as well as their appearance
- Our system divides the input image into a S × S grid
- If the center of an object falls into a grid cell, that grid cell is responsible for detecting that object.
- Each bounding box consists of 5 predictions: x, y, w, h, and confidence
- Network architecture is inspired by the GoogLeNet model for image classification
- YOLO predicts multiple bounding boxes per grid cell
Limitations of YOLO
- Struggles to generalize to objects in new or unusual aspect ratios or configurations
Other Detection Systems
- Haar, SIFT, HOG, convolutional features
What is Non-max Suppression
All modern object detectors follow a three step recipe:
(1) proposing a search space of windows (exhaustive by sliding window or sparser using proposals),
(2) scoring/refining the window with a classifier/regressor, and
(3) merging windows that might belong to the same object.
Non-Max Suppression - The algorithm greedily selects high scoring detections and deletes close-by less confident neighbours since they are likely to cover the same object
R-CNN [10] - Replaced features extraction and classifiers by a neural network
Related work - Viola&Jones, deformable parts model (DPM), clustering algorithms, mean-shift clustering, agglomerative clustering, affinity propagation clustering
Deformable parts models. Deformable parts models (DPM) use a sliding window approach to object detection
R-CNN. Region proposals instead of sliding windows to find objects in images. Selective
Search [34] generates potential bounding boxes, a convolutional network extracts features, an SVM scores the boxes, a linear model adjusts the bounding boxes, and non-max suppression eliminates duplicate detections.
SSD Notes
Discretizes the output space of bounding boxes into a set of default boxes over different aspect ratios and scales per feature map location
- The core of SSD is predicting category scores and box offsets for a fixed set of default bounding boxes using small convolutional filters applied to feature maps
- Based on a feed-forward convolutional network that produces a fixed-size collection of bounding boxes and scores for the presence of object class instances in those boxes
- Ground truth information needs to be assigned to specific outputs in the fixed set of detector outputs
These papers I need to revisit next couple of months to understand it better.
Happy Learning!!!