- 540 million years to trace evolutions of vision
- Human vision is trained for 540 million years
- Hierarchy of layers in our vision are involved in processing
- Images are numbers
- Pixes represented by 2D array of numbers
- RGB (3D Array)
- Computer vision Tasks
- Regression (Output takes a continuous value)
- Classification (Single Class label)
- Detect presence of features in particular image
- Domain Knowledge
- Define features
- Detect features and classify
- Occlusion
- Viewpoint variation
- Scale variation
- Deformation
- Background Clutter
- Intra Class variation
- Illumination Conditions
- Learn directly from Image data
- Low Level (Edge / Dark Spots)
- Mid Level (eyes, Ears, Nose)
- High Level (Facial Structures)
- Multiple Hidden Layers
- Input 2D Image (Vector of pixel values)
- All spatial information will be lost
- Connect neuron in hidden layer to all neurons in input layer
- Slide patch window across the image, this considers spatial structure
- Apply set of weights to extract local features
- Multiple filters and multiple set of weights
- Patchy Operation known as convolution
- Convolution preserves spatial relationship between pixels
- Elementwise multiplication between patch and filters
- Different filters for Sharpening, Edge
- Use multiple filters to extract different features
CNNs for Classification
- Convolution - Apply filter with learned weights to generate feature maps
- Non-Linearity - Often Relu (Image data highly non-linear)
- Pooling - Downsampling for each feature map
- Train model to learn weights
- Each Neuron sees patch of inputs
- Apply matrix of weights for elementwise multiplication
- depth = number of filters
- Relu - Pixel by pixel operation that replaces all negative values by zero (Non-Linear operation)
- Pooling - Reduce dimensionality preserve spatial invariance (Downsampling operations)
- Layer operations to learn hierarchy of features
- Feature Learning Pipeline + Performing Classification
- 14 million Images
- 21,841 categories
- Deeper Network vs How deep we can go
- New architecture beyond Feature Learning
- Semantic Segmentation (Fully Convolutional Network) - Downsampling and Upsampling operations, Driving Scene Segmentation, Encoder-Decoder
- Object Detection - Region Proposals / Classify them, Really long time to compute
- Image Captioning - Generate Semantic Content - Remove Fully Connected layer and replace them with RNN
- CNN feature Layer + RNN (Trained to predict words that describe the image)
- CAM (Class Activation Map)
No comments:
Post a Comment