Data Science, Database, AI Startups and Domain Learning's (Video-Image-Text-Data-Database): September 2018

September 21, 2018

Day #133 - Data sets, Data Challenges in Machine Learning

Google blogs and research papers are mother of all Data Analysis work. Rather than jumping directly executing pieces of code, Its very interesting to understand the perspective and practices for data collection and maintenance. Listed below are good summary from my readings from google papers / blogs

Practical advice for analysis of large, complex data sets

Technical - Ideas to Analyse Data

Look at distributions within data
Look for examples for validate understanding
Consider outliers
Check for consistency over time (Validity over period of time)

Process - Recommendations for Data Collection

Data collection setup
Reproducible
Exploratory Data Analysis

Social - Communicating your insights

Data Analysis starts with questions not with code or data
Accept ignorance and mistakes
Be skeptical
Educate Consumers

Crawling the internet: data science within a large engineering system

Identify and compute the refresh rate pattern and accordingly refresh data

Machine Learning: The High-Interest Credit Card of Technical Debt
Very interesting article on data related risks / challenges.

Unstable Data Dependencies
Underutilized Data Dependencies
Legacy Features
Correction Cascades
When Correlations No Longer Correlate

Happy Learning!!!

September 18, 2018

Day #132 - Sequence Learning Paper

Understanding NLP and Deep learning requires understanding the research papers behind it. Listed below are readings and important points for my reference (copied from the paper)

Paper #1 - Sequence Learning
Captured are summary of keypoints for Sequence Learning paper

Introduction

Multilayered Long Short-Term Memory (LSTM)
DNNs can only be applied to problems whose inputs and targets can be sensibly encoded with vectors of fixed dimensionality
We are mapping a sequence of words representing the question to a sequence of words representing the answer
LSTM learns to map an input sentence of variable length into a fixed-dimensional vector representation

Model

Map the input sequence to a fixed-sized vector using one RNN
The goal of the LSTM is to estimate the conditional probability
First, we used two different LSTMs: one for the input sequence and another for the output sequence

Paper #2 - Massive Exploration of Neural Machine Translation Architectures
https://arxiv.org/pdf/1703.03906.pdf

NMT - an end-to-end approach to automated translation

Based on an encoder-decoder architecture consisting of two recurrent neural networks (RNNs) and an attention mechanism that aligns target with source tokens
Shortcoming - amount of compute required to train them

NMT

Encoder-decoder architecture with attention mechanism
An encoder function fenc takes as input a sequence of source tokens x and produces a sequence of states h
Decoder is an RNN that predicts the probability of a target sequence y
Decoder RNN also uses context vector - called the attention vector and is calculated as a weighted average of the source states

Attention Mechanism

Commonly used attention mechanisms are the additive
Given an attention key h (an encoder state) and attention query s (a decoder state), the attention score for each pair is calculated

Paper #3 - NEURAL MACHINE TRANSLATION BY JOINTLY LEARNING TO ALIGN AND TRANSLATE - https://arxiv.org/pdf/1409.0473.pdf

Happy Learning!!!

September 17, 2018

Day #131 - Dataset collection and Standardization process

A very important paper I came across today - Datasheets for Datasets

This paper provides key checklist for Data Collection. Some of the important sections to refer are

Dataset Composition
Data Collection Process
Data Preprocessing
Dataset Distribution
Dataset Maintenance
Legal & Ethical Considerations

A great reference to check back and use it.

Happy Learning!!!

Day #130 - Chatbot Architecture

Goal - Oriented Bots

Narrow Domain
Specific tasks
Example Call Center
Model - Retrieval Based
Use Predefined responses

General Chat bots

General Conversation
Generative Models
For general entertainment
Generate new responses

Sequence to Sequence

Incoming message - Encoder
Decoder for response
Attention or least reserved input
Have fixed length to Padding is done

Padding

EOS - End of Sentence
PAD - Filler
GO - Start Encoding
UNK - Unknown word not in vocabulary

Bucketizing

Opportunity to avoid padding by bucketizing
Place them in different batches for RNN
RNN to keep track of intent of conversations

Cons

To Dramatic responses
Based on Domain of data

Intents Clustering

Graph of different responses
Labels to cluster them
Propagate the knowledge to other labels of graph
Expander library is used for this purpose

Updated - Jule 2022

Interesting Reads - LaMDA: Language Models for Dialog Applications

LaMDA: Language Models for Dialog Applications

Key Notes

Language Models for Dialog Applications
Metrics - (sensibleness, specificity, and interestingness)

LaMDA: our breakthrough conversation technology

BERT and GPT-3, it’s built on Transformer
Meena, a 2.6 billion parameter end-to-end trained neural conversational mode
At its heart lies the Evolved Transformer seq2seq architecture, a Transformer architecture discovered by evolutionary neural architecture search to improve perplexity.

Towards a Conversational Agent that Can Chat About…Anything

Happy Learning!!!

Day #129 - Neural Network for Words

Bag of Words

Vectorize each word with one hot encoder vector
Bag of Words representation - Sum of the individual One hot encoded vectors
BOW - Sum of sparse one hot encoded vectors

Neural Network for words

Dense Representation
Each word represented by Dense Vector
Word2Vec Embedding - Done in unsupervised manner
Sum of word2vec is feature representation
Convolutional filters to compute 2-gram words
Similar words have similar cosine distance in word2vec
Good Embedding + Convolution we can get more high level meaning
Maximum pooling over time (Just like we do in images) - Input Sequence - Convolutional filter - Slide in one direction - Maximum Activation - Select Output

Architecture

3,4,5 gram window - For each ngram we learn 100 filters
Obtain embedding of input sequence
Apply multi layer perceptron on those 300 features

Paper - https://arxiv.org/pdf/1408.5882.pdf

Apply Convolutions for Text

One hot encoded characters
1000 Kernels, 1000 filters
Apply same pattern Convolution - Pooling - Convolution - Pooling
Moving window with stride of two, Obtain pooling output

Encoder-Decoder Architecture

Attention Mechanism
Encoder - Hidden representation of input sentence (Encodes thought of sentence)
Types of encoders (RNN, CNN, Hierarchical structures)
Decoder - Decode the task / sequence from other language
LSTM / RNN encodes input sentence (End of Sentence token)
Decoding - Conditional Language Modelling
Feed output of previous state as input for next state
Stack several layers of LSTM model
Every state of decoder has three errors (Error from previous state, Error from context vector, Current input)

Encoder - Maps the source sequence to hidden vector (RNN)
Decoder - Perform Language modelling of given vector (RNN) but more inputs, three errors
Prediction - Conditional probability (Softmax)

Attention Mechanism

Powerful Technique in Neural Networks
Encoder has H states, Decoder - X states
Helps focus on different parts of sentence

Compute Similarities

Additive Attention
Multiplicative Attention
Dot Product

Local Attention

Predict best place

Happy Learning!!!

September 14, 2018

Day #128 - NLP Basics - Demo and Notes

Found some code snippets for quick reference. Adding code examples and Basics Concepts for NLP Learning kit

	#NLP Notes and fundamentals

	#What is one hot encoding
	#Way to convert categorical data into numerical format
	#One hot encoding demo code
	#Code - Reference - https://github.com/jalajthanaki/NLPython/blob/master/ch5
	from __future__ import division
	import pandas as pd
	from sklearn.feature_extraction import DictVectorizer

	df = pd.DataFrame([['Bachelors','EntryLevel','Male'],['Masters','Management','Female'],['Phd','Sales','Male']],columns=['Education','Segment','Gender'])
	print(df)

	print(pd.get_dummies(df))

	#using Sci-kit learn
	v = DictVectorizer()
	qualitative_features=['Segment']
	x_qual = v.fit_transform(df[qualitative_features].to_dict('records'))
	print(v.vocabulary_)
	print(x_qual.toarray())

	#What is ngrams
	#For N = 1, This is a sentence
	#Unigrams are - This, is, a , sentence

	#For N = 2, This is a sentence
	#bigrams are - This is, is a, a sentence

	#For N = 3, This is a sentence
	#Trigrams are - This is a, is a sentence

	#ngrams Demo
	from nltk import ngrams
	sentence = 'test the words for multiple types of possible ngrams to generate'

	#ngrams2
	print('ngram = 2')
	ngramresults = ngrams(sentence.split(),2)
	for data in ngramresults:
	print(data)

	#ngrams3
	print('ngram = 3')
	ngramresults = ngrams(sentence.split(),3)
	for data in ngramresults:
	print(data)

	#ngrams4
	print('ngram = 4')
	ngramresults = ngrams(sentence.split(),4)
	for data in ngramresults:
	print(data)

	#What is bow model
	#BOW example
	#Sentence - The dog is on the table
	#Representation - are, cat, dog, is, now, on, the, table
	#BOW representation - 0, 0, 1, 1, 0, 1, 1, 1

	from sklearn.feature_extraction.text import CountVectorizer
	import numpy as np

	ngram_vectorizer = CountVectorizer(analyzer='char_wb',ngram_range=(2,2),min_df=1)
	#list is number of document here there are two document and each has only one word
	#we are considering ngram=2

	counts = ngram_vectorizer.fit_transform(['this is document1 movies data','this is document2 text data is present'])
	ngram_vectorizer.get_feature_names() == (['document1','document2','this'])

	print(counts.toarray().astype(int))

	ngram_vectorizer = CountVectorizer(analyzer='char_wb',ngram_range=(1,1),min_df=1)
	#list is number of document here there are two document and each has only one word
	#we are considering ngram=1
	counts = ngram_vectorizer.fit_transform(['this is document1 movies data','this is document2 text data is present'])
	ngram_vectorizer.get_feature_names() == (['document1','document2','this'])

	print(counts.toarray().astype(int))

	#Term Frequency, Document Frequency, Inverse Document Frequency
	#Term Frequency - Number of times term t appears in document d
	#Document Frequency - Number of documents in collection the term appears
	#Inverse Document Frequency - Log(N/Dft) - (Number of documents in collection / Number of documents term t appears)
	#IDF = log[Total Docs / Docs contain the term]
	#Lower IDF higher the occurences


	from textblob import TextBlob
	import math

	def tf(word,blob):
	return blob.words.count(word)/len(blob.words)

	def n_containing(word,bloblist):
	return 1+sum(1 for blob in bloblist if word in blob)

	def idf(word,bloblist):
	x = n_containing(word,bloblist)
	return math.log(len(bloblist)/(x if x else 1))

	def tfidf(word,blob,bloblist):
	return tf(word,blob)*idf(word,bloblist)

	text1 = "term frequency document frequency tf idf"
	text2 = "numeric stats intended to reflect the data format"
	text3 = "data collection in corpus data"

	blob1 = TextBlob(text1)
	blob2 = TextBlob(text2)
	blob3 = TextBlob(text3)

	bloblist = [blob1,blob2,blob3]
	tf_score = tf('frequency',blob1)
	idf_score = idf('frequency',bloblist)
	tfidf_score = tfidf('frequency',blob1,bloblist)

	print('tf score is '+str(tf_score))
	print('idf score is '+str(idf_score))
	print('tfidf score is '+str(tfidf_score))

view raw NLPDemo.py hosted with ❤ by GitHub

Happy Learning!!!

September 09, 2018

Working on Research papers / GATE

It is absolutely important to follow your passion and burn yourself focusing on areas of interest, improving on your skills, balacing your work and learning.

A phd / title / tag need not be associated for pursuing such things. Small amounts of continuous focused learning effort is very imporant and key to move forward. Going forward will post my research / Gate learning from notes.

Reads - Ref1, Ref2

Happy GATE and Happy learning!!!

September 06, 2018

Day #127 - SSD, Yolo Paper Reading Notes

Only the key summary points. These are selected lines (copied) for my quick reference and understanding

Yolo Notes

Resize Image
Run CNN (A single convolutional network simultaneously predicts multiple bounding boxes and class probabilities for those boxes)
Non-max Suppression

Alternative Techniques

Sliding window and region proposal-based techniques

Implementation Details

YOLO sees the entire image during training and test time so it encodes contextual information about classes as well as their appearance
Our system divides the input image into a S × S grid
If the center of an object falls into a grid cell, that grid cell is responsible for detecting that object.
Each bounding box consists of 5 predictions: x, y, w, h, and confidence
Network architecture is inspired by the GoogLeNet model for image classification
YOLO predicts multiple bounding boxes per grid cell

Limitations of YOLO

Struggles to generalize to objects in new or unusual aspect ratios or configurations

Other Detection Systems

Haar, SIFT, HOG, convolutional features

What is Non-max Suppression

All modern object detectors follow a three step recipe:
(1) proposing a search space of windows (exhaustive by sliding window or sparser using proposals),
(2) scoring/refining the window with a classifier/regressor, and
(3) merging windows that might belong to the same object.

Non-Max Suppression - The algorithm greedily selects high scoring detections and deletes close-by less confident neighbours since they are likely to cover the same object

R-CNN [10] - Replaced features extraction and classifiers by a neural network

Related work - Viola&Jones, deformable parts model (DPM), clustering algorithms, mean-shift clustering, agglomerative clustering, affinity propagation clustering

Deformable parts models. Deformable parts models (DPM) use a sliding window approach to object detection

R-CNN. Region proposals instead of sliding windows to find objects in images. Selective
Search [34] generates potential bounding boxes, a convolutional network extracts features, an SVM scores the boxes, a linear model adjusts the bounding boxes, and non-max suppression eliminates duplicate detections.

SSD Notes

Discretizes the output space of bounding boxes into a set of default boxes over different aspect ratios and scales per feature map location

The core of SSD is predicting category scores and box offsets for a fixed set of default bounding boxes using small convolutional filters applied to feature maps
Based on a feed-forward convolutional network that produces a fixed-size collection of bounding boxes and scores for the presence of object class instances in those boxes
Ground truth information needs to be assigned to specific outputs in the fixed set of detector outputs

These papers I need to revisit next couple of months to understand it better.

Happy Learning!!!

Day #126 - Deep Learning Class Notes

Lesson 8: Deep Learning Part 2 2018 - Single object detection

Advice

If you have not come across doesn't mean its hard
Type it out yourself all code everytime
Don't wait to be perfect before you start communicating

Neural Network Architecture

Dataset
Newtork Architecture (Number of convolution layers, pooling, dropouts, activation functions)
Loss Function

Flow of Architecture for Single Object
1. I/p Image
2. ConvNet
3. Output Tensor vector

Flow of Architecture for Multiple Object
1. I/p Image
2. ConvNet
3. Output Tensor vector
4. 16 set of outputs

Notes

Bounding Boxes
Take Labelled data and generate classes
Labeling is expensive
Pascal VOC Dataset
Bounding Box with coordinates, category, image_id

Steps (Pytorch coding)

Build Classifier
Finding biggest object in each image and classify
Go through each bounding box in image
Get Largest One
Using Restnet to Classify
Model with 4 activations, mean square loss functions
Multiple label classification
Add Rotations, Flips, Constrast Changing

Architecture

Flatten
RELU
Dropout
Linear
Batch Normalization
Dropout
Linear
Loss functions

SSD

Single Shot Detection
Conv2D
Number of anchor boxes

Analysis

Transfer learning is done on top of it
Identify the highlighted sections

My Thoughts

Perform Segmentation
Pick the objects
Train and Classify them

To Learn Items

Python Debugger pdb.set_trace()
Detail Specific Code Walkthrough
lambda functions in Python

Adam

Momentum on gradient
Past Squared gradient

Happy Learning!!!

September 21, 2018

September 18, 2018

September 17, 2018

September 14, 2018

September 09, 2018

September 06, 2018

Git Code Repository

About Me

What is your Expertise

Search This Blog

Translate

About Me and Disclaimer

Labels

Data Science Good Reads

Cloud, Datacentre, BigData and NOSQL Blogs

SQL Links

Archecture Blog List

Programming Problems

Startup - Reads

Perl-Python-Ruby-Linux-Oracle

Management + Leadership Blogs

Research Papers & Podcasts

My Wordpress

Interesting Reads

Useful Links - C# and .NET

Java, Selenium, QTP and Test Tools Learning

Agile Testing

Reverse Logistics Reads

Biztalk Blogs

MS BI Links

Process - Learnt it :)

Usability Guidelines - Building Better Sites

.NET Test Tools and Other Interesting Reads

Review Checklist

Blog Archive

Live Traffic

Total Pageviews

Popular Posts