Data Science, Database, AI Startups and Domain Learning's (Video-Image-Text-Data-Database)

September 25, 2016

Persuading Organization to embrace analytics

Products / systems currently running need to look at their Data Collection techniques to identify more relevant data to perform better analytics. If current systems rely on point in time data, overwrite / archive historical records over a period of time, we will lose all the valuable information

Why Analytics ?

Predict your future based on your past and present
Correct your mistakes before it's too late
Identify and correct poor performing segments of business

How Analytics differs from Business Intelligence ?

I have worked for ETL, data marts, Schemas for BI projects
BI helps to summarize compare business performance YoY, QoQ
Analytics, is next step for BI to look at future trends

Where are we lagging ?

We need analytics but we do not have enough data points / features to perform analytics. Data collection is a key aspect. The underlying blood of Data science is collecting meaningful data and making models out of it. We need to devote sufficient time to collect data, pipeline it, process and aggregate it for Data Analysis, Modelling.

To evolve from a current product to a system with Analytics capabilities we need to change we way we store data, process data. Technical aspects, project deadlines, resistance has to be handled to make things work.

Persist, Persuade, Implement....

Happy Learning!!!

September 05, 2016

Day #31 - Support Vector Machines

SVM

Support Vector Machines
Widest Street approach separating +ve and -ve classes, Separations as wide as possible
SVM works on classifying only two classes
Hard SVM (Strictly linearly separable)
Soft SVM (Minimize how they fall on another side, Constant C to minimize how much allow one point go on another side)
Kernel Functions perform transformation of data
Using Kernel function we simulate idea of finding linear separator
Kernels take data into higher dimensional space
Other Key concepts discussed (Lagrange Multipliers, Quadratic Optimization problem)
Lagrangian constraint transform from 1D to 2D data
SVM (Linear way of approximation)
Types of Kernels - Polynomial Kernel, Radial Basis Function Kernel, Sigmoid Kernel

Maths Behind it - Link
Good Relevant Read - SVM

Happy Data Analysis!!!

Day #30 - Machine Learning Fundamentals

Supervised Learning

Classification and Regression problems
Past data + Past outputs leveraged
Regression - Continuous Values
Classification - Discrete Labels

Unsupervised

Clustering - Discrete Labels
Dimensionality reduction - Continuous Values

Classifiers

SVM (Linear way of approximations)
KNN (Lazy learner)
Decision Tree (Rule based approach, Set of Rules)
Naive Bayes (Pick class with maximum probability)

Evaluation Methods

K-Fold Validation
Cross Validation
Ranking / Search - Relevance
Clustering - Intra-cluster and inter-cluster distances
Regression - Mean Square Error
ROC Curve

Bagging

Build classifier with 30% of data
Again partition and build another classifier with next 30% of data
Random Forests - Random combination of Trees
Randomly decide and split on attributes

Boosting

Multiple weak classifiers build strong classifier
Sample with replacement
Adaboost - Adaptive boosting

Stacking

Use Output from one classifier as input for another classifier
Knn -> O/P -> SVM

Happy Learning!!!

August 31, 2016

Day #29 - Decision Trees

Hierarchical, Divide and Conquer strategy, Supervised algorithm
Works on numerical data
Concepts discussed - Information gain, entropy computation (Shanon entropy)
Pruning based on chi-square / Shannon entropy
Convert all string / character into categorical / numerical mappings
You can also bucketize continuous variables

Basic Python pointers

Good Reads
Link1 , Link2, Link3, Link4, Link5, Link6

Happy Learning!!!

August 15, 2016

Day #28 - R - Forecast Library Examples

Following Examples discussed. Library used - R - Forecast Library

Moving Average
Single Exponential Smoothing - Uses single smoothing factor
Double Exponential Smoothing - Uses two constants and is better at handling trends
Triple Exponential Smoothing - Smoothing factor, trend, seasonal factors considered
ARIMA

Happy Learning!!!

August 08, 2016

Applied Machine Learning Notes

Supervised Learning

Classification (Discrete Labels)
Regression (Output is continuous, Example - Age, Stock prices)
Past data + Past Outputs used

Unsupervised Learning

Dimensionality reduction (Data in higher dimensions, Remove dimension without losing lot of information)
Reducing dimensionality makes it easy for computation (Continuous values)
Clustering (Discrete labels)
No Past outputs, Only current data

Reinforcement Learning

All Game Playing is unsupervised
Learning Policy
Negative / Positive reward for each step

Type of Models

Inductive (Learn model, Learn from a function) vs Transductive (Lazy learning ex- Opinion from like minded people)
Online (Learn from every new incoming tweet) vs Offline (Look past 1 Yeat tweet)
Generative (Apply Gaussian on Data, Use ML and compute Mean / Variance) vs Discriminative (Two sides of Line)
Parametric vs Non-Parametric Models

Happy Learning!!!

July 31, 2016

Fifth Elephant Day #2

Fifth Elephant Day #2 - Part I

Session #1 - Content Marketing

Distribute relevant consistent content. Traditional vs Content Marketing

Challenges

Delivering content with speed. Channel proliferation (mobile, computers, tablets)
Intersection of Brands, Trends, Community Interests (Social media post and metrics)
Data from social media pages, online aggregators

Technical Details

Computation of term frequency, inverse document frequency
Using Solr, Lucene for Indexes
Cosine Similarity
Greedy Algorithm

Session #2 - Reasoning

Prediction vs Reasoning problem
Prediction Problems Evolution
At Advanced level Deep Learning, XGBoost, Graphical models

When Apply prediction ?
Features as input -> Prediction performed (Independent, stateless)

Reasoning - Sequential, Stateful Exploration
Reasoning Problems - Diagnosis, routes, games, crossing roads

Flavours of Reasoning

Algorithmic (Search)
Logical reasoning
Bayesian probabilistic reasoning
Markovnian reasoning

Knowledge, Learning the process of reasoning, Knowledge graphs were should in implementation of reasoning
{subject, predicate, object}

Session #3 - Continuous online learning

70% noise in C2B communication
100% noise in B2C communication
Zipfian

Technicalities

Apriori - Market Basket Analysis
XGBoost - Alternative to DL
Bias - Variance Tradeoff
Spectral Clustering

Bird of Feathers Session

Google Deepmind (Used for Air conditioning)
Bayesian Probabilistic Learning
Deep Learning - Build Hierarchy of features (OCR type of problems)
Traditional Neural Network (Fully Connected, lot of degree of freedom)
Structural causality (Subsystem appears before, Domain knowledge)
Temporal causality - This and then that happened
CNN - learning weights
Spectral clustering
PCA (reduce denser to smaller)
Deep Learning - Hidden layers obtained through coarse grained process

Deep Learning workshop Notes

Neural Networks
Multiple Layers
Lots of data

People Involved - Hinton, Andrew Ng, Bengio, Lecuss

Deep Learning now

Speech recognition
Google Deep Models on Phone
Google street view (House numbers)
Imagenet
Captioning images
Reinforcement learning

Neural Networks

Simple mathematical units combine into complex functions
X-> input, W-> weights, Non linear function of output

Multiple Layers

Multiple hidden layers between input and output
Training hidden layers is challenge

Gradient Descent

Define loss function
Minimize by moving along gradient

Backpropagation

Move Errors back through the network
Chain rule conception

Tools

Cafee - Configuration file
Torch - Describe network in lue
Theano - Describes computation, writes cuda code, runs and gives results

CNN

Used for images
Images are organized
Apply Convolutional filter
For Deep Learning GPU is important

Imagenet Competition

Convolution (Have all nice features retain them)
Pooling (Shrink image)
Softmax
Other

Simplest RNN - Gradient Descent problem
LSTM (Long Short Term memory)
Interword relationships from corpus (word2vec)

Happy Learning!!!

July 28, 2016

Fifth Elephant Day #1 Notes - Part II

Sessions # - Link

Talk #3 - Machine Learning in FinTech

Lending Space
Credit underwriting system

India

2% Credit card usage
65% of population < 27 yrs
Digital foot print (mobile)
Identity (Aadhar)

40 Decisions / Minute -> 100 Crores a month

Use Cases / Scenarios

Truth Score (Validity of address / person / sources)
Need Score (Urgency / Time to respond application)
Saver Score (cash flow real-time analytics)
Credit Score (Debt to income)
Credit awareness score
Continuous risk assessments

Talk #4 - Driving Behaviour from Smartphone Sensors

For Safety driving using smartphone sensors
Spatial / location data
Road traffic injuries due to distracted driving
Phone usage - 4x crash risk
Speedy driving - 45% car crash history
Driving behavior analysis / driving feedback
GPS + Inertial Navigational sensors (Accelerometer / Gyroscope / Magnetometer)

Characterization

Drive detection
Event detection
Collision detection

Qualification

Drive summarization and scoring
Risk modelling

Optimization

Events, location of events, duration of events

Dynamics

Sensors
Availability - wide variety across devices
Raw Data - noisy, unevenly spaced time series
Events - Time scales, combination of sensors
Model building - Labelled vs unlabelled data, feature engineering
Algorithms - Stream / batch efficiency

Techniques

Cluster data
Eliminated uninteresting time periods
Classification / Regression models
Spectral clustering

Talk #5 - Indian Agriculture

Crop rotation literacy
Data curation, Query tools on data product
Visualization and plotting of Agricultural data

Tak #6 and #7 - Last two talks were from Ecologists

Using Image comparison for Big Cat Counting
Predicting Big Cat Areas (Territories)
Observe Nature, Frame Hypothesis, Design Experiments
Confront with competing hypothesis
Spacegap program
Markov chain Monte-Carlo technique

Happy Learning!!!

Fifth Elephant Day #1 Notes - Part I

Sessions # - Link

Talk #1 - Data for Genomic Analysis

Great talk by Ramesh. I had attended his session / technical discussion earlier. This session provided insights on genome / discrepancies in genome sequence leading to rare diseases.

Genome - 3 Billion X 2 Characters
Character variables varies from person to person
Stats (1/10th of probability of cancer)
Baseline risk for breast cancer (1/8),(1/70) ovarian cancer
BRCA1 mutation (5-6 fold increase in breast cancer, 27 fold increase for ovarian cancer)

In India

35% inherited risk mutation
1/25 Thalassemia
1 in 400-900 Retinitis Pigmentosa
1 in 500, Hypertrophic Cardiomyopathy

Data Processing

1 Billion reads - 100GB data per person
Very similar sequence yet one character might differ
But reference is 3 Billion long

Efficiency

Need fast indexing
Suffix Trees and variations
Hash table based approaches

Reference Genome Sequence

Volume of data
Funnel down of variety of dimensions
Triplet Code (Molecule)
Variants of Triplets nailed down to difference of gnome
GPU processing / reduce computation time

Concepts Discussed / Used

Hypothesis Testing
Stats Models
GPU Processing to reduce computation time

They also provide assessment for hereditary diseases at corporate level.

Talk #2 - Alternative to Wall Street Data

This session gave me some new strategies to collect / analyze data

How to Identify occupancy rate at hotel ?

Count of cars from parking lots
Number of rooms lights on
Take pics of rooms from corner of street and predict based on images collected
Unconventional ways to think of data collection (Beating the wall street model)

What are usual ways

Checking websites

From Investor perspective lodging key metrics is a very important aspect
Data Sources

Direct data gathering
Web harvesting
Primary research

Primary Research

Look at notice patterns in front of you
Difference in invoice numbers
Serial number changes, difference values

Free Data Sets in link
Lot of opportunity

Analyze international markets (India / China)
COGS
SG
ETC

How to value data sets ?

Scarcity - How widely used
Granularity - Time / aggregation level
Structured
Coverage

What is the generative value

Revenue Surprise Estimates
Dataset insight / Analysis
Operating GAAP measures

A Great case study on impact of smart watch vs luxury watch was presented ? This session provides great insight into unconventional data collection ways

Generate money in automated system
Stock sensitivity to revenue surprises
Identify underlying ground truth

"Some Refreshing changes to world of investment"

Happy Learning!!!

July 24, 2016

Day #27 - Exploring ggplot2

About Me and Disclaimer

Welcome Visitor,
I have 20 years of experience (Coder - Emprical Learner - Teacher). I am currently working on Data Analytics (Video-Image-Text-Data) / Database / BI space. I dabble with "Data". Ping me or send a request to connect if what I do appeals to you and you want to talk about it (Data Science / Databases / Deep Learning / Architecture / Design Discussions / Consulting Projects/ Machine Learning Training's/ Strategic Leadership Roles).
Personal Goal - Reach / Teach up to 10 Million Students through various mediums (Catalyst between Academics and Industry)
My request to readers, Hope you find the posts, code snippets, notes helpful, please share your learning with others. We can only grow only by learning and teaching.

6+ years in AI, AI experience working on Image, Video, Text, Numbers - Data

15+ years in Databases

10+ in developing, deploying, monitoring large scale solutions in Supply Chain, Retail

Its my personal blog. The objective of this blog is to bookmark/share my learning's. Posts reflect my opinions, perspectives and interests. Blog post presented are my personal views and do not represent my employer's view. I have acknowledged all posts with References/Bookmarks.

For questions/feedback/career opportunities/training / consulting assignments/mentoring - please drop a note to sivaram2k10(at)gmail(dot)com
Coach / Code / Innovate

A blogpost a day keeps your thinking going.

September 25, 2016

September 05, 2016

August 31, 2016

August 15, 2016

August 08, 2016

July 31, 2016

July 28, 2016

July 24, 2016

About Me

What is your Expertise

Search This Blog

Git Code Repository

Translate

About Me and Disclaimer

Labels

Data Science Good Reads

Cloud, Datacentre, BigData and NOSQL Blogs

SQL Links

Archecture Blog List

Programming Problems

Startup - Reads

Perl-Python-Ruby-Linux-Oracle

Management + Leadership Blogs

Research Papers & Podcasts

My Wordpress

Interesting Reads

Useful Links - C# and .NET

Java, Selenium, QTP and Test Tools Learning

Agile Testing

Reverse Logistics Reads

Biztalk Blogs

MS BI Links

Process - Learnt it :)

Usability Guidelines - Building Better Sites

.NET Test Tools and Other Interesting Reads

Review Checklist

Blog Archive

Live Traffic

Total Pageviews

Popular Posts