"No one is harder on a talented person than the person themselves" - Linda Wilkinson ; "Trust your guts and don't follow the herd" ; "Validate direction not destination" ;

October 12, 2016

October 10, 2016

Day #36 - Pandas Dataframe Learning's

Happy Learning!!!

Day #35 - Bias Vs Variance

These are frequently occurring terms with respect to performance of model against training and testing data sets.

Classification error = Bias + Variance

Bias (Under-fitting)
  • Bias is high if the concept class cannot model the true data  distribution well, and does not depend on training set size.
  • High Bias will lead to under-fitting
How to identify High Bias
  • Training Error will be high
  • Cross Validation error also will be high (Both will be nearly the same)
  • High Variance will lead to over-fitting
How to identify High Variance
  • Training Error will be high
  • Cross Validation error also will be Very Very High compared to training error
Hot to Fix ?
Variance decreases with more training data, and increases with more complicated classifiers

Happy Learning!!!

October 08, 2016

Day #34 - What is diffference between Logistics Regression and Naive Bayes

Both are probabilistic

  • Discriminative (Entire approach is purely discriminative)
  • P(Y/X)
  • Final Value lies between Zero and 1
  • Formula given by exp(w0+w1x)/(exp(w0+ w1x)+1)
  • Further can be expressed as 1/(1+(exp-(w0+ w1x))
Naive Bayes
  • Generative Model
  • P(X/ Given Y) is Naive Bayes Assumption
  • Distribution for each class
Happy Learning

October 04, 2016

Day #33 - Pandas Deep Dive

Happy Learning!!!

October 02, 2016

October 01, 2016

Day #32 - Regularization in Machine Learning

A large coefficient will result in overfitting. To avoid we perform regularization. Regularization - To avoid overfitting
  • L1 - Sum of values (Lasso - Least absolute shrinkage and selection operator). L1 will be meeting in co-ordinates and result in one of the dimensions zero. This would result in variable elimination. The features that minimally contribute will be ignored.
  • L2 - Sum of squares of values (Ridge). L2 is kind of circle shaped. This will shrink all coefficient in same proportion but eliminate none
  • Discriminative - In SVM we use hyperplane to classify the classes. This is example for discriminative approach
  • Probabilistic - Generated by Gauss Distribution. This is again based on Central Limit Theorem. Infinite points will fit into a Normal distribution. Here we apply gauss distribution model
  • Max Likelihood - Probability that the point p belongs to one distribution. 

Happy Learning!!!

September 25, 2016

Persuading Organization to embrace analytics

Products / systems currently running need to look at their Data Collection techniques to identify more relevant data to perform better analytics. If current systems rely on point in time data, overwrite / archive historical records over a period of time, we will lose all the valuable information

Why Analytics ?
  • Predict your future based on your past and present
  • Correct your mistakes before it's too late
  • Identify and correct poor performing segments of business

How Analytics differs from Business Intelligence ?
  • I have worked for ETL, data marts, Schemas for BI projects
  • BI helps to summarize compare business performance YoY, QoQ
  • Analytics, is next step for BI to look at future trends

Where are we lagging ?

We need analytics but we do not have enough data points / features to perform analytics. Data collection is a key aspect. The underlying blood of Data science is collecting meaningful data and making models out of it. We need to devote sufficient time to collect data, pipeline it, process and aggregate it for Data Analysis, Modelling.

To evolve from a current product to a system with Analytics capabilities we need to change we way we store data, process data. Technical aspects, project deadlines, resistance has to be handled to make things work.

Persist, Persuade, Implement....

Happy Learning!!!

September 05, 2016

Day #31 - Support Vector Machines

  • Support Vector Machines
  • Widest Street approach separating +ve and -ve classes, Separations as wide as possible
  • SVM works on classifying only two classes
  • Hard SVM (Strictly linearly separable)
  • Soft SVM (Minimize how they fall on another side, Constant C to minimize how much allow one point go on another side)
  • Kernel Functions perform transformation of data
  • Using Kernel function we simulate idea of finding linear separator 
  • Kernels take data into higher dimensional space
  • Other Key concepts discussed (Lagrange Multipliers, Quadratic Optimization problem)
  • Lagrangian constraint transform from 1D to 2D data
  • SVM (Linear way of approximation)
  • Types of Kernels - Polynomial Kernel, Radial Basis Function Kernel, Sigmoid Kernel
Maths Behind it - Link
Good Relevant Read - SVM

Happy Data Analysis!!!

Day #30 - Machine Learning Fundamentals

Supervised Learning
  • Classification and Regression problems
  • Past data + Past outputs leveraged
  • Regression - Continuous Values
  • Classification - Discrete Labels
  • Clustering - Discrete Labels
  • Dimensionality reduction - Continuous Values
  • SVM (Linear way of approximations)
  • KNN (Lazy learner)
  • Decision Tree (Rule based approach, Set of Rules)
  • Naive Bayes (Pick class with maximum probability)
Evaluation Methods
  • K-Fold Validation
  • Cross Validation
  • Ranking / Search - Relevance
  • Clustering - Intra-cluster and inter-cluster distances
  • Regression - Mean Square Error
  • ROC Curve 
  • Build classifier with 30% of data
  • Again partition and build another classifier with next 30% of data
  • Random Forests - Random combination of Trees
  • Randomly decide and split on attributes
  • Multiple weak classifiers build strong classifier
  • Sample with replacement
  • Adaboost - Adaptive boosting
  • Use Output from one classifier as input for another classifier
  • Knn -> O/P -> SVM
Happy Learning!!!

August 31, 2016

Day #29 - Decision Trees

  • Hierarchical, Divide and Conquer strategy, Supervised algorithm
  • Works on numerical data
  • Concepts discussed - Information gain, entropy computation (Shanon entropy)
  • Pruning based on chi-square / Shannon entropy
  • Convert all string / character into categorical / numerical mappings
  • You can also bucketize continuous variables
Basic Python pointers

Good Reads
Link1 , Link2, Link3, Link4, Link5, Link6

Happy Learning!!!

August 15, 2016

Day #28 - R - Forecast Library Examples

Following Examples discussed. Library used - R - Forecast Library
  • Moving Average
  • Single Exponential Smoothing - Uses single smoothing factor
  • Double Exponential Smoothing - Uses two constants and is better at handling trends
  • Triple Exponential Smoothing - Smoothing factor, trend, seasonal factors considered

Happy Learning!!!

August 08, 2016

Applied Machine Learning Notes

Supervised Learning
  • Classification (Discrete Labels)
  • Regression (Output is continuous, Example - Age, Stock prices)
  • Past data + Past Outputs used
Unsupervised Learning
  • Dimensionality reduction (Data in higher dimensions, Remove dimension without losing lot of information)
  • Reducing dimensionality makes it easy for computation (Continuous values)
  • Clustering (Discrete labels)
  • No Past outputs, Only current data
Reinforcement Learning
  • All Game Playing is unsupervised
  • Learning Policy
  • Negative / Positive reward for each step
Type of Models
  • Inductive (Learn model, Learn from a function) vs Transductive (Lazy learning ex- Opinion from like minded people)
  • Online (Learn from every new incoming tweet) vs Offline (Look past 1 Yeat tweet)
  • Generative (Apply Gaussian on Data, Use ML and compute Mean / Variance) vs Discriminative (Two sides of Line)
  • Parametric vs Non-Parametric Models
Happy Learning!!!

July 31, 2016

Fifth Elephant Day #2

Fifth Elephant Day #2 - Part I

Session #1 - Content Marketing
  • Distribute relevant consistent content. Traditional vs Content Marketing
  • Delivering content with speed. Channel proliferation (mobile, computers, tablets)
  • Intersection of Brands, Trends, Community Interests (Social media post and metrics)
  • Data from social media pages, online aggregators

Technical Details
  • Computation of term frequency, inverse document frequency
  • Using Solr, Lucene for Indexes
  • Cosine Similarity
  • Greedy Algorithm
Session #2 - Reasoning
  • Prediction vs Reasoning problem
  • Prediction Problems Evolution 
  • At Advanced level Deep Learning, XGBoost, Graphical models
When Apply prediction ?
Features as input -> Prediction performed (Independent, stateless)

Reasoning - Sequential, Stateful Exploration
Reasoning Problems - Diagnosis, routes, games, crossing roads

Flavours of Reasoning
  • Algorithmic (Search)
  • Logical reasoning
  • Bayesian probabilistic reasoning
  • Markovnian reasoning
Knowledge, Learning the process of reasoning, Knowledge graphs were should in implementation of reasoning
{subject, predicate, object}

Session #3 - Continuous online learning
  • 70% noise in C2B communication
  • 100% noise in B2C communication
  • Zipfian
  • Apriori - Market Basket Analysis
  • XGBoost - Alternative to DL
  • Bias - Variance Tradeoff
  • Spectral Clustering

Bird of Feathers Session
  • Google Deepmind (Used for Air conditioning)
  • Bayesian Probabilistic Learning
  • Deep Learning - Build Hierarchy of features (OCR type of problems)
  • Traditional Neural Network (Fully Connected, lot of degree of freedom)
  • Structural causality (Subsystem appears before, Domain knowledge)
  • Temporal causality - This and then that happened
  • CNN - learning weights
  • Spectral clustering
  • PCA (reduce denser to smaller)
  • Deep Learning - Hidden layers obtained through coarse grained process
Deep Learning workshop Notes
  • Neural Networks
  • Multiple Layers
  • Lots of data
People Involved - Hinton, Andrew Ng, Bengio, Lecuss

Deep Learning now
  • Speech recognition
  • Google Deep Models on Phone
  • Google street view (House numbers)
  • Imagenet
  • Captioning images
  • Reinforcement learning
Neural Networks
  • Simple mathematical units combine into complex functions
  • X-> input, W-> weights, Non linear function of output
Multiple Layers
  • Multiple hidden layers between input and output
  • Training hidden layers is challenge
Gradient Descent
  • Define loss function
  • Minimize by moving along gradient
  • Move Errors back through the network
  • Chain rule conception
  • Cafee - Configuration file
  • Torch - Describe network in lue
  • Theano - Describes computation, writes cuda code, runs and gives results
  • Used for images
  • Images are organized
  • Apply Convolutional filter
  • For Deep Learning GPU is important
Imagenet Competition
  • Convolution (Have all nice features retain them)
  • Pooling (Shrink image)
  • Softmax
  • Other
Simplest RNN - Gradient Descent problem
LSTM (Long Short Term memory)
Interword relationships from corpus (word2vec)

Happy Learning!!!

July 28, 2016

Fifth Elephant Day #1 Notes - Part II

Sessions # - Link

Talk #3 - Machine Learning in FinTech
  • Lending Space
  • Credit underwriting system
  • 2% Credit card usage
  • 65% of population < 27 yrs
  • Digital foot print (mobile)
  • Identity (Aadhar)
40 Decisions / Minute -> 100 Crores a month

Use Cases / Scenarios
  • Truth Score (Validity of address / person / sources)
  • Need Score (Urgency / Time to respond application)
  • Saver Score (cash flow real-time analytics)
  • Credit Score (Debt to income)
  • Credit awareness score
  • Continuous risk assessments
Talk #4 - Driving Behaviour from Smartphone Sensors
  • For Safety driving using smartphone sensors
  • Spatial / location data
  • Road traffic injuries due to distracted driving
  • Phone usage - 4x crash risk
  • Speedy driving - 45% car crash history
  • Driving behavior analysis / driving feedback
  • GPS + Inertial Navigational sensors (Accelerometer / Gyroscope / Magnetometer)
  • Drive detection
  • Event detection
  • Collision detection
  • Drive summarization and scoring
  • Risk modelling
  • Events, location of events, duration of events
  • Sensors
  • Availability - wide variety across devices
  • Raw Data - noisy, unevenly spaced time series
  • Events - Time scales, combination of sensors
  • Model building - Labelled vs unlabelled data, feature engineering
  • Algorithms - Stream / batch efficiency
  • Cluster data 
  • Eliminated uninteresting time periods
  • Classification / Regression models
  • Spectral clustering
Talk #5 - Indian Agriculture
  • Crop rotation literacy
  • Data curation, Query tools on data product
  • Visualization and plotting of Agricultural data
Tak #6 and #7 - Last two talks were from Ecologists
  • Using Image comparison for Big Cat Counting
  • Predicting Big Cat Areas (Territories)
  • Observe Nature, Frame Hypothesis, Design Experiments
  • Confront with competing hypothesis
  • Spacegap program
  • Markov chain Monte-Carlo technique

Happy Learning!!!

Fifth Elephant Day #1 Notes - Part I

Sessions # - Link

Talk #1 - Data for Genomic Analysis

Great talk by Ramesh. I had attended his session / technical discussion earlier. This session provided insights on genome / discrepancies in genome sequence leading to rare diseases.

Genome - 3 Billion X 2 Characters
Character variables varies from person to person
Stats (1/10th of probability of cancer)
Baseline risk for breast cancer (1/8),(1/70) ovarian cancer
BRCA1 mutation (5-6 fold increase in breast cancer, 27 fold increase for ovarian cancer)

In India
  • 35% inherited risk mutation
  • 1/25 Thalassemia 
  • 1 in 400-900 Retinitis Pigmentosa
  • 1 in 500, Hypertrophic Cardiomyopathy
Data Processing
  • 1 Billion reads - 100GB data per person
  • Very similar sequence yet one character might differ
  • But reference is 3 Billion long
  • Need fast indexing
  • Suffix Trees and variations
  • Hash table based approaches
Reference Genome Sequence
  • Volume of data
  • Funnel down of variety of dimensions
  • Triplet Code (Molecule)
  • Variants of Triplets nailed down to difference of gnome
  • GPU processing / reduce computation time
Concepts Discussed / Used
  • Hypothesis Testing
  • Stats Models
  • GPU Processing to reduce computation time
They also provide assessment for hereditary diseases at corporate level.

Talk #2 - Alternative to Wall Street Data

This session gave me some new strategies to collect / analyze data

How to Identify occupancy rate at hotel ?
  •  Count of cars from parking lots
  •  Number of rooms lights on
  •  Take pics of rooms from corner of street and predict based on images collected
  •  Unconventional ways to think of data collection (Beating the wall street model)
What are usual ways
  •  Checking websites
From Investor perspective lodging key metrics is a very important aspect
Data Sources
  • Direct data gathering
  • Web harvesting
  • Primary research
Primary Research
  • Look at notice patterns in front of you
  • Difference in invoice numbers
  • Serial number changes, difference values
Free Data Sets in link
Lot of opportunity
  • Analyze international markets (India / China)
  • COGS
  • SG
  • ETC
How to value data sets ?
  • Scarcity - How widely used
  • Granularity - Time / aggregation level
  • Structured
  • Coverage

What is the generative value
  • Revenue Surprise Estimates
  • Dataset insight / Analysis
  • Operating GAAP measures
A Great case study on impact of smart watch vs luxury watch was presented ? This session provides great insight into unconventional data collection ways
  • Generate money in automated system
  • Stock sensitivity to revenue surprises
  • Identify underlying ground truth
"Some Refreshing changes to world of investment"

Happy Learning!!!

July 24, 2016

June 17, 2016

Good Read - Design Patterns

Happy Learning!!!

June 15, 2016

Day #26 - R - Moving Weighted Average

Example code based on two day workshop on Azure ML module. Simple example storing and accessing data from Azure workspace

Happy Learning!!!