"No one is harder on a talented person than the person themselves" - Linda Wilkinson ; "Trust your guts and don't follow the herd" ; "Validate direction not destination" ;

September 25, 2016

Persuading Organization to embrace analytics

Products / systems currently running need to look at their Data Collection techniques to identify more relevant data to perform better analytics. If current systems rely on point in time data, overwrite / archive historical records over a period of time, we will lose all the valuable information

Why Analytics ?
  • Predict your future based on your past and present
  • Correct your mistakes before it's too late
  • Identify and correct poor performing segments of business

How Analytics differs from Business Intelligence ?
  • I have worked for ETL, data marts, Schemas for BI projects
  • BI helps to summarize compare business performance YoY, QoQ
  • Analytics, is next step for BI to look at future trends

Where are we lagging ?

We need analytics but we do not have enough data points / features to perform analytics. Data collection is a key aspect. The underlying blood of Data science is collecting meaningful data and making models out of it. We need to devote sufficient time to collect data, pipeline it, process and aggregate it for Data Analysis, Modelling.

To evolve from a current product to a system with Analytics capabilities we need to change we way we store data, process data. Technical aspects, project deadlines, resistance has to be handled to make things work.

Persist, Persuade, Implement....

Happy Learning!!!

September 05, 2016

Day #31 - Support Vector Machines

SVM
  • Support Vector Machines
  • Widest Street approach separating +ve and -ve classes, Separations as wide as possible
  • SVM works on classifying only two classes
  • Hard SVM (Strictly linearly separable)
  • Soft SVM (Minimize how they fall on another side, Constant C to minimize how much allow one point go on another side)
  • Kernel Functions perform transformation of data
  • Using Kernel function we simulate idea of finding linear separator 
  • Kernels take data into higher dimensional space
  • Other Key concepts discussed (Lagrange Multipliers, Quadratic Optimization problem)
  • Lagrangian constraint transform from 1D to 2D data
  • SVM (Linear way of approximation)
  • Types of Kernels - Polynomial Kernel, Radial Basis Function Kernel, Sigmoid Kernel
Maths Behind it - Link
Good Relevant Read - SVM

Happy Data Analysis!!!

Day #30 - Machine Learning Fundamentals

Supervised Learning
  • Classification and Regression problems
  • Past data + Past outputs leveraged
  • Regression - Continuous Values
  • Classification - Discrete Labels
Unsupervised
  • Clustering - Discrete Labels
  • Dimensionality reduction - Continuous Values
Classifiers
  • SVM (Linear way of approximations)
  • KNN (Lazy learner)
  • Decision Tree (Rule based approach, Set of Rules)
  • Naive Bayes (Pick class with maximum probability)
Evaluation Methods
  • K-Fold Validation
  • Cross Validation
  • Ranking / Search - Relevance
  • Clustering - Intra-cluster and inter-cluster distances
  • Regression - Mean Square Error
  • ROC Curve 
Bagging
  • Build classifier with 30% of data
  • Again partition and build another classifier with next 30% of data
  • Random Forests - Random combination of Trees
  • Randomly decide and split on attributes
Boosting
  • Multiple weak classifiers build strong classifier
  • Sample with replacement
  • Adaboost - Adaptive boosting
Stacking
  • Use Output from one classifier as input for another classifier
  • Knn -> O/P -> SVM
Happy Learning!!!

August 31, 2016

Day #29 - Decision Trees

  • Hierarchical, Divide and Conquer strategy, Supervised algorithm
  • Works on numerical data
  • Concepts discussed - Information gain, entropy computation (Shanon entropy)
  • Pruning based on chi-square / Shannon entropy
  • Convert all string / character into categorical / numerical mappings
  • You can also bucketize continuous variables
Basic Python pointers

Good Reads
Link1 , Link2, Link3, Link4, Link5, Link6

Happy Learning!!!

August 15, 2016

Day #28 - R - Forecast Library Examples

Following Examples discussed. Library used - R - Forecast Library
  • Moving Average
  • Single Exponential Smoothing - Uses single smoothing factor
  • Double Exponential Smoothing - Uses two constants and is better at handling trends
  • Triple Exponential Smoothing - Smoothing factor, trend, seasonal factors considered
  • ARIMA

Happy Learning!!!

August 08, 2016

Applied Machine Learning Notes


Supervised Learning
  • Classification (Discrete Labels)
  • Regression (Output is continuous, Example - Age, Stock prices)
  • Past data + Past Outputs used
Unsupervised Learning
  • Dimensionality reduction (Data in higher dimensions, Remove dimension without losing lot of information)
  • Reducing dimensionality makes it easy for computation (Continuous values)
  • Clustering (Discrete labels)
  • No Past outputs, Only current data
Reinforcement Learning
  • All Game Playing is unsupervised
  • Learning Policy
  • Negative / Positive reward for each step
Type of Models
  • Inductive (Learn model, Learn from a function) vs Transductive (Lazy learning ex- Opinion from like minded people)
  • Online (Learn from every new incoming tweet) vs Offline (Look past 1 Yeat tweet)
  • Generative (Apply Gaussian on Data, Use ML and compute Mean / Variance) vs Discriminative (Two sides of Line)
  • Parametric vs Non-Parametric Models
Happy Learning!!!

July 31, 2016

Fifth Elephant Day #2

Fifth Elephant Day #2 - Part I

Session #1 - Content Marketing
  • Distribute relevant consistent content. Traditional vs Content Marketing
Challenges
  • Delivering content with speed. Channel proliferation (mobile, computers, tablets)
  • Intersection of Brands, Trends, Community Interests (Social media post and metrics)
  • Data from social media pages, online aggregators



Technical Details
  • Computation of term frequency, inverse document frequency
  • Using Solr, Lucene for Indexes
  • Cosine Similarity
  • Greedy Algorithm
Session #2 - Reasoning
  • Prediction vs Reasoning problem
  • Prediction Problems Evolution 
  • At Advanced level Deep Learning, XGBoost, Graphical models
When Apply prediction ?
Features as input -> Prediction performed (Independent, stateless)

Reasoning - Sequential, Stateful Exploration
Reasoning Problems - Diagnosis, routes, games, crossing roads

Flavours of Reasoning
  • Algorithmic (Search)
  • Logical reasoning
  • Bayesian probabilistic reasoning
  • Markovnian reasoning
Knowledge, Learning the process of reasoning, Knowledge graphs were should in implementation of reasoning
{subject, predicate, object}















Session #3 - Continuous online learning
  • 70% noise in C2B communication
  • 100% noise in B2C communication
  • Zipfian
Technicalities
  • Apriori - Market Basket Analysis
  • XGBoost - Alternative to DL
  • Bias - Variance Tradeoff
  • Spectral Clustering






Bird of Feathers Session
  • Google Deepmind (Used for Air conditioning)
  • Bayesian Probabilistic Learning
  • Deep Learning - Build Hierarchy of features (OCR type of problems)
  • Traditional Neural Network (Fully Connected, lot of degree of freedom)
  • Structural causality (Subsystem appears before, Domain knowledge)
  • Temporal causality - This and then that happened
  • CNN - learning weights
  • Spectral clustering
  • PCA (reduce denser to smaller)
  • Deep Learning - Hidden layers obtained through coarse grained process
Deep Learning workshop Notes
  • Neural Networks
  • Multiple Layers
  • Lots of data
People Involved - Hinton, Andrew Ng, Bengio, Lecuss

Deep Learning now
  • Speech recognition
  • Google Deep Models on Phone
  • Google street view (House numbers)
  • Imagenet
  • Captioning images
  • Reinforcement learning
Neural Networks
  • Simple mathematical units combine into complex functions
  • X-> input, W-> weights, Non linear function of output
Multiple Layers
  • Multiple hidden layers between input and output
  • Training hidden layers is challenge
Gradient Descent
  • Define loss function
  • Minimize by moving along gradient
Backpropagation
  • Move Errors back through the network
  • Chain rule conception
Tools
  • Cafee - Configuration file
  • Torch - Describe network in lue
  • Theano - Describes computation, writes cuda code, runs and gives results
CNN
  • Used for images
  • Images are organized
  • Apply Convolutional filter
  • For Deep Learning GPU is important
Imagenet Competition
  • Convolution (Have all nice features retain them)
  • Pooling (Shrink image)
  • Softmax
  • Other
Simplest RNN - Gradient Descent problem
LSTM (Long Short Term memory)
Interword relationships from corpus (word2vec)

Happy Learning!!!

July 28, 2016

Fifth Elephant Day #1 Notes - Part II

Sessions # - Link

Talk #3 - Machine Learning in FinTech
  • Lending Space
  • Credit underwriting system
India
  • 2% Credit card usage
  • 65% of population < 27 yrs
  • Digital foot print (mobile)
  • Identity (Aadhar)
40 Decisions / Minute -> 100 Crores a month

Use Cases / Scenarios
  • Truth Score (Validity of address / person / sources)
  • Need Score (Urgency / Time to respond application)
  • Saver Score (cash flow real-time analytics)
  • Credit Score (Debt to income)
  • Credit awareness score
  • Continuous risk assessments
Talk #4 - Driving Behaviour from Smartphone Sensors
  • For Safety driving using smartphone sensors
  • Spatial / location data
  • Road traffic injuries due to distracted driving
  • Phone usage - 4x crash risk
  • Speedy driving - 45% car crash history
  • Driving behavior analysis / driving feedback
  • GPS + Inertial Navigational sensors (Accelerometer / Gyroscope / Magnetometer)
Characterization
  • Drive detection
  • Event detection
  • Collision detection
Qualification
  • Drive summarization and scoring
  • Risk modelling
Optimization
  • Events, location of events, duration of events
Dynamics
  • Sensors
  • Availability - wide variety across devices
  • Raw Data - noisy, unevenly spaced time series
  • Events - Time scales, combination of sensors
  • Model building - Labelled vs unlabelled data, feature engineering
  • Algorithms - Stream / batch efficiency
Techniques
  • Cluster data 
  • Eliminated uninteresting time periods
  • Classification / Regression models
  • Spectral clustering
Talk #5 - Indian Agriculture
  • Crop rotation literacy
  • Data curation, Query tools on data product
  • Visualization and plotting of Agricultural data
Tak #6 and #7 - Last two talks were from Ecologists
  • Using Image comparison for Big Cat Counting
  • Predicting Big Cat Areas (Territories)
  • Observe Nature, Frame Hypothesis, Design Experiments
  • Confront with competing hypothesis
  • Spacegap program
  • Markov chain Monte-Carlo technique


Happy Learning!!!

Fifth Elephant Day #1 Notes - Part I

Sessions # - Link

Talk #1 - Data for Genomic Analysis

Great talk by Ramesh. I had attended his session / technical discussion earlier. This session provided insights on genome / discrepancies in genome sequence leading to rare diseases.

Genome - 3 Billion X 2 Characters
Character variables varies from person to person
Stats (1/10th of probability of cancer)
Baseline risk for breast cancer (1/8),(1/70) ovarian cancer
BRCA1 mutation (5-6 fold increase in breast cancer, 27 fold increase for ovarian cancer)

In India
  • 35% inherited risk mutation
  • 1/25 Thalassemia 
  • 1 in 400-900 Retinitis Pigmentosa
  • 1 in 500, Hypertrophic Cardiomyopathy
Data Processing
  • 1 Billion reads - 100GB data per person
  • Very similar sequence yet one character might differ
  • But reference is 3 Billion long
Efficiency
  • Need fast indexing
  • Suffix Trees and variations
  • Hash table based approaches
Reference Genome Sequence
  • Volume of data
  • Funnel down of variety of dimensions
  • Triplet Code (Molecule)
  • Variants of Triplets nailed down to difference of gnome
  • GPU processing / reduce computation time
Concepts Discussed / Used
  • Hypothesis Testing
  • Stats Models
  • GPU Processing to reduce computation time
They also provide assessment for hereditary diseases at corporate level.

Talk #2 - Alternative to Wall Street Data

This session gave me some new strategies to collect / analyze data

How to Identify occupancy rate at hotel ?
  •  Count of cars from parking lots
  •  Number of rooms lights on
  •  Take pics of rooms from corner of street and predict based on images collected
  •  Unconventional ways to think of data collection (Beating the wall street model)
What are usual ways
  •  Checking websites
From Investor perspective lodging key metrics is a very important aspect
Data Sources
  • Direct data gathering
  • Web harvesting
  • Primary research
Primary Research
  • Look at notice patterns in front of you
  • Difference in invoice numbers
  • Serial number changes, difference values
Free Data Sets in link
Lot of opportunity
  • Analyze international markets (India / China)
  • COGS
  • SG
  • ETC
How to value data sets ?
  • Scarcity - How widely used
  • Granularity - Time / aggregation level
  • Structured
  • Coverage



What is the generative value
  • Revenue Surprise Estimates
  • Dataset insight / Analysis
  • Operating GAAP measures
A Great case study on impact of smart watch vs luxury watch was presented ? This session provides great insight into unconventional data collection ways
  • Generate money in automated system
  • Stock sensitivity to revenue surprises
  • Identify underlying ground truth
"Some Refreshing changes to world of investment"

Happy Learning!!!