"No one is harder on a talented person than the person themselves" - Linda Wilkinson ; "Trust your guts and don't follow the herd" ; "Validate direction not destination" ;

December 31, 2017

December 08, 2017

Day #93 - Regularizations

Four methods of Regularization
  • Cross Validation inside training data
    • 4 to 5 folds of K-Fold Validations
    • Split into K non-intersecting subsets
    • Leave one out scheme
    • Target variable leakage is still present in K Fold Scheme
  • Smoothing based on size of category
    • Category big lot of data points
    • Formula = (mean(target)*nrows+globalmean*alpha)/(nrows+alpha)
    • alpha = category size we can trust
  • Add Random Noise
    • Unstable, Hard to make it work
    • Too much noise
    • LOO, Leave one out Regularization
  • Sorting and calculating mean on some type of data
    • Fix sorting order of data
    • Use Rows 0 to N-1 to calculate mean for N-1
    • Least Leakage
 Happy Learning!!!

November 28, 2017

Day #92 - Mean Encoding

Mean Coding
  • Add new variables based on certain features
  • Label encoding is done usually
  • Mean encoding is done as variable count / distinct unique variables
  • The proportion of label encoding also is included in this step
  • Min encoding with label encoding
  • Label encoding - No logical order
  • Mean encoding - Classes are separable
  • We can reach better loss with sorted trees
  • Trees need huge number of splits 
  • Model tries to treat all categories differently
Constructing Mean Encoding
  • Goods - Number of ones in a group
  • Bads - Number of zeros
Likelihood = Goods/(Goods + Bads) = mean(target)
Weight of Evidence = In(Goods/Bads)*100
Count = Goods = sum(target)
Diff = Goods-Bads


Happy Learning!!!

November 24, 2017

Database Sharding and Scalability Basics

Some Key considerations for NOSQL Vs RDBMS
  • Performance - Latency tolerance, How slow my queries can run for huge data sets
  • Durability - Data loss tolerance when database crashes losing in-memory or Lost transactions tolerance
  • Consistency - Weird results tolerance (Dirty data tolerance)
  • Availability - Downtime tolerance
Options for Scalability
  • Replication - Create copies of database, Application can talk to either database
  • Sharding - Sharding choosing a partition key, Key-value store partition based on key
  • Caching - Precomputed and stored, Manage cache expiration time and refresh logic
For streaming data we had already discussed Events Hub, Apache kafka. Now we have something called KSQL (Kafka streaming SQL to run on continuous data)

Great Session Talk

 

RDBMS VS NOSQL Considerations, Quick Summary
  • Performance - Latency tolerance
  • Durability - Data loss tolerance
  • Consistency - Weird results tolerance (Dirty data tolerance)
  • Availability - Downtime tolerance
Happy Learning!!!

November 16, 2017

Day #91- Retail Analytics - Data Mining / Analytics

Running a successful #Retail Store has a lot of Data Mining / Analytics challenges to solve and arrive at decisions based on data. Some of interesting Retail Data Mining / Analytics problems are
  • What sells best in each store with item level details
  • What are shopping time/routine for particular store
  • Using web data identify the relevance of shopping district / retail environment
  • What are money making items in the store (Quantity vs Price)
  • What is Sales / Stock ratio?
  • What is the forecast value of minimum orders for items in each store based on sales/traffic trends?
  • What is the correlation between Loss items, Shopping days/periods / people movements?
  • What is the retail price points identified based on End of Season Sales ?Forecasts / Predictions come as next steps after Data Analysis
Happy Analytics!!!

November 15, 2017

Day #90 - Regression Metrics Optimization

RMSE, MSE, R-Squared (Sometimes called L2 Loss)
Tree-Based
  • XGBoost, LightGBM
  • sklearn.RandomForestRegressor
Linear Models
  • sklearn.<>Regression
  • sklearn.SGDRegressor
Neural Networks
  • PyTorch
  • Keras
MAE (L1, Medial Regression)
Tree-Based
  • LightGBM
  • sklearn.RandomForestRegressor
MSPE, MAPE
  • MSPE is weighted version of MSE
  • MAPE is weighted version of MAE
Happy coding and learning!!!

November 14, 2017

Day #89 - Capsule networks

Key lessons
  • Instead of adding layers it nests layers inside it
  • We apply non-linearity to grouped neuros (capsule)
  • Dynamic routing - Replace scalar output feature detector of CNN by routing by agreement based on output
CNN History
  • Latest paper on capsule networks
  • Offers state of art performance for MNIST dataset
  • Convolutional networks - Learn mapping for input data and output label
  • Convolution layer - Series of matrix multiplication and summation operation, Output feature map (bunch of learned features from image)
  • RELU - Apply non-linearity to it (Network can learn both linear and non-linear functions). Solves vanishing gradient problem. (As gradeient is backpropagating its getting smaller and smaller, RELU prevents it)
  • Pooling - Creates sections and take maximum pixel value from each sections
  • Each line of code corresponds to layers in networks
  • Dropout - Neurons randomly turned on to prevent overfits (Regularization technique)
  • For handling rotations - AlexNet added different rotations to generalize to different rotations
  • Deeper networks improved classification accuracy
  • VGGnet adding more layers
  • Googlenet - Convolution with different sizes processed on same input, Several of those together
  • Resnet - Instead of stacking layers, Add operation improved vanishing gradient problem

Convolutional Network Challenges
  • As we go up the hierarchy each of features learnt will be more complex
  • Hierarchy happening with each layers
  • Sub-sampling loses spatial relationships
  • Spatial correlations are missed in sub-sampling and pooling
  • Bad for rotated images (Invariance issues)
Capsule Networks
  • Basic idea - Human brain attains transnational invariance in a better way, Instead of adding layers it nests layers inside it
  • Nested layer is called capsule, group of neurons
  • CNN route by pooling
  • Deeper in terms of nesting
Layer based squashing
  • Based on output neuron we apply non-linearity
  • We apply non-linearity to grouped neuros (capsure)
Dynamic routing
  • Replace scalar output by routing by agreement
  • Hierarchy tree of nested layers
Key difference - All iterations to compute output, For every capsule nested apply operations
Happy coding and learning!!!

Day #88 - Metrics Optimization

Loss vs Metric
  • Metric - Function which we want to use to evaluate the model. Maximum accuracy in classification
  • Optimization Loss - Easy to optimize for given model, Function our model optimizes. MSE, LogLoss
  • Preprocess train and optimize another metric - MSPE, MAPE, RMSLE
  • Optimize another metric postprocess predictions - Accuracy, Kapps
  • Early Stopping - Stop traning when models starts to overfit
 Custom loss functions

Accuracy Metrics





Happy Coding and Learning!!!

November 10, 2017

Day #87 - Classification Metrics

  • Accuracy (Essential for classification), Weighted Accuracy = Weighted Kappa
  • Logarithmic Loss (Depends on soft predictions probabilities)
  • Area under Receiver Operating Curve (Considers ordering of objects, tries all threshold to convert soft predictions to hard labels)
  • Kappa (Similar to R Squared)
Notations
N - Number of objects
L - Number of classes
y - Ground truth
yi - Predictions
[a = b] - indicator function
  • Soft labels (soft predictions) are classifier's scores - Probabilities of objects
  • Hard Labels (hard predictions) - argmax fi(x), [f(x)>b], b - threshold for binary classification, Predict label, maximum value from soft prediction and set class for prediction label. Function of soft label
Accuracy Score
  • Most referred measure of classifier quality
  • Higher is better
  • Need hard predictions
  • Number of correctly guessed objects
  • Argmax of soft predictions
Logloss
  • Work with soft predictions
  • Make classifier output posterior probabilities
  • Penalises for wrong answers
  • Set constant to frequencies of each class
Area Under Curve
  • Based on threshold decide percentage of above / below the threshold
  • Metric tries all possible ones and aggregate scores
  • Depends on order of objects
AUC - ROC
  • Compute TruePositive, FalsePositive
  • AUC max value 1
  • Fraction of correctly ordered pairs
AUC = Fraction of  correctly ordered pairs / total number of pairs
 = 1 - (Fraction of incorrectly ordered pairs / total number of pairs)

Cohen's Kappa
  • Score = 1- ((1-accuracy)/(1-baseline))
  • Baselines different for each data
  • Similar to R squared
  • Here R predictions for dataset used as baseline
  • Error = (1- Accuracy)
  • Weighted Error Score = Confusion matrix * Weight matrix and sum their results
  • Weighted Kappa = 1 - ((weighted error)/(weighted baseline error))
  • Useful for medical applications

Ref - Link


Happy Learning and Coding!!!

November 09, 2017

Day #86 - Regression Metrics

  • Relative Errors most important to us
  • MSW, MAE work with absolute error not for relative errors
  • MSPE (mean square percentage error)
  • MAPE (mean absolute percentage error) - Weighted version of MAE
  • RMSLE (Root mean square lograthmic error) - RMSE calculated in lograthmic scale - Cares about relative errors
Happy Coding and Learning!!!

November 07, 2017

Day #85 - Regression Metrics Optimization

Metrics
  • Metrics used to evaluate submissions
  • Best result finding optimal hyperplane
  • Exploratory metric analysis along with data analysis
  • Own ways to measure effectiveness of algorithms
Regression - Metrics
  • Mean Aquare Error
  • RMSE
  • R Squared
  • Same from optimization perspective
Classification
  • Accuracy
  • LogLoss
  • AUC
  • Cohen's Kappa
Regression Metrics
N - Samples
y - target values
y~ - target Predictions
yi - target ith value
yi~ - prediction ith object

Mean Square Error
MSE = 1/N(yi - yi~)^2
- Average the squared differences between actuals and targets

RMSE - Root Mean square Error = Sqrt(MSE)

  • Same as scale of target
  • RMSE vs MSE
  • Similar in terms of minimizers
  • Every RMSE minimizer is MSE minimizer
  • MSE(a) > MSE(b) <=> RMSE(a) > RMSE(b)
  • MSE orders in same way as RMSE
  • MSE easier to work with
  • Bit of difference in gradient based model
  • They may not be interchargeable for learning methods (learning rate)
R Squared
  • How much model is better than constant baseline
  • 1 predictions perfect
  • WHEN MSE is 0, R Square = 1
  • All reasonable models score between 0 and 1
MAE - Mean Absolute Error
  • Avg of absolute difference value between target and predictions
  • Widely used in Finance
  • 10$ Error twice worse than 5$ Error
  • MAE easier to justify
  • Median of target values useful for MAE
  • MAE gradient step function -1 smaller than target, +1 when greater than target
  • MAE is not differentiable
MAE vs MSE
  • For outliers - use MAS
  • unexpected but normal MSE
  • MAE robust to outliers
Happy Learning and Coding!!!

November 05, 2017

Day #84 - Data Leaks and Validations

  • Mimic Train / Test Splot as the test data
  • Perform KFold Validations
  • Choose best parameters for models
  • Submission Stage (Can't mimic exact train / test split)
  • Calculate mean and standard deviations of leader board scores
Data Leaks
  • Unexpected information in data that lets you make good predictions
  • Unusable in real world
  • Results of unintentional error
Time Series
  • Incorrect timesplits still exists
  • Check public and private splits
  • Missing feature columns are data leaks
Unexpected Information
  • Use File creation dates
  • Resize features / change creation date
  • ID's no sense to include in model
Happy Learning and Coding!!!

October 31, 2017

Day #83 - Data Splitting Strategies

  • Time based splits
  • Validation to mimic train / test pic
  • Time based trend - differs significantly, Time based patterns important
Different splitting strategies can differ significantly
  • In generated features
  • In a way model will rely on that features
  • In Some kind of target leak
 Split Categories
  •  Random Split (Split randomly by rows, Rows independent of each other), Row wise
  • Device special features for dependency cases
  • Timewise - Before particular date as training, After date as testing data. Useful features based on target
  • Moving window validation
  • By Id - (By Clustering pictures, grouping them and then finding features)
  • Combined (Split date for each shop independently)
Summary
  • In most cases split by Rownumber, Time, Id
  • Logic for feature generation depends on data splitting strategy
  • Set up your validation to mimic the train / test split of competition
Happy Learning and Coding!!!

Day #82 - Validation and Overfitting


  • Train Data (Past), Unseen Test Data (Future)
  • Divide into three parts - Train (Past), Validation (Past), Test (Future)
  • Underfitting (High Error on Both Training and Validation)
  • Overfitting (Doesn't generalize to test data, Low Error on Train, High Error on Validation)
  • Ideas (Lowest Error on both Training and Testing Data)
Validation Strategies
  • Hold Out (divide data into training / testing, No overlap between training / testing data ) - Used on Shuffle Data
  • K-Fold (Repeated hold out because we split our data) - Good Choice for medium amount of data, K- 1 training, one subset - Used on Shuffle Data
  • Leave one out : ngroups = len(train) - Too Little data (Special case of K fold, K = number of samples)
  • Stratification - Similar target distribution over different folds
Stratification useful for
  • Small datasets (Do Random Splits)
  • Unbalanced datasets
  • Multiclass classification
 Stratification preserves the target distribution over different folds

Happy Coding and Learning!!!

October 30, 2017

Day #81 - Dataset Cleaning


Dataset cleaning
  • Constant features (Remove constants features who value remain constant in both training and testing data, Value is constant in training but changes in testing - better to remove those features, Only fraction of features supplied in data, Same value in both training and testing set)
  • Duplicated features (Completely identical columns, This will slow down training time, remove duplicate columns)
  • Duplicated categorical features (Encode categorical features and compare them)
Other things to check
  • Duplicated rows (Duplicated rows with different targets, could be result of mistake, remove those duplicated rows to have high score on test set)
  • Check for common rows in train and test sets (Set labels manually for test rows in training set)
  • Check if dataset is shuffled (Oscillations around mean would be observed)
EDA Checklist
  • Get Domain Knowledge
  • Check How data is generated
  • Explore individual feature
  • Explore pairs and groups
  • Clean features
Happy Learning and Coding!!!

October 29, 2017

Day #80 - Visualizations

EDA is an art. Visualizations are art tools. Several different plots to prove hypothesis

Visualization Tools
  • Histograms (Split into bins, how many points fall in each bins, vary number of bins) - plt.hist(x)
  • XGBoost will benefit from explicit missing values
  • Plots - index versus value, plt.plot(x,'.'), randomness over indices
  • Statistics
Explore Feature Relations
  • Scatter Plots (Draw one features vs other), Data distribution between train and test tests validate how they are distributed
  • Correlation Plots (Run K-means clustering and reorder feature) - How similar features are
  • Plot (index vs feature statistics)
Feature Groups
  • Generate new features based on groups
Pairs
  • ScatterPlot, Scatter matrix
  • Correlation Plot (Corrplot)
 Groups
  •  Corrplot + Clustering
  •  Plot (Index vs feature statistics)

More Read (Link)




Happy Learning and Coding!!!

Day #79 - Exploratory Data Analysis (EDA)

EDA
  • Looking data, Understanding data
  • Complete data understanding required to build accurate models
  • Generate Hypothesis / Apply Intuition 
  • Top solutions use Advanced and Aggressive Modelling
  • Find insights and magic feature, Start with EDA before hardcore modeling
Visualization
  • Identify Patterns (Visualization to idea)
  • Use patterns to find better models (Idea to visualization, Hypothesis testing)
EDA Steps
  • Domain Knowledge (Google, Wikipedia understand data)
  • Check data is Intuitive (Values in data validate based on acquired domain knowledge, Manual correction of error, Mark incorrect rows and label them for model to leverage it)
  • Understand how data is generated (Test set / Training set generated by the Same Algorithm ? / Need to know underlying data generation Process / Visualize Training / Test set plots)
Exploring Anonymized and Encrypted Data
Anonymized Data
  • Replace data with encrypted text (This will not impact model though)
  • No meaningful names of columns
  • Find unique values of features, sort them and find differences
  • Distance between two consecutive features and the pattern for it
Explore Individual Features
  • Guess the meaning of the columns
  • Guess the types of the column (Categorical, Boolean, Numeric etc..)
Explore Feature Relations
  • Find relation between pairs
  • Find feature groups
Useful Python functions
  • df.dtypes
  • df.info()
  • x.value_counts()
  • x.isnull()
Happy Learning and Coding!!!

Day #78 - Image Processing - Kaggle Lessons

  • Use Trained model on data similar
  • Train network from scratch
  • Using pretrained model and Fine tune later
VGGNet16 Architecture
  • Remove Last layer with new one size of 4
  • Retrain model
  • Benefit from model trained from similar dataset
Image Augmentation
  • Increase number of training samples
  • Image rotations
Happy Learning and Coding!!!

Day #77 - Quick Summary - Kaggle Lessons - Features, Dates, Text

  • For Features - One Hot Encoding, Label Encoding, Frequency Encoding, Ranking, MinMaxScaler, StandardScaler
  • For Dates - Periodicity - Year, Date, Week, Time Slice - Time past since particular moment (before / after), Difference in Dates (Datetime_feature1 - Datetime_feature2), Boolean binary indicating date is holiday or not
  • For Text - Preprocessing - Lowercase, Stemming, Lemmatization, stopwords removal, Ngrams can help use local context, Postprocessing - TFiDF,  Use BOW for Ngrams
Happy Coding and Learning!!!

Day #76 - Text Processing - Kaggle Lessons


Bag of Words
  • Create new column for each unique word in data
  • Count occurrences in each documents
  • sklearn.feature_extraction.text.CountVectorizer
  • More comparable by using Term Frequency
  • tf = 1 / x.sum(axis=1)[:,None]
  • x = x*tf
  • Inverse Document Frequency
  • idf = np.log(x.shape[0])/(x>0).sum(0)
  • N Grams
  • Bag of Words (Each row represents text, Each column represents unique word)
  • Classifying document

For N = 1, This is a sentence
Unigrams are - This, is, a , sentence

For N = 2, This is a sentence
bigrams are - This is, is a, a sentence

For N = 3, This is a sentence
Trigrams are - This is a, is a sentence

sklearn.feature_extraction.text.CountVectorizer: Ngram_range, analyzer

Text Preprocessing steps
  • Lower case
  • Lemmatization (using knowledge of vocabulary and morphological analysis of words)
  • democracy, democratic and democratization -> democracy (Lemmatization)
  • Stemming (Chops of ending of words)
  • democracy, democratic, and democratization - democr (Stemming)
  • Stop words (Not contain important information)
sklearn.feature_extraction.text.CountVectorizer: max_df has parameters for stop words

I have done all this in my assignment work. This is there in my github code

For Applying Bag of words
  • Preprocessing - Lowercase, Stemming, Lemmatization, stopwords removal
  • Ngrams can help use local context
  • Postprocessing - TFiDF
  • Use BOW for Ngrams
BOW example
  • Sentence - The dog is on the table
  • Representation         - are, cat, dog, is, now, on, the, table
  • BOW representation  - 0,    0,    1,    1,     0,      1,    1,    1
BOW Issue

The food was good, not bad at all
The food was bad, not good at all

Both representations are the same however the meaning varies :)

Word to Vectors
  • Get vector representation of words and texts
  • Each word converted to vector
  • Uses nearby words
  • Different words used in same context will be used in vector representation
  • Apply basic operations can be done on vectors
  • Words - Word2Vec, Glove, FastText
  • Sentences - Doc2Vec
  • There are pretrained models
Bag of Words
  • Very large vectors
  • Meaning of each value in vector is unknown
Word2Vec
  • Relatively small vectors
  • Values of vector can be interpreted only in some cases
  • The words with similar meaning often have similar embeddings
Happy Learning, Happy Coding!!!

October 27, 2017

Day #75 - Missing Values

  • Reasons for Missing Values
  • How to Engineer them effectively
  • Hidden Missing Values
  • Plot distribution of values and find from histogram
Filling missing Values
  • -999, -1 (Fill with some value) - useful to provide different category, Perf Suffers
  • mean, median
  • Reconstruct value
  • add isnull column
Reconstruction
  • Missing values in timeseries
  • Temperature values missing for some days of month
  • Based on increase / decrease pattern
  • Ignore missing value while calculating mean
  • Change Categories to frequencies
  • XGBoost can handle NAN
Happy Learning and Coding!!!

Day #74 - Feature Generation - DateTime and Coordinates

DateTime
  • Differ Significantly between numeric and categorial features
  • Periodicity - Year, Date, Week
  • Time Slice - Time past since particular moment (before / after), Time moments in period
  • Difference in Dates (Datetime_feature1 - Datetime_feature2)
  • Special Time period (Medication every 3 days)
  • Sales Predictions (Days since last holiday, Days since weekend, Since last sales campaign)
  • Boolean binary indicating date is holiday or not
  • Sales Context Churn Prediction
  •     (Date Since user registration) - DateDiff
  •     (Date Since last purchase) - DateDiff
  •     (Date Since calling customer service) - DateDiff
  • Periodicity - Day number in week, month, season, year, second, minute, hour
  • Time Slice, Difference between dates
Coordinates
  • This can be used for churn prediction (Likelihood customer will return)
  • In Real Estate Scenario for predictions on Prices
  •     (Distance from School)
  •     (Distance from Airport)
  •     (Flats around particular point)
  • Alternatively distance from maximum expensive flat
  • Centre of clusters and find distances from centre point
  • Aggregated Statistics for surrounding data
Happy Learning and Coding!!!

Day #73 - Feature Generation - Categorical and ordinal features

  • Label Encoding - Based on Sort Order, Order of Appearance
  • Frequency Encoding - Based on Percentage of occurence
Categorical Features
  • Sex, Cabin, Embarked
  • One Hot Encoding
  • pandas.get_dummies
  • sklearn.preprocessing.OneHotEncoder
  • Works well for Linear methods (Minimum is zero, Maximum is 1)
  • Difficult for Tree methods based on One Hot Encoding Approach
  • Store only Non-Zero Elements (Sparse Matrices)
  • Create combination of features and get better results
  • Concatenate strings from both columns
  • One hot encoding it, Find optimal coefficient for every interaction
pclass,sex,pclass_sex
3,male,3male
1,female,1female
3,female,3female
1,female,1female

pclass_sex ==
1male,1female,2male,2female,3male,3female
0,0,0,0,1,0
0,1,0,0,0,0
0,0,0,0,0,1
0,1,0,0,0,0

Ordinal Features
  • Ordered categorial feature
  • First class expensive, second less, third least expensive
  • Drivers License Type A,B,C,D
  • Level of Education (Sorted in increasingly complex order)
  • Label Encoding, Map to numbers (Tree based)
  • Non Tree can't use effectively
Label Encoding
1. Alphabetical sorted [S,C,D] -> [2,1,3]
 - sklearn.preprocessing.LabelEncoder

2. Order of Appearance
[S,C,Q] -> [1,2,3]
 - Pandas.Factorize

Frequency Encoding (Depending on Percentage of Occurences)
[S,C,Q] -> [0.5,0.3,0.2]
encoding -> titanic.groupby('Embarked').size()
encoding = encoding/len(titanic)
titanic['enc'] = titanic.Embarked.map(encoding)

Frequency encoding will help for Linear based models (If frequency is correlated with target value then linear model will use the dependency). Preserve value distribution.
  • Equal Distributiona apply rank ties
  • from scipy.stats import rankdata
Summary
  • Ordinal is special case of categorial feature
  • Label Encoding maps categories to numbers
  • Frequency encoding maps categories to frequencies
  • Label and frequency encoding are used for Tree based models
  • One-Hot encoding is used for non-tree based models
  • Interactions of categorial features can help linear models and KNN

Happy Coding and Learning!!!

Day #72- Feature Generation - Numeric Features

Feature Generation
  • Predict Apple Sales (Linear Trend)
  • Examples - Add features indicating week number, GBDT will consider min calculated value for each week
  • Created Generated Tree
Numeric Features - Preprocessing
  • Tree based Methods (Decision Tree)
  • Non Tree based Methods (NN, Linear Model, KNN)
Technique #1 - Scaling of values
  • Apply Regularization in equal amounts
  • Do proper scaling
Min Max Scalar
  • To [0,1]
  • sklearn.preprocessing.MinMaxScaler
  • X = (X-X.min())/(X.max()-X.min())
Standard Scaler
  • To mean = 0, std = 1
  • sklearn.preprocessing.StandardScaler
  • X = (X-X.mean())/X.std()
Preprocessing (Scaling) should be done for all features not just for fewer features. Initial impact on the model will be roughly similar
Preprocessing Outliers
  • Calculate lower and upper bound values
  • Rank transformation
  • Better option than Min-Max Scale
Ranking, Transformations
  • scipy.stats.rankdata
  • Log transformation  - np.log(1+x)
  • Raising to power < 1 - np.sqrt(x+2/3)
Feature Generation (Based on Feature Knowledge, Exploratory Data Analysis)
  • Creating new features
  • Engineer using prior knowledge and logic
  • Example, Adding price per square feet if price and size of plot is provided
Summary
  • Tree based methods don't depend on scaling
  • Non-Tree methods hugely depend on scaling
Most often used preprocessing
  • MinMaxScaler - to [0,1]
  • StandardScaler - to mean==0, std==1
  • Rank - sets spaces between sorted values to be equal
  • np.log(1+x) and np.sqrt(1+x)
 Happy Learning and Coding!!!

Day #71 - Kaggle Best Practices


After a long pause back to learning mode. This post is on learning's from Coursera course - Winning Kaggle Competitions (https://www.coursera.org/learn/competitive-data-science)

Session #1
Basics on Kaggle Competition
  • Data - text, pictures (Format could be csv, database, text file, speech etc). Accompanied by description of features
  • Model - Exactly built during competition. Transforms data into answers, Model propertiese - Product best possible prediction and be reproducible
  • Submissions - Compare against models and predictions submitted
  • Evaluations - How good is your model, Quality of model defined by Evaluation function (Rate of correct answers)
  • Evaluation Criteria - Accuracy, Logistic Loss, AUC, RMSE, MAE

Guidelines for submissions
  • Analyze data
  • Fit model
  • Submit
  • See Public Score
Data Science Competition Platforms
  • kaggle
  • DrivenData
  • CrowdAnalityx
  • CodaLab
  • DataScienceChallenge.net
  • DataScience.net
  • KDD
  • VizDooM
Session #2
Using Kaggle
  • Data format and explanations
  • Evaluation Criteria
  • Sample Submission File
  • Timelines page
Kaggle vs Real world Competitions

Real - World Machine learning problems have Several Stages
  • Understand business problem
  • Problem Formulation
  • Collect Data, Mine Examples
  • Clean Data
  • Preprocess it
  • Identify Model for Task
  • Select best models
  • Accuracy
  • Deploy model (make it available to users)
  • Monitor and retrain with new data

Kaggle
  • All data collected and problems fixed
  • Model creation and evaluation
Summary
  • Real world problems are complicated
  • Competition are a great way to learn
  • But Kaggle competitions don't address the questions of formalization, deployment and testing
Key insights
  • Importance of understaing data
  • Tools to users
  • Try (Complex solutions, Advance feature engineering, doing huge calculation)
Session #3
  • Linear Model (Classifying two set of points using linear lines, 2Dimesions) - Logistics Regression, SVM (Linear models with different loss functions), Good for Sparse High Dimesional data, Linear models into two subspaces
  • Tree based - Use decision tree as basic block to build more complicated models (Tree based Decision Tree, Random Forest, Gradient Boosted Decision Trees), DT - Divide and Conquer approach to Recursively split spaces into sub spaces. Tree based methods split spaces into boxes
  • KNN - Nearest Neighbours, Labels for points shown, Points close to each other are likely to have similar labels, K nearest objects and label with majority votes, Relies heavily on measure points
  • Neural Networks - Special class of ML models, Blackbox produces most seperating curves, Play with parameters of simple feed forward networks. Good for image, sounds, text, speech
Session #4 - Data Preprocessing
  • Preprocess for feature engineering
  • Basic feature generation for different types of features
  • Numeric, Categorical, DateTime based features
Features
  • (0/1) - Binary Features
  • Numeric features (Age, fare)
  • Categorical (Classes)
Feature Preprocessing
  • Each feature has own ways to be preprocessed
  • Depends on model to use
  • Linear models not for two class features
  • One hot encoder
  • Random forest can easily put each class seperately and predict each probability

Data Types
Structured Data
  • Ordinal - Ranks 1st / 2nd / 3rd Ordinal Data
  • Numerical - Specific Numeric Data
  • Continuous - Petrol Prices continuous data
  • Categorical - Days of Week, Months of Year

Happy Learning and Coding!!!

September 25, 2017

September 01, 2017

Exploring Analytics in Microsoft Azure

I am working on Microsoft Azure platform on a BI cloud solution. Some of the key components I worked recently are
  • Azure Data Factory
  • Azure Data Lake
  • Azure SQL Data warehouse
  • Power BI on top of Data warehouse for reporting
I had earlier compared different stacks Microsoft / Google / Amazon.

The high level workflow for cloud based BI Solution and key components are

Step #1 - Moving Data from In premises to Cloud
Here data management gateway is installed on the in-premises machines, Pipelines are created in Azure Data factory to move data from In-premises to Azure Data lake

Step #2 - Azure Data Factory
ADF provides platform for data ingestion, Consuming high volumes of data. This experience setting up pipelines has some similarities and differences compared to SSIS. The key differences are
  • Everything is JSON based
  • Setting up Connections
  • Defining input and output data formats in datsets
  • Input and Output datasets also define the storage locations
  • Defining Pipeline logic which includes, logic, input, output datasets, scheduling for pipeline
  • This is bit straight forward but there is some learning with the tool, configuration properties
Step #3 - Azure Data lake
Azure Datalake is for storing data (RDBMS / No-RDBMS) data, If we have to integrate data from MSSQL, MYSQL for a realtime processing from two sources, We can leverage data lake to store and consolidate it later. The data stored in Datalake are referenced as external tables in AZURE Sql Datawarehouse

Step #4 - Web Application
All the references of data movement from Datalake and connectivity to Datawarehouse is managed by Access control leveraged with a Azure web app. The security aspect is well managed in Azure infrastructure

Step #5 - Data Consolidation into SQL Datawarehouse
The external tables referenced in Datalake can be referenced, queried in TSQL format and data loaded in Azure Datawarehouse tables. This is the location of fact and dimension tables that would power our datawarehouse. This could be done by stored procedures.

Step #6 - Power BI reporting
We have completed Data loading, data consolidation. The next is Power BI. PowerBI has the most power offering for web / mobile platforms. This is convenient and easy to use. The extended Analytics / R Support / Machine Library support also makes it suitable to run both Business Intelligence / Machine Learning solutions.

Security aspects of this architecture is well handled with Firewall, IAM access as needed. This seems very stable even some of the components are constantly updated. This is high level architecture explanation, We will look into To-do exercises in coming weeks.

Happy Learning!!!

July 19, 2017

Day #70 - Machine Learning - Deep Learning Fundamentals - Machine Learning Notes

Picture is worth 1000 words, Few examples listed in the book are very precise, clear on Machine Learning fundamentals. Below are few of the images on Machine learning / Deep Learning Concepts

Figure #1


  • How machine learning, AL and Deep Learning are inter-related, The subset representation clearly represents the knowledge boundaries
  • Deep Learning frameworks allow developers to iterate quickly, Making algos accessible to practitioners. Deep learning frameworks help to scale machine learning code for millions of users
  • Its important to note fundamentals of Machine Learning is important to work with Deep Learning

Figure #2


  • In Machine learning, historical data is used to derive learning's / rules from it and apply it for future data predictions
  • From the data we need to identify (relevant features / variables), In this process we use different techniques like PCA, Correlation techniques, Derived features to identify relevant feature attributes for model creation
  • From the vast amount of data we collect through enterprise applications / systems we need to identify / extract relevant data to build models and validate them. Setting up the data pipeline, training with required dataset becomes key for better / high accuracy models
Figure #3

  • High level perspective of Deep Learning, How the nodes are defined, weights computed
  • The loss part for each iteration is compared with predictions and sent back to perform weight updates, This iterations we call it as back propagation
  • Deep Learning term is because the network are 'deep' - multiple hidden layers involved in computation
Figure #4

  • SVM Wide street approach, line that separates two classes
  • Allow non-linear decision boundaries
  • Each dimension represents feature
  • Goal of SVN - Train a model that assigns unseen objects into particular category
  • Advantage - High Dimensionality, Memory Efficiency, Versatility
Machine Learning Notes







Happy Learning!!!

May 16, 2017

Day #69 - TSQL Skills for Data Pipeline and Cleanup Work

Pivot is a key thing when it comes to data preparation tasks, MSSQL pivot without aggregation does need a bit of workaround. Two things we will see in this post

Learning #1 - Script for Insert Data generation from MSSQL tables using SSMS (Hidden Gem in MSSQL)

Step 1 -  Database -> Tasks -> Generate Scripts

Step 2 -  Generate the Database objects (Tables as needed)


Step 3 -  Specify Save to Location, Data only option. After you specify options next step script runs and generates insert statement as needed.




Learning #2- Pivot for Data Preparation scenario
For a given scenario of customer/orders, Pivoting the data for next level of tasks



Happy Learning!!!

May 14, 2017

Weekend Seminar - Deep learning in production at Facebook

Good Talk - Deep learning in production at Facebook https://lnkd.in/fX7BZif

Notes from Session
Deep Learning Use Cases
  • Event Prediction - Listing top relevant stories for the user, predicting relevance - Approach - Logistics regression + Deep Neural Networks
  • Machine Translation - Automatically machine translated posts generated for users - Approach - Encoder - Decoder Architecture, Using RNN
  • Natural Language Processing - Understand Context of text - Deep Text - Approach - CNN for words + RNN for sequences
  • Computer Vision - Understand pics - Approach - CNN @ massive scale. Understand different aspects of pictures - Classification, Detection, Segmentation 
Scaling the models
  • Computer faster - Tweaks in FFT, TiledFFT, Winograd to reduce convolution computations, NNPack, CuDNN for CPUs
  • Memory Usage - GPU + Activations Memory released and reallocated during different layers of processing in Deep Networks
  • Compress models - Exploit redundancy in model designs, prune them
Good Insights!!!

Kaggle Vs Enterprise Machine Learning Adoption - Two sides of coin


Reposting Summary from Quora Answer with my perspective added

What you don't learn in Kaggle Competitions
  • Determining business problem to solve with data
  • Real world data imbalance, Accuracy issues, Maintaining Models
  • Miss the challenges of data engineering (What features to select, causational vs correlation in domain context) 
What you learn by experimenting real world data science applications in Production
  • Identifying / Reusing Existing data for first level models 
  • Identifying pipelines to build for more relevant variables
  • ETL / Data Consolidation / Aggregation, Eliminating outliers / Redundant Data
Today's systems have enough Transactional Reporting  / BI Reports in place. The challenge is evolving from the current system, leveraging current data, build a basic model, slowly build pipelines and extend other machine learning use cases.

Happy Learning!!!

April 29, 2017

Day #68 - CNN / RNN and Language Modelling Notes

At the end of every class, I have a feeling there is a lot to learn. People in the industry know things only at the application level. The depth of topics, mathematics discussed in class is very extensive. I always have a feeling of guilt "need to learn more". Every learning needs the breakpoint to correlate/understand end to end, to see the concept in a more familiar perspective. Always Keep Learning and Keep growing.

CNN Notes
  • In a CNN lower layers learn generic features like edges, shapes and feed it to higher layers
  • Earlier layer - Generic features
  • Later layer - Features specific to the corresponding problem
  • For any related problems we can leverage existing network VGG16, VGG19, Alexnet and modify the higher layers based on our need
  • Relu only passed those in Activation function where its > 0
  • Vanishing gradient problem - Weights will stagnate over a period of time
  • 6E/6W - Gradient Error with respect to weights
  • 6E/6I - Gradient Error with respect to Image
RNN
  • Main things is weights same across RNN
  • Weights between successive layers same
  • Document Classification, Data Generation, Chatbot, Time series - RNN can be used
LSTM - Long short term memory

Topics from Language Modelling class


Happy Learning!!!

April 28, 2017

Day #67 - Exploring Tableau Visualization

Canadian Car sales data visualization examples. The interpretation varies based on representation presented below. The data has all the details. Exploring same data in different visualization perspective will provide a different interpretation of same data.

Visualization #1 - This representation would help us figure out which month has usually high sales numbers

  • Three months of year (Dec-Jan-Feb) has relatively weak sales figures compared to rest of year
  • March-August trend shows good demand from customers resulting in increased sales
  • Last few months of the year shows decreased demand. This could be seasonal factor/holidays/travel. This need to be validated
Visualization #2 - Consolidated snapshot of comparison of yearly performance of sales numbers, Across several years and across all months (This one is a good big picture)


  • January is the lowest period of sales
  • Sales trend is increasing YOY (year over year)
  • May month consistently tops high sales for many years
The data format looks like below in Visualization #1

Visualization #3 - Data in simple table format



  • Six years total sales data is represented
  • Partial data is available for the year 2016
Happy Learning!!!

April 27, 2017

Day #66 - Maths behind backpropagation

Today it's mathematical learning for neural network fundamentals.
Keynotes
  • In Neural Network, Network forward propagates activation to produce output and it back propagates error to determine weight changes
  • Partial Derivative - Derivative of one of the variables holding the rest constant
  • Backpropagation uses gradient descent method, one needs to calculate the derivative of squared error function with respect to the weights of the network.
Happy Learning!!!

April 26, 2017

Keep Learning - Good Motivation Note

Interesting Slide from presentation - Dev @ 40



Happy Learning!!!

April 23, 2017

Smart Farming

Product #1 - Automated Farming + Design Layout + Soil Monitoring + Solar powered = "Smart Farming"


Product #2 - Counting Fruits + Finding Weeds + Cattle monitoring

Happy Farming!!!

April 20, 2017

Data Science - Find your Winning use case

I observe a lot of technologies discussed in Data Science roles. It covers Big Data, Open Source, and Commercial Tools, R, Python, MapR, Spark, Azure, Various cloud providers etc...

"Identifying relevant domain/product related use case that helps improve business/numbers is the key"

This LinkedIn post provides a great clarity on focus on relevant use cases, small wins, and scale success.


Happy Analytics!!!

April 17, 2017

Day #65 - Python Package Installation commands - Windows

Had an issue running a code, Tried different options, Uninstalling existing version of keras and reinstalling it worked. Bookmarking commands


Happy Learning!!!

April 13, 2017

Day #64 - ETL for Data and Delta Data Management

Custom SSIS example sample for ETL setup for Data Extraction and Update

Scenario
  • Two Databases (Source and Target)
  • Example with Test Table with few columns
  • Ability to get New Data
  • Ability get Delta Data (Updates)
Step in SSIS Project

Step 1 - Create a Data Flow Task

Step 2 - Add connection managers for Source and Target Databases



Step 3 - The operators and layout is (Source Data -> Lookup in Target Database -> Insert / Update TargetDatabase)



Step 4 - OLEDB Data Source Settings


Step 5 -  Lookup to map for data



Step 6 - Lookup Mapping


Step 7 - Match Non-Matching for Insert / Updates



Step 8 - Match Destination Settings


Step 9 - Non Match Update Query



Step 10 - Non Match Update Params

Reference table script


SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO
CREATE TABLE [dbo].[Table_1](
[Col1] [int] NULL,
[Col2] [int] NULL,
[Col3] [int] NULL
) ON [PRIMARY]
GO


Happy Learning!!!!

April 08, 2017

Day #63 - Notes from Text processing and Parallel Programming

Quick Summary notes for future reference

Text Processing - Word Sense Disambiguation
  • Rely on leveraging wordnet (Knowledge sources)
  • from nltk.corpus import wordnet - leverage it
  • Leverage Machine readable dictionary
Lesks Algorithm
  • Sense bag (ambigious word)
  • Context bag (different definitions to context word)
  • Close match will be picked
Walkers Algorithm for word sense disambiguation
  • Use Thesaurus to find scores in context
  • Highest score will be picked up for context relevance
  • Thesaurus Library pywordnet, now part of NLTK
Keywords
  • Polysemy - many possible meanings for a word or phrase.
  • Homonym - same spelling or pronunciation but different meanings
Parallel Programming
  • Filter locks
  • Bakery Algorithm
Example Implementation - link

Memory Consistency
  • Strict Consistency 
  • Sequentially consistent
  • Relaxed(Weak) consistent
Linearization Point
From Stackoverflow

Coarse Grained Vs Fine Grained
From Stackoverflow

Petersons Algorithm


More Reads - Link

Happy Learning!!!

April 07, 2017

TSQL Code formatting tool

Free tool for TSQL code formatting. Added to SSMS


Happy Formatting!!!

April 02, 2017

Fundamentals Again - Day #61 - Hypothesis Testing

  • Alternative Hypothesis - There is difference between groups
  • Null Hypothesis - There is no difference between groups
  • Binomial distribution - Two possible outputs
  • Sampling distributions, Mode, Median, Mean, Variability in distribution (Standard Deviation), Chi Square Distribution 
  • Conduct T-Test, Check the P-value to know Significance
Ref - Coursera

Happy Learning!!!

March 31, 2017

Day #60 - TSQL Profiling - Expressprofiler

Way better and Less complicated than SQL profiler
  • Profiler by DB Name
  • Profile by login account name
These two options are good enough to nail down most of issues. For blocking / deadlock we can hop on to Profile. Basic Checks this tool beats the need



Link - Download

Jasper Report passing parameter between datasets

Happy Learning!!!

Day #59 - Image Object Classification using Keras

This post is for basic image classification in Keras using VGG19. We leverage pre-trained models to detect objects in the image


Happy Learning!!!

March 20, 2017

Day #58 - Hacker Earth Challenge

With Running projects, its bit challenging to manage multiple tasks. Bookmarking my thoughts until further analysis

Problem - Link

Data Analysis (Approach)
  • Load data in SQL Tables
  • Analyze each column, Continous or Discrete variables
  • Outliers, missing data, summary of each Data Column
  • Manage Class Imbalances
  • Convert the dataset into numeric columns
  • Ignore any non-critical columns
  • Identify Data Correlations if it exists (Pending task)
The Approach
1. To eliminate class imbalance used smote technique
2. Used XGBoost to train and predict
3. Python 2.7 used. Two files one for Data cleanup, second for prediction

Happy Learning!!!

Day #57 - Xgboost on Windows 7, Python 2.7

On Python 2.7 Installed xgboost with below steps on Windows 7. This link was useful

1. Step 1 - Search for packages
anaconda search -t conda xgboost

2. Install Windows compatabile package
conda install -c mndrake xgboost

On Python3 Win64 Windows

conda install -c jjhelmus r-xgboost-cpu

conda install -c mikesilva xgboost


conda install -c rdonnelly py-xgboost

Happy Learning!!!

March 15, 2017

error: Unable to find vcvarsall.bat

While 'im2col_cython', encountered thus error on Windows 7. There were several solutions provided but below option only worked for me.

building 'im2col_cython' extension
error: Unable to find vcvarsall.bat

This post was useful to fix it

Step 1 - Installed Microsoft Visual C++ Compiler for Python 2.7 from https://www.microsoft.com/en-in/download/details.aspx?id=44266

Step 2 - In Directory C:\Anaconda2\Lib\distutils, Modified msvc9compiler.py file as below

Compiled as below
python.exe setup.py build_ext --inplace --compiler=msvc

Happy Learning!!!