- For Features - One Hot Encoding, Label Encoding, Frequency Encoding, Ranking, MinMaxScaler, StandardScaler
- For Dates - Periodicity - Year, Date, Week, Time Slice - Time past since particular moment (before / after), Difference in Dates (Datetime_feature1 - Datetime_feature2), Boolean binary indicating date is holiday or not
- For Text - Preprocessing - Lowercase, Stemming, Lemmatization, stopwords removal, Ngrams can help use local context, Postprocessing - TFiDF, Use BOW for Ngrams
October 29, 2017
Day #77 - Quick Summary - Kaggle Lessons - Features, Dates, Text
Labels:
Data Science,
Data Science Tips
Day #76 - Text Processing - Kaggle Lessons
Bag of Words
- Create new column for each unique word in data
- Count occurrences in each documents
- sklearn.feature_extraction.text.CountVectorizer
- More comparable by using Term Frequency
- tf = 1 / x.sum(axis=1)[:,None]
- x = x*tf
- Inverse Document Frequency
- idf = np.log(x.shape[0])/(x>0).sum(0)
- N Grams
- Bag of Words (Each row represents text, Each column represents unique word)
- Classifying document
For N = 1, This is a sentence
Unigrams are - This, is, a , sentence
For N = 2, This is a sentence
bigrams are - This is, is a, a sentence
For N = 3, This is a sentence
Trigrams are - This is a, is a sentence
sklearn.feature_extraction.text.CountVectorizer: Ngram_range, analyzer
Text Preprocessing steps
- Lower case
- Lemmatization (using knowledge of vocabulary and morphological analysis of words)
- democracy, democratic and democratization -> democracy (Lemmatization)
- Stemming (Chops of ending of words)
- democracy, democratic, and democratization - democr (Stemming)
- Stop words (Not contain important information)
I have done all this in my assignment work. This is there in my github code
For Applying Bag of words
- Preprocessing - Lowercase, Stemming, Lemmatization, stopwords removal
- Ngrams can help use local context
- Postprocessing - TFiDF
- Use BOW for Ngrams
- Sentence - The dog is on the table
- Representation - are, cat, dog, is, now, on, the, table
- BOW representation - 0, 0, 1, 1, 0, 1, 1, 1
BOW Issue
The food was good, not bad at all
The food was bad, not good at all
Both representations are the same however the meaning varies :)
- Get vector representation of words and texts
- Each word converted to vector
- Uses nearby words
- Different words used in same context will be used in vector representation
- Apply basic operations can be done on vectors
- Words - Word2Vec, Glove, FastText
- Sentences - Doc2Vec
- There are pretrained models
- Very large vectors
- Meaning of each value in vector is unknown
- Relatively small vectors
- Values of vector can be interpreted only in some cases
- The words with similar meaning often have similar embeddings
Labels:
Data Science,
Data Science Tips
October 27, 2017
Day #75 - Missing Values
- Reasons for Missing Values
- How to Engineer them effectively
- Hidden Missing Values
- Plot distribution of values and find from histogram
- -999, -1 (Fill with some value) - useful to provide different category, Perf Suffers
- mean, median
- Reconstruct value
- add isnull column
- Missing values in timeseries
- Temperature values missing for some days of month
- Based on increase / decrease pattern
- Ignore missing value while calculating mean
- Change Categories to frequencies
- XGBoost can handle NAN
Labels:
Data Science,
Data Science Tips
Day #74 - Feature Generation - DateTime and Coordinates
DateTime
- Differ Significantly between numeric and categorial features
- Periodicity - Year, Date, Week
- Time Slice - Time past since particular moment (before / after), Time moments in period
- Difference in Dates (Datetime_feature1 - Datetime_feature2)
- Special Time period (Medication every 3 days)
- Sales Predictions (Days since last holiday, Days since weekend, Since last sales campaign)
- Boolean binary indicating date is holiday or not
- Sales Context Churn Prediction
- (Date Since user registration) - DateDiff
- (Date Since last purchase) - DateDiff
- (Date Since calling customer service) - DateDiff
- Periodicity - Day number in week, month, season, year, second, minute, hour
- Time Slice, Difference between dates
- This can be used for churn prediction (Likelihood customer will return)
- In Real Estate Scenario for predictions on Prices
- (Distance from School)
- (Distance from Airport)
- (Flats around particular point)
- Alternatively distance from maximum expensive flat
- Centre of clusters and find distances from centre point
- Aggregated Statistics for surrounding data
Labels:
Data Science,
Data Science Tips
Day #73 - Feature Generation - Categorical and ordinal features
- Label Encoding - Based on Sort Order, Order of Appearance
- Frequency Encoding - Based on Percentage of occurence
- Sex, Cabin, Embarked
- One Hot Encoding
- pandas.get_dummies
- sklearn.preprocessing.OneHotEncoder
- Works well for Linear methods (Minimum is zero, Maximum is 1)
- Difficult for Tree methods based on One Hot Encoding Approach
- Store only Non-Zero Elements (Sparse Matrices)
- Create combination of features and get better results
- Concatenate strings from both columns
- One hot encoding it, Find optimal coefficient for every interaction
3,male,3male
1,female,1female
3,female,3female
1,female,1female
pclass_sex ==
1male,1female,2male,2female,3male,3female
0,0,0,0,1,0
0,1,0,0,0,0
0,0,0,0,0,1
0,1,0,0,0,0
Ordinal Features
- Ordered categorial feature
- First class expensive, second less, third least expensive
- Drivers License Type A,B,C,D
- Level of Education (Sorted in increasingly complex order)
- Label Encoding, Map to numbers (Tree based)
- Non Tree can't use effectively
1. Alphabetical sorted [S,C,D] -> [2,1,3]
- sklearn.preprocessing.LabelEncoder
2. Order of Appearance
[S,C,Q] -> [1,2,3]
- Pandas.Factorize
Frequency Encoding (Depending on Percentage of Occurences)
[S,C,Q] -> [0.5,0.3,0.2]
encoding -> titanic.groupby('Embarked').size()
encoding = encoding/len(titanic)
titanic['enc'] = titanic.Embarked.map(encoding)
Frequency encoding will help for Linear based models (If frequency is correlated with target value then linear model will use the dependency). Preserve value distribution.
- Equal Distributiona apply rank ties
- from scipy.stats import rankdata
- Ordinal is special case of categorial feature
- Label Encoding maps categories to numbers
- Frequency encoding maps categories to frequencies
- Label and frequency encoding are used for Tree based models
- One-Hot encoding is used for non-tree based models
- Interactions of categorial features can help linear models and KNN
Labels:
Data Science,
Data Science Tips
Day #72- Feature Generation - Numeric Features
Feature Generation
Preprocessing Outliers
- Predict Apple Sales (Linear Trend)
- Examples - Add features indicating week number, GBDT will consider min calculated value for each week
- Created Generated Tree
- Tree based Methods (Decision Tree)
- Non Tree based Methods (NN, Linear Model, KNN)
- Apply Regularization in equal amounts
- Do proper scaling
- To [0,1]
- sklearn.preprocessing.MinMaxScaler
- X = (X-X.min())/(X.max()-X.min())
- To mean = 0, std = 1
- sklearn.preprocessing.StandardScaler
- X = (X-X.mean())/X.std()
Preprocessing Outliers
- Calculate lower and upper bound values
- Rank transformation
- Better option than Min-Max Scale
- scipy.stats.rankdata
- Log transformation - np.log(1+x)
- Raising to power < 1 - np.sqrt(x+2/3)
- Creating new features
- Engineer using prior knowledge and logic
- Example, Adding price per square feet if price and size of plot is provided
- Tree based methods don't depend on scaling
- Non-Tree methods hugely depend on scaling
- MinMaxScaler - to [0,1]
- StandardScaler - to mean==0, std==1
- Rank - sets spaces between sorted values to be equal
- np.log(1+x) and np.sqrt(1+x)
Labels:
Data Science,
Data Science Tips
Day #71 - Kaggle Best Practices
After a long pause back to learning mode. This post is on learning's from Coursera course - Winning Kaggle Competitions (https://www.coursera.org/learn/competitive-data-science)
Session #1
Basics on Kaggle Competition
- Data - text, pictures (Format could be csv, database, text file, speech etc). Accompanied by description of features
- Model - Exactly built during competition. Transforms data into answers, Model propertiese - Product best possible prediction and be reproducible
- Submissions - Compare against models and predictions submitted
- Evaluations - How good is your model, Quality of model defined by Evaluation function (Rate of correct answers)
- Evaluation Criteria - Accuracy, Logistic Loss, AUC, RMSE, MAE
Guidelines for submissions
- Analyze data
- Fit model
- Submit
- See Public Score
- kaggle
- DrivenData
- CrowdAnalityx
- CodaLab
- DataScienceChallenge.net
- DataScience.net
- KDD
- VizDooM
Using Kaggle
- Data format and explanations
- Evaluation Criteria
- Sample Submission File
- Timelines page
Real - World Machine learning problems have Several Stages
- Understand business problem
- Problem Formulation
- Collect Data, Mine Examples
- Clean Data
- Preprocess it
- Identify Model for Task
- Select best models
- Accuracy
- Deploy model (make it available to users)
- Monitor and retrain with new data
Kaggle
- All data collected and problems fixed
- Model creation and evaluation
- Real world problems are complicated
- Competition are a great way to learn
- But Kaggle competitions don't address the questions of formalization, deployment and testing
- Importance of understaing data
- Tools to users
- Try (Complex solutions, Advance feature engineering, doing huge calculation)
- Linear Model (Classifying two set of points using linear lines, 2Dimesions) - Logistics Regression, SVM (Linear models with different loss functions), Good for Sparse High Dimesional data, Linear models into two subspaces
- Tree based - Use decision tree as basic block to build more complicated models (Tree based Decision Tree, Random Forest, Gradient Boosted Decision Trees), DT - Divide and Conquer approach to Recursively split spaces into sub spaces. Tree based methods split spaces into boxes
- KNN - Nearest Neighbours, Labels for points shown, Points close to each other are likely to have similar labels, K nearest objects and label with majority votes, Relies heavily on measure points
- Neural Networks - Special class of ML models, Blackbox produces most seperating curves, Play with parameters of simple feed forward networks. Good for image, sounds, text, speech
- Preprocess for feature engineering
- Basic feature generation for different types of features
- Numeric, Categorical, DateTime based features
- (0/1) - Binary Features
- Numeric features (Age, fare)
- Categorical (Classes)
- Each feature has own ways to be preprocessed
- Depends on model to use
- Linear models not for two class features
- One hot encoder
- Random forest can easily put each class seperately and predict each probability
Data Types
Structured Data
- Ordinal - Ranks 1st / 2nd / 3rd Ordinal Data
- Numerical - Specific Numeric Data
- Continuous - Petrol Prices continuous data
- Categorical - Days of Week, Months of Year
Labels:
Data Science,
Data Science Tips
September 25, 2017
September 01, 2017
Exploring Analytics in Microsoft Azure
I am working on Microsoft Azure platform on a BI cloud solution. Some of the key components I worked recently are
Security aspects of this architecture is well handled with Firewall, IAM access as needed. This seems very stable even some of the components are constantly updated. This is high level architecture explanation, We will look into To-do exercises in coming weeks.
- Azure Data Factory
- Azure Data Lake
- Azure SQL Data warehouse
- Power BI on top of Data warehouse for reporting
The high level workflow for cloud based BI Solution and key components are
Step #1 - Moving Data from In premises to Cloud
Here data management gateway is installed on the in-premises machines, Pipelines are created in Azure Data factory to move data from In-premises to Azure Data lake
Step #2 - Azure Data Factory
ADF provides platform for data ingestion, Consuming high volumes of data. This experience setting up pipelines has some similarities and differences compared to SSIS. The key differences are
- Everything is JSON based
- Setting up Connections
- Defining input and output data formats in datsets
- Input and Output datasets also define the storage locations
- Defining Pipeline logic which includes, logic, input, output datasets, scheduling for pipeline
- This is bit straight forward but there is some learning with the tool, configuration properties
Step #3 - Azure Data lake
Azure Datalake is for storing data (RDBMS / No-RDBMS) data, If we have to integrate data from MSSQL, MYSQL for a realtime processing from two sources, We can leverage data lake to store and consolidate it later. The data stored in Datalake are referenced as external tables in AZURE Sql Datawarehouse
Step #4 - Web Application
All the references of data movement from Datalake and connectivity to Datawarehouse is managed by Access control leveraged with a Azure web app. The security aspect is well managed in Azure infrastructure
Step #5 - Data Consolidation into SQL Datawarehouse
The external tables referenced in Datalake can be referenced, queried in TSQL format and data loaded in Azure Datawarehouse tables. This is the location of fact and dimension tables that would power our datawarehouse. This could be done by stored procedures.
Step #6 - Power BI reporting
We have completed Data loading, data consolidation. The next is Power BI. PowerBI has the most power offering for web / mobile platforms. This is convenient and easy to use. The extended Analytics / R Support / Machine Library support also makes it suitable to run both Business Intelligence / Machine Learning solutions.
Security aspects of this architecture is well handled with Firewall, IAM access as needed. This seems very stable even some of the components are constantly updated. This is high level architecture explanation, We will look into To-do exercises in coming weeks.
Happy Learning!!!
Labels:
Big Data,
Data Science
July 19, 2017
Day #70 - Machine Learning - Deep Learning Fundamentals - Machine Learning Notes
Picture is worth 1000 words, Few examples listed in the book are very precise, clear on Machine Learning fundamentals. Below are few of the images on Machine learning / Deep Learning Concepts
Figure #1
Happy Learning!!!
Figure #1
- How machine learning, AL and Deep Learning are inter-related, The subset representation clearly represents the knowledge boundaries
- Deep Learning frameworks allow developers to iterate quickly, Making algos accessible to practitioners. Deep learning frameworks help to scale machine learning code for millions of users
- Its important to note fundamentals of Machine Learning is important to work with Deep Learning
Figure #2
- In Machine learning, historical data is used to derive learning's / rules from it and apply it for future data predictions
- From the data we need to identify (relevant features / variables), In this process we use different techniques like PCA, Correlation techniques, Derived features to identify relevant feature attributes for model creation
- From the vast amount of data we collect through enterprise applications / systems we need to identify / extract relevant data to build models and validate them. Setting up the data pipeline, training with required dataset becomes key for better / high accuracy models
Figure #3
- High level perspective of Deep Learning, How the nodes are defined, weights computed
- The loss part for each iteration is compared with predictions and sent back to perform weight updates, This iterations we call it as back propagation
- Deep Learning term is because the network are 'deep' - multiple hidden layers involved in computation
Figure #4
- SVM Wide street approach, line that separates two classes
- Allow non-linear decision boundaries
- Each dimension represents feature
- Goal of SVN - Train a model that assigns unseen objects into particular category
- Advantage - High Dimensionality, Memory Efficiency, Versatility
Machine Learning Notes
Labels:
Data Science,
Data Science Tips
Subscribe to:
Posts (Atom)









