"No one is harder on a talented person than the person themselves" - Linda Wilkinson ; "Trust your guts and don't follow the herd" ; "Validate direction not destination" ;

January 25, 2020

Day #321 - Image Similarity Search

I am doing poor documentation of my code in my blog. I am trying to improve my documentation abilities. This blog is based on post

There are different key components involved in the similarity search
1. Data Collection
2. Generating Feature Vectors
3. Finding Similar matches based on different approaches

Step #1 - Dataset used for data collection is from Caltech 101 dataset. This contains several different classes.

Step #2 - Feature Vector generation is based on leveraging pre-trained networks. Use Existing pre-trained networks, removed the last layer. Retain the Feature vector, the flattened layer.

Step #3 - NearestNeighbors is the algorithm used to find nearest neighbors, The options are ‘ball_tree’, ‘kd_tree’, ‘brute’ will use a brute-force search, ‘auto’

Results 




Happy Learning!!!

January 22, 2020

Seat and Location based Pricing in Bus :)






  • Single Sleeper price 750 (Better Privacy / No Sharing)
  • Second-row single berth 700 (Can Exit Early)
  • Middle row single berth 650 (Reduced Privileges compared to above two)

  • Similar to Flight Seat Pricing :)

    Keep Thinking!!!

    January 20, 2020

    Catching up with Machine Learning for Experienced Folks

    You have a lot of tutorials on the web. You can leverage anything that works for you,
    The sequence I recommend is
    • Stats and Maths for Data Science (Stanford / Other Youtube classes)
    • Applications in Each Domain (Find applicable use cases for your domain)
    • Python (Youtube / Python Programming sites) – Data loading, analysis, join, filter etc…
    • ML Models (Linear, Logistics, Decision Trees, Random Forests) - Code up in python with datasets
    • Classification vs Regression - Code up in Python
    • Experiment Labelling, handling categorical data etc…
    • Take up one udemy course after all above list is done (December some offers may be there) 20$ courses…
    Keep Coding!!!

    Quick ride prospects in 2020

    Quick ride will take up rental cabs market in 2020, Below are compelling reasons for Quickride success
    1. Quick ride is affordable (For me its 100/- quickride vs 400/- rental cab booking 1/4th of the cost)
    2. Quick ride timing is predictable (Delay and timings are predictable, Predictably is much higher than Rental Cab booking platforms)
    3. Quick ride has recurring booking options (From rides we make friends)
    4. Quick ride makes you connect with like-minded tech folks, I found friends to mentor and learning from their experiences
    5. No peak charges (your BP is in control) like Rental cab booking platforms
    6. No cancellation charge 
    7. More of service and less of business (This is innovative approach and understands the market)
    8. Verified and most users are similar IT workers in Bangalore
    9. Risk is minimal ( drunken driving / overcrowded vehicles are not there in quickride), Well maintained vehicles as its owner managed
    10. Uber pool and Ola share failed miserably. Rather quickride created a market
    11. Cons - Some people make quickride a business with targets to make money
    12. 25% of market share in Bangalore of OLA / Uber will be takeup by Quickride in 2020 (My prediction)
    13. Quick ride needs to expand to other regions/countries too
    All the best Quickride team, Happy QuickRide!!!


    January 18, 2020

    January 17, 2020

    Jobs that would disappear before 2030 with AI

    • Security Jobs - We may not see security guards in Hotels, Apartments. The proportion would come down drastically. AI for Security, Surveillance real-time alert - Record and Playback will be gone soon. Now we have Edge Analytics which can act smarter, proactive. 
    • Automated Tolls - Manual Tool Operators would no longer be needed for vehicle details entry. Everything would be automated with detection, deduction and regulate traffic
    • Maids - House cleaning robots, Pet management robots, Kids Monitoring robots, Elderly assistance with robots
    • Basic Health Data Analysis - X-Ray Diagnosis, Report Diagnosis would be done by the system based on parameters observed. They might directly convey their findings
    • Customer Service - All the customer service BPO jobs would be replaced by Chatbots, Massive jobs cuts would happen in Data Entry, Customer Service Space
    • Automated Loans Approval - Manual validations would be replaced with ML models which can provide recommendations for Loan approval
    • Drivers - With more and more technology becoming mature for Autonomous driving trucks, the numbers of jobs for high way truck drivers would proportionately come down
    • Vehicle Repair Mechanics - With more software-powered, renewable energy powered vehicles, we may go away from traditional fuel-powered vehicles
    • Cooks / House Keeping jobs - A lot of tools will come in place to assist automated cooking, automated vessels cleaning
    • Shopper Associates - AI will power restocking, self-checkout facilities 
    A lot of these jobs will disappear and a lot of people would end up jobless. How are we going to upskill, reskill, balance the economy matters

    Keep Thinking!!!

    January 16, 2020

    My Favorite Retail Ideas in NRF 2020

    1. Smart Mirror

    2. Scan and go


    3. Vision Based Inventory Tracking


    4. Walmart AI Store Vision Powered


    5. Mobile based checkout

    6. Automated Checkout


    7. Shelf Edge Camera



    Retail in 2025

    • Every Retail Store is an Ecommerce Store
    • Every Offline Retail Store is a Fulfillment Centre
    • Every Offline Retail Store is a Warehouse
    • Every Offline Retail Store is a Returns Centre
    • Every Store will have Sensors, AR Experience, Personalized Experience
    • Specialty Stores, Private labels will be key
    • A lot of mobile-driven interfaces to search/try/buy

    Happy Learning!!!

    January 15, 2020

    Day #320 - Preprocessing Examples


    #https://scikit-learn.org/stable/modules/preprocessing.html#preprocessing
    data = {'ColA':[10,20,30],'COlB':[5,'',10],'ColC':['A','B','C'],'ColD':[1,2,3],'ColE':[10,40,50],'BinCol':['A','B','A']}
    import pandas as pd
    import numpy as np
    #Syntax Capital D and F
    dataset = pd.DataFrame(data)
    print(dataset.head())
    def ReplaceMissingValues():
    colnames = list(dataset.columns)
    #Column Names
    print(colnames)
    #Null Stats, All functions
    print(dataset.isnull().sum())
    #value empty
    #Usually axis=0 is said to be "column-wise" (and axis=1 "row-wise")
    print((dataset=='').sum(axis=0))
    datasetstd = dataset.replace('',np.NaN)
    print(datasetstd)
    datasetstd = datasetstd.replace(np.NaN,-999)
    return datasetstd
    from sklearn import preprocessing
    def HandleCategoryValues():
    #drop a column in a data frame
    datasetstd = ReplaceMissingValues()
    catcol= pd.get_dummies(dataset['ColC'])
    print(catcol)
    datasetstd = datasetstd.drop(['ColC'],axis=1)
    print(datasetstd)
    frames = [datasetstd,catcol]
    #Columns based merge
    result = pd.concat(frames,axis=1)
    print(result)
    rawdata = dataset[['BinCol']]
    ohe = preprocessing.OneHotEncoder()
    # OneHotEncoder, which transforms each categorical feature with n_categories possible values into n_categories binary features
    print('OneHotEncoder')
    print(rawdata)
    print(ohe.fit_transform(rawdata))
    ode = preprocessing.OrdinalEncoder()
    print('ordinal encoder')
    print(ode.fit_transform(rawdata))
    def StandardizeNumericalData():
    #min_max_scaler
    min_max_scaler = preprocessing.MinMaxScaler()
    rawdata = dataset[['ColD','ColE']]
    print(rawdata)
    print('min_max_scaler')
    print(min_max_scaler.fit_transform(rawdata))
    #standard scaler
    scaler = preprocessing.StandardScaler()
    print('scaler')
    print(scaler.fit_transform(rawdata))
    #max_abs_scaler
    max_abs_scaler = preprocessing.MaxAbsScaler()
    print('max_abs_scaler')
    print(max_abs_scaler.fit_transform(rawdata))
    HandleCategoryValues()
    #StandardizeData()
    import pandas as pd
    import math
    #Define Data Frames
    data = {'name': ['Raj', 'Siva', 'Mike', 'Dan','New_Joinee'],
    'age': [22,38,26,35,22],
    'location':['Chennai',math.nan,'Bengaluru','Chennai',math.nan]}
    dframe = pd.DataFrame(data)
    print(dframe)
    print('Missing Data Stats')
    print(dframe.isna().sum())
    #Option 1 - Fill with most frequent value
    from sklearn.impute import SimpleImputer
    imputer = SimpleImputer(strategy='most_frequent')
    newdata = imputer.fit_transform(dframe)
    print(newdata)
    imputer = SimpleImputer(strategy='constant',fill_value='missing')
    newdata = imputer.fit_transform(dframe)
    print(newdata)
    import numpy as np
    x = np.array([0,1,2,3,4,5,6,7,8,9])
    y = np.array([0,1,0,0,1,0,0,0,1,0])
    #Fill with same proportion
    from sklearn.model_selection import train_test_split
    x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.3,stratify=y)
    print(y_train)
    print(y_test)
    #https://twitter.com/justmarkham/status/1244986650410786817/photo/1
    data = {'name': ['Raj', 'Siva', 'Mike', 'Dan','New_Joinee'],
    'gender': ['M','M','M','M','F'],
    'location':['Chennai',math.nan,'Bengaluru','Chennai',math.nan]}
    from sklearn.preprocessing import OneHotEncoder
    from sklearn.pipeline import make_pipeline
    from sklearn.compose import make_column_transformer
    ohe = OneHotEncoder()
    dframe = pd.DataFrame(data)
    ct = make_column_transformer((ohe,['gender','location']),remainder='passthrough')
    print(ct)
    Happy Learning!!!

    January 14, 2020

    Day #319 - Data Story - Datavision - AI Use cases


    Happy Learning!!!

    Day #318 - Image Comparison Techniques

    Image is a set of points (vectors). We could use different techniques for image similarity comparison. Listed below some of out of box techniques. Dlib based comparison was earlier discussed.
    • Haar, Extract Region
    Comparison Techniques
    • Euclidean distance
    • Cosine distance
    • Hamming distance
    • Jaccard-Needham dissimilarity


    Ref - Link

    SSIM Approach Technique of structural similarity
    import cv2
    img1 = cv2.imread(r'D:\PetProject\image_comparison\4.jpg',0)
    face_cascade = cv2.CascadeClassifier(r'D:\PetProject\image_comparison\Haar.xml')
    #Perform Haar
    faces = face_cascade.detectMultiScale(img1, 1.3, 5)
    for (x,y,w,h) in faces:
    #Extract
    face1 = img1[y:y+h, x:x+w]
    #Resize
    face1 = cv2.resize(face1,(120,120), interpolation = cv2.INTER_CUBIC)
    img2 = cv2.imread(r'D:\PetProject\image_comparison\5.jpg',0)
    #Perform Haar
    #Extract
    #Resize
    faces = face_cascade.detectMultiScale(img2, 1.3, 5)
    for (x,y,w,h) in faces:
    face2 = img2[y:y+h, x:x+w]
    face2 = cv2.resize(face2,(120,120), interpolation = cv2.INTER_CUBIC)
    cv2.imshow('face1 1',face1)
    cv2.waitKey(0)
    cv2.imshow('face2 2',face2)
    cv2.waitKey(0)
    #Euclidean Distance
    import scipy.spatial.distance as dist
    import numpy as np
    #Convert to 1D
    img1_data= face1.reshape(-1)
    #Normalize
    img1_data= img1_data/255.0
    img2_data= face2.reshape(-1)
    img2_data = img2_data/255.0
    print(len(img1_data))
    print(len(img2_data))
    #Computes the Euclidean distance between two 1-D arrays
    print(dist.euclidean(img1_data,img2_data))
    #Compute the Cosine distance between 1-D arrays.
    #A.B= A.B/(|A||B|)
    print(dist.cosine(img1_data,img2_data))
    #Compute the Hamming distance between two 1-D arrays
    #Hamming distance is the number of bit positions in which the two bits are different
    print(dist.hamming(img1_data,img2_data))
    #Compute the Jaccard-Needham dissimilarity between two boolean 1-D array
    #Jaccard distance, which measures dissimilarity between sample sets
    print(dist.jaccard(img1_data,img2_data))
    cv2.destroyAllWindows()
    Happy Learning!!!

    January 13, 2020

    Day #317 - Ensemble Methods

    Summary of Ensemble Techniques, Bagging, Boosting code snippets

    #https://scikit-learn.org/stable/modules/ensemble.html#forest
    #Ensemble - combine several techniques
    #averaging - baggining - randomforest
    #boosting - seqiential build, combine weak classifiers - Adaboost, Gradient boosting
    #voting classifier
    from sklearn.ensemble import RandomForestClassifier
    from sklearn.model_selection import cross_val_score
    from sklearn.tree import DecisionTreeClassifier
    x_data = [[0,0],[1,1],[1,2],[0,4]]
    y_data = [0,1,1,0]
    rfit = RandomForestClassifier(n_estimators = 10)
    rfit = rfit.fit(x_data,y_data)
    scores = cross_val_score(rfit,x_data,y_data,cv=2)
    print(scores.mean())
    # The minimum number of samples required to split an internal node:
    dtree = DecisionTreeClassifier(max_depth=None,min_samples_split=2,random_state=0)
    dtree = dtree.fit(x_data,y_data)
    scores = cross_val_score(dtree,x_data,y_data,cv=2)
    print(scores.mean())
    print(rfit.predict([[3,3]]))
    print(dtree.predict([[3,3]]))
    #boosting
    from sklearn.ensemble import AdaBoostClassifier
    adaboostmodel = AdaBoostClassifier(n_estimators = 10)
    adaboostmodel = adaboostmodel.fit(x_data,y_data)
    print(adaboostmodel.predict([[3,3]]))
    from sklearn.ensemble import GradientBoostingClassifier
    gbmodel = GradientBoostingClassifier(n_estimators = 10,learning_rate=1.0,max_depth=1,random_state=0)
    gbmodel = gbmodel.fit(x_data,y_data)
    print(gbmodel.predict([[3,3]]))
    #VotingClassifier (with voting='hard') would classify based on the majority class label.
    #If ‘hard’, uses predicted class labels for majority rule voting. Else if ‘soft’, predicts the class label based on the argmax of the sums of the predicted probabilities
    from sklearn.ensemble import VotingClassifier
    estimators = [('model1',rfit),('model2',dtree),('model3',adaboostmodel),('model4',gbmodel)]
    votinghardmodel = VotingClassifier(estimators=estimators, voting = 'hard')
    votinghardmodel = votinghardmodel.fit(x_data,y_data)
    print(votinghardmodel.predict([[3,3]]))
    votingsoftmodel = VotingClassifier(estimators=estimators, voting = 'soft')
    votingsoftmodel = votingsoftmodel.fit(x_data,y_data)
    print(votingsoftmodel.predict([[3,3]]))
    Happy Learning!!!

    Day #316 - SVM Classification Examples


    #https://scikit-learn.org/stable/modules/svm.html#classification
    #SVM
    #Classification, Outlier detection
    #Useful for high dimensional spaces
    #two class classification
    from sklearn import svm
    x = [[0,0],[1,1],[3,3],[5,5],[2,3]]
    y = [0,1,2,1,1]
    svmmodel = svm.SVC()
    svmmodel.fit(x,y)
    #property attributes of SVM
    print(svmmodel.support_vectors_)
    #indexes of support vectors
    print(svmmodel.support_)
    #number of support vectors for each class
    print(svmmodel.n_support_)
    #multi-class classification
    #SVC, NuSVC and LinearSVC are classes capable of performing multi-class classification on a dataset.
    #For unbalanced problems certain individual samples keywords class_weight and sample_weight can be used.
    svmweightmodel = svm.SVC(class_weight={0:0.8,1:0.1,2:0.1})
    svmweightmodel.fit(x,y)
    print('Default Prediction')
    print(svmmodel.predict([[1.,1.]]))
    print('Weight Prediction')
    print(svmweightmodel.predict([[1.,1.]]))
    #linear svc
    #LinearSVC implements “one-vs-the-rest” multi-class strategy
    linearsvcmodel = svm.LinearSVC()
    linearsvcmodel.fit(x,y)
    print('linearsvcmodel Prediction')
    print(linearsvcmodel.predict([[1.,1.]]))
    view raw SVMExamples.py hosted with ❤ by GitHub
    Happy Learning!!!

    Day #315 - ML Notes - Regression

    • L1 (Lasso) can shrink some coefficients to zero
    • L2 (Ridge) shrinks all the coefficient by the same proportions but eliminates none. L2 does square a number punishes large values more than it punishes small values.

    I am bad at reading. I skip content and directly focus only on what I am trying to solve. I am going to go through sci-kit documentation and try all the code snippets.
    #https://scikit-learn.org/stable/modules/linear_model.html#ordinary-least-squares
    from sklearn import linear_model
    #Approach #1
    model = linear_model.LinearRegression()
    #initialize x and y
    x_data = [[0,0],[1,1],[2,2]]
    y_data = [0,1,2]
    #fit the model
    model.fit(x_data,y_data)
    params = model.coef_
    print(params[0])
    print(params[1])
    #prediction
    print(model.predict([[2,2]]))
    #Approach #2
    #Lasso
    #L1 (Lasso) can shrink some coefficients to zero
    lassomodel = linear_model.Lasso(alpha=0.1)
    lassomodel.fit(x_data,y_data)
    params = lassomodel.coef_
    print(params[0])
    print(params[1])
    #prediction
    print(lassomodel.predict([[2,2]]))
    #Approach #3
    #LARS
    #Least-angle regression (LARS) is a regression algorithm for high-dimensional data
    larsmodel = linear_model.LassoLars(alpha=0.1)
    larsmodel.fit(x_data,y_data)
    params = larsmodel.coef_
    print(params[0])
    print(params[1])
    #prediction
    print(larsmodel.predict([[2,2]]))
    #Approach #4
    #BayesianRidge
    #BayesianRidge estimates a probabilistic model of the regression problem
    bayesian = linear_model.BayesianRidge()
    bayesian.fit(x_data,y_data)
    #prediction
    print(bayesian.predict([[2,2]]))
    #Approach #5
    #Support Vector Regression.
    from sklearn import svm
    svmmodel = svm.SVR()
    svmmodel.fit(x_data,y_data)
    #prediction
    print(svmmodel.predict([[2,2]]))
    #Approach #6
    ridgemodel = linear_model.Ridge(alpha=.5)
    ridgemodel.fit(x_data,y_data)
    print(ridgemodel.predict([[2,2]]))
    Happy Learning!!!

    Concepts - WeightofEvidence, Information Value

    While checking on FinTech ML projects I came across these two concepts WeightofEvidence, Information Value. I found this link intuitive and understanding.

    Basically when we bucketize, within each range of buckets we can, in turn, sub-divide the other factors based on distribution. In Retail Scenario

    Customers Age Group (20-30, 30-40, 40-50). Within each bucket, we can find the percentage of fraudulent customers. It may be

    20-30 - 4% fradulent
    30-40 - 2.5%
    40-50 - 1%

    This technique helps to assign possible values and decide their impact. This is my understanding. We can also infer the same based on data analysis and distribution percentages across different classes.

    More Reads - Link1, Link2, Link3, Link4

    Happy Learning!!!

    Data-flow -> Knowledge-flow -> Future Prospects


    • Data to Datalake
    • Datalake to Collective DataInsights
    • DataInsights to Features
    • Features to Models
    • Models to Predictions
    • Predictions to Preparedness

    Data -> Insights -> Predictions

    Happy Learning!!!

    January 11, 2020

    Decode your personality - Social Media Unplugged

    • Google - will you everything about you. Regular search, restricted search, places visited
    • Gmail - Your friends and their locations, Your communication and type of person you are
    • Facebook - What is your social circle, your financial status from the place you live and place you work
    • Linkedin - Your average compensation can be estimated with your education, company and years of experience
    • Whatsapp - All your mood swings, emotional discussions and your relationship with the outside world
    • Mobile Number - Your average calls, the number of contacts in friends/blood relationships
    • Bank Account - Spending patterns, places you visit. Average expenses across food/shopping/travel
    • Uber - Where you traveled, What was your pattern for the last month
    We do not have ownership of your data shared. There is no expiry date for the data collected. This data is good enough to decode a person.

    Google Sued for Secretly Amassing Vast Trove of User Data
    • Consumer browsing history
    • Web activity data 
    • Invasion of privacy and violations
    • Storing geolocation data with its mobile apps
    Today Attended Aljazeera discussion on Cambridge Analytica Scandal. (April 8th)
    Founder - BRITTANY KAISER of ownyourdata 
    • 87 million Profiles were screened in facebook
    • Data science to classify/label possible persuaders
    • Leverage all their behavioral data 
    • Identify their interests (Climate change, national security, Refugee issues) based on their Facebook groups/feeds
    • Targeted Ads and convert them into positive voters / Compromise Integrity / Privacy
    Sounds Scary :( :( Manipulating data



    Tech Talk Link

    What Swiggy knows about you?


    Single vs Family, Brand Centric, More than what I could think of :(

    What a Better Social Network Would Look Like
    • Make social networks nonprofits
    • Ban algorithmic amplification
    • Restrict personal data collection and behavioral advertising
    • Let a field of smaller social networks bloom
    • Stop putting white men in charge
    New Age Social Media - My Perspective
    Social media has to be more realistic to reflect on our life. Our daily thoughts only our first level circle immediate relationships need to know. Our social thoughts or tagged social should be reflected in the second-level circle. About education, professional it needs to reflect in the connected circles
    Everything needs to be within the limits to avoid information overload. Today data is business and connections are the business value of the individual. In the long run, both businesses and consumers will lose value. Companies focused on rapid market share without ethical values will end up creating zombies than responsible citizens.

    Awful AI Projects - Link

    Keep Thinking!!!

    January 10, 2020

    Model Documentation and Coding Guidelines - Python

    This paper was very useful. This covers Data Source, Purpose, Model Accuracy, Recommendations. The key metrics (Screenshot from the paper)


    Structuring Machine Learning Projects

    ML Experiment Parameters

    • Model Parameters
    • Learning Rate
    • Number of Epochs Run
    • Training Loss
    • Validation Loss
    • CPU %%
    • Memory %%
    • Disk usage




    if (readable()): {
    be_happy()
    }
    else: {
    refactor()
    }
    #http://msdl.cs.mcgill.ca/people/shahla/misc/PythonConventions.pdf
    #filenames short file names
    #myfile.py
    #class name, CapWords convention
    #class MyClass:
    #private and protected variables with _
    # _myProtectedVar, _myPrivateVar
    #import in seperate lines
    #Bad
    import sys, os
    #Good
    import sys
    import os
    #hierarchy of import
    #standard library
    #major imports
    #App specific imports
    #indendation
    #break lines with \
    #no multiple statements in single line
    #Bad
    if foo == 'blah': doBlahThing()
    #good
    if foo == 'blah':
    doBlahThing()
    #No white space before paranthesis
    #bad
    spam (1)
    dict ['key']
    #good
    spam(1)
    dict['key']
    #no white space before comma, colon
    #bad
    if x==4:
    print x , y ,y = y
    #good
    if x==4:
    print x, y, y = y
    #operator declaration
    #bad
    x = 1
    operatorA = 2
    cab = 3
    #good
    x = 1
    operatorA = 2
    cab = 3
    #comparisons use None or conditions
    #bad
    if x:
    y = 6
    #good
    if x is not None:
    y = 6
    #http://www.cs.rpi.edu/academics/courses/fall18/csci1200/Good_Programming_Practices.pdf
    #uppercase constants
    GRAVITY
    #captitalize first word of class
    Person()
    #private protected with _ before
    _speed
    #Variables
    #Avoid global variables
    #instead of public variable use getters and setters
    class Person():
    def __init__(self,name):
    self.name = name
    def getName(self):
    return self.name
    def setName(self, name):
    self.name = str(name)
    #Avoid deep nesting
    def work_check(word):
    if len(word) < 5:
    return False
    if len(word) % 2 == 0:
    return False
    if word[0] != 'a':
    return False
    return False
    #Exception Handline
    #https://www.datacamp.com/community/tutorials/exception-handling-python
    try:
    a = 100 / 0
    print (a)
    except ZeroDivisionError:
    print ("Zero Division Exception Raised." )
    else:
    print ("Success, no error!")
    #https://python.g-node.org/python-autumnschool-2010/_media/materials/day0-haenel-best-practices.pdf
    #https://gist.github.com/ericmjl/27e50331f24db3e8f957d1fe7bbbe510
    #https://github.com/bast/somepackage
    #https://dev.to/codemouse92/dead-simple-python-project-structure-and-imports-38c6
    #https://docs.python-guide.org/writing/structure/
    #https://towardsdatascience.com/manage-your-data-science-project-structure-in-early-stage-95f91d4d0600
    #https://github.com/Azure/Azure-TDSP-ProjectTemplate
    #https://drivendata.github.io/cookiecutter-data-science/
    ├── LICENSE
    ├── Makefile <- Makefile with commands like `make data` or `make train`
    ├── README.md <- The top-level README for developers using this project.
    ├── data
    │ ├── external <- Data from third party sources.
    │ ├── interim <- Intermediate data that has been transformed.
    │ ├── processed <- The final, canonical data sets for modeling.
    │ └── raw <- The original, immutable data dump.
    ├── docs <- A default Sphinx project; see sphinx-doc.org for details
    ├── models <- Trained and serialized models, model predictions, or model summaries
    ├── notebooks <- Jupyter notebooks. Naming convention is a number (for ordering),
    │ the creator's initials, and a short `-` delimited description, e.g.
    │ `1.0-jqp-initial-data-exploration`.
    ├── references <- Data dictionaries, manuals, and all other explanatory materials.
    ├── reports <- Generated analysis as HTML, PDF, LaTeX, etc.
    │ └── figures <- Generated graphics and figures to be used in reporting
    ├── requirements.txt <- The requirements file for reproducing the analysis environment, e.g.
    │ generated with `pip freeze > requirements.txt`
    ├── setup.py <- Make this project pip installable with `pip install -e`
    ├── src <- Source code for use in this project.
    │ ├── __init__.py <- Makes src a Python module
    │ │
    │ ├── data <- Scripts to download or generate data
    │ │ └── make_dataset.py
    │ │
    │ ├── features <- Scripts to turn raw data into features for modeling
    │ │ └── build_features.py
    │ │
    │ ├── models <- Scripts to train models and then use trained models to make
    │ │ │ predictions
    │ │ ├── predict_model.py
    │ │ └── train_model.py
    │ │
    │ └── visualization <- Scripts to create exploratory and results oriented visualizations
    │ └── visualize.py
    └── tox.ini <- tox file with settings for running tox; see tox.testrun.org
    #https://www.datacamp.com/community/tutorials/inner-classes-python
    #https://www.datacamp.com/community/tutorials/python-data-type-conversion
    #https://github.blog/2015-01-21-how-to-write-the-perfect-pull-request/
    Happy Learning!!!

    January 09, 2020

    Day #314 - Dynamic Taxing using AI

    Flat tax slabs and income tax rates are the procedures followed today. With the digital economy, extensive data collected/monitored, new metrics we need to consider dynamic taxing based on several parameters. These parameters need to be chosen from the short term / long term perspective. Depending upon industry prospects/growth / economic factors these values can be adjusted for predicting/recommending tax. We need to collect several parameters to Predict taxing numbers.

    Some of the parameters we can leverage are
    • Manpower / Natural resources
    • Renewable energy sources
    • Direct skilled employment
    • Indirect employment 
    • Contribution to Innovation
    • Technology adoption / Sharing for partners
    • Contribution for long term growth / Longevity of the business /company
    • Contribution to Education / R & D / IP
    • Greenhouse impact
    • Sector score
    • Taxing based on domain/industry
    • Profit margins, Balance Sheets, Supplier Balance Sheets / Company global profit margins
    • %% of revenue saved with Automation / Robots
    • %% of materials sourced/imported
    • Export value / quantity of items / demand 
    • Measure and change dynamically after ever quarter
    Data needs to drive the decisions. We need to be more dynamic towards tax by understanding demand/market conditions/growth prospects/ecology and economic impact. To develop sustainable growth it needs to focus on both short term and long term benefits.






    Ref - Link

    Keep Thinking!!!

    January 08, 2020

    Day #313 - Data science use cases solved in Indian Startups

    I have personally trained/connected to several Indian Startups to understand their AI use cases with respect to their domain/businesses they operate. Some of the key use cases I observed are
    • Fintech - OCR, banking statements, documents data extraction (Computer Vision)
    • Textile - Similarity of items, thickness  (Computer Vision)
    • B2B - Shop verification, re-identification  (Computer Vision), Item Level Forecasting (Data)
    • Agritech - Ripening, Fruit quality assessment (Computer Vision)
    • Retail - Device failure predictions, IOT based predictions (Data)
    • SalesData - Sales Analytics, Cross-selling, Upselling (Data)
    Kaggle was one approach to solve. I always start from domain -> data -> use cases. This learning helped me to map AI challenges / applications across domains.

    This is the trend I observed in the past few years. I am hoping to train/connect more startups across other domains. Feel free to connect with me for any AI training requirements/discussions.

    Happy Learning!!!

    January 07, 2020

    Data Insights for HelloFresh

    Menu Insights
    • Peak seller's
    • Weekday trends
    • Top trends based on seasonality
    • Review based listings
    Customer Insights
    • Top customers
    • Age Groups
    • By Gender
    • Average revenue per customer by age group
    • Recurring customer patterns
    • Food quality issues / incidents / patterns
    Demand Insights
    • Locality vs Demand
    • Maximum Ordered Items
    • Peak times
    • Lean times
    • Weekend patterns
    • Weekday patterns
    • Peak hour trends
    Delivery Insights
    • Transportation cost / time
    • Maximum Traffic Delay Areas
    • Incidents / Damages
    Basic ML Use cases
    • Forecast on volumes of items based on historical data
    • Segmenting customers based on Age / Gender / Veg / Non-Veg / Cusine Choices and providing recommendations
    • Forecast Order Volumes and assign Delivery partners based on Projected numbers to reduce other delays
    • Recommending a similar item every day from other restaurants based on historical data
    • Balanced diet customized to need /preferences based on user choices for a week
    Happy Learning!!!

    Day #312 - AI for hospitals - Children's Hospitals

    Medical is very interesting. I haven't connected with Hospitals. It's a very niche domain. The process again is similar to other areas like Data Collection -> Insights -> BI -> AI. A high-level overview which you can consider doing it.

    Data Collection
    • Overview of Data Collected from Mothers, Newborns
    • Overview of Symptoms / Medications
    • Overview of Sequence of Medications / Side-Effects
    • Positive / Negative Cases
    • Finance / Insurance Related Insights
    Analytics Insights (BI)
    • Most Observed Issues
    • Rarely Observed Issues
    • Trend of Admissions / Patterns across Cities
    • Financial Insights / Insurance Related
    • Correlation to Past Medical History to Complications
    AI Use Cases
    • Patient historical data-based risk predictions
    • Monitoring for new-borns and proactive alerts
    • Tie-up with AI Companies for post pregnancy monitoring and alert of Kids (https://www.loveys.io/)
    • Use the Same Tech (https://www.loveys.io/) to monitor patients
    • Feedback / Sentiment Analysis from Clients
    Happy Learning!!!

    January 06, 2020

    Day #311 - Key Notes for Airbnb ML talks

    ML Use cases
    • Search Ranking 
    • Smart Pricing (Demand Vs Supply)
    • Fraud Detection (Risk Scoring)
    Link1 - Forecasting Uncertainty at Airbnb - Theresa Johnson (Airbnb)

    Insights will tell how it works, Model the business based on data flow/decisions involved.

    See the problems as both
    • Demand Problem
    • Supply Problem
    Loved the pool cleaner theory
    • Demand Problem - Population, Household count, Number of cleanings
    • Supply Problem - Number of cleaners, Available hours
    Link2 - ML Airbnb

    Key Lessons
    ML tool for customized prices for each night instead of flat rates. The classic combination of demand, supply, peak pricing

    Link3 - Bighead: Airbnb's end-to-end Machine Learning Platform | Airbnb

    Design Goals
    • Seamless (Easy to prototype)
    • Versatile (All framework support)
    • Consistent Environments
    Tools - Redspot, BigHead, MLAutomator, Deep Thought

    Slides - Link

    Advanced ML Use Cases
    • Categorizing Listings
    • Experience Ranking
    • Object Detection
    • Customer Service Ticket Routing
    Happy Learning!!!

    January 01, 2020

    Day #310 - Handle Data Imbalance, Missing Data

    This post is about
    • Handling blank values
    • Handling missing values
    • Handling data imbalance

    import pandas as pd
    import numpy as np
    #Define Data Frames
    Data = {
    'avgage': [22,38,26,35,22,' ',20],
    'Collections':[5000,np.NaN,6000,np.NaN,np.NaN,'',4000],
    'Category':[1,0,1,0,1,1,1]
    }
    Dataset = pd.DataFrame(Data)
    Columnnames = Dataset.columns.tolist()
    print(Columnnames)
    print(Dataset.isnull().sum())
    print('Data Stats')
    print('=================')
    print(Dataset.info())
    #Missing Value Stats
    print('Null Value Stats')
    print('=================')
    print(Dataset.isnull().sum(axis=0))
    #blank values along the column
    print('Blank Values')
    print('=================')
    print((Dataset == '').sum(axis=0))
    #Replace Blank Values
    DatsetBV=Dataset.fillna("NaN")
    #Replace NaN Values
    DatsetBV = DatsetBV.replace('', np.NaN)
    DatsetBV = DatsetBV.replace(' ', np.NaN)
    DatsetBV = DatsetBV.replace(np.NaN,'-999')
    DatsetBV = DatsetBV.replace('NaN','-999')
    print(DatsetBV)
    #Data Imbalance
    print('Stats')
    print(DatsetBV['Category'].value_counts())
    #assign X and Y
    y = DatsetBV.Category
    x = DatsetBV.drop('Category',axis=1)
    #seperate majority and minority class
    df_majority = DatsetBV[DatsetBV.Category==1]
    df_minority = DatsetBV[DatsetBV.Category==0]
    from sklearn.utils import resample
    #upsample minority class
    df_minority_upsampled = resample(df_minority,replace=True #Sample with replacement
    ,n_samples=5, #to match majority class
    random_state = 123) #reproducible results
    #upsample minority class
    df_majority_downsample = resample(df_majority,replace=False #Sample without replacement
    ,n_samples=2, #to match majority class
    random_state = 123) #reproducible results
    #combine upsampled and majority classes
    df_balanced_class_option1 = pd.concat([df_majority,df_minority_upsampled])
    print('upsampled')
    print(df_balanced_class_option1)
    #combine downsampled and minority classes
    df_balanced_class_option2 = pd.concat([df_minority,df_majority_downsample])
    print('downsampled')
    print(df_balanced_class_option2)
    My request to readers, If you find this code snippets, blogs, articles helpful, please share your learning with others. We can only grow only by learning and teaching.

    Happy Learning!!!

    Data Science Experiment - Milk Adulteration

    Data - Link


    import pandas as pd
    from sklearn import tree
    from sklearn.cross_validation import train_test_split
    from sklearn.ensemble import RandomForestClassifier
    import sys
    from mlxtend.classifier import EnsembleVoteClassifier
    from sklearn import svm
    from imblearn.over_sampling import SMOTE
    verbose = False
    ratio = 'auto'
    print (sys.version)
    input_file = "TrainDataBinaryClassification.xls"
    df = pd.read_csv(input_file,header=0,sep=",")
    print(df.head())
    print(df.head(5))
    #Remove insignificant id column
    df.drop(['Id'],1,inplace=True)
    #List all column headers
    print(list(df))
    #Fill missing values
    df = df.fillna(-999)
    features = list(df.columns[:-1])
    print(features);
    y1 = df['class']
    x1 = df[features]
    #Option 1
    #SENN = SMOTEENN(ratio=ratio)
    #x, y = SENN.fit_sample(x1, y1)
    #Option #2
    sm = SMOTE(kind='svm')
    x, y = sm.fit_sample(x1, y1)
    pred_train, pred_test, tar_train, tar_test = train_test_split(x,y,test_size=0.3)
    print('Shape of test data')
    rf = RandomForestClassifier(n_estimators=350) # initialize
    classifier2 = rf.fit(x, y) # fit the data to the algorithm
    pred_train, pred_test, tar_train, tar_test = train_test_split(x,y,test_size=0.3)
    print('Shape of test data')
    classifier = tree.DecisionTreeClassifier(criterion="entropy")
    classifier = classifier.fit(x,y)
    classifier3 = RandomForestClassifier(n_jobs=250)
    classifier3 = classifier3.fit(x,y)
    classifier2 = svm.SVC()
    classifier2 = classifier2.fit(x,y)
    clfs = [classifier, classifier2, classifier3]
    clf = EnsembleVoteClassifier(clfs, voting='hard', weights = (4,4,5))
    clf.fit(x, y)
    input_file = "TestDataTwoClass.xls"
    df = pd.read_csv(input_file,header=0,sep=",")
    df2 = pd.read_csv(input_file,header=0,sep=",")
    df.drop(['Id'],1,inplace=True)
    df = df.fillna(-999)
    x = df[features]
    predictions = clf.predict(x)
    print('predictions')
    i = 0
    for i in range(0,len(predictions)):
    print(predictions[i])
    df['class'] = predictions
    df2['class'] = predictions
    print('count',df['class'])
    header = ["Id","class"]
    df2.to_csv("Results_Binary_Class_Adulteration_Sep18_2.csv", sep=',', columns = header,index=False)
    view raw Experiment1.py hosted with ❤ by GitHub
    import pandas as pd
    from sklearn import tree
    from sklearn.cross_validation import train_test_split
    import sklearn.metrics
    import sys
    sys.path.append('C:\\Anaconda2\\xgboost')
    import xgboost as xgb
    #2.7.12 |Anaconda 4.0.0 (64-bit)|
    print (sys.version)
    input_file = "TrainDataBinaryClassification.xls"
    df = pd.read_csv(input_file,header=0,sep=",")
    print(df.head())
    print(df.head(5))
    #Remove insignificant id column
    df.drop(['Id'],1,inplace=True)
    #List all column headers
    print(list(df))
    print(df.head())
    #Fill missing values
    df = df.fillna(-99)
    features = list(df.columns[:-1])
    print(features);
    y = df['class']
    x = df[features]
    pred_train, pred_test, tar_train, tar_test = train_test_split(x,y,test_size=0.3)
    print('Shape of test data')
    #classifier = tree.DecisionTreeClassifier(criterion="entropy")
    #classifier = classifier.fit(pred_train,tar_train)
    gbm = xgb.XGBClassifier(max_depth=3, n_estimators=300, learning_rate=0.05).fit(x, y)
    #print('acc', classifier.score(x,y))
    #predictions = classifier.predict(pred_test)
    #print(predictions)
    #print(sklearn.metrics.confusion_matrix(tar_test,predictions))
    #print('Classifier Accuracy')
    #print(sklearn.metrics.accuracy_score(tar_test,predictions))
    input_file = "TestDataTwoClassResults.xls"
    df = pd.read_csv(input_file,header=0,sep=",")
    df2 = pd.read_csv(input_file,header=0,sep=",")
    df.drop(['Id'],1,inplace=True)
    df = df.fillna(-99)
    x = df[features]
    predictions = gbm.predict(x)
    print('predictions')
    #print(predictions)
    i = 0
    for i in range(0,len(predictions)):
    print(predictions[i])
    #print('count',len(predictions))
    df2['class'] = predictions
    #df.to_csv("Results_Adulteration.csv", sep=',',index=False)
    header = ["Id","class"]
    df2.to_csv("Results_Adulteration_Sep15.csv", sep=',', columns = header,index=False)
    view raw Experiment2.py hosted with ❤ by GitHub
    import pandas as pd
    from sklearn import tree
    from sklearn.cross_validation import train_test_split
    import sklearn.metrics
    import sys
    from sklearn.ensemble import RandomForestClassifier
    #2.7.12 |Anaconda 4.0.0 (64-bit)|
    print (sys.version)
    input_file = "TrainDataMultiClassClassification_Custom_sep18.csv"
    df = pd.read_csv(input_file,header=0,sep=",")
    print(df.head())
    print(df.head(5))
    #Remove insignificant id column
    df.drop(['Id'],1,inplace=True)
    #List all column headers
    print(list(df))
    print(df.head())
    #Fill missing values
    df = df.fillna(-9999)
    features = list(df.columns[:-1])
    print(features);
    y = df['class']
    x = df[features]
    pred_train, pred_test, tar_train, tar_test = train_test_split(x,y,test_size=0.3)
    print('Shape of test data')
    classifier = tree.DecisionTreeClassifier(criterion="entropy")
    classifier = classifier.fit(pred_train,tar_train)
    print('acc', classifier.score(x,y))
    predictions = classifier.predict(pred_test)
    rf = RandomForestClassifier(n_estimators=300) # initialize
    rf.fit(x, y) # fit the data to the algorithm
    input_file = "TestDataMultiClass.xls"
    df = pd.read_csv(input_file,header=0,sep=",")
    df2 = pd.read_csv(input_file,header=0,sep=",")
    df.drop(['Id'],1,inplace=True)
    df = df.fillna(-9999)
    x = df[features]
    predictions = rf.predict(x)
    #predictions = classifier.predict(x)
    print('predictions')
    #print(predictions)
    i = 0
    for i in range(0,len(predictions)):
    print(predictions[i])
    #print('count',len(predictions))
    df['class'] = predictions
    #print('count',len(predictions))
    df2['class'] = predictions
    print('count',df['class'])
    #df.to_csv("Results_Multi_Class_Adulteration.csv", sep=',',index=False)
    header = ["Id","class"]
    df2.to_csv("Results_Multi_Class_Adulteration_2_Sep18_RF.csv", sep=',', columns = header,index=False)
    view raw Experiment3.py hosted with ❤ by GitHub

    Happy Learning!!!

    Stats Lessons




    Happy Learning!!!

    NOSQL Internals and Design Practices

    Objective – The objective of this paper is to analyze NoSQL internals from RDBMS developer perspective and provide design guidelines for NoSQL Applications
    Analysis
    RDBMS – RDBMS came into the picture to ensure the ACID properties are maintained and there is a single version of the truth. RDBMS plays a critical role in OLTP applications (Banking, Finance, and Payment) domains.
    Database design– Database design is implemented to ensure it's normalized and avoid data redundancy. Primary Keys, Indexes are created to ensure query plans use the indexes to filter required rows and fetch required results within the shortest intervals.

    Query Execution – Data is typically stored in a B-Tree format. The data is organized physically in the form of clustered indexes. This is the reason search based on the primary key is quick compared to any other non-indexed columns. Database Engine implements several other operations to optimize the execution plan by leveraging indexes, statistics, and partitioning, Non-clustered indexes. Depending on the query plan join operators, sort operators are applied to produce the execution plan. The execution plan is reused if it already exists in memory.
    This paper was very useful to understand OLTP Internals. Reposting notes from my blog post
    • WAL – Changes are written in log and committed to disk when the checkpoint is reached
    • Buffer Manager – cache for data fetched / recently used
    • Two-Phase locking – Optimistic/pessimistic locking depending on isolation levels
    • Concurrency control – Based on isolation levels
    NoSQL Databases 
    Similar to above OLTP aspects, There are few papers that describe designing NOSQL apps for Read heavy / Write Heavy Apps. This paper was very useful to understand NoSQL perspective of designing apps in columnar databases

    For Heavy Writes
    • Tall Skinny Tables
    • Consolidate data into single columns
    For Heavy Reads
    • Fewer column families
    • Use bloom filters
    There are multiple NoSQL databases (Key-Value, Document-based, Columnar Databases, etc...). 

    Happy Learning!!!

    Day #309 - Handle Categorical Columns

    Have a Great, Peaceful and Successful 2020
    This post is on Handling Categorical Columns
    import pandas as pd
    #Define Data Frames
    Data = {'Location': ['Singapore', 'India', 'Japan', 'China','Korea'],
    'avgage': [22,38,26,35,22],
    'Education': ['UG','PG','Phd','UG','PG']
    }
    Dataset = pd.DataFrame(Data)
    #Categorize Location
    location = Dataset['Location']
    catlocation = pd.get_dummies(location)
    print(catlocation)
    #Categorize Education
    education = Dataset['Education']
    cateducation = pd.get_dummies(education)
    print(cateducation)
    #Standardize Avg Age
    from sklearn import preprocessing
    age = Dataset['avgage'].values
    min_max_scaler = preprocessing.MinMaxScaler()
    age_scaled = min_max_scaler.fit_transform(age.reshape(-1, 1))
    agedf = pd.DataFrame(age_scaled)
    print(agedf)
    #Merge all the data
    frames = [catlocation,cateducation,agedf]
    #Merge three frames horizontally
    merged_data = pd.concat(frames, axis=1)
    print(merged_data)
    view raw catcolumns.py hosted with ❤ by GitHub
    Happy Learning!!!