Data Science, Database, AI Startups and Domain Learning's (Video-Image-Text-Data-Database): October 2021

October 31, 2021

Remembering Facts vs Evaluating Ideas

I find it hard to remember configuration parameters, default settings, metrics. These are key to many certifications. Often we focus on the problem at hand, not specific functions or code to check.

Every definition is custom to each cloud provider and the set of theoretical FAQ questions, syntax specific to language. We neither measure problem solving or domain knowledge but rely on syntax and remembering facts. This is a stark difference between product vs service companies.

Certification does not necessarily mean you have the skills to build a solution. They merely imply familiarity with a tool/infra. As long as you map your current skills to new skills find the gaps and address you can build the required solution.

Learning is a collection of observations, experiments, experiences, applying your relevant past lessons. It is a compound effect. Building a solution is easy, but thinking from a futuristic perspective marks the difference between a newbie and an experienced techie.

20 years of experience is not working on the same project. The wider you explore bigger the perspective. The more you fail, the more you are aware of different domains/roles. In the end, let it be a collective memory of different experiences. Win or lose enjoy the journey.

I keep coding my logic with a mix of syntax I recollect across SQL, C, Python, R, C#. First, pseudo logic comes to mind. Later the logic is corrected based on StackOverflow answers. Every language has its own way of defining constructs and separators. Am I a bad programmer, mmm maybe... Always there is more to learn :)

Anyways value addition needs to be quantified so you need to pass this too :)

October 30, 2021

AI in Finance

Paper #1 - AI in Finance: Challenges, Techniques and Opportunities

Key Notes

Key Areas are capital markets, trading, banking, insurance, leading/loan, investment, asset/wealth management, risk management, marketing, compliance and regulation, payment, contracting, auditing, accounting, financial infrastructure, blockchain, financial operations, financial services, financial security, and financial ethics
Classic techniques including logic, planning, knowledge representation, statistical modeling, mathematical modeling, optimization, autonomous systems, multiagent systems, expert systems
Modern techniques such as recent advances in representation learning, machine learning, optimization, data analytics, data mining and knowledge discovery, computational intelligence, event analysis, behavior informatics, social media/network analysis
Specific business problems, such as market trend forecasting, stock price prediction, credit scoring, fraud detection, financial report analysis, pricing and hedging, marketing, consumer behavior analysis, algorithmic trading, social commerce, and Internet finance.
Portfolio planning and optimization: including designing, planning, optimizing and recommending investment portfolios and strategies in a market
Forecasting and prediction: including the regression, classification, estimation and prediction of trend (up or down), movement (direction and scale, etc.), value (e.g., price or volatility)
Business profiling: including describing, segmenting, characterizing and classifying markets, products, customers, and services.
Sentiment and intention modeling: including characterizing, representing, modeling, analyzing and evaluating the polarity, diversity, propensity and their dynamics of customer sentiment and intention
Anomaly detection: such as characterizing, quantifying, detecting, classifying and predicting abnormal, exceptional and changing behaviors, products, patterns, performance

Paper #2 - Enhancing Financial Inclusion using Mobile Phone Data and Social Network Analytics

Key Notes

Datasets - call-detail records, credit and debit account information of customers is used to create scorecards for credit card applicants
Call-detail records are used to build call networks and advanced social network analytics techniques are applied to propagate influence from prior defaulters throughout the network to produce influence scores
predictive model for a target measure of interest (e.g., churn, fraud, default)
sociodemographic information, such as age, marital status and postcode; debit account activity, including timing and amount of payments; and credit card activity
sociodemographic features such as age, marital status and residency as reported at the time of the credit card application are extracted.

Paper - P2P LOAN ACCEPTANCE AND DEFAULT PREDICTION WITH ARTIFICIAL INTELLIGENCE

Key Notes

Features for the first phase are:

debt to Income ratio (of the applicant);
employment length (of the applicant);
loan amount (of the loan currently requested);
purpose for which the loan is taken
loan amount (of the loan currently requested);
term (of the loan currently requested);
instalment (of the loan currently requested);
employment length (of the applicant);
home ownership (of the applicant. Rented, owned or owned with a mortgage on the property);
verification status of the income or income source (of the applicant. If this was verified by the Lending Club);
purpose for which the loan is taken;
Debt to Income ratio (of the applicant);
earliest credit line in the record (of the applicant);
number of open credit lines (in applicant’s credit file);
number of derogatory public records (of the applicant);
revolving line utilisation rate (the amount of credit the borrower is using relative to all available revolving credit);
total number of credit lines (in applicant’s credit file);
number of mortgage credit lines (in applicant’s credit file);
number of bankruptcies (in the applicant’s public record);
logarithm of the applicant’s annual income (the logarithm was taken for scaling purposes);
FICO score (of the applicant);
logarithm of total credit revolving balance (of the applicant).

Paper #3 - Behavior Revealed in Mobile Phone Usage Predicts Credit Repayment

Key Notes

Mobile phone transaction history prior to the extension of credit, and whether the credit was repaid on time
Transition to a postpaid plan
Call and SMS metadata

Paper #4 - Data Science in Economics

Key Notes

Solving the right problem at the Right time matters

2013 I was part of the Team that worked on Traffic Forecasting for Retail Stores

Multiple stores across geographies
Multiple DB’s for each local store

The forecasting system used to run at Enterprise, Synchronize data to local stores with their own internal synchronization jobs.

These jobs were configured to run according to time zones of stores
The algorithms were mostly around a weighted moving average, trend + moving average
The forecast job runs leveraging previous data and projects forecasts by the hour for next day, hourly basis patterns
The actuals are captured the following day and measured against it
In case of data not present sister stores (similar stores) data was leveraged for calculation

Whatever we say as of today measure model drift, missing data features, work at scale, coexist along with existing transaction system was built as server components, custom-built.

What we missed are

Instead of Traffic forecast if we had done a sales forecast it would have helped to apply solutions for both eCommerce and retail giant
We had inherent details of out of stock, replenishment alerts. The same could have been used for out of stock forecast per zone, replenishment forecast per zone
These real-time reports from RFID could have served as effective forecast opportunities on the same

Sometimes we may have the right technology and architecture but not the right use cases. Now I see the same things ML attempts to do with #kubeflow, #pipelines, #scale but the same problem which was solved with models available at that point in time would take a different set of skills to solve today 😊

Keep Exploring!!!

AI - Education Opportunities

Paper #1 - Strengthening e-Education in India using Machine Learning

Key Notes

Applying different data mining algorithms on the data of the person and suggesting which course is appropriate for him based on his background knowledge

Paper #2 - Personalized Education in the AI Era: What to Expect Next?

Key Notes

Content summarization and question generation Multi-modal content understanding: Human-in-the-loop content design

More Reads

Teaching Machine Learning in K–12 Computing Education: Potential and Pitfalls

Estimating returns to special education: combining machine learning and text analysis to address confounding

Keep Exploring!!!

October 25, 2021

Merlion - open-source machine learning library for time series - Forecasting

Paper - Merlion: A Machine Learning Library for Time Series

Key Notes

From Salesforce another forecasting library
Merlin includes classic statistical methods, tree ensembles, and deep learning methods.
Merlion implements many diverse models for both forecasting and anomaly detection

Forecasting Algos List

Univariate time series forecasting

ARIMA (AutoRegressive Integrated Moving Average)
SARIMA (Seasonal ARIMA)
ETS (Error, Trend, Seasonality)
Prophet
Deep autoregressive LSTM

Multivariate forecasting models

autoregression algorithm
Vector Autoregression

Examples

Documentation

Orbit: A Python Package for Bayesian Forecasting

Orbit: Probabilistic Forecast with Exponential Smoothing

darts is a Python library for easy manipulation and forecasting of time series

Time Series Made Easy in Python

Keep Exploring!!!

October 24, 2021

Indian Startup #Greyorange #AI #DataScience #Robotics #WarehouseAutomation

Useful links for further review

Logistics landscape link
Patent1, Patent2, Patent3

Keep Exploring!!!

October 21, 2021

Cloud Comparison - Good Read

A good paper on cloud comparison, I was looking for such a handy paper for a long time

Paper - Public Cloud Infrastructure Vendors

Key Notes

Service Types

Compute Services

Infra

Serverless

Storage

Database

Big Data

Real Time Streaming

ML Service

Networking

Additional Services

Very good Work and Nice Summary from Paper!!!

Good Read - Databases Vs Blockchain

Paper - Trends in Development of Databases and Blockchain

Key Notes

ACID (Atomicity, Consistency, Isolation, and Durability)
CAP - (Consistency, Availability, Partition tolerance)
DCS (Decentralization, Consistency, Scalability) theorem

Difference between blockchain and database

Blockchain differs from traditional databases in numerous ways like its decentralization, cryptographic security using chained hashes, no administration control, immutability, freedom to transfer
These distributed databases have their consensus mechanism for the joint agreement on a data block by the network parties
blockchain databases = distributed databases, support features like complex data types, rich query structure,
ACID compliant [3], low latency, fast scalability, and cloud hosting

CAP

Consistency - Any read in the distributed system gives the latest write on the nodes.
Availability - A Client always receives a response at any point of time irrespective of whether the read is the latest write.
Partition Tolerance - In case of partition between nodes in the distributed system, the system should still be functioning

BCS

Decentralization - There is no trusted entity controlling the network, hence no single point of failure.
Consistency - The blockchain nodes will read the same data at the same time.
Scalability - The performance of blockchain should increase with the increase in the number of peers and the number of allocated computational resources.

Good Read!!!

October 16, 2021

Smarphones are sensors, Customer Data Platform

If you aren't the paying customer, you are the product.

Interesting read - Android Mobile OS Snooping By Samsung, Xiaomi, Huawei and Realme Handsets

Key Findings

Samsung, Xiaomi, Huawei and Realme Android variants all transmit a substantial volume of data to the OS developer (i.e. Samsung etc) and to third-party parties that have pre-installed system apps (including Google, Microsoft, Heytap, LinkedIn, Facebook)
Re-linkability of advertising identifiers. Samsung, Xiaomi, Realme and Google all collect long-lived device identifiers, e.g. the hardware serial number, as well as user-resettable identifiers, such as advertising IDs
On the Samsung handset the Google Advertising ID is sent to Samsung servers

What apps are used and when, what app screens are viewed, when and for how long
Several Samsung system apps use Google Analytics to log user interactions (windows viewed etc)
Samsung, Xiaomi, Realme, Huawei, Heytap and Google collect details of the apps installed on a handset
The list of installed apps is potentially sensitive information since it can reveal user interests and traits (a mental health app, a political news app)
No opt-out. As already noted, this data collection occurs even though privacy settings are enabled
Xiaomi collects the most extensive data on user interactions, including the timing and duration of every app window viewed by a user
One example of potentially sensitive metadata is the name, timing and duration of the app windows viewed by a user.
Data which is not sensitive in isolation can become sensitive when combined with other data
Android handsets can be directly tied to a person’s identity in at least two ways. Firstly, via the SIM. When a person has a contract with a mobile operator then the SIM. Secondly, via the app store used.
Use of the Google Play store requires login using a Google account, which links the handset to that account since Google collect device identifiers such as the hardware serial number and IMEI along with the account details
Sometimes the plaintext data (i.e. after decryption, if needed) is human-readable, e.g. json.
On a Samsung handset Samsung, Google and Microsoft/LinkedIn all collect data. That raises the question of whether the data collected separately by these parties can be linked together (and of course combined with data from other sources).

Keep Exploring!!!

October 15, 2021

Decision Trees - 5 Mins Tutorials

	# -- coding: utf-8 --
	"""DecisionTrees.ipynb
	Automatically generated by Colaboratory.
	"""

	#Gini Index has values inside the interval [0, 0.5] whereas the interval of the Entropy is [0, 1].
	#Calculation of the Gini Index will be faster.

	import numpy as np
	import pandas as pd
	from sklearn.metrics import confusion_matrix
	from sklearn.model_selection import train_test_split
	from sklearn.tree import DecisionTreeClassifier
	from sklearn.metrics import accuracy_score
	from sklearn.metrics import classification_report

	dataset = pd.read_csv(r'/content/diabetes_dataset.csv',sep= ',', header =0)
	print(dataset.head())

	#list all column names
	print(dataset.columns.tolist())

	from sklearn.tree import DecisionTreeClassifier # Import Decision Tree Classifier
	from sklearn.model_selection import train_test_split # Import train_test_split function
	#Drop the Outcome column and assign dataframe to X
	x=dataset.drop(['Outcome'], axis=1)
	#Assign outcome Y
	y=dataset.Outcome

	#Divide into train test split
	x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=1)

	# Create Decision Tree classifer model
	model = DecisionTreeClassifier(criterion='gini', splitter='best', max_depth=10, min_samples_split=2)
	# Map Data
	model = model.fit(x_train,y_train)
	#Predict the response for test dataset
	y_pred = model.predict(x_test)

	#Evaluation using Accuracy score
	from sklearn import metrics #Import scikit-learn metrics module for accuracy calculation
	print("Accuracy:",metrics.accuracy_score(y_test, y_pred)*100)

	#from sklearn import tree
	#text_representation = tree.export_text(model)
	#print(text_representation)

	# Create Decision Tree classifer model
	model = DecisionTreeClassifier(criterion='entropy', splitter='random', max_depth=10, min_samples_split=2)
	# Map Data
	model = model.fit(x_train,y_train)
	#Predict the response for test dataset
	y_pred = model.predict(x_test)

	#Evaluation using Accuracy score
	from sklearn import metrics #Import scikit-learn metrics module for accuracy calculation
	print("Accuracy:",metrics.accuracy_score(y_test, y_pred)*100)

	#For Information gain purpose lets limit Pregnancy values = 0,1,2 only
	#Total Samples = 349, Outcome = 0 is 263, Outcome = 1 is 86
	#Pregnancy values 0, Outcome = 0, 73, Total Samples 111
	#Pregnancy values 0, Outcome = 1, 38, Total Samples 111
	#Pregnancy values 1, Outcome = 0, 106, Total Samples 135
	#Pregnancy values 1, Outcome = 1, 29, Total Samples 135
	#Pregnancy values 2, Outcome = 0, 84, Total Samples 103
	#Pregnancy values 2, Outcome = 1, 19, Total Samples 103

	#Total set
	import math
	A = ((263/349)*math.log((263/349),2))
	B = ((86/349)*math.log((86/349),2))
	TotalEntropy = -A-B
	print('Entropy of Pregnancy column')
	print(TotalEntropy)

	#For Pregnancy 0,[73,38], 111
	A = ((73/111)*math.log((73/111),2))
	B = ((38/111)*math.log((38/111),2))
	EntropyP0 = -A-B
	print('Entropy of Pregnancy 0')
	print(EntropyP0)

	#For Pregnancy 1,[106,29],135
	A = ((106/135)*math.log((106/135),2))
	B = ((29/135)*math.log((29/135),2))
	EntropyP1 = -A-B
	print('Entropy of Pregnancy 1')
	print(EntropyP1)

	#For Pregnancy 2,[84,19],103
	A = ((-84/103)*math.log((84/103),2))
	B = ((19/103)*math.log((19/103),2))
	EntropyP2 = -A-B
	print('Entropy of Pregnancy 2')
	print(EntropyP2)

	Gain_Pfeature= TotalEntropy - (111/349)EntropyP0 - (135/349)EntropyP1 - (103/349)*EntropyP2
	print('Gain on pregnancy feature')
	print(Gain_Pfeature)

view raw DecisionTrees.py hosted with ❤ by GitHub

Happy Learning!!!

Pipelines - Pipelines

This concept of pipelines sometimes I feel the reality vs state of art is way too different

As of today %% of companies that have data consolidated for Building, models would be 5%, Rest all could be connect and extract data as needed
ML is not a separate skill, Data - OLTP, OLAP, Reporting, ML everything has to co-exist.

The intent of the pipeline is to automate Model Building / Deployment. I have not seen direct training/deployment.

In Actual Implementation

Training code will be separate
Test data Location / Connectors to Pull data
Trained models storage / Saving their metrics
Deploying trained model as API

Still, we can achieve everything with the skills the team has across DB / ML, We don't need to have a dedicated ML pipeline. This post on DIY pipeline demonstrates the same DIY machine learning training pipeline

More Read

Keep Exploring!!!

October 14, 2021

Telematics - Papers

Datasets

Paper - Synthetic Dataset Generation of DriverTelematics

Features

Further Datasets - Link

Paper - Collaborative Cloud-Edge Computation for Personalized Driving Behavior Modeling

Key Notes

Generative Adversarial Recurrent Neural Networks (GARNN)
CGARNN-Edge (Conditional GARNN)
Driving behavior modeling can also be used by insurance companies to determine the vehicle insurance premium
Drivers may have distinct driving behaviors because of their individual difference, such as age group, gender, and driving experience
Real-time performance is a stringent requirement for ADAS. For example, fatigue driving or other abnormal driving behaviors should be detected immediately

Driving behavior: speed, acceleration, brake force, steering, lane offset, and lane position signal

Paper - Driver Telematics Analysis

Key Notes

Aggressive behaviors include lane violations, failure to stop, speeding, sudden raise of acceleration and severe other violations
Behavior parameters that account for aggregate driving profile: mean speed, mean speed excluding the stops, mean acceleration, mean deceleration, average length of a trip, mean number of acceleration/deceleration changes within a trip, standstill time proportion, acceleration time proportion, deceleration time proportion and constant speed time proportion
Trip Features - Ride Length, Ride Speed, Ride Length without stops, Ratio of Stops, speeds, angles, accelerations, speed*angles.
Driving features - Mean acceleration, Mean deceleration, Average number of acceleration/deceleration changes

Paper - Analyzing driving behavior from CAN data using context-specific information

Key Notes

Speed based acceleration thresholds - applicable to all categories

Here adaptive speed - based thresholding is derived exponential regression equations
Thresholds for turn / straight segments are different. Stricter for turns

Paper - Driving Style Representation in Convolutional Recurrent Neural Network Model of Driver Identification∗

Key Notes

Input: A trajectory 𝑇 .
Model: A predictive model 𝑀 to capture variations in driving behavior to derive driving style information.
Goal: Predict identity of driver for trajectory 𝑇 based on driving style information.
Optimization Objective: Minimize prediction error