"No one is harder on a talented person than the person themselves" - Linda Wilkinson ; "Trust your guts and don't follow the herd" ; "Validate direction not destination" ;

October 31, 2021

Remembering Facts vs Evaluating Ideas

I find it hard to remember configuration parameters, default settings, metrics. These are key to many certifications. Often we focus on the problem at hand, not specific functions or code to check.

Every definition is custom to each cloud provider and the set of theoretical FAQ questions, syntax specific to language. We neither measure problem solving or domain knowledge but rely on syntax and remembering facts. This is a stark difference between product vs service companies. 

Certification does not necessarily mean you have the skills to build a solution. They merely imply familiarity with a tool/infra. As long as you map your current skills to new skills find the gaps and address you can build the required solution.

Learning is a collection of observations, experiments, experiences, applying your relevant past lessons. It is a compound effect. Building a solution is easy, but thinking from a futuristic perspective marks the difference between a newbie and an experienced techie.

20 years of experience is not working on the same project. The wider you explore bigger the perspective. The more you fail, the more you are aware of different domains/roles. In the end, let it be a collective memory of different experiences. Win or lose enjoy the journey.

I keep coding my logic with a mix of syntax I recollect across SQL, C, Python, R, C#. First, pseudo logic comes to mind. Later the logic is corrected based on StackOverflow answers. Every language has its own way of defining constructs and separators. Am I a bad programmer, mmm maybe... Always there is more to learn :)

Anyways value addition needs to be quantified so you need to pass this too :)


October 30, 2021

AI in Finance

Paper #1 - AI in Finance: Challenges, Techniques and Opportunities

Key Notes

  • Key Areas are capital markets, trading, banking, insurance, leading/loan, investment, asset/wealth management, risk management, marketing, compliance and regulation, payment, contracting, auditing, accounting, financial infrastructure, blockchain, financial operations, financial services, financial security, and financial ethics
  • Classic techniques including logic, planning, knowledge representation, statistical modeling, mathematical modeling, optimization, autonomous systems, multiagent systems, expert systems
  • Modern techniques such as recent advances in representation learning, machine learning, optimization, data analytics, data mining and knowledge discovery, computational intelligence, event analysis, behavior informatics, social media/network analysis
  • Specific business problems, such as market trend forecasting, stock price prediction, credit scoring, fraud detection, financial report analysis, pricing and hedging, marketing, consumer behavior analysis, algorithmic trading, social commerce, and Internet finance.
  • Portfolio planning and optimization: including designing, planning, optimizing and recommending investment portfolios and strategies in a market
  • Forecasting and prediction: including the regression, classification, estimation and prediction of trend (up or down), movement (direction and scale, etc.), value (e.g., price or volatility)
  • Business profiling: including describing, segmenting, characterizing and classifying markets, products, customers, and services.
  • Sentiment and intention modeling: including characterizing, representing, modeling, analyzing and evaluating the polarity, diversity, propensity and their dynamics of customer sentiment and intention 
  • Anomaly detection: such as characterizing, quantifying, detecting, classifying and predicting abnormal, exceptional and changing behaviors, products, patterns, performance





Paper #2 - Enhancing Financial Inclusion using Mobile Phone Data and Social Network Analytics

Key Notes

  • Datasets - call-detail records, credit and debit account information of customers is used to create scorecards for credit card applicants
  • Call-detail records are used to build call networks and advanced social network analytics techniques are applied to propagate influence from prior defaulters throughout the network to produce influence scores
  • predictive model for a target measure of interest (e.g., churn, fraud, default) 
  • sociodemographic  information, such as age, marital status and postcode; debit account activity, including timing and amount of payments; and credit card activity
  • sociodemographic features such as age, marital status and residency as reported at the time of the credit card application are extracted.





Paper - P2P LOAN ACCEPTANCE AND DEFAULT PREDICTION WITH ARTIFICIAL INTELLIGENCE

Key Notes

Features for the first phase are: 

  • debt to Income ratio (of the applicant); 
  • employment length (of the applicant); 
  • loan amount (of the loan currently requested); 
  • purpose for which the loan is taken
  • loan amount (of the loan currently requested); 
  • term (of the loan currently requested); 
  • instalment (of the loan currently requested); 
  • employment length (of the applicant);
  • home ownership (of the applicant. Rented, owned or owned with a mortgage on the property); 
  • verification status of the income or income source (of the applicant. If this was verified by the Lending Club); 
  • purpose for which the loan is taken; 
  • Debt to Income ratio (of the applicant); 
  • earliest credit line in the record (of the applicant); 
  • number of open credit lines (in applicant’s credit file); 
  • number of derogatory public records (of the applicant);
  • revolving line utilisation rate (the amount of credit the borrower is using relative to all available revolving credit);
  • total number of credit lines (in applicant’s credit file); 
  • number of mortgage credit lines (in applicant’s credit file); 
  • number of bankruptcies (in the applicant’s public record); 
  • logarithm of the applicant’s annual income (the logarithm was taken for scaling purposes); 
  • FICO score (of the applicant); 
  • logarithm of total credit revolving balance (of the applicant).

Paper #3 - Behavior Revealed in Mobile Phone Usage Predicts Credit Repayment

Key Notes

  • Mobile phone transaction history prior to the extension of credit, and whether the credit was repaid on time
  • Transition to a postpaid plan
  • Call and SMS metadata

Paper #4 - Data Science in Economics

Key Notes







More Reads - 

Keep Exploring!!!

Solving the right problem at the Right time matters

2013 I was part of the Team that worked on Traffic Forecasting for Retail Stores

  • Multiple stores across geographies
  • Multiple DB’s for each local store

The forecasting system used to run at Enterprise, Synchronize data to local stores with their own internal synchronization jobs. 

  • These jobs were configured to run according to time zones of stores
  • The algorithms were mostly around a weighted moving average, trend + moving average 
  • The forecast job runs leveraging previous data and projects forecasts by the hour for next day, hourly basis patterns
  • The actuals are captured the following day and measured against it
  • In case of data not present sister stores (similar stores) data was leveraged for calculation

Whatever we say as of today measure model drift, missing data features, work at scale, coexist along with existing transaction system was built as server components, custom-built. 

What we missed are

  • Instead of Traffic forecast if we had done a sales forecast it would have helped to apply solutions for both eCommerce and retail giant
  • We had inherent details of out of stock, replenishment alerts. The same could have been used for out of stock forecast per zone, replenishment forecast per zone
  • These real-time reports from RFID could have served as effective forecast opportunities on the same

Sometimes we may have the right technology and architecture but not the right use cases. Now I see the same things ML attempts to do with #kubeflow, #pipelines, #scale but the same problem which was solved with models available at that point in time would take a different set of skills to solve today 😊

Keep Exploring!!!

AI - Education Opportunities

Paper #1 - Strengthening e-Education in India using Machine Learning

Key Notes

Applying different data mining algorithms on the data of the person and suggesting which course is appropriate for him based on his background knowledge


Paper #2 - Personalized Education in the AI Era: What to Expect Next?

Key Notes




Content summarization and question generation Multi-modal content understanding: Human-in-the-loop content design





More Reads

Teaching Machine Learning in K–12 Computing Education: Potential and Pitfalls

Estimating returns to special education: combining machine learning and text analysis to address confounding

Keep Exploring!!!

October 25, 2021

Merlion - open-source machine learning library for time series - Forecasting

Paper - Merlion: A Machine Learning Library for Time Series

Key Notes

  • From Salesforce another forecasting library
  • Merlin includes classic statistical methods, tree ensembles, and deep learning methods. 
  • Merlion implements many diverse models for both forecasting and anomaly detection

Forecasting Algos List

Univariate time series forecasting

  • ARIMA (AutoRegressive Integrated Moving Average)
  • SARIMA (Seasonal ARIMA)
  • ETS (Error, Trend, Seasonality)
  • Prophet
  • Deep autoregressive LSTM

Multivariate forecasting models

  • autoregression algorithm
  • Vector Autoregression






Examples

Documentation

Orbit: A Python Package for Bayesian Forecasting

Orbit: Probabilistic Forecast with Exponential Smoothing

darts is a Python library for easy manipulation and forecasting of time series

Time Series Made Easy in Python

Keep Exploring!!!

October 24, 2021

Indian Startup #Greyorange #AI #DataScience #Robotics #WarehouseAutomation

Indian Startup #Greyorange #AI #DataScience #Robotics #WarehouseAutomation

Useful links for further review

Keep Exploring!!!

October 21, 2021

Cloud Comparison - Good Read

A good paper on cloud comparison, I was looking for such a handy paper for a long time

Paper - Public Cloud Infrastructure Vendors 

Key Notes

Service Types


Compute Services


Infra


Serverless


Storage


Database


Big Data


Real Time Streaming

ML Service 

Networking


Additional Services


Very good Work and Nice Summary from Paper!!!

Good Read - Databases Vs Blockchain

Paper - Trends in Development of Databases and Blockchain

Key Notes

  • ACID (Atomicity, Consistency, Isolation, and Durability) 
  • CAP - (Consistency, Availability, Partition tolerance)
  • DCS (Decentralization, Consistency, Scalability) theorem

Difference between blockchain and database

  • Blockchain differs from traditional databases in numerous ways like its decentralization, cryptographic security using chained hashes, no administration control, immutability, freedom to transfer
  • These distributed databases have their consensus mechanism for the joint agreement on a data block by the network parties
  • blockchain databases = distributed databases, support features like complex data types, rich query structure,
  • ACID compliant [3], low latency, fast scalability, and cloud hosting

CAP 

  • Consistency - Any read in the distributed system gives the latest write on the nodes.
  • Availability - A Client always receives a response at any point of time irrespective of whether the read is the latest write.
  • Partition Tolerance - In case of partition between nodes in the distributed system, the system should still be functioning

BCS

  • Decentralization - There is no trusted entity controlling the network, hence no single point of failure. 
  • Consistency - The blockchain nodes will read the same data at the same time. 
  • Scalability - The performance of blockchain should increase with the increase in the number of peers and the number of allocated computational resources. 




Good Read!!!


October 16, 2021

Smarphones are sensors, Customer Data Platform

If you aren't the paying customer, you are the product. 

Interesting read - Android Mobile OS Snooping By Samsung, Xiaomi, Huawei and Realme Handsets

Key Findings

  • Samsung, Xiaomi, Huawei and Realme Android variants all transmit a substantial volume of data to the OS developer (i.e. Samsung etc) and to third-party parties that have pre-installed system apps (including Google, Microsoft, Heytap, LinkedIn, Facebook)
  • Re-linkability of advertising identifiers. Samsung, Xiaomi, Realme and Google all collect long-lived device identifiers, e.g. the hardware serial number, as well as user-resettable identifiers, such as advertising IDs
  • On the Samsung handset the Google Advertising ID is sent to Samsung servers

  • What apps are used and when, what app screens are viewed, when and for how long
  • Several Samsung system apps use Google Analytics to log user interactions (windows viewed etc)
  • Samsung, Xiaomi, Realme, Huawei, Heytap and Google collect details of the apps installed on a handset
  • The list of installed apps is potentially sensitive information since it can reveal user interests and traits (a mental health app, a political news app)
  • No opt-out. As already noted, this data collection occurs even though privacy settings are enabled
  • Xiaomi collects the most extensive data on user interactions, including the timing and duration of every app window viewed by a user
  • One example of potentially sensitive metadata is the name, timing and duration of the app windows viewed by a user.
  • Data which is not sensitive in isolation can become sensitive when combined with other data
  • Android handsets can be directly tied to a person’s identity in at least two ways.  Firstly, via the SIM. When a person has a contract with a mobile operator then the SIM. Secondly, via the app store used.
  • Use of the Google Play store requires login using a Google account, which links the handset to that account since Google collect device identifiers such as the hardware serial number and IMEI along with the account details
  • Sometimes the plaintext data (i.e. after decryption, if needed) is human-readable, e.g. json.
  • On a Samsung handset Samsung, Google and Microsoft/LinkedIn all collect data. That raises the question of whether the data collected separately by these parties can be linked together (and of course combined with data from other sources).




Keep Exploring!!!

October 15, 2021

Decision Trees - 5 Mins Tutorials



Happy Learning!!! 

Pipelines - Pipelines

This concept of pipelines sometimes I feel the reality vs state of art is way too different

  1. As of today %% of companies that have data consolidated for Building, models would be 5%, Rest all could be connect and extract data as needed
  2. ML is not a separate skill, Data - OLTP, OLAP, Reporting, ML everything has to co-exist. 

The intent of the pipeline is to automate Model Building / Deployment. I have not seen direct training/deployment.

In Actual Implementation

  • Training code will be separate
  • Test data Location / Connectors to Pull data
  • Trained models storage / Saving their metrics
  • Deploying trained model as API

Still, we can achieve everything with the skills the team has across DB / ML, We don't need to have a dedicated ML pipeline. This post on DIY pipeline demonstrates the same DIY machine learning training pipeline

More Read

Keep Exploring!!!


October 14, 2021

Telematics - Papers

Datasets

Paper - Synthetic Dataset Generation of DriverTelematics

Features

Further Datasets - Link

Paper - Collaborative Cloud-Edge Computation for Personalized Driving Behavior Modeling

Key Notes

  • Generative Adversarial Recurrent Neural Networks (GARNN)
  • CGARNN-Edge (Conditional GARNN)
  • Driving behavior modeling can also be used by insurance companies to determine the vehicle insurance premium
  • Drivers may have distinct driving behaviors because of their individual difference, such as age group, gender, and driving experience
  • Real-time performance is a stringent requirement for ADAS. For example, fatigue driving or other abnormal driving behaviors should be detected immediately

  • Driving behavior: speed, acceleration, brake force, steering, lane offset, and lane position signal




Paper - Driver Telematics Analysis

Key Notes

  • Aggressive behaviors include lane violations, failure to stop, speeding, sudden raise of acceleration and severe other violations
  • Behavior parameters that account for aggregate driving profile: mean speed, mean speed excluding the stops, mean acceleration, mean deceleration, average length of a trip, mean number of acceleration/deceleration changes within a trip, standstill time proportion, acceleration time proportion, deceleration time proportion and constant speed time proportion
  • Trip Features - Ride Length, Ride Speed, Ride Length without stops, Ratio of Stops, speeds, angles, accelerations, speed*angles.
  • Driving features - Mean acceleration, Mean deceleration, Average number of acceleration/deceleration changes

Paper - Analyzing driving behavior from CAN data using context-specific information 

Key Notes


Speed based acceleration thresholds - applicable to all categories

  • Here adaptive speed - based thresholding is derived exponential regression equations
  • Thresholds for turn / straight segments are different. Stricter for turns

Paper - Driving Style Representation in Convolutional Recurrent Neural Network Model of Driver Identification∗

Key Notes

  • Input: A trajectory 𝑇 .
  • Model: A predictive model 𝑀 to capture variations in driving behavior to derive driving style information.
  • Goal: Predict identity of driver for trajectory 𝑇 based on driving style information.
  • Optimization Objective: Minimize prediction error

A Vehicle Classification Algorithm based on Telematics Data

Keep Exploring!!!

October 13, 2021

World of ML landscape

World of ML landscape - link

Picking few selected areas to focus and deep dive.












Keep Exploring the Catalog!!!