Data Science, Database, AI Startups and Domain Learning's (Video-Image-Text-Data-Database)

October 11, 2021

BERT QnA Example

Some examples are very good to pick on ideas and customize as we need. BERT based QnA Example

Text Clustering - Did a decent job to cluster based on JD Types - Cloud, Server, ML etc..

Unsupervised NER using BERT

Document search with fragment embeddings

Finbert

Keep Exploring!!!

October 10, 2021

Forecast - Planning - Recommendations - Paper Reads

Paper #1 - Maximizing Store Revenues using Tabu Search for Floor Space Optimization

Key Notes

Floor space is a valuable and scarce asset for retailers
Connected multi-choice knapsack problem with an additional global
constraint and propose a tabu search based metaheuristic that exploits the
multiple special neighborhood structures
Over the last decade, the number of products competing for limited space increased by up to 30%
The product mix of categories, merchandising rules, sales patterns and characteristics of display furniture
(1) develop a statistical model to measure the space elasticity; and
(2) formulate and solve an optimization problem for each store to determine the optimal assignment of planograms to maximize total revenue subject to certain business constraints

Paer #2 - Reversing ShopView analysis for planogram creation

Key Notes

ShopView can build the planogram without the need of manually creating it in software
OCR in the identification of products
Planograms specifies the absolute physical locations of the products, and the amount of space each type of product should occupy
Planogram compliance using template images
Vision - Object Recognition based on attributes, Template and Feature Matching, Optical Character Recognition (OCR)
Custom Dictionary - Implementing a custom dictionary for the OCR engine seemed a good strategy since at first glance it would improve the performance of the OCR algorithm

Paper #3 - Deep Learning based Recommender System: A Survey and New Perspectives

Key Notes

Collaborative €ltering makes recommendations by
learning from user-item historical interactions, either explicit (e.g. user’s previous ratings) or implicit feedback (e.g. browsing history)
Content-based recommendation is based primarily on comparisons across items’ and users
Hybrid model refers to recommender system that integrates two or more types of recommendation strategies
Strengths of deep learning based recommendation models - Nonlinear Transformation, Sequence Modelling

Paper #4 - Fashion Retail: Forecasting Demand for New Items

Key Notes

Merchandising Factors - Discount, Visibility, Promotion
Derived Features - Age of Style, Trend and Seasonality, Cannibalisation

Paper #5 - Time Series Forecasting With Deep Learning: A Survey

More Reads

Keep Exploring!!!

October 09, 2021

Technology - Consulting - Coding - Domain Learning

At a senior role, what are things we can accomplish. I agree with the perspective and the work that is called out Link

Technical Work

Review for technically design/architecture
Analyze for with security/scalability of design
Collaborate with other technical teams to agree on interfaces and common APIs

People Work

1-1s on a weekly basis
Regular feedback

Plus my own additions

Patenting / Knowledge Sharing
Building your point of view
Be on top of tech - Code as you need

Ongoing

Teaching, mentoring and coaching
Technical conversations and reviewing designs

Plus a perspective on mastering technology vs domain I like this article

Adding my top reasons to solve problems and not to master tech - Work On Interesting problems not Technologies

Ideas take time and need refinement
As you keep coding, you keep building perspectives
Working prototype creates more interest/excitement and keep improving
Your interest will not die down as you are solving newly known challenges
Scope, features you will balanace when you spot the unknowns
Its your idea you will not kill it :)

On WLB - Link

We collectively create the culture we live in, changes comes from healthy WLB
20% of your work produces 80% of your value. Prioritize over priorities

Myth of super performers

I loved the below lines, I have seen this specific behavior. People who deliver but do not share, collaborate within the team. Adding my own perspectives

X is the only developer who gets anything done
Do not actively share knowledge with his peers
Good at communication but bad at collaboration
Explain simple things in a complicated way
Good, connect at Leadership Level. Over-communication at the leadership level, limited collaboration at ground level
Instead, make more people productive will reap the greatest benefits
Turn our attention from individuals to groups of people
Don’t mistake humility for ignorance - There are a lot of software engineers out there who won’t express opinions unless asked

Agile principles alternative definition

Empathy for customer needs
Actually getting stuff done
A bird’s-eye view of the product vs market
Able to balance birds-eye view to product view vs component view

More Reads

20 Things I’ve Learned in my 20 Years as a Software Engineer

Keep Thinking!!!

Dark side of profits

The dark side of analytics - mobile apps - facebook - youtube - amplify the #engagement for the sake of profit
If you aren't the paying customer, you are the product. #google #facebook #android
Anger/ Hate / Excitement / Drugs creates dopamine addiction and keeps the conversation going
For the cab sharing, delivery partners - The illusion of guaranteed income while the variable incentives seem attractive initially but mental and physical costs would take a big toll soon
High dopamine low effort entertainment (video games, drugs, porn, Netflix), it becomes the default way to spend leisure time really quickly

I will let this happen to my own kid vs can I leverage everything outside my home as an untapped market.

As an end-user think

How much time does Zuck spend on FB every day
Will they let their kids spend so much time FB that a typical teen does

Until we recognize we are in this trap of low-cost internet, virtual addiction you will never come out of this virtual trap - low-cost mobile phone, free internet, engagement vs leaving away your goals in life.

The same happens in every other domain, Why do restaurants don't hesitate to use outdated/expired products in their food. It boils down to one's own integrity vs profits.

How I Met & Surpassed My Career Goals While Following One Actionable Rule

Keep Thinking!!!

Perspective of Learning

During schools days

Why should I learn?
Life will be same, I will become driver / cleaner ?
What are ways to quit education, All my friends started working
I don't apply anything I learn why should I learn ?
My dad work vs what he learns is not connected

Now

Learn to know state of art
Learn to design things
Solve business problems in your own way
Learn to review others work
Learn to have good domain knowledge
Learn to be employable, do better contributions

Now I learn more sincerely than my school days :) :) :)

NLP - NER - Papers

Paper #1 - Recent Trends in Named Entity Recognition (NER)

Key Notes

‘Named Entity Recognition’ refers to identifying person, organization, location
NER belongs to a general class of problems in NLP called sequence tagging
Prominent supervised learning methods - Hidden Markov Models (HMM), Decision Trees, Maximum Entropy Models (ME)

Unsupervised clustering method using lexical resources eg. Wordnet

Paper #2 - A Survey on Deep Learning for Named Entity Recognition

Key Notes

Rule-based approaches, which do not need annotated data as they rely on hand-crafted rules
Unsupervised learning approaches, which rely on unsupervised algorithms Feature-based supervised learning approaches, which rely on supervised learning algorithms

71% of search queries contain at least one named entity

word-level representation - continuous bagof-words (CBOW) and continuous skip-gram models
Commonly used word embeddings include Google Word2Vec, Stanford
GloVe, Facebook fastText and SENNA.
CharNER considers a sentence as a sequence of characters and utilizes LSTMs to extract characterlevel representations.
Besides word-level and character-level representations, some studies also incorporate additional information (e.g., gazetteers [18], [108], lexical similarity [109], linguistic dependency [110] and visual features [111]) into the final representations of words

Paper #3 - Document Ranking for Curated Document Databases using BERT and Knowledge Graph Embeddings: Introducing GRAB-Rank

Key Notes
Curated Document Databases (CDD) play an important role in helping researchers find relevant articles in scientific literature
Document ranking has been extensively used in the context of document retrieval
Recent work on Learning to Rank (LETOR) has used word embeddings of various kind as the input
Word embeddings can be learnt from scratch or a pre-trained embedding model can be adopted
A popular algorithm for generating vector representations of words is GloVE (Global Vectors for Word Representation), an unsupervised learning algorithm that operates by aggregating global word-word co-occurrence statistics
Semantic document ranking models take into account the context of terms in relation to their neighbouring terms
Context of the word “bank”, either as: (i) an organisation for investing and borrowing money, (ii) the side of a river or lake, (iii) a long heap of some substance
A popular choice of pre-trained contextual model is the Bidirectional Encoder Representations from Transformer (BERT)
An alternative contextual model that can be used is the embeddings from Language Model ELMo
A knowledge graph is a collection of vertices and edges where the vertices represent entities or concepts, and the edges represent a relationship between entities and/or concepts.

OIE4KGC (Open Information Extraction for Knowledge Graph Construction)

More Reads

Code Experiments

Compare documents similarity using Python | NLP

Measuring the Document Similarity in Python

How to compute the similarity between two text documents?

How to do semantic document similarity using BERT

How to cluster text documents using BERT

Question answering using transformers and BERT

Calculating Document Similarities using BERT, word2vec, and other models

Pykg2vec representation of entities and relations in Knowledge Graphs

Pykg2vec: Python Library for KGE Methods

Python NLP Tutorial: Building A Knowledge Graph using Python and SpaCy

NER with spacy

Keep Exploring!!!

October 05, 2021

Time series variables and Insights

Good read - Link

Discrete variables - Discrete data is information that can only take certain values. Discrete data refers to individual and countable items (discrete variables). Countable, Point in time data (Bank balance). Looks like clusters, points. The number of customers who bought different items. The number of computers in each department. The number of items you buy at the grocery store each week
Continuous Variables - Continuous data is data that can take any value. Takes any measured value within a specific range. Height, weight, temperature and length are all examples of continuous data. Some continuous data will change over time. Looks like line graphs, continuous.
Univariate analysis is the simplest form of data analysis where the data being analyzed contains only one variable
Bivariate data – This type of data involves two different variables. The analysis of this type of data deals with causes and relationships and the analysis is done to find out the relationship among the two variables.
Multivariate analysis is the analysis of three or more variables.

Multiple linear regression
Multiple logistic regression
Multivariate analysis of variance (MANOVA)
Factor analysis
Cluster analysis
The aim of multivariate analysis is to find patterns and correlations between several variables simultaneously
Simple regression pertains to one dependent variable and one independent variable
Multiple regression (aka multivariable regression) pertains to one dependent variable and multiple independent variables
Multivariate regression pertains to multiple dependent variables and multiple independent variables

A stationary (time) series is one whose statistical properties such as the mean, variance and autocorrelation are all constant over time. Hence, a non-stationary series is one whose statistical properties change over time.

Keep Exploring!!!

October 04, 2021

Dark Side of Social Media

I was expecting such evidence-based insights to understand social media manipulation.

Key Notes from Link1, Link2

There were conflicts of interest between what was good for the public and what was good for Facebook. And Facebook, over and over again, chose to optimize for its own interests, like making more money
Facebook has realised that if they change the algorithm to be safer, people will spend less time on the site, they'll click on less ads, they'll make less money
The version of Facebook that exists today is tearing our societies apart and causing ethnic violence around the world,” says former Facebook employee France Haugen.

Paper #1 - THE WELFARE EFFECTS OF SOCIAL MEDIA

Adverse outcomes such as suicide and depression appear to have risen sharply over the same period that the use of smartphones and social media
Social media may create ideological “echo chambers” among like-minded friend groups, thereby increasing political polarization
Deactivating Facebook freed up 60 minutes per day for the average person in our Treatment group
Facebook deactivation significantly reduced news knowledge and attention to politics.

Data Curation paper - Reads

Paper #1 - A Survey on Data Cleaning Methods for Improved Machine Learning Model Performance

Two aspects of data cleaning: what to clean and how to clean

Key Notes

SampleClean: Simulated Clean Data Instances - SampleClean suggests a solution to sampling the raw data that can better present clean data instances.
Approximate Query Processing (AQP). The AQP consists of two steps: first, in Direct Estimate (DE), a set of k rows is sampled randomly and cleaned, and the training result is returned independently of the dirty data. The correction step is used to reweight the sample based on the contribution of the cleaned data
ActiveClean: Incremental Data Cleaning in Convex Models. ActiveClean gradually cleans a dirty dataset to learn a convex-loss model, such as Logistic Regression and Support Vector Machine (SVM).
HoloClean: Holistic Data Repairs With Probabilistic Inference
AlphaClean: Generate-Then-Search Parallel Data Cleaning
CPClean: Reusable Computation in Data Cleaning

ML Papers - Learning-with-Label-Noise

Paper #2 - Advancing Data Curation With Metadata and Statistical Relational Learning

Key Notes

We refer to data science as an umbrella term gathering algorithms and techniques from several disciplines, such as statistics, software engineering, and machine learning
Data is inconsistent, duplicated, stale, incomplete, and/or inaccurate. Data errors, such as outliers, duplicates, missing values, and inconsistencies.
Mapping Metadata to Data Quality Issues
Error Detection
Joint Error Detection and Repair Suggestion

Data Quality fundamentals

The Consistency dimension refers to the validity and integrity of values and tuples with respect to defined inter- and intra-relational constraints that exist within either single or multiple relations
The accuracy dimension identifies correct and true values of the entities presented by data.
Completeness is a degree to which values are included in a data collection
Timeliness dimension reflects the change and update of data by identifying the most current value of an entity in a database
Core data quality dimensions, the violation of Accuracy, Consistency,
Uniqueness, Completeness and Timeliness lead to data quality issues

Metadata is "structured information that describes, explains, locates, or otherwise makes it easier to retrieve, use, or manage an information resource"

Single-Column Profiling Tasks

Cardinalities refers to the counts of values
Number of rows: the number of entities which are available in the table;
Distinctness: the number of distinct values of the single attribute;
Uniqueness: the ratio of the number of distinct values to the number of rows

Value Distribution refers to the distribution of values on the column. This category includes:

Constancy: the ratio between the most frequent value count and the number of rows;
Extreme values: minimum and maximum values in numeric columns; shortest and
longest strings in categorical, alphanumeric or text columns;
Histogram: values distribution summary on an attribute
Quartiles: three points that divide numeric distribution into four equal groups;
Inverse distribution: an inverse frequency distribution (a distribution of the frequency distribution);

Patterns

Patterns refers to the syntactic properties on the values of the individual column.
Lengths, which specifies the descriptive statistics of the column value lengths
Decimals, which determines the number of decimals in numeric columns

Multi-Column Profiling Tasks

Functional dependencies
What. The first dimension captures common data quality issues and typical data cleaning tasks, which had been found in the literature.
How. The second dimension reflects differently focused data cleaning approaches.

Rule-Based Approaches

Data cleaning rules or integrity constraints to detect and repair various error types in the dataset.

Statistical Approaches

DEC (DetectExplore-Clean) framework [22] uses statistical and other analytical techniques, such as the Fleiss’ kappa measure, to compute the glitch score, which identifies and scores the data glitches

Probabilistic and Machine Learning-Based Approaches

The BoostClean system [141] addresses the domain value violations while cleaning training data for predictive models
The HoloClean system [202] considers error detection as a black-box component and expects the specification of integrity constraints-aligned data quality rules to make probabilistic suggestions on how to repair erroneous data values.
Interactive Data Cleaning
Numerous data cleaning systems use crowdsourcing for duplicate detection and resolution

Supervised Error Detection with Metadata

1) an Error Detection Suite, which includes pluggable error detection systems that function as black boxes to our system.

2) a Metadata Profiler Suite, which extracts various metadata categories, and

3) an Aggregation Suite, which combines the output of the error detection suite and the profiler. In the following, we describe each of the components.

Keep Exploring!!!

Probabilistic Forecasting Reads

Paper - Master's Thesis : Comparison of probabilistic forecasting deep learning models in the context of renewable energy production

DeepAR
Wavenet
Transformer
Temporal Fusion Transformer
Prophet

Awesome Reads

Timeseries ML

Code - Link

Naive forecasting models (Naive, Seasonal Naive, Moving Average, etc)
MXNet [10], developed by Amazon Web Services
GluonTS has been developed by a Amazon Web Service team to fill the gap of time series modeling toolkit
MQCNN, MQRNN, NBEATS and Wavenet does not outputs samples of a distribution function, but quantiles of the distribution itself
NPTS is the implementation of the “Non-Parametric Time Series Forecaster” model
MQCNN is the implementation of one variant of the model described in paper ”A Multi-Horizon Quantile Recurrent Forecaster”
The model Transformer is the implementation of “Transformer” model architecture, as it was defined in paper [22]. It is described in this paper as ”The first sequence transduction model based entirely on attention, replacing the recurrent layers most commonly used in encoder-decoder architectures with multi-headed self-attention”
The model Wavenet is the implementation of ”Wavenet” model architecture, as it was defined in paper [23], with a quantized target. This model network is composed of dilated causal convolutional layers. Both residual and parameterised skip connections are used throughout the network,to speed up convergence and enable training of much deeper models
DeepAR - global model from historical data of all time series. Similar to LSTM-based recurrent neural network architecture to the probabilistic forecasting problem
Binomial distribution - Two possible outcomes (the prefix “bi” means two, or twice)
Assumptions - Each trial is independent. The probability of success (tails, heads, fail or pass) is exactly the same for each trial
Poisson distribution - Gives us the probability of a given number of events happening in a fixed interval of time
Continuous distribution - data can take on any value within a specified range
Discrete distribution is one in which the data can only take on certain values, for example integers
RNN architecture for probabilistic forecasting, incorporating a negative Binomial likelihood
Monte Carlo (MCMC) methods comprise a class of algorithms for sampling from a probability distribution.

Spacetimeformer Multivariate Forecasting

Multivariate Time Series Forecasting with Transformers

Top 5 Forecasting Demos

Demand forecasting using RNN with LSTM on PyTorch

Anomaly Detection with PyOD

NGBoost: Natural Gradient Boosting for Probabilistic Prediction

Keep Exploring!!!!

October 11, 2021

October 10, 2021

October 09, 2021

October 05, 2021

October 04, 2021

About Me

What is your Expertise

Search This Blog

Git Code Repository

Translate

About Me and Disclaimer

Labels

Data Science Good Reads

Cloud, Datacentre, BigData and NOSQL Blogs

SQL Links

Archecture Blog List

Programming Problems

Startup - Reads

Perl-Python-Ruby-Linux-Oracle

Management + Leadership Blogs

Research Papers & Podcasts

My Wordpress

Interesting Reads

Useful Links - C# and .NET

Java, Selenium, QTP and Test Tools Learning

Agile Testing

Reverse Logistics Reads

Biztalk Blogs

MS BI Links

Process - Learnt it :)

Usability Guidelines - Building Better Sites

.NET Test Tools and Other Interesting Reads

Review Checklist

Blog Archive

Live Traffic

Total Pageviews

Popular Posts