Data Science, Database, AI Startups and Domain Learning's (Video-Image-Text-Data-Database): August 2020

August 29, 2020

My Perspectives on Interpreting Data

Data based Decisions than opinions/perspectives
Data-Driven Thinking / Measure what you can collect/interpret
Staying unbiased / finding missing data
Use logical decisions / thoughts for skewness / relationships / trends
Data is everywhere but interpreting it correctly is a skill, Conveying fact without overselling or missing the point is also a key skill
Using Data with Caution and proactive about making changes as conditions change
Agile, Observe, Adapt, Change and Monitor

Keep Thinking!!!

August 27, 2020

Technology / Job Trends

It appears like boom but many Skills will Converge in the next 5 years.
10 Years Back

OLTP - Real-time
OLAP - BI

Past 5 Years to till Date

Real-time OLAP - Columnstore Databases - Vertica
Data Aggregation - Across SQL, NoSQL, Data lake
In-memory Real-time machine learning - Spark
Data Science - Forecasting, Clustering, Anamoly, Churn, CLV, Recommendations - Built on top of Data lakes
Features Stores - Evolving / Embracing products - Feast, Hopsworks

Now to Next 5 Years

Real-time analytics - Translytics (Microservices + Shard Data) - Newer forms of data store / Analytics
Dockerized + Kubernetes + KfServing - Everything as API
Leverage more analytics at every stage of data pipeline - KSQL (Kafka SQL, Spark ML
Unified FeatureStores to access - Realtime, Trends, ML Features, More and more tools will automate everything

The gaps between Database Developer, BI Developer, Data Scientist will start to overlap and create a new set of roles.

Interesting Read - ML feedback

Keep Thinking!!!

August 22, 2020

Research Paper read - Display Advertising with Real-Time Bidding (RTB) and Behavioural Targeting

Research Paper read - Display Advertising with Real-Time Bidding (RTB) and Behavioural Targeting

Key Notes

RTB - Real-time bids. The mechanism to buy and sell ads
Key components - Demand-side platform, Supply-side platform, Real-time bidding
Input Signals - Image, Video, Audio

Online Advertising Ecosystem
The different components and interaction is displayed in below picture

Realtime behavioral targeting

Collect all traits
Monitor and Alert
Bid and reach out with relevant ads

User tracking
A user is typically identified by an HTTP cookie, designed to allow websites to remember the status of an individual user, including remembering shopping items added in the cart in an online store or recording the user’s previous browsing activities for generating personalized and dynamical content

Personalized workflow

This is an interesting pic. How many cookies present in NYT page. Cookie Syncing is done to keep track/sync all cookies of a particular user.

ML Use Case for Click-through rate prediction

Look-alike modeling - on the basis of the learned user profiles, identify and target unknown users who have similar interests and commercial intents with the known (converted) customers

Conversion over multiple touchpoints

Key Concepts
CTR, Click-Through Rate - the probability of a specific user in a specific context clicking a specific ad
CVR, Conversion Rate - the probability of the user conversion is observed after showing the ad impression

Keep Thinking!!!

Day #334 - Exploring - Featuretools

Have been listening/hearing about Feature generation, feature management. There are a couple of tools/frameworks in this perspective.

For a typical ML product level use case

Who defines the problem - Domain Expert / Product Manager
Who knows the data sources - BA / Database Developer / Product Manager
Raw Data -> Processed data - DB Developer
Data Exploration / Analysis / Feature Creation - BI / DB / ML Developer
Model Development / Validation - ML Developer
Deployment / Monitoring / Improvement - Devops / ML Developer

Feature store handle the part between raw data - data aggregation - feature generation/feature engineering

Installing Featuretools

Analysis - Basically like connecting few tables, doing that analysis of unique, average its all taken care after you define the entities, It is like prebuilt analysis based on identified associations

Experimenting this on colab - Colab notebook link

From Link , Feature comparison between different feature stores

Paper - Link

Key Notes

Handling Data Ingestion
Aggregating data from diverse sources
Access controlled and versioned

Key Offerings

Automated Feature Generation
Access to generated feature
Data Privacy / Data Governance
Data Visualization

My Thoughts

Today with all cloud trend all the data OLTP, OLAP, SQL, NoSQL sit next to each other
Generating reports aggregating all sources in near real-time fashion is possible
Some features/variables can be pulled from OLTP tables
In a Data Lake / DW, Some of the insights would be already present in computed reports
Metadata management would already be available in the system which will handle data quality aspects
ML systems will work together as part of larger Data ecosystem comprising of OLTP, OLAP, SQL, NoSQL system. A lot of feature store workloads are already handled by other pieces.

More Reads

Keep Thinking!!

August 21, 2020

GCP - VM - Remote Jupyter Access

Steps provided in CS231 were perfect to try out

1. "Allow HTTP traffic" and "Allow HTTPS traffic"

2. Enable Static IP

3. Create Firewall rule

4. Jupyter configuration Update

Ref2 - Config File Update

	Step #1
	========
	vi ~/.jupyter/jupyter_notebook_config.py

	Step #2
	========
	i - insert

	c = get_config()
	c.NotebookApp.ip = '*'
	c.NotebookApp.open_browser = False
	c.NotebookApp.port = 8888

	:wq commit

	Step #2
	========
	jupyter notebook

view raw Steps hosted with ❤ by GitHub

Happy Learning!!!

August 19, 2020

Research paper read - Serverless inferencing on Kubernetes

Serverless inferencing on Kubernetes

Key Notes

KNative serverless paradigm to provide a serverless machine learning inference solution
Frameworks - MLFlow, Kubeflow

Deployment / Inference Challenges

Handling multiple machine learning frameworks in a consistent manner.
Updating running models with new versions.
Scaling models appropriately with constraints.
Monitoring models.
Canaries allow users to split a small percentage of traffic to their new model

KFServing

KFServing is a project that was created within the Kubeflow
Transformers allow focused data transformations of the request and response from the model

Example #1
Provide Inference Location

Create a storage initializer to download the artifacts from any popular storage (Google Storage, Amazon S3, Azure, local disk) and load onto the server.
Wire up networking so an endpoint is made available for inference requests

Example #2
Canary Location

Monitoring and explainability of models in production
Success Metrics for ML Model
1. Monitoring model performance
2. Monitoring metrics related to incoming data
3. Detecting outliers and drift
4. Explaining model predictions

Key aspects
Monitoring system requires functionality to determine when significant changes to data and predictive distributions happen

Seldon Core provides a dedicated /send-feedback API endpoint accepting labels and performing user-defined metric calculations

Drift Detector - The goal of the drift detector is therefore to identify when the distribution of the requests for the deployed model starts to diverge from the training data and model predictions

Model Monitoring - a KNative broker which can farm these out as desired via programmable triggers to serverless components such as outlier, drift and adversarial detection

More Reads - Minio - High performance object storage

Keep Thinking!!!

Research Paper Reads - MODELING USERS FOR ONLINE ADVERTISING

Paper #1 - MODELING USERS FOR ONLINE ADVERTISING

Key Notes

Contribution - a neural network model (app2vec) to vectorize mobile apps by studying how users employ these apps

Data Collected from Users

User activity data
User behaviors
Logging user activities
Contents consumed by users
Anonymous browser cookie syncing technique

Ad Platforms

Targeting audiences
User profiling
Ads based on their activity history across the web

Findings

Users watching polymorphic videos are likely to have similar interests

Insights

US mobile users download more than eight apps per month on average
90% of the time spent on mobile devices was spent using apps

Online Ad Targeting

Data - users browsing, app usage,
and other activities on the Internet
Targeting - site/page context, placement size, user behavior and geolocation

User Targeting
Publishers, Advertisers, Ad-networks, Online users

Research Directions

Cross-device user tracking - Users access online content through multiple devices
Value of user profile - Different costs associated with them, Ad targeting on user profile

Observe User Online Advertising Profile and Ad Targeting
Do ads target user profiles in the field?
What are the ads shown to different users?
How do ads impact users profiles?

Data - The capability to gather display ads and video ads from across the web is central to our work
Profile-driven crawling - Enables each crawler instance to interact with the ad ecosystem as though it were a unique user with particular characteristics.
The Anatomy of Online Advertising

Advertisers - Advertiser reach out to potential customers.
Publisher View - premium campaigns (specific advertisers, ad networks, ad exchanges)

Types of ads - Text Ads, Display Ads, Stream Ads, Video Ads
Video ads - Pre-roll, mid-roll, post-roll, Overllay-ads, Sponsored Videos

User Modeling on Mobile

app2vec to represent apps in a vector space without a priori knowledge of their semantics
app2vec to cluster apps based on app distances in their vector space
Computing app similarity is through the bag-of-words method using app meta information

Large Scale Look-alike Audience Modeling

A simple similarity-based look-alike system can use direct user-2-user similarity to search for users that look like (or in other words, be similar to) seeds
Another type of look-alike audience systems for online advertising is built with Logistic Regression (LR)
User segments can be user characteristics such as user interest categories.

Real-time Attention Based Look-alike Model for Recommender System
Key Notes

Real-time attention based look-alike model (RALM) for recommender systems
Deep neural networks (DNNs) and recurrent neural networks (RNNs) are more and more popular on recommendation task
"Matthew effect" - low quality and poor diversity of recommended contents.

RALM

RALM is a similarity based look-alike model, which consists of user representation learning and look-alike learning
Deep interest network for multifields user interests representation learning
Local representation of seeds should be processed online in real-time
k-means clustering to partition seeds into k clusters
Similarity based methods determine similarity between seeds and users based on distance measurement.

System Architecture
Offline Training

User Representation learning. The user representation model is developed based on deep learning network
Look-alike learning is based on attention model and clustering algorithm

Online asynchronous processing

User feedback monitor: The audience extension system updates the seeds of candidates through monitoring the click behaviors of all WeChat users in real-time
online serving - The lookalike model predicts the global embedding of seeds through global attention unit

Metrics

CTR (Click-through Rate): As audience increased, many new users sharing the same interests with seeds are reached. Therefore, CTR is expected not to decrease
Category & Diversity. One of our purposes is enriching user’s interest in our system, so we define a metric named diversity. It is represented by a number of content categories or tags a user has read in a day. With a more comprehensive user representation, more kinds of contents will be reached and category&tag diversity is expected to increase

More Reads
Comprehensive Audience Expansion based on End-to-End Neural Prediction

Keep Thinking!!!

August 14, 2020

Download GCP Storage files

Download files from GCP storage bucket

	#Google Cloud Shell
	#Step 1
	mkdir Data

	#Step 2 Command
	gsutil -m cp -R gs://BUCKET_NAME/FOLDER_OR_FILE_PATH ./Data

	#Step 3
	zip -r Data.zip Data

	#Step 4 Download
	dl Data.zip

	#Push to bucket and download
	gsutil mv Data.zip gs://my_bucket/*


	#Step 5 - Cleanup
	rm -r Data.zip
	rm -r Data

	#Ref -https://stackoverflow.com/questions/11640637/download-files-and-folders-from-google-storage-bucket-to-a-local-folder
	#Limit is 5GB

view raw Download_GCP_Data hosted with ❤ by GitHub

Happy Learning!!!

August 29, 2020

August 27, 2020

August 22, 2020

August 21, 2020

August 19, 2020

August 14, 2020

Git Code Repository

About Me

What is your Expertise

Search This Blog

Translate

About Me and Disclaimer

Labels

Data Science Good Reads

Cloud, Datacentre, BigData and NOSQL Blogs

SQL Links

Archecture Blog List

Programming Problems

Startup - Reads

Perl-Python-Ruby-Linux-Oracle

Management + Leadership Blogs

Research Papers & Podcasts

My Wordpress

Interesting Reads

Useful Links - C# and .NET

Java, Selenium, QTP and Test Tools Learning

Agile Testing

Reverse Logistics Reads

Biztalk Blogs

MS BI Links

Process - Learnt it :)

Usability Guidelines - Building Better Sites

.NET Test Tools and Other Interesting Reads

Review Checklist

Blog Archive

Live Traffic

Total Pageviews

Popular Posts