"No one is harder on a talented person than the person themselves" - Linda Wilkinson ; "Trust your guts and don't follow the herd" ; "Validate direction not destination" ;

August 29, 2020

My Perspectives on Interpreting Data

  • Data based Decisions than opinions/perspectives
  • Data-Driven Thinking / Measure what you can collect/interpret
  • Staying unbiased / finding missing data
  • Use logical decisions / thoughts for skewness / relationships / trends
  • Data is everywhere but interpreting it correctly is a skill, Conveying fact without overselling or missing the point is also a key skill
  • Using Data with Caution and proactive about making changes as conditions change
  • Agile, Observe, Adapt, Change and Monitor
Keep Thinking!!!

August 27, 2020

Technology / Job Trends

It appears like boom but many Skills will Converge in the next 5 years.
10 Years Back
  • OLTP - Real-time
  • OLAP - BI
Past 5 Years to till Date
  • Real-time OLAP - Columnstore Databases - Vertica
  • Data Aggregation - Across SQL, NoSQL, Data lake
  • In-memory Real-time machine learning - Spark
  • Data Science - Forecasting, Clustering, Anamoly, Churn, CLV, Recommendations - Built on top of Data lakes
  • Features Stores - Evolving / Embracing products - Feast, Hopsworks
Now to Next 5 Years
  • Real-time analytics - Translytics (Microservices + Shard Data) - Newer forms of data store / Analytics
  • Dockerized + Kubernetes + KfServing - Everything as API
  • Leverage more analytics at every stage of data pipeline - KSQL (Kafka SQL, Spark ML
  • Unified FeatureStores to access - Realtime, Trends, ML Features, More and more tools will automate everything
The gaps between Database Developer, BI Developer, Data Scientist will start to overlap and create a new set of roles.

Interesting Read - ML feedback 

Keep Thinking!!!

August 22, 2020

Research Paper read - Display Advertising with Real-Time Bidding (RTB) and Behavioural Targeting

Research Paper read - Display Advertising with Real-Time Bidding (RTB) and Behavioural Targeting

Key Notes
  • RTB - Real-time bids. The mechanism to buy and sell ads
  • Key components - Demand-side platform, Supply-side platform, Real-time bidding
  • Input Signals - Image, Video, Audio
Online Advertising Ecosystem
The different components and interaction is displayed in below picture

Realtime behavioral targeting
  • Collect all traits
  • Monitor and Alert
  • Bid and reach out with relevant ads
User tracking
A user is typically identified by an HTTP cookie, designed to allow websites to remember the status of an individual user, including remembering shopping items added in the cart in an online store or recording the user’s previous browsing activities for generating personalized and dynamical content

Personalized workflow
This is an interesting pic. How many cookies present in NYT page. Cookie Syncing is done to keep track/sync all cookies of a particular user.


ML Use Case for Click-through rate prediction
Look-alike modeling - on the basis of the learned user profiles, identify and target unknown users who  have similar interests and commercial intents with the known (converted) customers

Conversion over multiple touchpoints
Key Concepts
CTR, Click-Through Rate - the probability of a specific user in a specific context clicking a specific ad
CVR, Conversion Rate - the probability of the user conversion is observed after showing the ad impression

Keep Thinking!!!

Day #334 - Exploring - Featuretools

Have been listening/hearing about Feature generation, feature management. There are a couple of tools/frameworks in this perspective.

For a typical ML product level use case
  • Who defines the problem - Domain Expert / Product Manager
  • Who knows the data sources - BA / Database Developer / Product Manager
  • Raw Data -> Processed data - DB Developer
  • Data Exploration / Analysis / Feature Creation - BI / DB / ML Developer
  • Model Development / Validation - ML Developer
  • Deployment / Monitoring / Improvement - Devops / ML Developer
Feature store handle the part between raw data - data aggregation - feature generation/feature engineering

Installing Featuretools


Analysis - Basically like connecting few tables, doing that analysis of unique, average its all taken care after you define the entities, It is like prebuilt analysis based on identified associations

Experimenting this on colab - Colab notebook link 
From Link , Feature comparison between different feature stores

Paper - Link
Key Notes
  • Handling Data Ingestion 
  • Aggregating data from diverse sources
  • Access controlled and versioned
Key Offerings
  • Automated Feature Generation
  • Access to generated feature
  • Data Privacy / Data Governance
  • Data Visualization
My Thoughts
  • Today with all cloud trend all the data OLTP, OLAP, SQL, NoSQL sit next to each other 
  • Generating reports aggregating all sources in near real-time fashion is possible
  • Some features/variables can be pulled from OLTP tables
  • In a Data Lake / DW, Some of the insights would be already present in computed reports
  • Metadata management would already be available in the system which will handle data quality aspects
  • ML systems will work together as part of larger Data ecosystem comprising of OLTP, OLAP, SQL, NoSQL system. A lot of feature store workloads are already handled by other pieces. 
More Reads
Keep Thinking!!

August 21, 2020

GCP - VM - Remote Jupyter Access

Steps provided in CS231 were perfect to try out

1. "Allow HTTP traffic" and "Allow HTTPS traffic"
2. Enable Static IP
3. Create Firewall rule
4. Jupyter configuration Update

Ref2 - Config File Update

Happy Learning!!!

August 19, 2020

Research paper read - Serverless inferencing on Kubernetes

Serverless inferencing on Kubernetes

Key Notes
  • KNative serverless paradigm to provide a serverless machine learning inference solution
  • Frameworks - MLFlow, Kubeflow
Deployment / Inference Challenges
  • Handling multiple machine learning frameworks in a consistent manner.
  • Updating running models with new versions.
  • Scaling models appropriately with constraints.
  • Monitoring models.
  • Canaries allow users to split a small percentage of traffic to their new model
KFServing
  • KFServing is a project that was created within the Kubeflow
  • Transformers allow focused data transformations of the request and response from the model

Example #1 
Provide Inference Location
  • Create a storage initializer to download the artifacts from any popular storage (Google Storage, Amazon S3, Azure, local disk) and load onto the server.
  • Wire up networking so an endpoint is made available for inference requests
Example #2
Canary Location

Monitoring and explainability of models in production
Success Metrics for ML Model
1. Monitoring model performance
2. Monitoring metrics related to incoming data
3. Detecting outliers and drift
4. Explaining model predictions

Key aspects
Monitoring system requires functionality to determine when significant changes to data and predictive distributions happen

Seldon Core provides a dedicated /send-feedback API endpoint accepting labels and performing user-defined metric calculations

Drift Detector - The goal of the drift detector is therefore to identify when the distribution of the requests for the deployed model starts to diverge from the training data and model predictions

Model Monitoring - a KNative broker which can farm these out as desired via programmable triggers to serverless components such as outlier, drift and adversarial detection

More Reads - Minio - High performance object storage

Keep Thinking!!!

Research Paper Reads - MODELING USERS FOR ONLINE ADVERTISING

Paper #1 - MODELING USERS FOR ONLINE ADVERTISING

Key Notes
  • Contribution - a neural network model (app2vec) to vectorize mobile apps by studying how users employ these apps
Data Collected from Users
  • User activity data
  • User behaviors
  • Logging user activities
  • Contents consumed by users
  • Anonymous browser cookie syncing technique
Ad Platforms
  • Targeting audiences
  • User profiling
  • Ads based on their activity history across the web
Findings
  • Users watching polymorphic videos are likely to have similar interests
Insights
  • US mobile users download more than eight apps per month on average
  • 90% of the time spent on mobile devices was spent using apps
Online Ad Targeting
  • Data - users browsing, app usage,
  • and other activities on the Internet
  • Targeting - site/page context, placement size, user behavior and geolocation
User Targeting
Publishers, Advertisers, Ad-networks, Online users

Research Directions
  • Cross-device user tracking - Users access online content through multiple devices
  • Value of user profile - Different costs associated with them, Ad targeting on user profile
Observe User Online Advertising Profile and Ad Targeting
Do ads target user profiles in the field?
What are the ads shown to different users?
How do ads impact users profiles?

Data - The capability to gather display ads and video ads from across the web is central to our work
Profile-driven crawling - Enables each crawler instance to interact with the ad ecosystem as though it were a unique user with particular characteristics.
The Anatomy of Online Advertising
  • Advertisers - Advertiser reach out to potential customers. 
  • Publisher View - premium campaigns (specific advertisers, ad networks, ad exchanges)
Types of ads - Text Ads, Display Ads, Stream Ads, Video Ads
Video ads - Pre-roll, mid-roll, post-roll, Overllay-ads, Sponsored Videos

User Modeling on Mobile
  • app2vec to represent apps in a vector space without a priori knowledge of their semantics
  • app2vec to cluster apps based on app distances in their vector space
  • Computing app similarity is through the bag-of-words method using app meta information
Large Scale Look-alike Audience Modeling
  • A simple similarity-based look-alike system can use direct user-2-user similarity  to search for users that look like (or in other words, be similar to) seeds
  • Another type of look-alike audience systems for online advertising is built with Logistic Regression (LR)
  • User segments can be user characteristics such as user interest categories. 

Real-time Attention Based Look-alike Model for Recommender System
Key Notes
  • Real-time attention based look-alike model (RALM) for recommender systems
  • Deep neural networks (DNNs) and recurrent neural networks (RNNs) are more and more popular on recommendation task
  • "Matthew effect" - low quality and poor diversity of recommended contents.

RALM
  • RALM is a similarity based look-alike model, which consists of user representation learning and look-alike learning
  • Deep interest network for multifields user interests representation learning
  • Local representation of seeds should be processed online in real-time
  • k-means clustering to partition seeds into k clusters
  • Similarity based methods determine similarity between seeds and users based on distance measurement.

System Architecture
Offline Training
  • User Representation learning. The user representation model is developed based on deep learning network
  • Look-alike learning is based on attention model and clustering algorithm
Online asynchronous processing
  • User feedback monitor: The audience extension system updates the seeds of candidates through monitoring the click behaviors of all WeChat users in real-time
  • online serving - The lookalike model predicts the global embedding of seeds through global attention unit

Metrics
  • CTR (Click-through Rate): As audience increased, many new users sharing the same interests with seeds are reached. Therefore, CTR is expected not to decrease
  • Category & Diversity. One of our purposes is enriching user’s interest in our system, so we define a metric named diversity. It is represented by a number of content categories or tags a user has read in a day. With a more comprehensive user representation, more kinds of contents will be reached and category&tag diversity is expected to increase
More Reads
Comprehensive Audience Expansion based on End-to-End Neural Prediction

Keep Thinking!!!

August 14, 2020

Download GCP Storage files

Download files from GCP storage bucket


Happy Learning!!!