"No one is harder on a talented person than the person themselves" - Linda Wilkinson ; "Trust your guts and don't follow the herd" ; "Validate direction not destination" ;

November 07, 2023

YouTube recommendation systems

Some papers need more iterations to learn the basics. Summary from link 

Patterns

  • Episodic series are usually watched sequentially
  • Users often discover artists in a genre beginning with the most broadly popular before focusing on smaller niches
  • Balancing new content exploration/exploitation perspective.
  • If a user was recently recommended a video but did not watch it then the model will naturally demote this impression on the next page load

Approach

  • YouTube videos, split into two distinct problems: candidate generation and ranking
  • YouTube activity history
  • Shortlist small subset (hundreds) of videos from a large corpus
  • Broad personalization via collaborative filtering
  • Similarity between users - video watches, search query tokens, demographics
  • Similarity of Features describing the video and user
  • Ranking network accomplishes this task by assigning a score to each video according to a desired objective function

  • Recommendation as extreme multiclass classification where the prediction problem
  • Embeddings of each candidate video
  • Deep neural network is to learn user embeddings u as a function of the user’s history and context that are useful for discriminating among videos with a softmax classifier
  • Explicit feedback mechanisms exist on YouTube (thumbs up/down) + implicit feedback of watches
  • Magnitude of implicit user history / feedback is extremely sparse
  • Traditional softmax vs hierarchical softmax 

Features used

  • Video watches, search query tokens, demographics
  • Feedback mechanisms 
  • implicit user history when feedback is extremely sparse
  • Demographic features 
  • User’s geographic region and device
  • User’s gender, logged-in state and age

Scaling Lessons

  • Previous systems at YouTube relied on hashing
  • Scoring problem reduces to a nearest neighbor search in the dot product space
  • Previous linear and tree-based methods for watch time prediction

Feature Engineering

  • A user’s watch history is represented by a variable-length sequence of sparse video IDs which is
  • mapped to a dense vector representation via the embeddings
  • Network requires fixed-sized dense inputs

Algorithms

  • If users are discovering videos through means other than our recommendations, we want to be able to quickly propagate this discovery to others via collaborative filtering.


Ranking

  • The primary role of ranking is to use impression data to specialize and calibrate candidate predictions for the particular user interface. 
  • Assign an independent score to each video impression using logistic regression
  • Ranking with watch time better captures engagement

Feature Representation

  • Binary (e.g. whether the user is logged-in) 
  • millions of possible values (e.g. the user’s last search query)
  • Features are further split according to whether they
  • Contribute only a single value (“univalent”) or a set of values (“multivalent”)
  • An example of a univalent categorical feature is the video ID of the impression being scored
  • multivalent feature might be a bag of the last N video IDs the user has watched
  • A univalent feature in machine learning would be a feature that can assume only one value or has only one dimension
  • If you have a dataset with a feature called "IsEmailVerified," it could potentially be a binary feature signifying whether a user's email is verified (1) or not (0)
  • A multivalent feature would be a feature that can have multiple values or dimensions. 
  • For example, a feature representing "VehicleType" could have multiple categories like "Car," "Truck," "Bike," and "Bus," each representing a different mode of transportation.
  • "Color," which could encode the color of an item as an RGB tuple (Red, Green, Blue).

Deep ranking network architecture

  • univalent and multivalent features
  • All layers are fully connected. In practice, hundreds of features are fed into the network.



Embedding

  • Categorical features in the same ID space also share underlying emeddings.
  • Out-of-vocabulary values are simply mapped to the zero embedding

CHATGPT4 Support

  • Define the user and video embeddings (simulating the learned embeddings from the actual system).
  • Simulate the function recommend_videos that scores and ranks videos for a user based on their embedding.
  • Implement the softmax function for the recommendation score.
  • Provide sample usage of the recommendation function using dummy data.

softmax vs hierarchical softmax with python example

The key takeaway is that while softmax calculates probabilities across all possible classes, hierarchical softmax does it for a specific path in a binary tree, reducing the computation when dealing with a vast number of classes.

Softmax Probabilities: The softmax function has converted the logits ([2.5, 1.0, 0.5]) into probabilities ([0.73612472, 0.16425163, 0.09962365]), which sum to 1.

Hierarchical Softmax Probability of a Class: The hierarchical softmax has calculated the probability for the class represented by the path 'LL' (left, left), which is (0.6 \times 0.8 = 0.48).


Keep Exploring!!!

No comments: