Google has actively applied “Deep Learning” to various products in recent years. We’ll introduce one of these about the recommendation system.

Deep Neural Network for YouTube Recommendations (Sep. 2016)

System overview
They describes the deep neural network architecture for recommending YouTube videos.
The system has 2 deep neural networks and 4 steps as below:
   1.Videos corpus
   2.Candidate videos generation (deep neural network)
   3.Ranking the videos (deep neural network)
   4.Recommendations

fig.1

Candidate video generation (deep neural network)
The candidate generation model architecture is as bellow.

fig.3

A high dimensional embedding is learned for each video, and these are fed into a feed-forward neural network. The input vectors combines a user’s watch history, search history, demographic and geographic features.The input layer is followed by several hidden layers of fully connected Rectified Linear Units (ReLU). All the videos and search tokens were embedded with 256 floats each in a maximum bag size of 50 recent watches and 50 recent searched. This is trained until convergence over all Youtube users.

・Depth 0: A linear layer simply transforms the concatenation layer to match the softmax dimention of 256
・Depth 1: 256 ReLU
・Depth 2: 512 ReLU → 256 ReLU
・Depth 3: 1024 ReLU → 512 ReLU → 256 ReLU
・Depth 4: 2048 ReLU → 1024 ReLU → 512 ReLU → 256 ReLU

This deep neural network needs to classify a specific video watch “wt” at time “t” among millions of videos “i” (classes) from a corpus “V” based on user “U” and context “C”. The task is to learn user embeddings “u” as a function of the user’s history and context that are useful for discriminating among videos with a softmax classifier.

fig.2

Ranking the videos (deep neural network)
The ranking network architecture model is as bellow.

This model works with the much smaller of candidate videos produced by the first model. So, It is possible to take advantage of more features describing the video and the user’s relation to it.

The ranking model assign an independent score to each video impression using logistic regression and then the list of videos is sorted by this score and returned to the user.

The final ranking video is being tuned based on live A/B testing results but generally a simple function of expected watch time per impression.

fig.4

Three challenges
This Recommending YouTube videos has three challenges.

・scale
The scale challenge reflects the facts that YouTube’s massive user base and corpus.They mentioned this point as below:
The two-stage approach to recommendation allows us to make recommendations from a very large corpus (millions) of videos while still being certain that the small number of videos appearing on the device are personalized and engaging for the user.

・freshness
The freshness challenge reflects the facts that YouTube has a very dynamic corpus with many hours of video are uploaded per second, and users prefer newer content. They mentioned this point as below:
Recommending recently uploaded (“fresh”) content is extremely important for YouTube as a product. We consistently observe that users prefer fresh content, though not at the expense of relevance. In addition to the first-order effect of simply recommending new videos that users want to watch, there is a critical secondary phenomenon of bootstrapping and propagating viral content.

・noise
The noise challenge is due to the fact that “we rarely observe the ground truth of user satisfaction and instead model noisy implicit feedback signals.” They mentioned this point as below:
Ranking by click-through rate often promotes deceptive videos that the user does not complete (“clickbait”) whereas watch time better captures engagement.
→ the watch time is the ultimate hint.

Results

・Increasing the width and depth of hidden layers improves results.

・Ranking is a more classical machine learning problem yet our deep learning approach outperformed previous linear and tree-based methods for watch time prediction.

・Logistic regression was modified by weighting training examples with watch time for positive examples(the video impression was clicked) and unity for negative examples(the impression was not clicked), allowing us to learn odds that closely model expected watch time.

For the full details, see the paper as linked at here.
https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/45530.pdf

Related
Translation system:
https://arxiv.org/pdf/1609.08144.pdf
https://arxiv.org/pdf/1611.04558v1.pdf
Recommendation system:
・Google Play Store:https://arxiv.org/pdf/1606.07792v1.pdf
・YouTube:https://research.google.com/pubs/pub45530.html

Author: Takumu Sumi / Research and Business Development
I develop technical methods for clients’ demands and lead the way to resolve problems.
In addition, create a roadmap of the technology and develop unique technology based on our researches at LEAPMIND Inc.

LEAPMIND Careers

Posted by SumiTakumu