Francois Belletti Francois Belletti

Faster long range dependent recommendations for Youtube

Long range predictions under tight latency and memory constraints

A key insight to enable faster recommendations without adding much of a compute burden was to factorize the memory space of RNNs into several independent spaces which enabled a much more robust memory cell without a quadratic scaling of compute.

In https://proceedings.mlr.press/v84/belletti18a.html, there is a detailed presentation of the model as well as its practical empirical impact.

The key thing to understand is that there is a need to dissociate the two roles of a weight in a linear layer in deep learning:

  1. Each weight enables connectivity;

  2. Each weights enables modulation.

Connectivity is obviously the role the weight has in a matrix multiply of linking one of the inputs to one of the inputs. The modulations is the learned part which uses the capacity of the model to amplify or decay an input towards a certain output.

Starting with Deep Fried Convnets, some researchers developed the key insight that the two aspects can be dissociated and tuned independently by using layers featuring fewer degrees of freedom. This can be enabled with low rank matrices, as well as Fourier transforms or factorization (depth-wise separation).

Alternately, XCeption additionally introduced the idea of factorizing the latent space into independent components to require less compute.

This can be applied to any network, as with RNNs in our paper, with the additional benefit for RNNs that independent memory cells cannot corrupt one-another and therefore enable long range dependence at a low cost.

However, as measured in our follow-up paper, while there is provable long range dependence in human behavior as relevant to recommender systems, once can prove that regardless of how many memory cells one has in a standard RNN, due to the leakiness of the write gates, there is constant corruption of the memory state.

With bounded memory spaces as in LSTMs or GRUs, this provably prevents long range dependence and is perhaps one of the strongest argument one can make towards using alternate solutions in such contexts, such as attention, which we did in our subsequent paper.

Read More