Recommendation systems help websites serve you new content similar to what you already like. Sometimes, this is done by describing users with a set of abstract numbers (you) which they then compare to other sets of abstract numbers (content on the website). Some people (Google for examle) think this is too simple a model because people can be interested in many types of content that are not necessarily related to each other. Instead, we can use multiple sets of numbers for each user rather than just one set.
How is this different from using just more numbers? Let's look at what the numbers in question actually represent. To determine how relevant an item might be for a particular user, a similarity score is calculated between the numbers representing that item and the numbers representing that user. However, users can be interested in many things that are not necessarily related, and therefore there interests cannot be described in one dimension.
This project builds a model using vector representations of documents read by different users on a website to try to give the best possible recommendations for users based on their individual interests. This is a content-based recommendation system, meaning that recommendations are based on the content of the documents themselves, rather than mutual interactions with the documents between users. All of the research used for this project is oriented around collaborative recommendation, so document vectors are used in place of optimized item embeddings, but most of the math is otherwise the same.
This project was inspired by research done at Condé Nast presented at an event by Dataiku on October 23, 2019.
The starting point for the idea of a content-based recommendation system based on Nonlinear Latent Factorization (NLF) is a paper by Weston et. al. (2013) in which they describe a system of using multiple "interest units" to describe each user and differentiating between them by only using the one providing the best user-item relationship while ignoring the others.
From the paper:
The key idea of the proposed model is to define T interest vectors per user, where the user part of the model is written as Uˆ which is an m × |U| × T tensor. Hence, we also write Uˆiu ∈ R m as the m-dimensional vector that represents the ith of T possible interests for user u. The item part of the model is the same as in the classical user-item factorization models, and is still denoted as a m×|D| matrix V.
This is a bit abstract still, but makes more sense in the context of content-based recommendations. The original paper optimizes both User and Item vectors to get the best user-item pairings, in the typical style of a fully collaborative recommendation system enabled by matrix decomposition. Instead, my project uses doc2vec-generated item vectors so that the original relative meaning between items, based on their actual content, is not lost.
Once vectors are optimized for each user, recommendations can be made based on vector similarity.
variable | definition |
---|---|
U | All users |
u | Single user |
i | Interest unit |
T | Number of discrete user interests (don't confuse this with the vector transpose symbol) |
V | Item vectors |
D | Corpus (all items) |
Du | Items relevant to a user |
d | Item relevant to a user |
d-bar | Item irrelevant to a user |
f | Scoring function |
L | function which weights a user's rank of an item |
J | cost/objective function |
alpha | Gradient descent learning rate |
h | Hinge regularization parameter (typically 1) |
The L(rank(d)) function allows for weighting a user's ranking systems in order to tweak bias in the model. However, this is not actually useful for a binary ranking system (like in this project), as it is constant. The "1+... term is used to implement "hinge" loss, and is useful as a form of regularization in gradient descent.
Using this equation with the nonlinear scoring model allows for a type of "max-based" nonlinearity.
As stated above, L(rank(d)) cancels out and is not relevant for the purposes of this particular project.The gradient is found by taking the partial derivative of the objective funciton with respect to a user's vector. User vectors are updated using the gradient above. (Weston 2013) describes a process for stochastic gradient descent (SGD) which is not exactly relevant for the purposes of this project since the item vectors, V, are static and predetermined by doc2vec rather than objective function minimization. I used mini-batch gradient descent instead, in part substituting vectorized tensor orperations for iterative functions.
The gradient for the nonlinear funciton is similar to the lienar model, but uses the particular interest unit per user which is most relevant to the selected relevant item d. The equation above describes the update process for the particular interest unit i for a given user Uu. Note that only one interest unit is updated per instance of this equation while the others are ignored. Tweaking the hinge regularization parameter, represented here with just "1", is key here for optimization.
Skipping the document vectorization step, I've chosen a dataset from kaggle with user interactions on the Brazilian news site Globo. They've already vectorized each document with vector length m = 250.
- Vector size (m): 250
- Unique articles: 364,047
- Relevant Articles: 46,033
- Unique users: 322,897
- User-article interactions: 2,951,986
The graph below shows the cumulative number of users who have read at least n articles (articles on x-axis, users on y-axis.)
I used this to pick n>=10 and used 8 for the training set.
The NonlinearModel
class can produce linear and nonlinear models depending on the value of T
. Much of the boolean broadcasting and indexing used in the nonlinear model becomes irrelevant when T is set to 1. First, user vectors are initialized by taking the simple mean of document vectors for the documents relevant to each user.
Then, the user vectors are further optimized in gradient descent. A set number of items per user is used in each iteration of gradient descent. This is given by size
where all items are unique per user. Increasing batch_size
repeats those unique relevant items (∈ Du) but with new items not relevant to the user (∉ Du). The learning rate is also scaled by batch size for consistency, since gradients from each pair of items are summed at each iteration. Otherwise, larger batches would produce larger gradients. test_size
is used for validation at each step to compute the objective function J for the updated user vectors, the relevant document vectors which remain constant, and a new randomly chosen set of document vectors not relevant to users. gd_algorithm
tells whether or not to use RPROP.
Optimization with h = 1 and alpha = 0.01 is shown below.
The key first step of the nonlinear model is "user interest partitioning" in which user interests i are initialized. Collaborative algorithms tend to initialize i randomly but for the purposes of this project, in which item vectors V are predefined, it makes more sense to initialize U based on V before optimization with gradient descent.
First user interest vectors are initialized on a per-user basis using a modified k-means clustering algorithm, k being equal to the number of latent vectors per user, or equivalent to T. The function is broadcastable and runs simultaneously for all users.
Once vectors are initialized, vectors are updated at each step of gradient descent using the equations in the math section above. Gradient calculation is not quite as straightforward as for the linear model, since only one interest unit should be updated per user-item interaction.
A |U| x T x size
x m tensor is used to start, where are repeated across the user interest axis T times, then multiplied by a boolean matrix along the third axis and summed to produce a |U| x T x m gradient tensor. The boolean matrix is determined per user, per item, by the maximum dot product between user vectors and the relevant item. This can be modified by use_vdbar_for_interest_unit
, which considers the overall cost per item pair. Additionally, the hyperparameter readj_interval
determines how often the best interest unit is calculated.
The biggest challenge in optimizing the nonlinear model is the low signal-to-noise ratio. Randomness of item pairings, while essential for generalized optimization, makes taking effective steps in gradient descent extremely difficult. Some effective was around this problem are as follows:
- Interest Unit Readjustment - Using higher values for
readj_interval
allows interest units to adjust effectively before being thrown off by more noisiness in the data. However, convergence with this strategy is very slow. - Learning Rate - Much higher learning rates than for the linear model help to escape local minima.
- Batch Size - Larger batch sizes help to control noisiness.
- Regularization - Adjusting regularization ended up being the most effective way to ensure fast, consistent convergence. The regularization parameter in the hinge loss adjustment can be tweaked to change the gradient, although the default parameter value of 1 is still used in the end calculation. Too high a value leads to divergence. Too low a value prevents optimization.
Optimization with h = 5 and alpha = 0.1 is shown below.
img
- Doc2Vec training/re-training
- dimensionality reduction
- data preparation
- user optimization
- sorting algorithms
- Expand on evaluation (calculate p-values)
- Write up "implementation"
- Medium writeup
- Variable sizes per user in training set (requires forward-filling (or back-filling) with NANs)