-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathassignment-4.tex
275 lines (200 loc) · 13.5 KB
/
assignment-4.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
% This version of CVPR template is provided by Ming-Ming Cheng.
% Please leave an issue if you found a bug:
% https://github.com/MCG-NKU/CVPR_Template.
\documentclass[final]{cvpr}
\usepackage{times}
\usepackage{epsfig}
\usepackage{graphicx}
\usepackage{amsmath}
\usepackage{amssymb}
% Include other packages here, before hyperref.
\usepackage{csquotes}
% If you comment hyperref and then uncomment it, you should delete
% egpaper.aux before re-running latex. (Or just hit 'q' on the first latex
% run, let it finish, and you should be clear).
\usepackage[pagebackref=true,breaklinks=true,colorlinks,bookmarks=false]{hyperref}
\def\cvprPaperID{****} % *** Enter the CVPR Paper ID here
\def\confYear{2021}
%\setcounter{page}{4321} % For final version only
\newcommand{\q}[1]{\enquote{#1}}
\newcommand{\Set}[1]{\left\{ #1 \right\}}
\usepackage{amsmath}
\usepackage{amssymb}
\newcommand{\R}{\mathbb{R}}
\begin{document}
%%%%%%%%% TITLE
\title{
Assignment 4 - Neural Network Recommendation System \\~\\
\large{Team name: M2 Robo}
}
\author{
Chan Kwan Yin\\
3035466978 \\
Team leader
\and
Lee Chun Yin\\
3035469140\\
\and
Chiu Yu Ying\\
3035477630
}
\maketitle
\clearpage
\section{Background}
Among all collaborative filtering techniques, matrix factorization (MF) become a standard approach to the model-based recommendation. Many research teams have put their efforts on enhancing it to break the accuracy record.
However, it has several limitations such as the cold-start problem -- how the system provide recommendations to new users, how to recommend when new item is added without feature vector or embedding \cite{FMF}. Also, the use of dot product for recommending popular items will lead to the tendency of recommending popular items to those users without specific user interests. With the use of inner product of User and item feature embeddings, MF limits the capture of the complex relations of the users and items interactions \cite{he2017neural}.
In this assignment, we aim to implement the new approach, Neural Collaborative Filtering (NCF), proposed by He team in 2017 \cite{he2017neural}. They suggested that their design of deep neural network can overcome the limitation of MF.
\section{Neural Collaborative Filtering (NCF) model}
This is the model proposed by He team in 2017 \cite{he2017neural}. Their proposed model uses a combination of matrix factorization and multi-layer perceptron techniques, and the hidden layers from the two techniques are combined by a fully-connected layer. However, their paper mainly focuses on the prediction of \textit{implicit user feedback}, that is, they are predicting the values of $y_{ui}$ where $y_{ui}=1$ when the obervation (user $u$, item $i$) is observed, and $0$ when the
observation (user $u$, item $i$) is not observed. Thus, at the final stage of their proposed model, the output of the fully-connected layer is passed into a sigmoid function, such that the model only outputs values within $[0, 1]$.
The problem of predicting implicit feedback is different from the problem we wish to solve in the Netflix dataset, because in the Netflix problem we wish to predict actual ratings from $[1, 5]$.
Nevertheless, we still think that He's paper provides a good starting point - We modify He's model by removing the final sigmoid function and changing the loss functions used directly, such that the neural network solves a regression problem instead of the original binary
classification problem.
The model proposed by He is split into 3 parts - Generalized Matrix Factorization, Multi-Layer Perceptron, and Neural matrix factorization. In fact, the third part is a combination of the previous two parts. For this part of our project, we will follow the same workflow as He and implement the 3 parts separately.
\begin{figure}[h]
\includegraphics[width=0.5\textwidth]{./NeuCF.PNG}
\caption{Neural Collaborative Filtering Model Design \cite{he2017neural}}
\end{figure}
\subsection{Model - Input Layer}
The input of the model are the user ID and item ID transformed to a binarized vector with one-hot encoding.
\subsection{Model - Embedding Layer}
For each user and each movie, we train two sets of embedding latent vectors. This is because one set of latent
vectors will be passed into the itemwise multiplication part
of the model (in He’s paper, it is called Generalized Matrix
Factorization (GMF)), and the other set of latent vectors will
be passed into the neural network part of the model (in He’s
paper, it is called Multi-Layer Perceptron (MLP)).
\subsection{Model - Generalized Matrix Factorization (GMF)}
In this part of the model, the user latent vector and item latent vector are combined via itemwise multiplication, similar to that in the SVD matrix factorization model. The output of this part is another vector with the same size at the latent vectors and is passed on to the final output layer of the model.
\subsection{Model - Multi-Layer Perceptron (MLP)}
In this part of the model, the user latent vector and item latent vector are first concatenated. Afterwards, the whole concatenated vector is passed through several layers of fully connected hidden linear layers, using the ReLU activation function. the output of this part is a vector dependng on the output size of the last hidden layer and is passed on to the final output layer of the model.
\subsection{Model - Output Layer}
The final output layer takes in the vectors produced by the GMF and MLP parts of the model and concatenates them. The output of this part, which is the final output of the whole model, is a weighted sum of the vector elements, where the weights are model parameters to be learned. The final output layer produces a predicted score $\hat{y_{ui}}$.
\subsection{Model - Formula}
\bigskip
\textbf{ The formula of the GMF is as follow:}
$$\hat{y_{ij}} = f(P^T v_i Q^T v_j|P,Q,\Theta_f)$$
The variables with m users and n items are denoting as:
\begin{itemize}
\item $P \in \R^{(m \times K)}$ latent factor matrix for users from embedding layer.
\item $Q \in \R ^{(n \times K)}$ latent factor matrix for items from embedding layer.
\item $f$ interaction function.
\item $\Theta_{f}$ model parameters of the interaction function $f$
\end{itemize}
\bigskip
\textbf{ For MLP $f$, it can be formulated as:}
$$f(P^T v_i, Q^T v_j) = \phi_{output}(\phi_X(...\phi_2(\phi_1(P^T v_i, Q^T v_j))...))$$
The mapping functions are as follow:
\begin{itemize}
\item $\phi_{output}$: the mapping function for output layer
\item $\phi_x$ the activation function for the xth hidden neural FC layer. $x$ is the number of the layers
\end{itemize}
\section{Technical Details}
The training data set provided by Netflix consists of more than 100 million ratings with 17770 movies and 480189 users.
% Such a huge data set would consume a significant amount of training time and memory
% ($O(m^2 n)$, since a correlation matrix between users is to be constructed),
% which is not possible for our hardware available.
% Therefore, only a subset of data is used for evaluation.
% To be specific, only first $10000$ movies and first $1000$ users that appear in the data set are considered.
\subsection{Data preprocessing}
The movies are loaded into a list with columns of Movie ID, User ID and Rating. Then, it will be fed into the test loader with batch size.
% Approximately $80\%$ data are then reformatted into a rating matrix $R \in \mathbb R^{m \times n}$ for training,
% where $r_{ij}$ is the rating of user $i$ on movie $j$;
% the rest are retained for performance evaluation.
% The missing data are being skipped in our model.
%In general, our algorithm runs through all rows for $\epsilon+1$ times
% where $\epsilon$ is the number of epochs of gradient descent.
% The model involves 5 parameters,
% namely $\mu \in \mathbb R, \mathbf b \in \mathbb R^m, \mathbf c \in \mathbb R^n, P \in \mathbb R^{mk}, Q \in \mathbb R^{nk}$.
% The blocking term is $nk$, so this model is scalable to the full dataset given that $nk$ is not beyond memory limit.
% We have not used the full model in this assignment due to time constraints,
% but we will do so in the final report.
\subsection{Optimization and scheduler}
We select Adadelta as our optimizer.
% To have a better performance on the converge of loss function, we implement a scheduler with step size = 1 and $\gamma$ = 0.7.
To have a better performance on the converge of loss function,we implement a scheduler with different combinations of hyperparameters of scheduler
\subsection{Hyperparameters selection for scheduler}
There are two hyperparameters, namely
\begin{itemize}
\item step size
\item $\gamma$
\end{itemize}
\subsection{Hyperparameters selection for model}
In this MLP model, there are five hyperparameters to be selected, namely
\begin{itemize}
\item Learning Rate ($\alpha$)
\item Regularization ($\lambda$)
\item Rank of factorization ($k$)
\item Number of epoch ($x$)
\item Batch Size ($N$)
% \item Log Interval ($t$)
\end{itemize}
\subsubsection{Learning Rate ($\alpha$)}
The learning rate indicates the speed at which the model learns. It controls how much to change the modal to respond the estimate error each time when the model weights are updated.
Rate with too small value may cause a long training time while a large value may allow the modal learning too fast to have less accurate or even not meaningful result.
\subsubsection{Regularization ($\lambda$)}
% Regularization plays a crucial role in preventing the overfitting issue when training model. It aims to reduce the complexity of model to achieve its function. When the regularization parameter is large, the decay in weights during SGD update will be increased and therefore the weights of the hidden units will be negligible (close to zero).
% If it is set to be too large, each change to the parameters are cancelled by the regularization term. If it is set to be too small, overfitting may result.
\subsubsection{Rank of factorization ($k$)}
Intuitively, each rank represents some characteristic of a user or a movie. Users and movies that interact strongly on a rank would be affected.
A large value of $k$ increases the training time, memory consumption and probably the chance of overfitting,
while a low value of $k$ reduces the representativeness of the model and hence reduced accuracy.
\subsubsection{Number of epoch ($x$)}
The number of epochs defines the number of times that the modal will work through the whole training dataset.
Since we already report the training score for each epoch, it is adjusted to a value such that the training score appears to converge negligibly.
% \subsubsection{Log-interval ($t$)}
\subsubsection{Batch size ($N$)}
The batch size defines the number of samples to be passed through before the update of model parameters.
The small batch size usually can give a better performance since the model can start learning with a small subset and have more cycles to improve.
However, the smaller batch size costs the slower learning and hence more training time.
\subsection{Predictive test set score (RMSE)}
The model is evaluated by computing the RMSE between predicted and actual rating values:
$$ \text{RMSE} = \sqrt{\sum_{(i, j) \in E} \frac{{(\hat r_{ij} - r_{ij})}^2}{\left| E \right|}} $$
where $E$ is the set of retained evaluation data.
\section{Model performance}
\subsection{Findings}
The following table exhaustively lists our test results. Train RMSE and Test RMSE refer to the RMSE value calculated for the training dataset and testing dataset respectively.
\begin{tabular}{|c|c|c|c|c|c|c|c|c|}
\hline
No. & $k$ & $\alpha$ & $\gamma$ & $N$ & Epochs & Train RMSE & Test RMSE
\\ \hline
1 & $100$ & $1$ & $0.7$ & $65535$ & $9$ & $1.16384$ & $1.08611$
\\ \hline
2 & $300$ & $1$ & $0.7$ & $65535$ & $9$ & $1.16491$ & $1.08696$
\\ \hline
3 & $100$ & $0.1$ & $0.9$ & $65535$ & $9$ & $1.16536$ & $1.08816$
\\ \hline
4 & $100$ & $1$ & $0.7$ & $4096$ & $9$ & $1.16368$ & $1.08977$
\\ \hline
5 & $100$ & $1$ & $0.7$ & $1048576$ & $8$ & $5.14467$ & $2.23113$
\\ \hline
\end{tabular}
\begin{figure}
\includegraphics[width=0.5\textwidth]{screenshot20210415234153.png}
\caption{RMSE graph for model 1}
\end{figure}
\begin{figure}
\includegraphics[width=0.5\textwidth]{screenshot20210415234346.png}
\caption{RMSE graph for model 2}
\end{figure}
\begin{figure}
\includegraphics[width=0.5\textwidth]{screenshot20210415234450.png}
\caption{RMSE graph for model 3}
\end{figure}
\begin{figure}
\includegraphics[width=0.5\textwidth]{screenshot20210415234538.png}
\caption{RMSE graph for model 4}
\end{figure}
\begin{figure}
\includegraphics[width=0.5\textwidth]{screenshot20210415234604.png}
\caption{RMSE graph for model 5}
\end{figure}
\hspace{10em}
\subsection{Discussion}
When comparing the results of tuning the value of rank from 100 to 300, we can see the RMSE curves have a similar shape in general. After epoch 0, both MSE values drop significantly by around 4.5 units and then the values remain constant. The shape becomes smoother than previous results by tuning learning rate to 0.1 and $\gamma$ to 0.9. The curve of decreasing the batch size to 4096 has a similar shape with the first and second one but the extent of drop is much smaller (by 0.3). By increasing the batch size to 1048576, the RMSE drops constantly (remains a straight line) from $\text{mse} = 13.17$ to $\text{mse} = 5.144$.
Overall, the first one with (rank = 100, learning rate = 1.0, gamma = 0.7, batch size = 65535) has the lowest mse (mse = 1.163) after training. The highest belongs to the (rank = 100; learning rate = 1.0, gamma = 0.7, batch size = 1048576)
{\small
\bibliographystyle{ieee_fullname}
\bibliography{egbib}
}
\end{document}