-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathassignment-2.tex
332 lines (262 loc) · 11.6 KB
/
assignment-2.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
% This version of CVPR template is provided by Ming-Ming Cheng.
% Please leave an issue if you found a bug:
% https://github.com/MCG-NKU/CVPR_Template.
\documentclass[final]{cvpr}
\usepackage{times}
\usepackage{epsfig}
\usepackage{graphicx}
\usepackage{amsmath}
\usepackage{amssymb}
% Include other packages here, before hyperref.
\usepackage{csquotes}
\usepackage{mathtools}
% If you comment hyperref and then uncomment it, you should delete
% egpaper.aux before re-running latex. (Or just hit 'q' on the first latex
% run, let it finish, and you should be clear).
\usepackage[pagebackref=true,breaklinks=true,colorlinks,bookmarks=false]{hyperref}
\def\cvprPaperID{****} % *** Enter the CVPR Paper ID here
\def\confYear{2021}
%\setcounter{page}{4321} % For final version only
\newcommand{\q}[1]{\enquote{#1}}
\newcommand{\Set}[1]{\left\{ #1 \right\}}
\DeclarePairedDelimiter{\norm}{\lVert}{\rVert}
\begin{document}
%%%%%%%%% TITLE
\title{
Assignment 2 - KNN Recommendation System \\~\\
\large{Team name: M2 Robo}
}
\author{
Chan Kwan Yin\\
3035466978 \\
Team leader
\and
Lee Chun Yin\\
3035469140\\
\and
Chiu Yu Ying\\
3035477630
}
\maketitle
\clearpage
\section{Background}
In the KNN ($k$-nearest neighbours) model, we assume that similar users will give close ratings on similar movies.
To predict the rating $\hat r_{ij}$ of user $i$ on movie $j$ ($1 \le i \le m, 2 \le j \le n$),
the $k$ most similar users ($n_1, \ldots, n_k$) to user $i$ are computed based on similar choices of ratings,
and $\hat r_{ij}$ is estimated based on the known values $r_{n_1j}, \ldots, r_{n_kj}$.
\subsection{Similarity (Nearness) metrics}
The similarity metric $d : \mathbb R^n \times \mathbb R^n \to \mathbb R$ used in the KNN model satisfies
\begin{equation} \label{eq:metric} \begin{cases}
d(x, y) = 0 &\iff x = y \\
0 < d(x, y) \leq 1 &\iff x \ne y \\
d(x, y) = d(y, x) &\forall x, y
\end{cases} \end{equation}
While $d(i, j) < d(i, k)$ implies that $j$ is more similar to $i$ than $k$ is,
it is not necessary that the metric follows the triangular inequality.
This enables the use of correlation coefficients in addition to common norm metrics.
\subsubsection{Pearson correlation}
Considering a user $u_i$ rates movies with a distribution $R_i \sim (\mu_i,\sigma_i)$,
the correlation between user $u_i$ and the other user $u_j$
would be the coefficient between the two distributions $R_i$ and $R_j$:
$$ \rho_{ij} = \frac{E[(R_i - \mu_i)(R_j - \mu_j)]}{\sigma_i \sigma_j} $$
To get the $k$ similar data by considering the $n$ movies that both user $u_i$ and $u_j$ have, we estimate the co-variance and variances:
$$ E[(R_i-\mu_i)(R_j-\mu_j)] \approx \frac{1}{n} \sum_k (r_{ik} - \mu_i) (r_{jk} - \mu_j)$$
$$ \sigma_i \approx \sqrt{\frac{1}{n} \sum_k (r_{ik} - \mu_i)} ,
\sigma_j \approx \sqrt{\frac{1}{n} \sum_k (r_{jk} - \mu_i) ,}$$
Note that the range of the Pearson Correlation Coefficient is $[-1,1]$.
To make the coefficient satisfy the conditions in Equation~\ref{eq:metric}
and not have negative values, we definean extraordinarily narrow situation
$$ d(i, j) = \frac12 \left(1 - \rho_{ij}\right) $$
such that $d$ has the range $[0, 2]$.
A smaller value implies higher similarity between users.
\subsubsection{Spearman Correlation}
Contrary to Pearson's choice of mean and variance,
Spearman Correlation uses the ranking of variables in the vector as the metric
to eliminate effects of non-uniform distributions.
$$ \rho_{ij} = \frac{
\operatorname{Cov}({\operatorname{rank}_i, \operatorname{rank}_j})
}{
\sigma_{\operatorname{rank}_i} \sigma_{\operatorname{rank}_j}
} $$
Similar to Pearson Correlation, the metric based on Spearman correlation is defined as
$$ d(i, j) = \frac12 \left(1 - \rho_{ij}\right) $$
\subsubsection{Euclidean Distance}
We take the $2$-norm squared:
$$ d(i, j) = \sum_{l=1}^n \left( r_{il} - r_{jl} \right)^2 $$
\subsubsection{Taxicab as a similarity metric}
We take the $1$-norm:
$$ d(i, j) = \sum_{l=1}^n \left| r_{il} - r_{jl} \right| $$
\subsubsection{Cosine similarity}
This can be thought as the cosine of the angle between the two vectors of ratings.
We define
$$ \rho(i, j) = \frac1{ \norm{\tilde{\mathbf r}_{i\cdot}} + \norm{\tilde{\mathbf r}_{j\cdot}} }
\sum_{l=1}^n \left( \tilde r_{il} \tilde r_{jl} \right) $$
We further rescale the metric to the range $[0, 1]$ using the same strategy as in correlations:
$$ d(i, j) = \frac12 \left(1 - \rho(i, j)\right) $$
This metric is sensitive to the exact value (rather than just relative difference);
a negative value could imply that the user does not like the data.
Therefore, we propose two mechanisms for rescaling the ratings,
using $\tilde r_{ij} = r_{ij} - \overline{\mathbf r_{\cdot\cdot}}$
and $\tilde r_{ij} = r_{ij} - 3$ respectively,
where $3$ is the standard rating for "neutral".
\subsection{Aggregation of ratings}
After the $k$ nearest neighbour users have been selected,
$\hat r_{ij}$ can be estimated by aggregating $\Set{ r_{n_1j}, r_{n_2j}, \ldots, r_{n_kj} }$.
A naive aggregation method is to just take the arithmetic mean of these rating values
without considering other factors such as the actual correlation.
Considering the case when all the similar users do not have a rating of the movie, the rating may be high due to the other less similar users. We believed that implementing the weights based on the similarity is important:
$$ \hat r_{ij} = \mu_j + \frac{\sum_{v \in P(i)} d(i, v) (r_{vj}-\mu_j)}
{\sum_{v \in P(i)} \left| d(i, v) \right|} $$
This formula increases the effect of more similar users on the rating prediction.
\section{Technical Details}
The training data set provided by Netflix consists of more than 100 million ratings with 17770 movies and 480189 users.
Such a huge data set would consume a significant amount of training time and memory
($O(m^2 n)$, since a correlation matrix between users is to be constructed),
which is not possible for our hardware available.
Therefore, only a subset of data is used for evaluation.
To be specific, only first $10000$ movies and first $1000$ users that appear in the data set are considered.
\subsection{Data preprocessing}
The first 1000 movies are loaded into a numpy array with columns of Movie ID, User ID and Rating.
The movie IDs and user IDs are reordered from 0 for the ease of indexing.
Approximately $80\%$ data are then reformatted into a rating matrix $R \in \mathbb R^{m \times n}$ for training,
where $r_{ij}$ is the rating of user $i$ on movie $j$;
the rest are retained for performance evaluation.
The retained and missing data are imputed with the mean rating for the corresponding movie.
\subsection{Hyperparameter selection}
In this KNN model, there are three hyperparameter to be selected, namely
\begin{itemize}
\item Value of $k$
\item Similarity metric
\item Aggregation function
\end{itemize}
\subsubsection{Choice of $k$}
A large value of $k$ would reduce accuracy as users with lower similarity are selected.
On the other hand, a small value of $k$ would be biased over the choice of the most similar user.
For the initial experiment, we only tested with $k=10$.
As a baseline model, we also set $k=m$, i.e.
to evaluate the RMSE by taking the plain arithmetic mean of all users.
\subsubsection{Choice of metric}
Since the ratings are discrete in nature,
little difference is expected between the different metrics,
with the exception of the cosine metric,
which is sensitive to the exact value.
\subsubsection{Choice of Aggregation function}
For every metric, we compute both the unweighted and weighted aggregates and compare the results.
\subsection{Predictive test set score (RMSE)}
The model is evaluated by computing the RMSE between predicted and actual rating values:
$$ \text{RMSE} = \sqrt{\sum_{(i, j) \in E} \frac{{(\hat r_{ij} - r_{ij})}^2}{\left| E \right|}} $$
where $E$ is the set of retained evaluation data.
\section{Model performance}
The following table exhaustively lists our test results.
The RMSE of the baseline is $1.1977149830409672$.
\vspace{1em}
\begin{tabular}{| c | c | c |}
\hline
\multicolumn{3}{|c|}{\textbf{Euclidean}}\\
\hline
$k$ & \textbf{Aggregation function} & \textbf{RMSE}\\
\hline
$10$ & Unweighted mean & $1.1743212065844748$\\
\hline
$10$ & Weighted & $1.174185480169039$\\
\hline
\end{tabular}\\
\vspace{1em}
\begin{tabular}{| c | c | c |}
\hline
\multicolumn{3}{|c|}{\textbf{Taxicab}}\\
\hline
$k$ & \textbf{Aggregation function} & \textbf{RMSE}\\
\hline
$10$ & Unweighted mean & $1.1761011327320081c$\\
\hline
$10$ & Weighted & $1.1758026291697672$\\
\hline
\end{tabular}\\
\vspace{1em}
\begin{tabular}{| c | c | c |}
\hline
\multicolumn{3}{|c|}{\textbf{Cosine with $3$-corrected ratings}}\\
\hline
$k$ & \textbf{Aggregation function} & \textbf{RMSE}\\
\hline
$10$ & Unweighted mean & $1.1807121450228732$\\
\hline
$10$ & Weighted & $1.1800525348636057$\\
\hline
\end{tabular}\\
\vspace{1em}
\begin{tabular}{| c | c | c |}
\hline
\multicolumn{3}{|c|}{\textbf{Cosine with mean-corrected ratings}}\\
\hline
$k$ & \textbf{Aggregation function} & \textbf{RMSE}\\
\hline
$10$ & Unweighted mean & $1.1064212349492457$\\
\hline
$10$ & Weighted & $1.104635214204885$\\
\hline
\end{tabular}\\
\vspace{1em}
\begin{tabular}{| c | c | c |}
\hline
\multicolumn{3}{|c|}{\textbf{Pearson}}\\
\hline
$k$ & \textbf{Aggregation function} & \textbf{RMSE}\\
\hline
$10$ & Unweighted mean & $1.1977283650557216$\\
\hline
$10$ & Weighted & $1.1977314956919407$\\
\hline
\end{tabular}\\
\vspace{1em}
\begin{tabular}{| c | c | c |}
\hline
\multicolumn{3}{|c|}{\textbf{Spearman}}\\
\hline
$k$ & \textbf{Aggregation function} & \textbf{RMSE}\\
\hline
$10$ & Unweighted mean & $1.1981137810682911$\\
\hline
$10$ & Weighted & $1.1981166551209421$\\
\hline
\end{tabular}
\vspace{1em}
\subsection{Discussion}
The results are mostly very similar to or even worse than the baseline value.
\subsubsection{Cosine metric}
Among all metrics, the cosine metric with mean-corrected ratings
appears to have the best result.
This is probably because the cosine metric is effective in characterizing
how different a user is from the "normal" user;
for example, if a movie is rated with an average of $4$,
the contrast between rating $3$ and $5$ is much more significant than
that between $1$ and $3$,
as is reflected by the fact that the former is different by $0.707$
while the latter is $0.242$.
\subsubsection{Aggregation method}
It is observed that weighted average performs similarly to or even worse than the unweighted mean,
except for the case in cosine metric.
One possible reason is that the neighbourhood users did not review the evaluated movie,
hence incorrectly giving extra significance to these imputed data.
\section{Future work}
We noticed several problems in our design and plan to resolve them in the future weeks.
\subsection{Do not use imputed values for prediction}
Currently, missing data are imputed using the corresponding mean value of ratings on the same movie.
These imputed values do not actually represent opinions of similar users,
and they should not be considered.
It is proposed that we only aggregate real ratings from neighbours,
and only fallback to mean rating if all $K$ neighbours did not rate the movie.
\subsection{Top $Q$ optimization}
Hong and Tsamis proposed considering only the neighbourhood that
intersects with the users with the top $Q < m$ number of ratings~\cite{Alpher01}.
They suggested that this improves performance as
the distance matrix can be reduced from $O(m^2)$ to $O(Qm)$.
We noticed that this also reduces the chance of the aforementioned problem
where a substantial proportion of the top $K$ neighbours did not the movie
of which a rating is to be predicted.
{\small
\bibliographystyle{ieee_fullname}
\bibliography{egbib}
}
\end{document}