-
Notifications
You must be signed in to change notification settings - Fork 48
/
Copy pathintro.tex
263 lines (205 loc) · 17.5 KB
/
intro.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
\input{common/header.tex}
\inbpdocument
\chapter{Introduction}
\label{ch:intro}
%\begin{quotation}
%``I only work on intractable nonparametrics - Gaussian processes don't count.'' \\
%\hspace*{\fill} Sinead Williamson, personal communication
%\end{quotation}
\begin{quotation}
``All models are wrong, but yours are stupid too.'' \\
\hspace*{\fill} \citet{mlhipster}
\end{quotation}
Prediction, extrapolation, and induction are all examples of learning a function from data.
There are many ways to learn functions, but one particularly elegant way is by probabilistic inference.
Probabilistic inference takes a group of hypotheses (a \emph{model}), and weights those hypotheses based on how well their predictions match the data.
This approach is appealing for two reasons.
First, keeping all hypotheses that match the data helps to guard against over-fitting.
Second, comparing how well a dataset is fit by different models gives a way of finding which sorts of structure are present in that data. % models having different types of structure is a way to find which sorts of structure are present in a dataset.
This thesis focuses on constructing models of functions.
%The types of structure examined in this thesis
Chapter \ref{ch:kernels} describes how to model functions having many different types of structure, such as additivity, symmetry, periodicity, changepoints, or combinations of these, using Gaussian processes (\gp{}s).
Chapters \ref{ch:grammar} and \ref{ch:description} show how such models can be automatically constructed from data, and then automatically described.
Later chapters explore several extensions of these models.
%will describe how to model functions having many different types of structure, such as additivity, symmetry, periodicity, changepoints, or combinations of these, using Gaussian processes (\gp{}s).
%To be able to learn a wide variety of structures, we would like to have an expressive language of models of functions.
%We would like to be able to represent simple kinds of functions, such as linear functions or polynomials.
%We would also like to have models of arbitrarily complex functions, specified in terms of high-level properties such as how smooth they are, whether they repeat over time, or which symmetries they have.
%Chapters 2 and 3 will
%
%All of these models of function can be constructed using Gaussian processes (\gp{}s).%, a tractable class of models of functions.
%
%This chapter will introduce the basic properties of \gp{}s.
%Chapter \ref{ch:kernels} will describe how to model these different types of structure using \gp{}s.
This short chapter introduces the basic properties of \gp{}s, and provides an outline of the thesis.
%Chapter \ref{ch:grammar} will show how searching over many
\section{Gaussian process models}
Gaussian processes are a simple and general class of models of functions.
%
\begin{figure}[t]
\begin{centering}
\begin{tabular}{cc}
\hspace{-3mm}\raisebox{2cm}{$f(x)$}
\includegraphics[width=0.46\textwidth]{\introfigsdir/fuzzy-0} &
\includegraphics[width=0.46\textwidth]{\introfigsdir/fuzzy-1} \\
\hspace{-3mm}\raisebox{2cm}{$f(x)$}
\includegraphics[width=0.46\textwidth]{\introfigsdir/fuzzy-2} &
\includegraphics[width=0.46\textwidth]{\introfigsdir/fuzzy-3} \\[-1mm]
$x$ & $x$
\end{tabular}
\end{centering}
\caption[A one-dimensional Gaussian process posterior]
{A visual representation of a Gaussian process modeling a one-dimensional function.
Different shades of red correspond to deciles of the predictive density at each input location.
Coloured lines show samples from the process -- examples of some of the hypotheses included in the model.
\emph{Top left:} A \gp{} not conditioned on any datapoints.
\emph{Remaining plots:} The posterior after conditioning on different amounts of data.
All plots have the same axes.
}
\label{fig:gp-post}
\end{figure}
%
%Figure \ref{fig:gp-post} shows a Gaussian process distribution, as it is conditioned on more and more observations.
To be precise, a \gp{} is any distribution over functions such that any finite set of function values $f(\vx_1), f(\vx_2), \ldots f(\vx_N)$ have a joint Gaussian distribution \citep[chapter 2]{rasmussen38gaussian}.
A \gp{} model, before conditioning on data, is completely specified by its mean function,
%
\begin{align}
%\mathbb{E}(f(x)) = \mu(x)
\expectargs{}{f(\vx)} = \mu(\vx)
\end{align}
%
and its covariance function, also called the \emph{kernel}:
%
\begin{align}
\textrm{Cov} \left[f(\vx), f(\vx') \right] = \kernel(\vx,\vx')
\end{align}
%
It is common practice to assume that the mean function is simply zero everywhere, since uncertainty about the mean function can be taken into account by adding an extra term to the kernel.
After accounting for the mean, the kind of structure that can be captured by a \gp{} model is entirely determined by its kernel.
The kernel determines how the model generalizes, or extrapolates to new data.
There are many possible choices of covariance function, and we can specify a wide range of models just by specifying the kernel of a \gp{}.
For example, linear regression, splines, and Kalman filters are all examples of \gp{}s with particular kernels.
However, these are just a few familiar examples out of a wide range of possibilities. %of the many possible models we can express through choosing a kernel.
One of the main difficulties in using \gp{}s is constructing a kernel which represents the particular structure present in the data being modelled.
\subsection{Model selection}
The crucial property of \gp{}s that allows us to automatically construct models is that we can compute the \emph{marginal likelihood} of a dataset given a particular model, also known as the \emph{evidence} \citep{mackay1992bayesian}.
The marginal likelihood allows one to compare models, balancing between the capacity of a model and its fit to the data~\citep{rasmussen2001occam,mackay2003information}.
%discover the appropriate amount of detail to use, due to Bayesian Occam's razor
%
%Choosing a kernel, or kernel parameters, by maximizing the marginal likelihood will typically result in selecting the \emph{least} flexible model which still captures all the structure in the data.
%For example, if a kernel has a parameter which controls the smoothness of the functions it models, the maximum-likelihood estimate of that parameter will usually correspond to the smoothest family of functions consistent with the observed function values.
%
The marginal likelihood under a \gp{} prior of a set of function values $\left[ f(\vx_1), f(\vx_2), \ldots f(\vx_N) \right] \defeq \vf(\vX)$ at locations $\vX$ is given by:
% given by the rows of $\vX$:
%
\begin{align}
p(\funcval | \vX, \mu(\cdot), k(\cdot, \cdot)) & = \N{\funcval}{\mu(\vX)}{k(\vX, \vX)} \label{eq:gp_marg_lik} \\
& = (2\pi)^{-\frac{N}{2}} \times \underwrite{ |k(\vX, \vX)|^{-\frac{1}{2}}\,}{\parbox{2.5cm}{\parbox{4cm}{\footnotesize controls model capacity}}} \nonumber \\
& \qquad \times \underwrite{\exp \left\{ -\frac{1}{2} \left( \funcval - \boldsymbol\mu(\vX) \right)\tra k(\vX, \vX)\inv \left(\funcval - \boldsymbol\mu(\vX) \right) \right\}}{\footnotesize encourages fit with data} \nonumber
\end{align}
%
This multivariate Gaussian density is referred to as the \emph{marginal} likelihood because it implicitly integrates (marginalizes) over all possible functions values $\vf( \bar \vX)$, where $\bar \vX$ is the set of all locations where we have not observed the function.
% Talk about the meaning of the different terms in the equation?
\subsection{Prediction}
%Even though we don't need to consider locations other than at the data when computing the marginal likelihood,
We can ask the model which function values are likely to occur at any location, given the observations seen so far.
By the formula for Gaussian conditionals (given in appendix~\ref{ch:appendix-gaussians}), the predictive distribution of a function value $f(\vx^\star)$ at a test point $\testpoint$ has the form:
%
\begin{align}
p( f(\testpoint) | \funcval, \vX, \mu(\cdot), k(\cdot, \cdot))
= \mathcal{N} \big( f(\testpoint) \given & \underwrite{\mu(\testpoint) + k(\testpoint, \vX) k(\vX, \vX)\inv \left( \funcval - \mu(\vX) \right) }{\footnotesize predictive mean follows observations}, \nonumber \\
& \underwrite{k(\testpoint, \testpoint) - k(\testpoint, \vX) k(\vX, \vX)\inv k(\vX, \testpoint)}{\footnotesize predictive variance shrinks given more data}
\big)
\label{eq:predictive}
\end{align}
These expressions may look complex, but only require a few matrix operations to evaluate.
Sampling a function from a \gp{} is also straightforward: a sample from a \gp{} at a finite set of locations is just a single sample from a single multivariate Gaussian distribution, given by \cref{eq:predictive}.
\Cref{fig:gp-post} shows prior and posterior samples from a \gp{}, as well as contours of the predictive density.
Our use of probabilities does not mean that we are assuming the function being learned is stochastic or random in any way; it is simply a consistent method of keeping track of uncertainty.
\subsection{Useful properties of Gaussian processes}
\label{sec:useful-properties}
There are several reasons why \gp{}s in particular are well-suited for building a language of regression models:
\begin{itemize}
\item {\bf Analytic inference.}
Given a kernel function and some observations, the predictive posterior distribution can be computed exactly in closed form.
This is a rare property for nonparametric models to have.
\item {\bf Expressivity.}
Through the choice of covariance function, we can express a wide range of modeling assumptions.
Some examples will be shown in \cref{ch:kernels}.
\item {\bf Integration over hypotheses.}
The fact that a \gp{} posterior, given a fixed kernel, lets us integrate exactly over a wide range of hypotheses means that overfitting is less of an issue than in comparable model classes.
For example, compared to neural networks, relatively few parameters need to be estimated, which lessens the need for the complex optimization or regularization schemes.
%
%In contrast, much of the neural network literature is devoted to techniques for regularization and optimization.
%TODO: add some citations.
\item {\bf Model selection.}
A side benefit of being able to integrate over all hypotheses is that we can compute the marginal likelihood of the data given a model.
This gives us a principled way of comparing different models.
\item {\bf Closed-form predictive distribution.}
The predictive distribution of a \gp{} at a set of test points is simply a multivariate Gaussian distribution.
This means that \gp{}s can easily be composed with other models or decision procedures.
%For example, in reinforcement learning applications,
\item {\bf Easy to analyze.}
It may seem unsatisfying to restrict ourselves to a limited model class, as opposed to trying to do inference in the set of all computable functions.
However, simple models can be used as well-understood building blocks for constructing more interesting models. %in diverse settings.
For example, consider linear models.
Although they form an extremely limited model class, they are simple, easy to analyze, and easy to incorporate into other models or procedures.
Gaussian processes can be seen as an extension of linear models which retain these attractive properties~\citep[chapter 2]{rasmussen38gaussian}.
%Linear models may seem like a hopelessly simple model class, but they're arguably the most useful modeling tools in existence.
%\citet{rasmussen38gaussian} give a thorough introduction to \gp{}s, including many different types of analysis of the properties of this model.
\end{itemize}
\subsection{Limitations of Gaussian processes}
There are several issues which make \gp{}s sometimes difficult to use:
\begin{itemize}
\item {\bf Slow inference.}
Computing the matrix inverse in \cref{eq:gp_marg_lik,eq:predictive} takes $\mathcal{O}(N^3)$ time, making exact inference prohibitively slow for more than a few thousand datapoints.
However, this problem can be addressed by approximate inference schemes~\citep{snelson2006sparse, quinonero2005unifying, hensman2013gaussian}.
%Most \gp{} software packages implement several of these methods
\item {\bf Light tails of the predictive distribution.}
The predictive distribution of a standard \gp{} model is Gaussian.
We may sometimes with to use non-Gaussian predictive likelihoods, for example in order to be robust to outliers, or to perform classification.
Using non-Gaussian likelihoods requires approximate inference.
Fortunately, mature software packages exist~\citep{GPy, GPML, VanRiiHarJylVeh14} which can automatically perform approximate inference for a wide variety of non-Gaussian likelihoods, and also implement sparse approximations.
%\item {\bf Symmetric}
%The predictive distribution of a \gp{} is always symmetric about its mean function.
%This means that \gp{} models are inappropriate for modeling non-negative functions, such as likelihoods.
%However, this problem can be addressed by simply exponentiating a \gp{} model, giving rise to a \emph{log-Gaussian process}.
\item {\bf The need to choose a kernel.}
The flexibility of \gp{} models raises the question of which kernel to use for a given problem.
%means that we are also faced with the difficult task of choosing a kernel.
Choosing a useful kernel is equivalent to learning a useful representation of the input.
Kernel parameters can be set automatically by maximizing the marginal likelihood, but until recently, human experts were required to choose the parametric form of the kernel. %, usually from a small set of standard kernels.
\Cref{ch:grammar} will show a way in which kernels can be automatically constructed for a given dataset.
\end{itemize}
\section{Outline and contributions of thesis}
The main contribution of this thesis is to develop a method to automatically model, visualize, and describe a variety of statistical structures in data, by searching through an open-ended language of regression models.
This thesis also includes a set of related results showing how Gaussian processes can be extended or composed with other models.
%Furthermore, the fact that the marginal likelihood is available means that we can evaluate how much evidence the data provides for one structure over another, allowing an automatic search to construct models for us.
Chapter~\ref{ch:kernels} is a tutorial showing how to build a wide variety of structured models of functions by constructing appropriate covariance functions.
%TODO: Expand this description.
We will also show how \gp{}s can produce nonparametric models of manifolds with diverse topological structures, such as cylinders, toruses and M\"obius strips.
Chapter~\ref{ch:grammar} shows how to search over an open-ended language of models, built by adding and multiplying different kernels.
Since we can evaluate each model by the marginal likelihood, we can automatically construct custom models for each dataset by a straightforward search procedure.
We will show how the nature of \gp{}s allow the resulting models to be visualized by decomposing them into diverse, interpretable components, each capturing a different type of structure.
Our experiments show that capturing such high-level structure sometimes allows one to extrapolate beyond the range of the data.
One benefit of using a compositional model class is that the resulting models are relatively interpretable.
Chapter~\ref{ch:description} demonstrates a system which automatically describes the structure implied by a given kernel on a given dataset, generating reports with graphs and English-language text describing the resulting model.
%We will show several automatic analyses of time-series.
Combined with the automatic model search developed in chapter~\ref{ch:grammar}, this system represents the beginnings of what could be called an ``automatic statistician'', capable of some aspects of model-building and explanation currently performed by experts.
Chapter~\ref{ch:deep-limits} analyzes deep neural network models by characterizing the prior over functions obtained by composing \gp{} priors to form \emph{deep Gaussian processes}.
%, and relates them to existing neep neural network architecures.
We show that, as the number of layers increase, the amount of information retained about the original input diminishes to a single degree of freedom.
A simple change to the network architecture fixes this pathology.
We relate these models to neural networks, and as a side effect derive several forms of \emph{infinitely deep kernels}.
Chapter~\ref{ch:additive} examines a more limited, but much faster way of discovering structure using \gp{}s.
Specifying a kernel having many different types of structure, we use kernel parameters to discard whichever types of structure are \emph{not} found in the current dataset.
The particular model class we examine is called \emph{additive Gaussian processes}, a model summing over exponentially-many \gp{}s, each depending on a different subset of the input variables.
We give a polynomial-time inference algorithm for this model, and relate it to other model classes.
For example, additive \gp{}s are shown to have the same covariance as a \gp{} that uses \emph{dropout}, a recently developed regularization technique for neural networks.
Chapter~\ref{ch:warped} develops a Bayesian clustering model in which the clusters have nonparametric shapes, called the infinite warped mixture model.
The density manifolds learned by this model follow the contours of the data density, and have interpretable, parametric forms in the latent space.
The marginal likelihood lets us infer the effective dimension and shape of each cluster separately, as well as the number of clusters.
\outbpdocument{
\bibliographystyle{plainnat}
\bibliography{references.bib}
}