Deep Kernel Learning

Recap: Gaussian Processes

A Gaussian process is a family of stochastic functions which share a statistical property:

\[\left(\begin{array} ff(\mathbf{x}_1) \\ f(\mathbf{x}_2) \\ \vdots \\ f(\mathbf{x}_n)\end{array}\right) \sim \mathcal{N}\left(\mathbf{\mu}, K\right)\]

\[\mathbf{\mu}_i = \mathcal{E}(f(x_i)), \hspace{1cm} K_{ij} = k(x_i, x_j) \]

Recap: GP Prediction & Learning

Predictions of \(f\) at new point(s) \(\mathbf{x}^*\) with noise \(\sigma\) \[f_* \sim \mathcal{N}\left(\bar{f_*}, K_{f_*, f_*}\right)\] \[\bar{f_*} = \mu_{X_*} + K_{X_*, X}\left[K_{X,X}+\sigma^2 I\right]^{-1} \mathbf{y}\] \[K_{f_*,f_*} = K_{X_*, X_*} - K_{X_*, X}\left[K_{X,X}+\sigma^2 I\right]^{-1} K_{X,X_*}\]

For learning, maximise \[ \log p\left(\mathbf{y}|\theta\right) \sim - \mathbf{y}^T\left(K_\theta+\sigma^2 I\right)^{-1}\mathbf{y}-\log|K_\theta + \sigma^2 I|\]

Problem

“How can Gaussian processes possibly replace neural networks? Have we thrown the baby out with the bathwater?” David MacKay (1998)

Neural Networks are complex models with many parameters and choices good for finding hierarchical relationships
Gaussian Processes are probabilistic, non-parametric function approximations which use distances to training datapoints
How to combine these complimentary models?

Observation: Equivalence of NN and GP

An infinitely wide single hidden layer neural network is equivalent to a Gaussian Process!

Explanation

The statement is meant for a single output, i.e. \[[y^1(\mathbf{x}_1), \dots, y^1(\mathbf{x}_n)]^T \sim \mathcal{N}\left(\mu, K\right)\]

More confusion

If you see the single output network as a function \(y = f(x)\), then the function is a Gaussian process with some kernel. So, close inputs would be expected to be similar or uncorrelated.
Gaussian processes are random? Where is the randomness in the NN?
Explanation in Deep Neural Networks as Gaussian Processes Lee et. al

Central Limit theorem

Assumption: Before looking at the NN, the weights are i.i.d distributed with finite variance \[y(\mathbf{x}) = b^1 + \sum_{j=1}^{N_1}W^1_{j}x_j^l(x)\] \[x_j^l = \phi\left(b_j^0+ \sum_{k=1}^{d_{in}}W_{jk}^0 x_k^i\right)\]

Layer activations \(x_j^l\) are independent of each other
Output is an infinite sum of i.i.d. random variables, so it must be Gaussian

Central Limit theorem (cont.)

\(\{ y(\mathbf{x}_1) , \dots, y(\mathbf{x}_n) \}\) is a multivariate Gaussian \[K^y(x,x') = \mathcal{E}\left[y(x)y(x')\right]=\sigma_b^2+\sigma_W^2C(x,x')\] where \(C\) depends on the activation function and the weight/bias distribution.

Takeaways: There exists a kernel function on the input space, which can in principle be learned.
Input output pairs of a wide trained neural network can be used to learn a Gaussian process to predict how the network behaves in the vicinity of the training points
The approximation does not have to be a neural network

Adding depth

Rather than applying Gaussian processes on the original input space, they are applied on the output of a network

\[k\left(\mathbf{x}_i, \mathbf{x}_j\right) \rightarrow k\left(g(\mathbf{x}_i), g(\mathbf{x}_j)\right)\]

Two sets of parameters \(\omega\), the NN weights, and \(\theta\) the kernel parameters are jointly learned for the task by maximising \[\log p(\mathbf{y}) \sim - \left[\mathbf{y}^T \left(K_\gamma + \sigma^2 I\right)\mathbf{y}\right]+\log \left[K_\gamma + \sigma^2 I\right]\] With \(\gamma =\{ \omega, \theta\}\)

Just backprop

Adding Depth (cont.)

Figure is somewhat misleading, as training points in \(\infty\) layer are omitted. Also, each output should have their own GP layer.

Kernels

Kernels should be expressive, though the abstraction of the input space makes interpretation hard. However, encoder spaces can be meaningful.

Suggested kernel: spectral mixture base kernels (smb) \[k\left(\mathbf{x},\mathbf{x}'\right) = \sum_{q=1}^Q a_q \frac{\left|\Sigma_q\right|^{1/2}}{\left(2\pi\right)^{D/2}}\mathrm{e}^{-\frac{1}{2}\left|\Sigma_q^{1/2}\left(\mathbf{x}-\mathbf{x'}\right)\right|^2}\cos\left<\mathbf{x}-\mathbf{x}', 2\pi \mu_q\right>\]

Regression Experiments

Test on various regression data sets.
For speed, KISS-GP with Kronecker/Toeplitz structure is used
These methods are limited to small dimensional kernel input spaces
The DNNs have architecture [d-1000-500-50-2] or [d-1000-1000-500-50-2], so the kernel can work on an appropriate dimensionality
For Q, the number of SM kernel channels, 4 is chosen for n < 10k and 6 otherwise
Deep Kernels are pre-trained

Recap: Gaussian Processes

Recap: GP Prediction & Learning

Problem

Observation: Equivalence of NN and GP

Explanation

More confusion

Central Limit theorem

Central Limit theorem (cont.)

Adding depth

Adding Depth (cont.)

Kernels

Regression Experiments

Results

Olivetti

Olivetti results

Kernel finds structure

Mapping discontinuities

Summary