Recap: Gaussian Processes

A Gaussian process is a family of stochastic functions which share a statistical property:

\[\left(\begin{array} ff(\mathbf{x}_1) \\ f(\mathbf{x}_2) \\ \vdots \\ f(\mathbf{x}_n)\end{array}\right) \sim \mathcal{N}\left(\mathbf{\mu}, K\right)\]

\[\mathbf{\mu}_i = \mathcal{E}(f(x_i)), \hspace{1cm} K_{ij} = k(x_i, x_j) \]

Recap: GP Prediction & Learning

Predictions of \(f\) at new point(s) \(\mathbf{x}^*\) with noise \(\sigma\) \[f_* \sim \mathcal{N}\left(\bar{f_*}, K_{f_*, f_*}\right)\] \[\bar{f_*} = \mu_{X_*} + K_{X_*, X}\left[K_{X,X}+\sigma^2 I\right]^{-1} \mathbf{y}\] \[K_{f_*,f_*} = K_{X_*, X_*} - K_{X_*, X}\left[K_{X,X}+\sigma^2 I\right]^{-1} K_{X,X_*}\]

  • For learning, maximise \[ \log p\left(\mathbf{y}|\theta\right) \sim - \mathbf{y}^T\left(K_\theta+\sigma^2 I\right)^{-1}\mathbf{y}-\log|K_\theta + \sigma^2 I|\]

Problem

“How can Gaussian processes possibly replace neural networks? Have we thrown the baby out with the bathwater?” David MacKay (1998)

  • Neural Networks are complex models with many parameters and choices good for finding hierarchical relationships
  • Gaussian Processes are probabilistic, non-parametric function approximations which use distances to training datapoints
  • How to combine these complimentary models?

Observation: Equivalence of NN and GP

An infinitely wide single hidden layer neural network is equivalent to a Gaussian Process!

  • ???

Explanation

  • The statement is meant for a single output, i.e. \[[y^1(\mathbf{x}_1), \dots, y^1(\mathbf{x}_n)]^T \sim \mathcal{N}\left(\mu, K\right)\]

More confusion

  • If you see the single output network as a function \(y = f(x)\), then the function is a Gaussian process with some kernel. So, close inputs would be expected to be similar or uncorrelated.
  • Gaussian processes are random? Where is the randomness in the NN?
  • Explanation in Deep Neural Networks as Gaussian Processes Lee et. al

Central Limit theorem

  • Assumption: Before looking at the NN, the weights are i.i.d distributed with finite variance \[y(\mathbf{x}) = b^1 + \sum_{j=1}^{N_1}W^1_{j}x_j^l(x)\] \[x_j^l = \phi\left(b_j^0+ \sum_{k=1}^{d_{in}}W_{jk}^0 x_k^i\right)\]
  • Layer activations \(x_j^l\) are independent of each other
  • Output is an infinite sum of i.i.d. random variables, so it must be Gaussian

Central Limit theorem (cont.)

  • \(\{ y(\mathbf{x}_1) , \dots, y(\mathbf{x}_n) \}\) is a multivariate Gaussian \[K^y(x,x') = \mathcal{E}\left[y(x)y(x')\right]=\sigma_b^2+\sigma_W^2C(x,x')\] where \(C\) depends on the activation function and the weight/bias distribution.
  • Takeaways: There exists a kernel function on the input space, which can in principle be learned.
  • Input output pairs of a wide trained neural network can be used to learn a Gaussian process to predict how the network behaves in the vicinity of the training points
  • The approximation does not have to be a neural network

Adding depth

Rather than applying Gaussian processes on the original input space, they are applied on the output of a network

\[k\left(\mathbf{x}_i, \mathbf{x}_j\right) \rightarrow k\left(g(\mathbf{x}_i), g(\mathbf{x}_j)\right)\]

  • Two sets of parameters \(\omega\), the NN weights, and \(\theta\) the kernel parameters are jointly learned for the task by maximising \[\log p(\mathbf{y}) \sim - \left[\mathbf{y}^T \left(K_\gamma + \sigma^2 I\right)\mathbf{y}\right]+\log \left[K_\gamma + \sigma^2 I\right]\] With \(\gamma =\{ \omega, \theta\}\)
  • Just backprop

Adding Depth (cont.)

Figure is somewhat misleading, as training points in \(\infty\) layer are omitted. Also, each output should have their own GP layer.

Kernels

  • Kernels should be expressive, though the abstraction of the input space makes interpretation hard. However, encoder spaces can be meaningful.
  • Suggested kernel: spectral mixture base kernels (smb) \[k\left(\mathbf{x},\mathbf{x}'\right) = \sum_{q=1}^Q a_q \frac{\left|\Sigma_q\right|^{1/2}}{\left(2\pi\right)^{D/2}}\mathrm{e}^{-\frac{1}{2}\left|\Sigma_q^{1/2}\left(\mathbf{x}-\mathbf{x'}\right)\right|^2}\cos\left<\mathbf{x}-\mathbf{x}', 2\pi \mu_q\right>\]

Regression Experiments

  • Test on various regression data sets.
  • For speed, KISS-GP with Kronecker/Toeplitz structure is used
  • These methods are limited to small dimensional kernel input spaces
  • The DNNs have architecture [d-1000-500-50-2] or [d-1000-1000-500-50-2], so the kernel can work on an appropriate dimensionality
  • For Q, the number of SM kernel channels, 4 is chosen for n < 10k and 6 otherwise
  • Deep Kernels are pre-trained

Results

Olivetti

Find the orientation of faces

Olivetti results

Kernel finds structure

Mapping discontinuities

Kernels work best on smooth problems, DNNs can help dealing with sharp features

Summary

  • Deep Kernels are drop-in replacements for regular kernels
  • Very powerful technique for regression problems
  • Scalable \(\mathcal{O}(n)\) training and \(\mathcal{O}(1)\) testing