Recap Gaussian Processes

GPs are flexible probabilistic models

\[f(\mathbf{X}) = \left[f(\mathbf{x_1})f(\mathbf{x_2})\dots f(\mathbf{x_n})\right]^T \sim \mathcal{N}\left(\mathbf{\mu}, K\right)\] \[K_{ij} = k(\mathbf{x_i},\mathbf{x_j}, \theta)\]

  • Predictions \(f_*\) at new point(s) \(\mathbf{x}^*\) with noise \(\sigma\) \[f_* \sim \mathcal{N}\left(\bar{f_*}, K_{f_*, f_*}\right)\] \[\bar{f_*} = \mu_{X_*} + K_{X_*, X}\left[K_{X,X}+\sigma^2 I\right]^{-1} \mathbf{y}\] \[K_{f_*,f_*} = K_{X_*, X_*} - K_{X_*, X}\left[K_{X,X}+\sigma^2 I\right]^{-1} K_{X,X_*}\]

Recap Kernel Learning

Maximise \[ \log p\left(\mathbf{y}|\theta\right) \sim - \mathbf{y}^T\left(K_\theta+\sigma^2 I\right)^{-1}\mathbf{y}-\log|K_\theta + \sigma^2 I|\]

The rub

Calculation of

\[\left(K + \sigma^2 I\right)^{-1} \mathbf{y}\]

and

\[\log|K + \sigma^2 I|\]

is expensive for large datasets.

Cholesky: steps : \(\mathcal{O}(n^3)\), storage : \(\mathcal{O}(n^2)\)

Prediction for single point: mean: \(\mathcal{O}(n)\), variance \(\mathcal{O}(n^2)\)

Kernel Approximations: Inducing Point Methods

Idea: Do the heavy-lifting on a small subset and then generalise to full kernel

\[U = \left[\mathbf{u_1}\mathbf{u_2}\dots \mathbf{u_m}\right]\]

SoR (Subset of Regression)

\[\hat{k}_{SoR}(\mathbf{x},\mathbf{z}) = K_{\mathbf{x},U} K_{U,U}^{-1}K_{U,\mathbf{z}}\]

As \(K_{U,U} = f(\theta)\), kernel learning can be done on the approximation kernel.

If \(m << n\), then inversion is much faster

Other methods (FITC) lead to full rank approximations

Kernel Approximations: Drawbacks

  • Loss of accuracy

  • Not necessarily cheap:

    Flops: \(\mathcal{O}(m^2n+m)\)

    Storage: \(\mathcal{O}(mn+m^2)\)

Speedup: Structure Exploitation

  • If data sits on a product space and the kernel factorises, calculations can be sped up

\[X \in \mathcal{X}_1 \times \cdots \times \mathcal{X}_P\] and \[k\left(\mathbf{x}_i, \mathbf{x}_j\right) = \prod_{p = 1}^P k\left(\mathbf{x}_i^p, \mathbf{x}_j^p\right)\]

The kernel matrices factorise into Kronecker products: \[K = K_1 \otimes K_2 \otimes \dots \otimes K_P\]

\[m = \prod_{i=1}^P n_i\] Useful properties:

\[K = \Otimes_{d = 1}^{P} K_d \rightarrow K^{-1} = \Otimes_{d = 1}^{P} K_d^{-1} \]

Structure Exploitation

Kronecker Methods:

Costs: Flops: \(\mathcal{O}(Pm^{1+\frac{1}{P}})\) Storage: \(\mathcal{O}(P m^{2/P})\)

Drawbacks:

  • Needs a very specfic data structure (Virtual data can help)

  • Needs specific kernel structure (approximations usually okay)

  • Does get expensive in high dimensions

Toeplitz methods

  • If data is evenly spaced on a 1D-grid and \(k(x, x')=k\left(x-x'\right),\) then one can use Fast Fourier transform to solve linear systems and \(K\) is constant along diagonals

  • Costs: Flops: \(\mathcal{O}(m\log m)\), Storage: \(\mathcal{O}(m)\)

  • Drawbacks: Data has to be 1D and evenly spaced

Structured Kernel Interpolation (SKI)

  • Observation: Putting the inducing points on a grid does by itself not do enough, as the covariances \(K_{X,U}\) must be calculated
  • Idea: Approximate the covariances by using a fine, but structured grid of \(\mathbf{u}\)s, with \(m >> n\), rather than \(m << n\)
  • Interpolate locally, for example: \[ \tilde{k}(\mathbf{x}_i, \mathbf{u}_j) = \omega_i k(\mathbf{u}_\alpha, \mathbf{u}_j) + (1-\omega_j)k(\mathbf{u}_\beta, \mathbf{u}_j)\]
  • \(\hat{K}_{X,U} = WK_{U,U}\), where \(W\), the matrix of \(\omega\)s is sparse!

SKI (cont.)

  • For greater accuracy, use cubic splines, i.e. 4 interpolation points.
  • \[\hat{K}_{X,X} \approx K_{X,U}K^{-1}_{U,U}K_{U,X} \approx W K_{U,U}K_{U,U}^{-1}K_{U,U}W^T\] \[WK_{U,U}W^{T} = K_{SKI}\]
  • Structure of \(K_{U,U}\) dictates speed, same \(\mathcal{O}\) as Kronecker or Toeplitz when calculating \(K^{-1}_{SKI}\mathbf{y}\) for inference, or \(\log |K_{SKI}|\) for structure learning
  • Computational complexity for learning and inference is the same. Speedup already without Kronecker/Toeplitz though

SKI (cont.)

  • Argument to use SKI as a general framework, in which the different approximations such as SoR/FITC are merely approximations of the true kernel via different grids
  • They argue that the local approximation of the W has a regularisation effect (I think?!)
  • Number of inducing points for good representation depends on the expressiveness of the kernel to approximate
  • Loss in accuracy in the approximation is counterbalanced by a larger number of interpolation points.
  • GPs + sparse interpolation + Kronecker/Toeplitz = KISS GP Kernel Interpolation for Scalable Structured Gaussian Processe

Experiments

Generate large 1000x1000 covariance matrix from known kernel BY sampling \(x_i \sim \mathcal{N}(0, 25)\). The inputs are not evenly spaced

  • Already 40 inducing points are enough

Comparison of Strategies

Different strategies for different W: linear interpolation, cubic spline, kmeans, full W

Speedup

Kernel Interpolation

SKI really benefits from Kronecker/Toeplitz treatment, large \(m\) allows for structure learning, rather than just smoothing

Reconstruct 2D Kronecker kernel: KISS needs 0.67 hours vs. FITC 7.2 hours

  • Additional insight: SKI can help to sample from high-dimensional kernels!

Sound Reconstruction

1D problem -> Toeplitz method, extreme speedup

Summary

  • Structure learning and inference using GPs can become scalable to large data sets
  • Using more interpolation points than there are data points, but placing them cleverly and connecting them sparsely, gives dramatic speedups
  • Super speedups Kronecker/Toeplitz work only in small dimensions (D <5)