Kernel Interpolation for Scalable Structured Gaussian Processes

Recap Gaussian Processes

GPs are flexible probabilistic models

\[f(\mathbf{X}) = \left[f(\mathbf{x_1})f(\mathbf{x_2})\dots f(\mathbf{x_n})\right]^T \sim \mathcal{N}\left(\mathbf{\mu}, K\right)\] \[K_{ij} = k(\mathbf{x_i},\mathbf{x_j}, \theta)\]

Predictions \(f_*\) at new point(s) \(\mathbf{x}^*\) with noise \(\sigma\) \[f_* \sim \mathcal{N}\left(\bar{f_*}, K_{f_*, f_*}\right)\] \[\bar{f_*} = \mu_{X_*} + K_{X_*, X}\left[K_{X,X}+\sigma^2 I\right]^{-1} \mathbf{y}\] \[K_{f_*,f_*} = K_{X_*, X_*} - K_{X_*, X}\left[K_{X,X}+\sigma^2 I\right]^{-1} K_{X,X_*}\]

Recap Kernel Learning

Maximise \[ \log p\left(\mathbf{y}|\theta\right) \sim - \mathbf{y}^T\left(K_\theta+\sigma^2 I\right)^{-1}\mathbf{y}-\log|K_\theta + \sigma^2 I|\]

The rub

Calculation of

\[\left(K + \sigma^2 I\right)^{-1} \mathbf{y}\]

and

\[\log|K + \sigma^2 I|\]

is expensive for large datasets.

Cholesky: steps : \(\mathcal{O}(n^3)\), storage : \(\mathcal{O}(n^2)\)

Prediction for single point: mean: \(\mathcal{O}(n)\), variance \(\mathcal{O}(n^2)\)

Kernel Approximations: Inducing Point Methods

Idea: Do the heavy-lifting on a small subset and then generalise to full kernel

\[U = \left[\mathbf{u_1}\mathbf{u_2}\dots \mathbf{u_m}\right]\]

SoR (Subset of Regression)

\[\hat{k}_{SoR}(\mathbf{x},\mathbf{z}) = K_{\mathbf{x},U} K_{U,U}^{-1}K_{U,\mathbf{z}}\]

As \(K_{U,U} = f(\theta)\), kernel learning can be done on the approximation kernel.

If \(m << n\), then inversion is much faster

Other methods (FITC) lead to full rank approximations

Kernel Approximations: Drawbacks

Loss of accuracy
Not necessarily cheap:

Flops: \(\mathcal{O}(m^2n+m)\)

Storage: \(\mathcal{O}(mn+m^2)\)

Speedup: Structure Exploitation

If data sits on a product space and the kernel factorises, calculations can be sped up

\[X \in \mathcal{X}_1 \times \cdots \times \mathcal{X}_P\] and \[k\left(\mathbf{x}_i, \mathbf{x}_j\right) = \prod_{p = 1}^P k\left(\mathbf{x}_i^p, \mathbf{x}_j^p\right)\]

The kernel matrices factorise into Kronecker products: \[K = K_1 \otimes K_2 \otimes \dots \otimes K_P\]

\[m = \prod_{i=1}^P n_i\] Useful properties:

\[K = \Otimes_{d = 1}^{P} K_d \rightarrow K^{-1} = \Otimes_{d = 1}^{P} K_d^{-1} \]

Structure Exploitation

Kronecker Methods:

Costs: Flops: \(\mathcal{O}(Pm^{1+\frac{1}{P}})\) Storage: \(\mathcal{O}(P m^{2/P})\)

Drawbacks:

Needs a very specfic data structure (Virtual data can help)
Needs specific kernel structure (approximations usually okay)
Does get expensive in high dimensions

Toeplitz methods

If data is evenly spaced on a 1D-grid and \(k(x, x')=k\left(x-x'\right),\) then one can use Fast Fourier transform to solve linear systems and \(K\) is constant along diagonals
Costs: Flops: \(\mathcal{O}(m\log m)\), Storage: \(\mathcal{O}(m)\)
Drawbacks: Data has to be 1D and evenly spaced

Structured Kernel Interpolation (SKI)

Observation: Putting the inducing points on a grid does by itself not do enough, as the covariances \(K_{X,U}\) must be calculated

Idea: Approximate the covariances by using a fine, but structured grid of \(\mathbf{u}\)s, with \(m >> n\), rather than \(m << n\)

Interpolate locally, for example: \[ \tilde{k}(\mathbf{x}_i, \mathbf{u}_j) = \omega_i k(\mathbf{u}_\alpha, \mathbf{u}_j) + (1-\omega_j)k(\mathbf{u}_\beta, \mathbf{u}_j)\]

\(\hat{K}_{X,U} = WK_{U,U}\), where \(W\), the matrix of \(\omega\)s is sparse!

SKI (cont.)

For greater accuracy, use cubic splines, i.e. 4 interpolation points.

\[\hat{K}_{X,X} \approx K_{X,U}K^{-1}_{U,U}K_{U,X} \approx W K_{U,U}K_{U,U}^{-1}K_{U,U}W^T\] \[WK_{U,U}W^{T} = K_{SKI}\]

Structure of \(K_{U,U}\) dictates speed, same \(\mathcal{O}\) as Kronecker or Toeplitz when calculating \(K^{-1}_{SKI}\mathbf{y}\) for inference, or \(\log |K_{SKI}|\) for structure learning

Computational complexity for learning and inference is the same. Speedup already without Kronecker/Toeplitz though

SKI (cont.)

Argument to use SKI as a general framework, in which the different approximations such as SoR/FITC are merely approximations of the true kernel via different grids

They argue that the local approximation of the W has a regularisation effect (I think?!)

Number of inducing points for good representation depends on the expressiveness of the kernel to approximate

Loss in accuracy in the approximation is counterbalanced by a larger number of interpolation points.

GPs + sparse interpolation + Kronecker/Toeplitz = KISS GP Kernel Interpolation for Scalable Structured Gaussian Processe

Experiments

Generate large 1000x1000 covariance matrix from known kernel BY sampling \(x_i \sim \mathcal{N}(0, 25)\). The inputs are not evenly spaced

Already 40 inducing points are enough

Comparison of Strategies

Different strategies for different W: linear interpolation, cubic spline, kmeans, full W

Speedup

Kernel Interpolation

SKI really benefits from Kronecker/Toeplitz treatment, large \(m\) allows for structure learning, rather than just smoothing

Reconstruct 2D Kronecker kernel: KISS needs 0.67 hours vs. FITC 7.2 hours

Additional insight: SKI can help to sample from high-dimensional kernels!

Sound Reconstruction

1D problem -> Toeplitz method, extreme speedup

Summary

Structure learning and inference using GPs can become scalable to large data sets

Using more interpolation points than there are data points, but placing them cleverly and connecting them sparsely, gives dramatic speedups

Super speedups Kronecker/Toeplitz work only in small dimensions (D <5)