Gaussian Process Regression Networks

Review Gaussians

Gaussians are the simplest stable probability distribution on infinite support

\[ \log p = C_1 x^2 + C_2 x + \mbox{const} = - A \left(x - \mu_x\right)^2\]

where \(A > 0.\)

Other reasons:

Any sum of independent random variables \(X\) with finite variance is approximately Gaussian in the asymptotic limit.
A Gaussian also maximises the entropy given mean and variance.

The natural extension

A two dimensional Gaussian could look like this

\[\log p(x_1, x_2) = -\left(x_1, x_2\right) \left(\begin{array}[cc] CC_{11} & C_{12} \\ C_{21} & C_{22}\end{array}\right) \left(\begin{array}{c} x_1 \\ x_2 \end{array}\right)+ \mbox{const}\] which really is a product of two one-dimensional Gaussians in a rotated basis \[\log p(x_1, x_2) = - A_1 x'^2_1 - A_2 x'^2_2 + \mbox{const}\]

Note that the \(C\) matrix is not random, but must have positive or zero eigenvalues, otherwise the probability distribution blows up and would not be normalised.

This property is called positive-definite.

One, two, many

Naturally this can be extended by for more variables, these distributions are called multivariate Gaussians

\[\log p(\hat{x}) \sim -\left(x_1, \dots, x_n\right) \left(\begin{array}[cc] CC_{11} & \dots & C_{1n} \\ \vdots & \dots & \vdots \\ C_{n1} & \dots & C_{nn}\end{array}\right) \left(\begin{array}{c} x_1\\ \vdots \\ x_n \end{array}\right)\]

Eigenvalues have to be positive, but apart from that there is free choice in coefficients (many degrees of freedom)

Properties 1

Multivariate Gaussians have some wonderful mathematical properties:

Knowing the covariances \(\Sigma\) and means \(\hat{\mu}\) of each variable is enough to write down the pdf

\[ p(\hat{x}) = \frac{1}{\sqrt{\left(2\pi\right)^d \left|\Sigma\right|}} \exp\left[-\frac{1}{2} \left(\hat{x}-\hat{\mu}\right)^T \Sigma^{-1} \left(\hat{x} - \hat{\mu}\right) \right] \]

This means there are simple sufficient statistics.

Properties 2

Ignorance of any subset of variables leads to another multivariate Gaussian

If \(\hat{x} = \left(x_1, x_2\right),\) then integrating over \(x_2\) leads to a normal multivariate for \(x_1\) \[\int \exp\left(-\left(x_1, x_2 \right)\left(\begin{array}[cc] C\Sigma_{11} & \Sigma_{12} \\ \Sigma_{21} & \Sigma_{22} \end{array}\right)^{-1} \left(\begin{array} C x_1 \\ x_2 \end{array}\right) \right) d x_2\] \[ \sim \exp\left(- x_1^T \Sigma^{-1}_{11} x_1 \right). \]

So even if you do not know all variables, the ones you have access to are still Gaussian. We can just choose the blocks in the covariance matrix we care about, provided the remaining variables are indetermined.

Properties 3

If the other variables \(\hat{x}_2\) are determined, then the distribution for \(\hat{x}_1\) is still Gaussian, but with shifted mean and new covariance.

\[ \log p(\hat{x_1}) = - \left(\hat{x}_1 - \hat{\mu} \right)^T \Sigma^{-1} \left(\hat{x}_1 - \hat{\mu} \right) + \mbox{const} \] where \[\hat{\mu} = \hat{\mu}_1 + \Sigma_{12}\Sigma_{22}^{-1}\left(\hat{x}_2 - \hat{\mu}_2\right)\] \[\Sigma = \Sigma_{11}-\Sigma_{12}\Sigma_{22}^{-1}\Sigma_{21}.\]

Variance (uncertainty) and expected values are dependent on our knowledge.

It is the simplest probabilistic model we can imagine.

Application of Gaussian Multivariates

In a large data set, the features are approximately Gaussian

\[\left(\begin{array} xx_{11} & x_{12} & x_{13} & NA & x_{15} \\ NA & x_{22} & x_{23} & x_{24} & x_{25} \\ \vdots & & \dots & & \vdots \\ x_{n1} & x_{n2} & x_{n3} & x_{n4} & x_{n5}\end{array} \right)\]

Many algorithms do not work with missing values. An alternative to imputing mean or median values is to infer the maximum likelihood value by using the available information for a data point using the empirical covariance of the multivariate Gaussian (Zoubin, Deutsch, NIPS93).

Gaussian Processes

So far, the matrix \(\Sigma\) is very information rich. Anytime a new value \(x_i\) is added to the multivariate Gaussian, a new positive eigenvalue is added, which could be any value. TMI

Gaussian Processes 2

Assume that all the \(f(x_i)\) are of the same type (say temperature) and live on some index set (for example the spatial coordinates on a map).

When the covariance function is a function of the underlying index set, we have a Gaussian process \[ \Sigma_{ij} = k(x_i, x_j) \]
These functions have to be chosen such that, \(\Sigma\) is still positive-semidefinite. Such functions are called kernel functions.

\[\left[f(x_1)\dots f(x_n)\right]^T \sim \mathcal{N}\left(\vec{\mu}, \Sigma(\{x\})\right)\]

Kernel examples 1

The index set is generally multi-dimensional

-The Exponentiated Quadratic \[k_{SE}(x,x') = \sigma^2 \exp\left( -\frac{(x - x')^2}{2 l^2} \right)\]

Kernel Examples 2

The Ornstein-Uhlenbeck process

\[k_{OU}(x,x') = \sigma^2 \exp\left( -\frac{(x - x')}{2 l} \right)\]

Kernel Examples 3

The periodic kernel

\[k_{per} (x - x') = \sigma^2 \exp\left(- \frac{2 \sin^2\left(\pi |x-x'|/p\right)}{l^2}\right)\]

Many other kernels are possible, they do not have to act on spatial coordinates, e.g. string kernels in genetics.

Generally kernels have hyperparameters, in the examples above the length scale \(l\), amplitude \(\sigma^2\) and the period \(p\).

Kernel Fitting

By maximising the log likelihood, one can find optimal kernel parameters \(\theta\)

\[\log p(D) = - \hat{x}^T \Sigma^{-1}_\theta \hat{x} + \mbox{const}(\theta) \]

This usually takes kernel complexity into account via the constant term. See Automatic Relevancy Detection (ARD)

Quick summary

Gaussian Processes are non-parametric models
Predictions at a new point \(x^*\) takes into account all other \(n\) points
Predictions themselves are Gaussian with \(x^*\) dependent mean and variance

Gaussian Regression Processes

What happens if we have related fields?

Gaussian Process Regression Networks

The model:

\[\hat{y} = \tilde{W}(x) \left(\hat{f}(x) + \sigma_f \hat{\epsilon}\right) + \sigma_y \hat{z}\]

The output \(\hat{y}\) is \(p\) dimensional
There are \(q\) internal fields \(\hat{f}\) that are independent gaussian processes
The matrix \(\tilde{W}\) translates between the latent fields \(\hat{f}\) and the observables \(\hat{y}\)
The matrix entries \(W_{ij}\) are also independent gaussian processes, so the way output and latent field behave may change in space
All \(q + p\times q\) gaussian processes have inferreable kernel parameters \(\theta_w, \theta_f, \sigma_f\) and well as the output variance \(\sigma_y\)

Properties

Division in Signal and Noise

Signal transmitted through \(q\) channels \[k_{y_i} = \sum_{j = 1}^{q}W_{ij}(x) \left[k_{f_j}(x,x')+\sigma_f^2\right]W_{ij}(x') + \sigma_y^2\]

As \(W\) is trained on data, covariances for new points \(\hat{y}(x*)\) depend not only on distances to other points, but on data directly

Observations

Amplitude is non-stationary \(\sigma^2 = \sum_{j = 1}^q W_{ij}(x)W_{ij}(x')\)
Each of the q kernels may have their own length scale, and \(W\) allows the effective length-scale of \(\hat{y}\) to vary between the smallest and largest of \(\hat{f}\)
The kernels might be differently structured (periodic/smooth etc.), so there can be spatial variation of the overall behaviour.
The noise covariance is non-stationary

ARD

One can vary q either manually and infer \(p(D|q)\) or add trainable activations \(k_{f_i} \rightarrow a_i k_{f_i}\) that remove channels.

\(q < p\) becomes a dimensionality reduction
\(q > p\) allows for interesting kernel behaviour, such as length-scale and structure switching

Inference

Optimise

\[p(D | \hat{w}, \hat{f}, \sigma_y) = \prod_i \mathcal{N}\left(\hat{y}(x_i); W(x_i)\hat{f}(x_i), \sigma_y^2 I\right)\]

Ideally we are not looking for the optimal values of \(W\) and \(\hat{f}\), but rather distributions to account for epistemic uncertainty.

In practice, this means Markov-Chain-Monte-Carlo (MCMC) or Variational Bayes (VB). In the paper they use standard methods

In MCMC \[p(\hat{y}(x*)|D) = \lim_{J\rightarrow \infty} \frac{1}{J} \sum_{i = 1}^J p(\hat{y}(x*)|W^i_*, f^i_*, \sigma_f, \sigma_y)\]

Costs

Non-parametric methods tend to be quite expensive to train (\(O(N^3)\) for single convariance \(f\) and \(O(q N^3)\) for multi covariance f)

Sampling scales as \(O(Npq)\)

VB also dominated by \(O(N^3)\), each EM iteration is more expensive than MCMC, but one tends to need less of them.

GPRN scale favorably in the number of dimensions \(p\) of the data set compared to other methods (\(O(pqN^3)\) vs \(O(p^3 N^3)\)) making it a tool of choice for high dimensional data, such as gene expression experiments.

Experiments

Over several datasets the GPRN show very good performance (see table in paper)

Swiss Jura experiment

259 measurements of cadmium, nickel and zinc concentrations in a 14.5 \(km^2\) area which are believed to be not independent. Standard gaussian processes only relate cadmium to cadmium, zinc to zinc, …

The model strongly prefers \(q = 2\) latent nodes.

Covariances are informative

Zinc - Cadmium covariance

Summary

GPRNs are flexible non-parametric models that work well in high-dimensional settings
Not just for predictions of new points, but also learning of relationships between data points
Successfully models non-stationary signal and noise dependencies with freedom to allow for different kernel behaviours
Bayesian methods which account for model uncertainty