[Derivation Scribbles] Basic GP Regression Formula

Derivations for the Gaussian process predictive distribution. Single-output GP, observe with additive Gaussian noise.
Gaussian Process
Derivation Scribbles
Author

Rui-Yang Zhang

Published

October 13, 2024

Gaussian Process Regression, adapted from https://docs.jaxgaussianprocesses.com/.

Block Matrix Inversion

The first thing we need to establish is the block matrix inversion identity. Consider an invertible matrix \(\Sigma\) that can be written as

\[ \Sigma = \begin{bmatrix}\Sigma_{AA} & \Sigma_{AB} \\\Sigma_{BA} & \Sigma_{BB} \\\end{bmatrix} \]

where \(\Sigma_{AA}, \Sigma_{AB}, \Sigma_{BA}, \Sigma_{BB}\) are matrices of the right dimension and sufficiently non-singular. Next, we have the block matrix inversion identity stated below.

\[ \begin{split} \Sigma^{-1} &= \begin{bmatrix}\Sigma_{AA} & \Sigma_{AB} \\\Sigma_{BA} & \Sigma_{BB} \\\end{bmatrix}^{-1} \\ &= \begin{bmatrix} (\Sigma_{AA} - \Sigma_{AB}\Sigma_{BB}^{-1}\Sigma_{BA})^{-1} & -(\Sigma_{AA} - \Sigma_{AB}\Sigma_{BB}^{-1}\Sigma_{BA})^{-1} \Sigma_{AB} \Sigma_{BB}^{-1}\\ -\Sigma_{BB}^{-1} \Sigma_{BA}(\Sigma_{AA} - \Sigma_{AB}\Sigma_{BB}^{-1}\Sigma_{BA})^{-1} & (\Sigma_{BB} - \Sigma_{BA}\Sigma_{AA}^{-1}\Sigma_{AB})^{-1} \end{bmatrix}. \end{split} \]

Marginal and Conditional Gaussians

Consider a multivariate Gaussian distribution \(x = (x_A, x_B)^T\) where \(x_A\) is \(d_A\) dimensional, \(x_B\) is \(d_B\) dimensional, and \(x\) is \(d = d_A + d_B\) dimensional. The mean vector and covariance matrix of the multivariate Gaussian is set to be as follows:

\[ x = \begin{bmatrix} x_A \\ x_B \end{bmatrix} \sim N_d \left( \mu, \Sigma\right) = N_d \left( \begin{bmatrix} \mu_A \\ \mu_B \end{bmatrix}, \begin{bmatrix}\Sigma_{AA} & \Sigma_{AB} \\\Sigma_{BA} & \Sigma_{BB} \\\end{bmatrix}\right). \]

It is easy to notice that the marginal distributions \(x_A\) and \(x_B\) can be obtained by selecting the needed entries of the above equation, i.e. 

\[ \begin{split} x_A &\sim N_{d_A}(\mu_A, \Sigma_{AA}), \\ x_B &\sim N_{d_B}(\mu_B, \Sigma_{BB}). \end{split} \]

The conditional distributions are a bit tricky, which we will derive below. Due to symmetry, we will derive the conditional distribution \(x_A | x_B\) and just state \(x_B | x_A\). Using \(p(\cdot)\) to denote the density of a random variable, we have

\[ \begin{split} p(x_A | x_B) &= \frac{p(x_A, x_B)}{p(x_B)} \\ &\propto \exp\left\{ -\frac{1}{2} (x - \mu)^T\Sigma^{-1}(x - \mu) \right\}. \end{split} \]

Focusing on the terms inside the second exponential, we first denote

\[ \Sigma^{-1} = \begin{bmatrix} V_{AA} & V_{AB} \\ V_{BA} & V_{BB} \end{bmatrix} \]

which then yield

\[ \begin{split} &\quad (x - \mu)^T\Sigma^{-1}(x - \mu) \\ &= \begin{bmatrix} x_A - \mu_A \\ x_B - \mu_b \end{bmatrix}^T \begin{bmatrix} V_{AA} & V_{AB} \\ V_{BA} & V_{BB} \end{bmatrix}\begin{bmatrix} x_A - \mu_A \\ x_B - \mu_b \end{bmatrix} \\ &= \begin{bmatrix} (x_A - \mu_A)^T V_{AA} + (x_B - \mu_B)^T V_{BA} \\ (x_A - \mu_A)^T V_{AB} + (x_B - \mu_B)^T V_{BB} \end{bmatrix}^T\begin{bmatrix} x_A - \mu_A \\ x_B - \mu_b \end{bmatrix} \\ &= (x_A - \mu_A)^T V_{AA} (x_A - \mu_A) + (x_A - \mu_A)^T V_{AB} (x_B - \mu_B) \\ &\quad + (x_B - \mu_B)^T V_{BA} (x_A - \mu_A) + (x_B - \mu_B)^T V_{BB} (x_B - \mu_B). \end{split} \]

We can keep terms with \(x_A\) and put the rest into the normalising constant. As \(V_{AA}\) is square and \(V_{AB}= V_{BA}^T\), we can simplify our above equation into

\[ \begin{split} &\quad x_A^T V_{AA} x_A - 2 x_A^T V_{AA} \mu_A + 2x_A^T V_{AB} (x_B - \mu_B) \\ &= x_A^T V_{AA} x_A - 2 x_A^T [ V_{AA} \mu_A +V_{AB} (x_B - \mu_B)] \\ &= (x_A - \mu')^T V_{AA}(x_A - \mu')+ C \end{split} \]

for some constant \(C\) independent of \(x_A\) and the newly defined

\[ \mu' = \mu_A - V_{AA}^{-1}V_{AB} (x_B - \mu_B). \]

Therefore, using the values of \(V_{AA}, V_{AB}\) from the block matrix inversion formula earlier, we have

\[ \begin{split} \mu' &= \mu_A - V_{AA}^{-1}V_{AB} (x_B - \mu_B) \\ &= \mu_A + \Sigma_{AB}\Sigma_{BB}^{-1}(x_B - \mu_B) \\ V_{AA}^{-1} &= \Sigma_{AA} - \Sigma_{AB}\Sigma_{BB}^{-1}\Sigma_{BA} \\ \end{split} \]

and via symmetry, we have the conditional distributions

\[ \begin{split} x_A | x_B &\sim N_{d_A}(\mu_A + \Sigma_{AB}\Sigma_{BB}^{-1}(x_B - \mu_B), \Sigma_{AA} - \Sigma_{AB}\Sigma_{BB}^{-1}\Sigma_{BA}), \\ x_B | x_A &\sim N_{d_B}(\mu_B + \Sigma_{BA}\Sigma_{AA}^{-1}(x_A - \mu_A), \Sigma_{BB} - \Sigma_{BA}\Sigma_{AA}^{-1}\Sigma_{AB}). \end{split} \]

Gaussian Process Regression

Consider we have a single-output Gaussian process \(f \sim \mathcal{GP}(\mu, k)\) where \(\mu\) is the mean function and \(k\) is the kernel function. The support of this GP is assumed to be \(\mathbb{R}^d\). Consider we have made \(m\) observations of this GP \(f\) where the observations are made at locations \(X \in \mathbb{R}^m\) with values \(y \in \mathbb{R}^m\) and the observations are noisy with independent additive Gaussian noise of variance \(\sigma^2\), i.e. \(y = f(X) + \xi\) with \(\xi_i \sim N(0, \sigma^2) ~\forall i = 1, 2, \ldots, m\). Denote the existing observations as \(\mathcal{D} = \{ X, y \}\).

Under our modelling assumptions, we could write down the (log) likelihood of the \(m\) observations \(y\) under our GP prior \(f \sim \mathcal{GP}(\mu, k)\). Since \(y = f(X) + \xi\), we have

\[ y | X \sim N_m \left( \mu(X), k(X, X) + \sigma^2 I_m \right) \] paramerised by \(\theta\) (e.g. observation noise \(\sigma\), lengthscale and variance of the kernel \(k\)) which gives us the following log likelihood

\[ \log p(y|X) = - \frac{m}{2}\log(2\pi) - \log | k(X, X) + \sigma^2 I_m | - \frac{1}{2} \left( y - \mu(X) \right)^T ( k(X, X) + \sigma^2 I_m)^{-1}\left( y - \mu(X) \right) \] that we maximise w.r.t. \(\theta\) to obtain the maximum likelihood estimators of the (hyper)parameters.

Next, conditional on these observations, we wish to know the distributions of the GP at test points \(X_* \in \mathbb{R}^n\), i.e. the conditional distribution \(y_* = f(X_*) ~| \mathcal{D}\). This can be achieved by first model \(y_*\) and \(y\) jointly, then condition on \(y\). Using the conditional distribution formula above, we denote for simplicity the Gram matrices

\[ K = k(X, X), \qquad K_* = k(X, X_*), \qquad K_{**}=k(X_*, X_*), \]

which gives us

\[ \begin{split} y_* ~|X_*, \mathcal{D}, \sigma^2 &\sim N_{n}(\mu_{y_* | \mathcal{D}}, K_{y_* | \mathcal{D}}), \\ \mu_{y_* | \mathcal{D}} &= \mu(X) + K_*^T (K + \sigma^2 I_n)^{-1} y,\\ K_{y_* | \mathcal{D}} &= K_{**} - K_*^T (K + \sigma^2 I_n)^{-1}K_*. \end{split} \]

In the common scenario where we assume \(\mu = 0\), we further have the following GP predictive distribution

\[ \begin{split} y_* ~|X_*, \mathcal{D}, \sigma^2 &\sim N_{n}(\mu_{y_* | \mathcal{D}}, K_{y_* | \mathcal{D}}), \\ \mu_{y_* | \mathcal{D}} &= K_*^T (K + \sigma^2 I_n)^{-1} y,\\ K_{y_* | \mathcal{D}} &= K_{**} - K_*^T (K + \sigma^2 I_n)^{-1}K_*. \end{split} \]