Derivation Scribbles: Basic Gaussian Process Regression Formula

Derivations for the Gaussian process predictive distribution. Single-output GP, observe with additive Gaussian noise.
Gaussian Process
Derivation Scribbles
Author

Rui-Yang Zhang

Published

October 13, 2024

Gaussian Process Regression, adapted from https://docs.jaxgaussianprocesses.com/.

Block Matrix Inversion

The first thing we need to establish is the block matrix inversion identity. Consider an invertible matrix \(\Sigma\) that can be written as

\[ \Sigma = \begin{bmatrix}\Sigma_{AA} & \Sigma_{AB} \\\Sigma_{BA} & \Sigma_{BB} \\\end{bmatrix} \]

where \(\Sigma_{AA}, \Sigma_{AB}, \Sigma_{BA}, \Sigma_{BB}\) are matrices of the right dimension and sufficiently non-singular. Next, we have the block matrix inversion identity stated below.

\[ \begin{split} \Sigma^{-1} &= \begin{bmatrix}\Sigma_{AA} & \Sigma_{AB} \\\Sigma_{BA} & \Sigma_{BB} \\\end{bmatrix}^{-1} \\ &= \begin{bmatrix} (\Sigma_{AA} - \Sigma_{AB}\Sigma_{BB}^{-1}\Sigma_{BA})^{-1} & -(\Sigma_{AA} - \Sigma_{AB}\Sigma_{BB}^{-1}\Sigma_{BA})^{-1} \Sigma_{AB} \Sigma_{BB}^{-1}\\ -\Sigma_{BB}^{-1} \Sigma_{BA}(\Sigma_{AA} - \Sigma_{AB}\Sigma_{BB}^{-1}\Sigma_{BA})^{-1} & (\Sigma_{BB} - \Sigma_{BA}\Sigma_{AA}^{-1}\Sigma_{AB})^{-1} \end{bmatrix}. \end{split} \]

Marginal and Conditional Gaussians

Consider a multivariate Gaussian distribution \(x = (x_A, x_B)^T\) where \(x_A\) is \(d_A\) dimensional, \(x_B\) is \(d_B\) dimensional, and \(x\) is \(d = d_A + d_B\) dimensional. The mean vector and covariance matrix of the multivariate Gaussian is set to be as follows:

\[ x = \begin{bmatrix} x_A \\ x_B \end{bmatrix} \sim N_d \left( \mu, \Sigma\right) = N_d \left( \begin{bmatrix} \mu_A \\ \mu_B \end{bmatrix}, \begin{bmatrix}\Sigma_{AA} & \Sigma_{AB} \\\Sigma_{BA} & \Sigma_{BB} \\\end{bmatrix}\right). \]

It is easy to notice that the marginal distributions \(x_A\) and \(x_B\) can be obtained by selecting the needed entries of the above equation, i.e. 

\[ \begin{split} x_A &\sim N_{d_A}(\mu_A, \Sigma_{AA}), \\ x_B &\sim N_{d_B}(\mu_B, \Sigma_{BB}). \end{split} \]

The conditional distributions are a bit tricky, which we will derive below. Due to symmetry, we will derive the conditional distribution \(x_A | x_B\) and just state \(x_B | x_A\). Using \(p(\cdot)\) to denote the density of a random variable, we have

\[ \begin{split} p(x_A | x_B) &= \frac{p(x_A, x_B)}{p(x_B)} \\ &\propto \exp\left\{ -\frac{1}{2} (x - \mu)^T\Sigma^{-1}(x - \mu) \right\}. \end{split} \]

Focusing on the terms inside the second exponential, we first denote

\[ \Sigma^{-1} = \begin{bmatrix} V_{AA} & V_{AB} \\ V_{BA} & V_{BB} \end{bmatrix} \]

which then yield

\[ \begin{split} &\quad (x - \mu)^T\Sigma^{-1}(x - \mu) \\ &= \begin{bmatrix} x_A - \mu_A \\ x_B - \mu_b \end{bmatrix}^T \begin{bmatrix} V_{AA} & V_{AB} \\ V_{BA} & V_{BB} \end{bmatrix}\begin{bmatrix} x_A - \mu_A \\ x_B - \mu_b \end{bmatrix} \\ &= \begin{bmatrix} (x_A - \mu_A)^T V_{AA} + (x_B - \mu_B)^T V_{BA} \\ (x_A - \mu_A)^T V_{AB} + (x_B - \mu_B)^T V_{BB} \end{bmatrix}^T\begin{bmatrix} x_A - \mu_A \\ x_B - \mu_b \end{bmatrix} \\ &= (x_A - \mu_A)^T V_{AA} (x_A - \mu_A) + (x_A - \mu_A)^T V_{AB} (x_B - \mu_B) \\ &\quad + (x_B - \mu_B)^T V_{BA} (x_A - \mu_A) + (x_B - \mu_B)^T V_{BB} (x_B - \mu_B). \end{split} \]

We can keep terms with \(x_A\) and put the rest into the normalising constant. As \(V_{AA}\) is square and \(V_{AB}= V_{BA}^T\), we can simplify our above equation into

\[ \begin{split} &\quad x_A^T V_{AA} x_A - 2 x_A^T V_{AA} \mu_A + 2x_A^T V_{AB} (x_B - \mu_B) \\ &= x_A^T V_{AA} x_A - 2 x_A^T [ V_{AA} \mu_A +V_{AB} (x_B - \mu_B)] \\ &= (x_A - \mu')^T V_{AA}(x_A - \mu')+ C \end{split} \]

for some constant \(C\) independent of \(x_A\) and the newly defined

\[ \mu' = \mu_A - V_{AA}^{-1}V_{AB} (x_B - \mu_B). \]

Therefore, using the values of \(V_{AA}, V_{AB}\) from the block matrix inversion formula earlier, we have

\[ \begin{split} \mu' &= \mu_A - V_{AA}^{-1}V_{AB} (x_B - \mu_B) \\ &= \mu_A + \Sigma_{AB}\Sigma_{BB}^{-1}(x_B - \mu_B) \\ V_{AA}^{-1} &= \Sigma_{AA} - \Sigma_{AB}\Sigma_{BB}^{-1}\Sigma_{BA} \\ \end{split} \]

and via symmetry, we have the conditional distributions

\[ \begin{split} x_A | x_B &\sim N_{d_A}(\mu_A + \Sigma_{AB}\Sigma_{BB}^{-1}(x_B - \mu_B), \Sigma_{AA} - \Sigma_{AB}\Sigma_{BB}^{-1}\Sigma_{BA}), \\ x_B | x_A &\sim N_{d_B}(\mu_B + \Sigma_{BA}\Sigma_{AA}^{-1}(x_A - \mu_A), \Sigma_{BB} - \Sigma_{BA}\Sigma_{AA}^{-1}\Sigma_{AB}). \end{split} \]

Gaussian Process Regression

Consider we have a single-output Gaussian process \(f \sim \mathcal{GP}(\mu, k)\) where \(\mu\) is the mean function and \(k\) is the kernel function. The support of this GP is assumed to be \(\mathbb{R}^d\). Consider we have made \(m\) observations of this GP \(f\) where the observations are made at locations \(X \in \mathbb{R}^m\) with values \(y \in \mathbb{R}^m\) and the observations are noisy with independent additive Gaussian noise of variance \(\sigma^2\), i.e. \(y = f(X) + \xi\) with \(\xi_i \sim N(0, \sigma^2) ~\forall i = 1, 2, \ldots, m\). Denote the existing observations as \(\mathcal{D} = \{ X, y \}\).

Next, conditional on these observations, we wish to know the distributions of the GP at test points \(X_* \in \mathbb{R}^n\), i.e. the conditional distribution \(y_* = f(X_*) ~| \mathcal{D}\). This can be achieved by first model \(y_*\) and \(y\) jointly, then condition on \(y\). Using the conditional distribution formula above, we denote for simplicity the Gram matrices

\[ K = k(X, X), \qquad K_* = k(X, X_*), \qquad K_{**}=k(X_*, X_*), \]

which gives us

\[ \begin{split} y_* ~|X_*, \mathcal{D}, \sigma^2 &\sim N_{n}(\mu_{y_* | \mathcal{D}}, K_{y_* | \mathcal{D}}), \\ \mu_{y_* | \mathcal{D}} &= \mu(X) + K_*^T (K + \sigma^2 I_n)^{-1} y,\\ K_{y_* | \mathcal{D}} &= K_{**} - K_*^T (K + \sigma^2 I_n)^{-1}K_*. \end{split} \]

In the common scenario where we assume \(\mu = 0\), we further have the following GP predictive distribution

\[ \begin{split} y_* ~|X_*, \mathcal{D}, \sigma^2 &\sim N_{n}(\mu_{y_* | \mathcal{D}}, K_{y_* | \mathcal{D}}), \\ \mu_{y_* | \mathcal{D}} &= K_*^T (K + \sigma^2 I_n)^{-1} y,\\ K_{y_* | \mathcal{D}} &= K_{**} - K_*^T (K + \sigma^2 I_n)^{-1}K_*. \end{split} \]