Spatial-Temporal GP (2)

A series of blog posts on spatial-temporal Gaussian processes. Exploiting the Kronecker structure of temporal GP regression with 1d space.
Author

Rui-Yang Zhang

Published

October 31, 2024

In this blog post, I will walk through how one could exploit the Kronecker structure of the temporal Gaussian process (GP) regression with one-dimensional space + one-dimensional time inputs and one-dimensional output. This is the second of a series of blog posts on spatial-temporal Gaussian processes.

Separable Kernels

Recall from the last post that we have fitted a temporal GP on an one-dimensional spatial and one-dimensional temporal grid. Since we define the overall kernel as a product of the spatial and temporal component of the kernel, i.e. k=ks×kt, we have the Kronecker structure of the Gram matrices K=KsKt, visually shown below.

Gram Matrices

Such kernels are known as separable kernels, and in this post we will explore how one could exploit this structure to obtain significant computational speed ups.

Kronecker Facts

Before describing how one could leverage the Kronecker structure, first we state several relevant and helpful facts about matrices with a Kronecker structure.

Consider two matrices ARn1×n2, BRm1×m2. The Kronecker product K=ABRn1m1×n2m2 is defined by

AB=[a11Ba1n2Ban11Ban1n2B].

The Kronecker product operator is bi-linear and associative, so we have

A(B+C)=AB+AC(B+C)A=BA+CA(kA)B=A(kB)=k(AB)A(BC)=(AB)C. More interesting (and relevant here) properties are the ones related to inverse, Cholesky decomposition, and determinant.

First, we have the inverse property

(AB)1=A1B1 for any invertible A,B.

Next, we have the mixed-product property

(A1B1)(A2B2)=(A1A2)(B1B2). Note that if we have Cholesky decomposition A=LL for lower triangular matrix L and its conjugate transpose L, we have

AB=(LALA)(LBLB)=(LALB)(LALB). Similarly, if we have eigendecomposition A=QAΛAQAT for diagonal matrix ΛA, we have

AB=(QAΛAQAT)(QBΛBQBT)=(QAQB)(ΛAΛB)(QATQBT).

Finally, if we have square matrices ARn×n and BRm×m, then

|AB|=|A|m|B|n for matrix determinants.

Kronecker Matrix-Vector Product

We will first show a useful algorithm to compute matrix-vector product of the form (AB)z where A,B are square matrices of size NA×NA and NB×NB respectively and zRN with N=NANB. Note that this algorithm can be generalised to matrix-vector product with matrix being a Kronecker product of D square matrices.

We will use the vec operator where it stacks the columns of a matrix vertically to obtain a single column vector, i.e. for A=[a1,a2,,ak] with ai being column vectors, we have

vec(A)=[a1ak].

A property about the vec operator and Kronecker product is the following:

(AB)vec(Z)=vec[BZAT]=vec[B(AZT)T] Back to the product of interest (AB)z, we have z as a column vector. To apply the vec formula above, we need to reshape z to enable sensible matrix products. So, we have, using JAX (and JAX Numpy) notations,

step1 = z.reshape(N_A, N_B)
step2 = A @ step1
step3 = step2.T
step4 = B @ step3
result = step4.T.flatten()

where the .reshape in JAX Numpy is practically transpose then reshape - which is also why we transpose before flatten to get the final result. In terms of computational time, the naive implementation of (AB)z will be O((NANB)2) whereas the Kronecker implementation is only O(NANB). The Kronecker implementation will be used whenever it is applicable.

Kronecker Matrix-Matrix Product

One could also easily extend the above Kronecker matrix-vector product to Kronecker matrix-matrix product in the following way. Consider the matrix-matrix product (AB)Z where A,B are square matrices of size NA×NA and NB×NB respectively and ZRN×M. We will break matrix Z down in to M columns and perform Kronecker matrix-vector product to each of the columns. This gives a computational time of O(NANBM) as opposed to the O((NANB)2M) of naive implementation. We could also exploit the vectorisation functionalities to further speed up this product using methods such as jax.vmap.

Standard GP Sampling, Training, and Prediction

For a Gaussian process fGP(μ,k) where μ is the mean function and k is the kernel function, we can draw a sample of f at test locations X=(x1,x2,,xk)

f=μ(X)+k(X,X) ξ,ξNk(0,Ik) where k(X,X) is the Gram matrix and the square root denotes the lower Cholesky factor.

Consider we have made m observations of this GP f where the observations are made at locations XRm with values yRm and the observations are noisy with independent additive Gaussian noise of variance σ2, i.e. y=f(X)+ξ with ξiN(0,σ2) i=1,2,,m. Denote the existing observations as D={X,y}.

To train the model using data (or conduct MLE), we need to optimise the log likelihood

logp(y|X)=m2log(2π)log|k(X,X)+σ2Im|12yT(k(X,X)+σ2Im)1y. In addition, we have the (conditional) predictive distribution

y |X,D,σ2Nn(μy|D,Ky|D),μy|D=KT(K+σ2In)1y,Ky|D=KKT(K+σ2In)1K.

which also implies that if we wish to draw a posterior sample we would have

f=μy|D(X)+ky|D(X,X) ξ,ξNk(0,Ik).

“GP Does Not Scale”

It is a consensus / folklore that GP does not scale. This is mostly due to the fact that the training and sampling of GP involves inversions and Cholesky decomposition of m×m matrices where m is the number of observations. Most commonly used algorithms for matrix inversion and Cholesky decomposition are of O(m3) time complexity and are serial in natural (so do not enjoy the GPU speed-ups that are prevalent in machine learning) – even a moderately sized data set will induce prohibitive costs.

It is still ongoing research to device tricks and algorithms to make GP more scalable. Some notable approaches includes:

Here, we will look at one such approach: Kronecker structure exploiting method. Assume we have a 1D space + 1D time temporal GP with Ns spatial grid points and Nt temporal grid points. The naive implementation will have a time complexity of O((NsNt)3), whereas a Kronecker-aware implementation will only have a time complexity of O(max{Ns,Nt}3). Below, we will clarify the precise ways we can leverage the matrix structure to achieve computational speed-ups.

Kronecker Product Gaussian Process

The contents here are largely based on Saatchi (2011).

Sampling from a GP

Naive

f=μ(X)+KsKt ξ,ξNk(0,Ik)

Kronecker

f=μ(X)+(KsKt) ξ,ξNk(0,Ik)

where we can use the Kronecker matrix-vector product.

GP Likelihood

Naive

logp(y|X)=m2log(2π)log|KsKt+σ2Im|12yT(KsKt+σ2Im)1y.

Kronecker

There are two places where we need to leverage the Kronecker structure:

  • determinant |KsKt+σ2Im|
  • inverse (KsKt+σ2Im)1.

Consider eigendecompositions K=QΛQT, Ks=QsΛsQsT, and Kt=QtΛtQtT. We know that QQT=I and |Q|=1, so since

K+σ2Im=QΛQT+Q(σ2Im)QT=Q(Λ+σ2Im)QTKsKt+σ2Im=(QsQt)(ΛsΛt+σ2Im)(QsTQtT)

we have

|K+σ2Im|=|Q||Λ+σ2Im||QT|=|ΛsΛt+σ2Im|(K+σ2Im)1=QT(Λ+σ2Im)1Q1=Q(ΛsΛt+σ2Im)1QT

where the remaining term ΛsΛt+σ2Im is a diagonal matrix, and we can leverage Kronecker matrix-vector (and matrix-matrix) product whenever necessary.

Footnotes

  1. Frequentist MLE is equivalent to a Bayesian MAP with flat priors.↩︎

  2. since we are using only a summary statistic (MAP with flat prior) for the parameters in the posterior predictive, instead of the full marginal posterior.↩︎