= z.reshape(N_A, N_B)
step1 = A @ step1
step2 = step2.T
step3 = B @ step3
step4 = step4.T.flatten() result
In this blog post, I will walk through how one could exploit the Kronecker structure of the temporal Gaussian process (GP) regression with one-dimensional space + one-dimensional time inputs and one-dimensional output. This is the second of a series of blog posts on spatial-temporal Gaussian processes.
Separable Kernels
Recall from the last post that we have fitted a temporal GP on an one-dimensional spatial and one-dimensional temporal grid. Since we define the overall kernel as a product of the spatial and temporal component of the kernel, i.e.
Such kernels are known as separable kernels, and in this post we will explore how one could exploit this structure to obtain significant computational speed ups.
Kronecker Facts
Before describing how one could leverage the Kronecker structure, first we state several relevant and helpful facts about matrices with a Kronecker structure.
Consider two matrices
The Kronecker product operator
First, we have the inverse property
Next, we have the mixed-product property
Finally, if we have square matrices
Kronecker Matrix-Vector Product
We will first show a useful algorithm to compute matrix-vector product of the form
We will use the
A property about the
where the .reshape
in JAX Numpy is practically transpose then reshape - which is also why we transpose before flatten to get the final result. In terms of computational time, the naive implementation of
Kronecker Matrix-Matrix Product
One could also easily extend the above Kronecker matrix-vector product to Kronecker matrix-matrix product in the following way. Consider the matrix-matrix product jax.vmap
.
Standard GP Sampling, Training, and Prediction
For a Gaussian process
Consider we have made
To train the model using data (or conduct MLE1), we need to optimise the log likelihood
which also implies that if we wish to draw a posterior sample we would have
“GP Does Not Scale”
It is a consensus / folklore that GP does not scale. This is mostly due to the fact that the training and sampling of GP involves inversions and Cholesky decomposition of
It is still ongoing research to device tricks and algorithms to make GP more scalable. Some notable approaches includes:
- Inducing Points
- Vecchia Approximations
- SPDE Approach
- Efficiently Posterior Sampling
- Conjugate Gradients.
Here, we will look at one such approach: Kronecker structure exploiting method. Assume we have a 1D space + 1D time temporal GP with
Kronecker Product Gaussian Process
The contents here are largely based on Saatchi (2011).
Sampling from a GP
Naive
Kronecker
where we can use the Kronecker matrix-vector product.
GP Likelihood
Naive
Kronecker
There are two places where we need to leverage the Kronecker structure:
- determinant
- inverse
.
Consider eigendecompositions
we have
where the remaining term