This blog post is largely based on some notes written by Chris Sherlock.
Kalman Filter
The filtering problem of HMM is about learning the hidden state \(X_t\) given the observations \(y_{1:t}\)2, i.e. finding the distribution \(p(x_t | y_{1:t})\). Closely related to filtering is the prediction problem where we try to learn \(X_{t+1}\) using observations \(y_{1:t}\), i.e. finding \(p(x_{t+1}|y_{1:t})\). Since we assume we know the relationship between \(X_t\) and \(X_{t+1}\), solving the filtering problem will allow us to solve the prediction problem.
The Kalman filter, as the name might suggest, is Kalman’s solution to the filtering problem assuming the linear Gaussian HMMs. We will derive the method below, which only uses (a bit tedious but) elementary linear algebra and probability theory.
We define the distribution \(X_t |y_{1:t} \sim N(\tilde{\mu}_t, \tilde{\Sigma}_t)\) for \(t = 1,2, \ldots\) where the Gaussianity follows from the closedness of Gaussian random variables under linear operators and \(\tilde{\mu}_t, \tilde{\Sigma}_t\) are things that we will figure out the expressions of subsequently.
First, we know from the conditional independence structure of an HMM that3 \[ X_t | x_{0:t-1}, y_{1:t-1} = \Phi x_{t-1} + N(0, B). \] If we integrate out \(X_{0:t-1}\), and especially \(X_{t-1}\), we would have \[ {\color{blue} X_t | y_{1:t-1} \sim N(\Phi\tilde{\mu}_{t-1} , \Phi \tilde{\Sigma}_{t-1} \Phi^T + B) =: N({\mu}_{t}, {\Sigma}_t).} \] This gives us the solution to the prediction problem subject to knowing the filtering problem. Next, we will solve the filtering problem.
Again, using the model structure, we know that \[ Y_t | y_{1:t-1}, x_{0:t} = H x_t + N(0,R) \] which helps us to realise the covariance structure of \(Y_t\) and \(X_t\) conditional on \(y_{t-1}\). This gives us the following distribution \[ \begin{bmatrix} X_t \\ Y_t \end{bmatrix} \bigg| y_{t-1} = N\left( \begin{bmatrix} \mu_t \\ H\mu_t \end{bmatrix}, \begin{bmatrix} \Sigma_t & \Sigma_t H^T \\ H\Sigma_t & H \Sigma_t H^T + R \end{bmatrix} \right) \] using the conditional distribution \(X_t | y_{t-1}\) above.
Afterwards, we use the formula of conditional distribution of multivariate Gaussian (see here for a derivation) to write down the filtering distribution: \[ {\color{blue} \begin{split} X_t | y_{1:t} &\sim N(\tilde{\mu}_t, \tilde{\Sigma}_t) \\ \tilde{\mu}_t &= \mu_t + \Sigma_t H^T (H \Sigma_t H^T + R)^{-1} (y_t - H \mu_t) =: \mu_t + K_t (y_t - H \mu_t) \\ \tilde{\Sigma}_t &= \Sigma_t - \Sigma_t H^T (H \Sigma_t H^T + R)^{-1}H \Sigma_t =: \Sigma_t - K_t H \Sigma_t \\ K_t &:= \Sigma_t H^T (H \Sigma_t H^T + R)^{-1} \end{split}} \] where \(K_t\) is often called the Kalman gain as it could be viewed as the weighting of the assimilated observation \(y - H \mu_t\) we use to update our prediction.
In summary, we can break down the Kalman filter procedure after observing a new observation \(y_t\) into two steps: Given current filtering distribution \(X_{t-1} | y_{1:t-1} \sim N(\tilde{\mu}_{t-1}, \tilde{\Sigma}_{t-1})\),
- Propagate: \(X_{t} | y_{1:t-1} \sim N(\mu_t, \Sigma_t)\) with propagated mean \(\mu_t = \Phi \tilde{\mu}_{t-1}\); propagated covariance \(\Sigma_t = \Phi \tilde{\Sigma}_{t-1}\Phi^T + B\).
- Assimilate: \(X_{t} | y_{1:t} \sim N(\tilde{\mu}_{t}, \tilde{\Sigma}_{t})\) with Kalman gain \(K_t = \Sigma_t H^T (H \Sigma_t H^T + R)^{-1}\); filtered mean \(\tilde{\mu}_t = \mu_t + K_t (y_t - H \mu_t)\); filtered covariance \(\tilde{\Sigma}_t = \Sigma_t - K_t H \Sigma_t\).
Remark
- If we use a fixed covariance matrix \(P^b\) instead of the time-varying \(\Sigma_t\) for the above procedure, we would recover the optimal interpolation (OI) (Edwards et al. 2015) of oceanographic / meterological data assimilation.
- If we use a non-linear observation operator \(h(\cdot)\) and use its Jacobian \(H_t := \nabla h(x_t)\) instead of \(H\) when hidden state is \(x_t\), we would recover the extended Kalman filter (EKF) (Edwards et al. 2015).
Ensemble Kalman Filter
In Kalman filter, we update the filtering distribution using the propagated mean and covariance. However, this is not necessary - we could use the average of samples from the latest filtering distribution instead. This will result in the ensemble Kalman filter (EnKF).
Assume we draw \(M\) i.i.d. samples from the filtering distribution \(X_{t-1} | y_{1:t-1}\) and denote them by \(\tilde{X}^{m}_{t-1}\) for \(m = 1, 2, \ldots, M\). Subsequently, we would obtain the propagated samples \[ {\color{blue} X^m_{t} = \Phi \tilde{X}^{m}_{t-1} + N(0,B) } \] where different Gaussian noises are injected to each sample. It can be shown that the mean and variance of those propagated samples are identical to those of the exact propagated distribution.
Subsequently, we simulate \[ {\color{blue} Y_t^m = H X_t^m + N(0, R). } \] Those \(Y_t^m\) would replace the role of \(Y_t | y_{1:t-1}, x_{0:t}\) in our computation. Recall that our Kalman gain can be written as \[ K_t = \operatorname{Cov}(X_t, Y_t | y_{t-1}) \operatorname{Cov}(Y_t, Y_t | y_{t-1})^{-1}. \] Using our samples \(X_t^m\) and \(Y_t^m\), we can compute their empirical covariance and estimate the Kalman gain, given by \[ {\color{blue} \hat{K}_t = \operatorname{Cov}(X_t^{1:m}, Y_t^{1:m}) \operatorname{Cov}(Y_t^{1:m}, Y_t^{1:m})^{-1}. } \] This would then allow us to update our existing samples \(X_t^m\), which is given by \[ {\color{blue} \tilde{X}_t^m = X_t^m + \hat{K}_t(y_t - Y_t^m) } \] for each \(m = 1, 2, \ldots, M\).
In summary, we can break down the ensemble Kalman filter procedure after observing a new observation \(y_t\) into two steps: Given current samples \(\tilde{X}^{m}_{t-1}\) from the filtering distribution \(X_{t-1} | y_{1:t-1} \sim N(\tilde{\mu}_{t-1}, \tilde{\Sigma}_{t-1})\),
- Propagate: For each \(m\), we propagate by \(X^m_{t} = \Phi \tilde{X}^{m}_{t-1} + N(0,B)\).
- Assimilate: For each \(m\), we simulate \(Y_t^m = H X_t^m + N(0, R)\) and estimate the Kalman gain \(\hat{K}_t = \operatorname{Cov}(X_t^{1:m}, Y_t^{1:m}) \operatorname{Cov}(Y_t^{1:m}, Y_t^{1:m})^{-1}\). Subsequently, we assimilate our samples by \(\tilde{X}_t^m = X_t^m + \hat{K}_t(y_t - Y_t^m)\).
References
Footnotes
An implicit notation throughout this blog post is that capital letters such as \(X_t\) represents random variables, while the lower cases like \(x_t\) represent the realisations of a random variable, thus is a constant.↩︎
We use the notation \(1:t\) to represent \(1, 2, \ldots, t\).↩︎
We will slightly abuse the notation by including things such as \(N(0,Q)\) with properly defined random variables to ease of exposition.↩︎