$\newcommand{\P}{\mathbb{P}}$ $\newcommand{\E}{\mathbb{E}}$ $\newcommand{\S}{\mathcal{S}}$ $\newcommand{\var}{\mathrm{Var}}$ $\newcommand{\bmu}{\boldsymbol{\mu}}$ $\newcommand{\bSigma}{\boldsymbol{\Sigma}}$ $\newcommand{\btheta}{\boldsymbol{\theta}}$ $\newcommand{\bpi}{\boldsymbol{\pi}}$ $\newcommand{\indep}{\perp\!\!\!\perp}$ $\newcommand{\bp}{\mathbf{p}}$ $\newcommand{\bx}{\mathbf{x}}$ $\newcommand{\bX}{\mathbf{X}}$ $\newcommand{\by}{\mathbf{y}}$ $\newcommand{\bY}{\mathbf{Y}}$ $\newcommand{\bz}{\mathbf{z}}$ $\newcommand{\bZ}{\mathbf{Z}}$ $\newcommand{\bw}{\mathbf{w}}$ $\newcommand{\bW}{\mathbf{W}}$ $\newcommand{\bv}{\mathbf{v}}$ $\newcommand{\bV}{\mathbf{V}}$

TUTORIAL 4b

Kalman filtering


Course: Math 535 - Mathematical Methods in Data Science (MMiDS)
Author: Sebastien Roch, Department of Mathematics, University of Wisconsin-Madison
Updated: Dec 3, 2020
Copyright: © 2020 Sebastien Roch


1 Background: linear-Gaussian models

The model: We consider a Markov process $\{\bX_t\}_{t=0}^T$ with state space $\S = \mathbb{R}^{d_0}$ of the following form

$$ \bX_{t+1} = F \,\bX_t + \bW_t $$

where the $\bW_t$'s are i.i.d. $\mathcal{N}(\mathbf{0}, Q)$ and $F$ and $Q$ are known $d \times d$ matrices. We denote the initial state by $\bX_0 = \bx_0$. We assume that the process $\{\bX_t\}_{t=1}^T$ is not observed, but rather that an auxiliary observed process $\{\bY_t\}_{t=1}^T$ with state space $\S = \mathbb{R}^{d}$ satisfies

$$ \bY_t = H\,\bX_t + \bV_t $$

where the $\bV_t$'s are i.i.d. $\mathcal{N}(\mathbf{0}, R)$ and $H \in \mathbb{R}^{d \times d_0}$ and $R \in \mathbb{R}^{d}$ are known matrices. Graphically, it can be represented as follows, where as before each variable is node and its conditional distribution depends only on its parent nodes.

HMM

(Source)

Our goal is to infer the unobserved states given the observed process. Specifically, we look at the filtering problem. Quoting Wikipedia:

The task is to compute, given the model's parameters and a sequence of observations, the distribution over hidden states of the last latent variable at the end of the sequence, i.e. to compute P(x(t)|y(1),…,y(t)). This task is normally used when the sequence of latent variables is thought of as the underlying states that a process moves through at a sequence of points of time, with corresponding observations at each point in time. Then, it is natural to ask about the state of the process at the end. This problem can be handled efficiently using the forward algorithm.

The forward algorithm: We derive the forward algorithm in the discrete case. As usual it can be adapted to continuous case such as the linear-Gaussian model. We use the notation $\bx_{1:s} = (\bx_1, \ldots,\bx_s)$. The joint distribution (of what is referred to as a hidden Markov model) takes the form

$$ \P[\bX_{1:t}, \bY_{1:t}] = \prod_{s=1}^{T} \P[\bX_s|\bX_{s-1}] \,\P[\bY_s|\bX_s]. $$

Our goal is to maximize over $\bx_t$

$$ \P[\bX_t = \bx_t|\bY_{1:t} = \by_{1:t}] = \frac{\P[\bX_t = \bx_t, \bY_{1:t} = \by_{1:t}]}{\P[\bY_{1:t} = \by_{1:t}]}. $$

Because the denominator does not depend on $\bx_t$, it suffices to compute the numerator.

We give a recursion for the numerator $\alpha_t(\bx_t)$ as a function of $\bx_t$ that takes advantage of the conditional independence relations in the model. Observe that

$$ \begin{align*} \alpha_t(\bx_t) &= \P[\bX_t = \bx_t, \bY_{1:t} = \by_{1:t}]\\ &= \sum_{\bx_{t-1} \in \S} \P[\bX_{t-1} =\bx_{t-1}, \bX_t = \bx_t, \bY_{1:t} = \by_{1:t}]\\ &= \sum_{\bx_{t-1} \in \S} \P[\bX_{t-1} = \bx_{t-1}, \bY_{1:t-1} = \by_{1:t-1}] \\ &\qquad\qquad \times\P[\bX_t = \bx_t|\bX_{t-1} = \bx_{t-1}] \,\P[\bY_t = \by_t | \bX_t = \bx_t]\\ &= \sum_{\bx_{t-1} \in \S} \alpha_{t-1}(\bx_{t-1}) \,\P[\bX_t = \bx_t|\bX_{t-1} = \bx_{t-1}] \,\P[\bY_t = \by_t | \bX_t = \bx_t]. \end{align*} $$

The two conditional probabilities on the last line are known.

Returning to the linear-Gaussian case: In the linear-Gaussian case, the joint distribution is multivariate Gaussian and the conditional densities are specified entirely by their means and covariance matrices. See [Bis, Section 13.3] for the details. Let $\bmu_t$ and $\bSigma_t$ be the mean and covariance matrix of $\bX_t$ conditioned on $\bY_{1:t}$. The recursions for these quantities are the following:

$$ \begin{align*} \bmu_t &= F\,\bmu_{t-1} + K_{t} (\bY_{t} - H F \bmu_{t-1})\\ \bSigma_t &= (I_{d_0 \times d_0} - K_t H) P_{t-1} \end{align*} $$

where

$$ \begin{align*} P_{t-1} &= F \,\bSigma_{t-1} F^T + Q\\ K_t &= P_{t-1} H^T (H P_{t-1} H^T + R)^{-1} \end{align*} $$

This last matrix is known as the Kalman gain matrix. The solution above is known as Kalman filtering.

2 Location tracking

We apply Kalman filtering to location tracking. We imagine that we get noisy observations about the successive positions of an object. Think of GPS measurements for instance. We seek to get a better estimate of the location of the object using the method above. See for example this blog post on location tracking.

We model the true location as a linear-Gaussian system over the 2d position $(z_{1t}, z_{2t})_t$ and velocity $(\dot{z}_{1t}, \dot{z}_{2t})_t$ sampled at $\Delta$ intervals of time. Formally, the system is

$$ \bX_t = (z_{1t}, z_{2t}, \dot{z}_{1t}, \dot{z}_{2t}), \quad F = \begin{pmatrix} 1 & 0 & \Delta & 0\\ 0 & 1 & 0 & \Delta\\ 0 & 0 & 1 & 0\\ 0 & 0 & 0 & 1 \end{pmatrix}. $$

In words, the velocity is unchanged, up to Gaussian perturbation. The position changes proportionally to the velocity in the corresponding dimension.

The observations $(\tilde{z}_{1t}, \tilde{z}_{2t})_t$ are modeled as

$$ \bY_t = (\tilde{z}_{1t}, \tilde{z}_{2t}), \quad H = \begin{pmatrix} 1 & 0 & 0 & 0\\ 0 & 1 & 0 & 0\\ \end{pmatrix}. $$

In words, we only observe the positions, up to Gaussian noise.

3 Implementing the Kalman filter

In [2]:
function lgSamplePath(ss, os, F, H, Q, R, x_1, T)

    x, y = zeros(ss,T), zeros(os,T);
    x[:,1] = x_1;
    ey = zeros(os);
    rand!(MvNormal(zeros(os),R),ey);
    y[:,1] = H*x[:,1] .+ ey;
    for t = 2:T
        ex = zeros(ss);
        rand!(MvNormal(zeros(ss),Q),ex);
        x[:,t] = F*x[:,t-1] .+ ex;
        ey = zeros(os);
        rand!(MvNormal(zeros(os),R),ey);
        y[:,t] = H*x[:,t] .+ ey;
    end
    return x, y
end
Out[2]:
lgSamplePath (generic function with 1 method)
In [3]:
ss = 4; # state size
os = 2; # observation size
F = [1 0 1 0; 0 1 0 1; 0 0 1 0; 0 0 0 1]; 
H = [1 0 0 0; 0 1 0 0];
Q = 0.01*Diagonal(ones(ss));
R = 3*Diagonal(ones(os));
initmu = [8; 10; 1; 0];
initV = 3*Diagonal(ones(ss));
T = 50;
x, y = lgSamplePath(ss, os, F, H, Q, R, initmu, T);
In [4]:
plot(x[1,:], x[2,:], 
    legend=false, xlims=(minimum(x[1,:])-5,maximum(x[1,:])+5),
    ylims=(minimum(x[2,:])-5,maximum(x[2,:])+5),
    markershape=:xcross, alpha=0.5)
scatter!(y[1,:],y[2,:])
Out[4]:
In [5]:
function kalmanUpdate(ss, A, C, Q, R, yt, prevmu, prevV)

    mupred = A*prevmu;
    Vpred = A*prevV*A' .+ Q;
    e = yt .- C*mupred; # error (innovation)
    S = C*Vpred*C' .+ R;
    Sinv = inv(S);
    K = Vpred*C'*Sinv; # Kalman gain matrix
    munew = mupred .+ K*e;
    Vnew = (Diagonal(ones(ss)) .- K*C)*Vpred;
    return munew, Vnew
end
Out[5]:
kalmanUpdate (generic function with 1 method)
In [6]:
function kalmanFilter(ss, os, y, A, C, Q, R, initmu, initV, T)

    mu = zeros(ss, T);
    V = zeros(ss, ss, T);
    mu[:,1] = initmu;
    V[:,:,1] = initV;
    for t=2:T
        mu[:,t], V[:,:,t] = kalmanUpdate(ss, A, C, Q, R, 
            y[:,t], mu[:,t-1], V[:,:,t-1])
    end
    return mu, V
end
Out[6]:
kalmanFilter (generic function with 1 method)
In [7]:
mu, V = kalmanFilter(ss, os, y, F, H, Q, R, initmu, initV, T)
plot!(mu[1,:], mu[2,:], 
    markershape=:rect, linewidth=2)
Out[7]:
In [8]:
dobs = x[1:2,:] - y[1:2,:]
mse_obs = sqrt(sum(dobs.^2))
Out[8]:
16.94065141843803
In [9]:
dfilt = x[1:2,:] - mu[1:2,:]
mse_filt = sqrt(sum(dfilt.^2))
Out[9]:
12.254009107771452

We indeed observe a reduction.

4 Missing data [optional]

We can also allow for the possibility that some observations are missing. Imagine for instance losing GPS signal while going through a tunnel. The recursions above are still valid, with the only modification that the $\bY_t$ and $H$ terms are dropped at those times $t$ where there is no observation. Julia has a convenient missing type to handle this situation.

We use a same sample path as above, but mask observations at times $t=10,\ldots,20$.

In [10]:
ss = 4 # state size
os = 2 # observation size
F = [1 0 1 0; 0 1 0 1; 0 0 1 0; 0 0 0 1]
H = [1 0 0 0; 0 1 0 0]
Q = 0.001*Diagonal(ones(ss))
R = 0.1*Diagonal(ones(os))
initmu = [8; 10; 1; 0]
initV = 0.1*Diagonal(ones(ss))
T = 30
x, y = lgSamplePath(ss, os, F, H, Q, R, initmu, T);
In [11]:
y = convert(Array{Union{Missing,Float64}}, y)
for i=10:20
    y[1,i]=missing
    y[2,i]=missing
end
In [12]:
plot(x[1,:], x[2,:], 
    legend=false, xlims=(minimum(x[1,:])-5,maximum(x[1,:])+5),
    ylims=(minimum(x[2,:])-5,maximum(x[2,:])+5),
    markershape=:xcross, alpha=0.5)
scatter!(y[1,:],y[2,:])
Out[12]:
In [13]:
function kalmanUpdate(ss, A, C, Q, R, yt, prevmu, prevV)

    mupred = A*prevmu
    Vpred = A*prevV*A' .+ Q
    if ismissing.(yt)==[true, true]
        return mupred, Vpred
    else
        e = yt .- C*mupred # error (innovation)
        S = C*Vpred*C' .+ R
        Sinv = inv(S)
        K = Vpred*C'*Sinv # Kalman gain matrix
        munew = mupred .+ K*e
        Vnew = (Diagonal(ones(ss)) .- K*C)*Vpred
        return munew, Vnew
    end
end
Out[13]:
kalmanUpdate (generic function with 1 method)
In [14]:
mu, V = kalmanFilter(ss, os, y, F, H, Q, R, initmu, initV, T)
plot!(mu[1,:], mu[2,:], 
    markershape=:rect, linewidth=2)
Out[14]: