# Particle filters for high-dimensional geoscience applications: A review

**Funding information**European Research Council via the CUNDA project under the European Union Horizon 2020 programme, and the Deutsche Forschungsgemeinschaft through grant CRC 1294;

## Abstract

Particle filters contain the promise of fully nonlinear data assimilation. They have been applied in numerous science areas, including the geosciences, but their application to high-dimensional geoscience systems has been limited due to their inefficiency in high-dimensional systems in standard settings. However, huge progress has been made, and this limitation is disappearing fast due to recent developments in proposal densities, the use of ideas from (optimal) transportation, the use of localization and intelligent adaptive resampling strategies. Furthermore, powerful hybrids between particle filters and ensemble Kalman filters and variational methods have been developed. We present a state-of-the-art discussion of present efforts of developing particle filters for high-dimensional nonlinear geoscience state-estimation problems, with an emphasis on atmospheric and oceanic applications, including many new ideas, derivations and unifications, highlighting hidden connections, including pseudo-code, and generating a valuable tool and guide for the community. Initial experiments show that particle filters can be competitive with present-day methods for numerical weather prediction, suggesting that they will become mainstream soon.

## 1 INTRODUCTION

Data assimilation for geoscience applications, such as weather or ocean prediction, is a slowly maturing field. Even the linear data assimilation problem cannot be solved adequately because of the size of the problem. Typically, global-scale numerical weather prediction needs estimation of over 10^{9} state variables, assimilating over 10^{7} observations every 6–12 hr. Existing methods like 4DVar do not provide accurate uncertainty estimates and need efficient pre-conditioners, while Ensemble Kalman Filters (EnKFs) heavily rely on somewhat *adhoc* fixes like localization and inflation to find accurate estimates. Hybrids of variational and ensemble Kalman filter methods are a step forward, although localization and inflation are still needed in realistic applications. An extra complication is localization over time needed in ensemble smoothers like the Ensemble Kalman Smoother and four-dimensional ensemble-variational data assimilation system (4DEnsVar) when the fluid flow is strong: what is local at observation time is not necessary local at the start of the assimilation window because the observation influence is advected with the flow. Furthermore, the recent surge of papers on accurate treatment of observation errors shows that a long way is still ahead of us to solve even the (close to) linear data assimilation problem.

Although these problems are formidable, another difficulty arises from the fact that the problem is typically nonlinear, and, with increasing model resolution and more complex observation operators, increasingly so. Both variational and Kalman-filter-like methods have difficulty handling nonlinear problems. Variational methods can easily fail when the cost function is multimodal, and are hampered by the assumption that the prior probability density function (pdf) of the state is assumed to be Gaussian. EnKFs make the explicit assumption that the prior pdf and the likelihood of the observations as function of the state are Gaussian, or, somewhat equivalently, assume that the analysis is a linear combination of prior state and observations. Both methods have been shown to fail for nonlinear data assimilation problems in low-dimensional systems, and both have been reported to have serious difficulties in numerical weather prediction at the convective scale where the model resolution is only a few km. Particle filters hold the promise of fully nonlinear data assimilation without any assumption on prior or likelihood, and recent textbooks like Reich and Cotter (2015), Nakamura and Potthast (2015), and van Leeuwen *et al.* (2015) provide useful introductions to data assimilation in general, and particle filters in particular.

Other fully nonlinear data assimilation methods are Markov chain Monte-Carlo (MCMC) methods that draw directly from the posterior in a sequential way, so one sample after the other, after a burn-in period; e.g. Robert and Cassela (2004) or van Leeuwen *et al.* (2015) give a geophysics-friendly introduction. The samples are correlated, often 100*%* when the new sample is not accepted, making them very inefficient in high-dimensional systems. This is why we concentrate on particle filtering here.

*N*of model states , called particles, that represent the prior probability density function (pdf)

*p*(

**x**

^{n}), as

*n*− 1 to time

*n*with the typically nonlinear model equations

*f*(..) denotes the deterministic model, and

*β*^{n}is a random forcing representing missing physics, discretization errors, etc. In this paper we assume this model noise to be additive, but one could also consider multiplicative noise in which

*β*^{n}is a function of the state of the system. We assume that the pdf from which the

*β*^{n}are drawn is known; typically a Gaussian

*N*(0,

**Q**).

*ϵ*^{n}are random vectors representing measurement errors and possibly representation errors. Again we assume that these errors have known characteristics, often Gaussian, so for example,

*ϵ*^{n}∼

*N*(0,

**R**). These observations are assimilated by multiplying the prior pdf above with the likelihood of each possible state, that is, the probability density

*p*(

**y**

^{n}|

**x**

^{n}) of the observation vector given each possible model state, following Bayes' theorem:

*p*(

**x**

^{n}|

**y**

^{n}) is the posterior pdf, the holy grail of data assimilation. To avoid confusion, it is good to realise that the true state is not a random variable when we apply Bayes' theorem. It is a realization of a process, which could be random or deterministic, from which we then take noisy observations. Instead, Bayes' theorem is a statement of what we think the true state might be. Since the pdf of the

*ϵ*^{n}is known and Bayes' theorem is a statement for each possible state

**x**

^{n}to be the true state,

*p*(

**y**

^{n}|

**x**

^{n}) is the pdf of

**y**

^{n}given that the true state vector would be

**x**

^{n}. In general, since for a given state

**x**

^{n}the observation

**y**

^{n}is equal to the observation error

**shifted by**

*ϵ***H**(

**x**

^{n}), we find (e.g. van Leeuwen 2015):

*w*

_{i}are given by:

Since all terms are known explicitly, we can just calculate this as a number. The self-normalization in the last part of Equation 7 is consistent with the notion that, for a proper representation of a pdf, the sum of the weights should be equal to one, so that the integral over the whole state space of the particle representation of the pdf is equal to one. Figure 1 depicts the working of this filter.

*n*+ 1 gives a weighted representation of the prior at time

*n*+ 1. Assimilating the observation at time

*n*+ 1 by Bayes' theorem leads to a modification of the weights (e.g. Doucet

*et al.*2001 or van Leeuwen 2009):

Even in low-dimensional applications, the variation of the weights increases with the number of assimilation steps. Eventually one particle has a much higher weight than all the others. To prevent this, resampling can be used before propagation to obtain equally weighted particles. This duplicates high-weight particles and abandons low-weight particles. After resampling, some of the particles have identical values, but if the model contains a stochastic component and independent random forcings are used for different particles, diversity is restored; e.g. Doucet *et al.* (2001) or van Leeuwen (2009) give details. Algorithm 1 illustrates the steps.

A simple resampling scheme using only one draw from a uniform distribution *U* is presented in Algorithm 2.

In high-dimensional problems, the weights vary enormously even at one observation time, and typically one particle obtains a much higher weight than all the others. Snyder *et al.*, 2008 (2008, 2015) have shown that the number of particles needed to avoid a weight collapse, in which one particle gets weight 1 and the rest of the weights very close to zero, has to grow exponentially with the dimension of the observations **y** for a large class of particle filters. If the weights collapse, all particles are identical after resampling, and all diversity is lost. From this discussion it becomes clear that, for particle filters to work, we need to ensure that their weights remain similar.

In this review we will discuss four basic ways to make progress on this fundamental problem of weight degeneracy. In the first one, we explore the so-called proposal-density freedom to steer particles through state space such that they obtain very similar weights, e.g. Doucet *et al.* (2001). As pointed out by e.g. Snyder *et al.* (2008), there are fundamental problems when applying these techniques to the high-dimensional geoscience applications. We will examine the issue in detail and discuss so-called equal-weight particle filters, which point towards new ways to formulate and attack the degeneracy problem.

The second approach transforms the prior particles into particles from the posterior, either in one go, or via a more smooth transformation process (Reich, 2013). While the one-step approaches can be shown to fail in high-dimensional settings, they do lend themselves very naturally to localization. The more smooth multi-step transition variants seem to be able to avoid the degeneracy problem without localization, and are an interesting new development.

The third, more straightforward from the geoscience experience, approach is to introduce localization in particle filters. While initial implementations were discouraging (e.g. van Leeuwen 2009), new formulations have shown remarkable successes, such that localized particle filters are now tested in global operational numerical weather prediction systems (e.g. Potthast *et al.*
2019).

The fourth approach is to abandon the idea of using pure particle filters and combine them with EnKFs. This should not be confused with using EnKFs in proposal densities. Several variants exist, such as second-order exact filters, in which only the first two moments are estimated, sequential versions in which first an EnKF is used and the posterior EnKF ensemble is used as input for the particle filter, or vice versa, and combinations in which localized weights are calculated and, dependent on the effective ensemble size, a full particle filter, an EnKF, or a combination of both is used.

These four variants form the basis of the following four sections. Each section contains a critical discussion of the approximations and remaining major issues. It should be noted that the pseudo-code provided does not give the most efficient implementation of the different particle filters, but is rather an illustration of the computational steps involved. Efficient pseudo-code for some of the more complex schemes can be found in Vetra-Carvalho *et al.* (2018). The paper is closed with a concluding section and an outlook of what possible next steps could be.

## 2 PROPOSAL DENSITY PARTICLE FILTERS

Ideally we draw independent samples directly from the posterior pdf because the samples would all have equal weight automatically. This can only be done, however, when the shape of the posterior pdf is known and when it is easy to draw from the posterior. An example of this is a Gaussian prior combined with a linear Gaussian likelihood. Under these assumptions the posterior is also Gaussian and the mean and covariance can be calculated directly from the prior using the Kalman update equations. EnKFs make use of this result and draw directly from that pdf, which is why all posterior particles have equal weights in an EnKF.

The standard particle filter draws particles from the prior. These then have to be modified to become particles of the posterior via the weighting with the likelihood. This is a general procedure in statistics called importance sampling: one draws from an approximation of the pdf one is interested in, and corrects for this via so-called importance weights.

In the introduction we argued that drawing from the prior leads to weights that vary too much: typically, in high-dimensional problems with numerous independent observations one particle gets weight 1, and all other particles have a weight very close to zero. However, we could explore the idea of importance sampling on the transition from one time to the next. When the numerical model is not deterministic but stochastic we have the freedom to change the model equations to move the particles to those parts of state space where we want them to be, for instance closer to the observations.

*n*, so Bayes' theorem at time

*n*is given by Equation 4. If the model is stochastic, we can write the prior as

*p*(

**x**

^{n}|

**x**

^{n − 1}) is the transition density, the pdf of the state at time

*n*when the state at time

*n*− 1 is known. For instance, if the model error is additive and the model equation is given by Equation 2, it holds that

**∼**

*β**N*(0,

**Q**), and we find

*n*− 1 we have a set of weighted particles as in Equation 1, but with weights instead of 1/

*N*. We can evaluate the expression Equation 9 for the prior as a weighted mixture of transition densities

*n*− 1 and assume that Equation 12 is exact. This is not necessarily a good approximation, especially when the number of particles is small. On the other hand, it is consistent with the particle filter approximation in the first place, and one of the few things one can do. By Bayes' formula 4, the posterior can then be written as:

*i*, and we know that this leads to ensemble collapse for high-dimensional systems. However, now the prior particles at time

*n*are allowed to arise from following a different model equation. This works as follows. We can multiply and divide Equations 12 and 13 by a so-called proposal density

*q*(

**x**

^{n}|

**x**

^{n − 1},

**y**

^{n}), leading to:

Now realise that drawing from corresponds to running the original stochastic model. We could instead draw from , which would correspond to a model equation from our choosing. Figure 2 illustrates the basic idea.

*g*(.,.) is now the deterministic part and is the stochastic part. These can be freely chosen, and examples of these will be given below. Note that we allowed

*g*(..) to depend on the observations at the future time. This means that we generate the prior particles at time

*n*by making one draw from for each

*i*where

*n*from the alternative model

*q*(

**x**

^{n}|

**x**

^{n − 1},

**y**

^{n}) and account for this by changing the weights of the particles. Equations 14 and 15 can be written as

We see that the weights now contain two factors, the likelihood weight, which also appears in the standard particle filter, and a proposal weight. These two weights have opposing effects. If we use a proposal density that strongly pushes the model towards the observations, the likelihood weight will be large because the difference between observations and model states becomes smaller, but the proposal weight becomes smaller because the model is pushed away from where it wants to go, so will be small. On the other hand, a weak pushing towards the observations keeps the proposal weight high, but leads to a small likelihood weight. This suggests that there is an optimum weight related to an optimal position for each particle as function of its position at time *n* − 1. This will be explored in equal-weight formulations of the particle filter. Figure 3 shows how typical proposal-density particle filters work. Equal-weight particle filters are discussed later.

### 2.1 A simple relaxation scheme

*m*for the state vector to emphasise that there are several model time steps between observation times.

**T**is a relaxation matrix of our choice. In this example, the deterministic part consists of the first two terms on the right-hand side of the equation, while the third term denotes the random part. Let us assume the pdf of the random forcing is Gaussian with mean zero and covariance . Then we can immediately write for the proposal density

**x**

^{m}is just a shift in the mean of the pdf of . For the original model, we assume that the random part is Gaussian with zero mean and covariance

**Q**, so that

Note that the normalization factors of the Gaussians do not have to be calculated explicitly if we stipulate that the sum of the weights has to be equal to one. The scheme is depicted by Algorithm 3.

Simple as the scheme is, it does not solve the degeneracy problem. However, it can be used as a simple scheme when several model time steps are used between observation times, because the proposal is independent of the proposal at other time steps. This scheme can easily be used in combination with other schemes that work at observation time, to be discussed next.

### 2.2 Weighted Ensemble Kalman Filter

*et al.*2010) the stochastic EnKF of Burgers

*et al.*(1998) is used as follows. The EnKF update can be written as:

*K*is the ensemble Kalman gain and

*ϵ*_{i}∼

*N*(0,

**R**), with

**R**the observational error covariance. Using the expression for the forecast in the Kalman filter update equation, we find:

Strictly speaking, this is correct only if the Kalman gain is calculated using the ensemble covariance of *f*(**x**^{n − 1}), so without the model errors *β*^{n}, otherwise the proposal is not Gaussian. We can calculate the weights of the particles in a similar way to the previous example. Algorithm 4 shows the algorithmic steps.

The behaviour of this filter has been studied extensively by Morzfeld *et al.* (2017). In high-dimensional systems, this filter will be degenerate, consistent with the theory of Snyder *et al.* (2015), and as proven in the next section. The only way to make this work is to include localization, not only at the EnKF level, but also at the level of the particle filter (e.g. Morzfeld *et al.* 2017).

### 2.3 Optimal proposal density

In the class of particle filters in which the proposal density of each particle is dependent only on that particle, an optimal proposal density can be derived, as in e.g. Doucet *et al.* (2001). They defined optimality as the proposal density that gives a minimal variance of the weights, and Snyder *et al.*
(2015) provide an elegant proof of this optimality. In this section we generalize this result and show that the optimal proposal density is optimal even when each particle has its own proposal density which is allowed to depend on all previous particles, so a proposal of the form .

*et al.*(2015) concentrate on the case that one is interested in – an optimal representation of

*p*(

*x*

^{n},

*x*

^{n − 1}|

*y*

^{n}) in a sequential algorithm – so in a sequential smoother. To this end they introduce the random variable

*q*that minimizes the variance in the weights

*w*

^{∗}, with the expectation taken over the density from which we draw the particles, so the proposal

*q*.

Here we show that the optimal proposal density is also optimal for the strict filtering case, so when we are interested in minimal variance of the weights at time *n* only. Specifically, the question is: given the set of particles at *t* = *n* − 1 drawn from *p*(**x**_{n − 1}|**y**_{1:n − 1}), which proposal density of the form gives minimal variance of the weights at time *n*?

*i*as function of the state at time

*n*as:

*n*− 1. Note that the second equality follows from Bayes' theorem, as follows:

*I*,

**X**

^{n}) where Prob(

*I*=

*i*) = 1/

*N*and, conditionally on

*I*=

*i*, . Furthermore, define the associated random variable

*q*that minimizes the variance of

*W*, we use the well-known law of total variance (derived in the Appendix for completeness):

*q*,

*q*. Moreover, and thus the first term in var

_{W}(

*W*) is

*W*is almost surely constant in

**X**

^{n}, that is if and only if

**x**

^{n}. Because both

*p*and

*q*are densities (in

**x**

^{n}),

*cst*= 1. Combining these results, we have a lower bound for var(

*W*) that is determined by the variance of over

*i*, with equality if and only if

Note that this is a new result, as previous proofs only considered proposal densities of the form , and we extended it to more general proposal densities of the form .

This remarkable result shows that firstly the optimal proposal density, so , does indeed lead to the lowest variance in the weights for the class of particle filters in which the transition density is of the form . Secondly, it shows that we can predict the variance in the weights without doing the actual experiment, for any number of particles, provided we can compute , and thirdly the weights are *independent of the position of the particles* **x**^{n}. Unfortunately, this variance is zero only when the observations are not dependent on the state at time *n* − 1, which is never the case in the geosciences.

*H*=

**H**is linear. By the same argument that is used to derive the Kalman filter update, we find

**T**=

**QH**

^{T}(

**HQH**

^{T}+

**R**)

^{−1}is the Kalman-like gain with the background covariance

**Q**, and the weights are proportional to

This shows two things. First, in this special case, the simple relaxation scheme of Section 2.1 is equal to the optimal proposal when the relaxation matrix **T** is chosen as above. Second, comparing the weights of the optimal proposal with the weights of the standard filter, they both depend on the squared distance , and , respectively, but in the standard particle filter the distance is defined w.r.t. **R** and in the optimal proposal the distance it is defined is w.r.t. **HQH**^{T} + **R**. Hence the weights with the optimal proposal are more similar, but the improvement is substantial only if **Q** is large, and the analysis of weight collapse by Snyder *et al.* (2008) still applies.

*et al.*(2015) show that the optimal proposal is the proposal of this form with minimal variance in the weights in this case too, which can also easily be seen by applying the above to

*m*<

*n*.

Looking back at the filters described in the previous sections, we find the following. The relaxation scheme uses a simple proposal density that is of the form , so the theory holds, and that proposal will lead to degenerate results. This is indeed the finding of van Leeuwen (2010). The WEKF has a proposal that depends on all particles at time *n* − 1 through the Kalman gain **K**, so the proposal is of the form . Hence also this filter will perform worse than the optimal proposal and hence will be degenerate for high-dimensional systems. This was first explored in detail by Morzfeld *et al.* (2017).

### 2.4 Implicit Particle Filter

The Implicit Particle Filter is an indirect way to draw from the optimal proposal, even over several time steps. Often the assumption is made that the model errors of both original model and proposal density are Gaussian, and the observation operator **H** is linear. In this case, a draw from the optimal proposal is a draw from a multivariate Gaussian, and we know how to do that.

However, when **H** is nonlinear, or when the proposal is used over several model time steps, the density to draw from is not now Gaussian. Chorin *et al.* (2010) realised that one could still draw from a Gaussian and then apply a transformation to that draw to find samples from the optimal proposal density. The method is explained here for one time step, but the extension to multiple time steps is straightforward. Figure 4 illustrates the basic idea.

*q*(

**) =**

*ξ**N*(0,

**I**), and we can write the transformation as in which

**J**

_{i}is the Jacobian of the transformation from

**x**

^{n}to

**. That transformation is found implicitly, hence the name of the filter, by defining**

*ξ*

*ξ*_{i}for each particle, solving for

**x**

^{n}in

Interestingly, while the optimal proposal density shows that the weights are only dependent on the position of the particles at the previous time, so on via *φ*_{i}, the implicit map makes the weights also dependent on the positions at the current time *n*, so on via the Jacobian of the transformation between ** ξ** and

**x**. Only when the Jacobian is a constant, so when

*F*

_{i}is quadratic in

**x**

_{i}, this dependence disappears.

*et al.*(2012) suggest a random map of the form

**P**is a chosen covariance matrix, ideally the covariance of the posterior pdf, and

*λ*

_{i}is a scalar. This transforms the problem into solving a highly nonlinear scalar equation for

*λ*

_{i}, which is a much simpler problem than finding directly. This map can be shown to be a bijection when has only closed contours in the high-probability regions; otherwise one would first have to choose a closed contour area and then perform the map. In general, when the optimal proposal (over several time steps if needed) is multimodal, the transformation from the state variable to a Gaussian is not monotonic, and the Implicit Particle Filter needs to be adapted, for example, by using a separate Gaussian for each mode. The algorithm is given in Algorithm 5.

Of further interest is that is the same as the solution to a 4D-Var problem well known in meteorology. But it is a special 4D-Var as the initial position of each particle is fixed and it has to be a weak-constraint 4D-Var. The latter condition is needed as a strong-constraint 4D-Var would have no possibility to move a particle in state space as its initial condition is fixed.

However, this filter will also suffer from weight collapse in high-dimensional applications as it is still a sampling scheme for the optimal proposal density. The following sections will discuss ways to improve on the optimal proposal.

### 2.5 Equal weights by resampling at time *n* − 1

This says that, assuming the pdf at the previous time can be approximated by a set of *N* particles, the analysis distribution is a mixture of the optimal proposal pdfs with mixture weights *α*_{i}.

If we can compute the optimal proposal density and the weights *α*_{i} in closed form, we can also draw samples *directly from this mixture density*. For this, we first draw an index *I* from the discrete distribution with weights *α*_{i}, Prob(*I* = *j*) = *α*_{j}, followed by a draw from the corresponding pdf . Doing this *N* times will lead to *N* different particles with equal weights because each of them is an independent draw directly from the posterior. If the index *I* is equal to a value *j* more than once, the particle is propagated from time *n* − 1 to time *n* with independent random forcing for each of these draws. This simple scheme provides better samples than the optimal proposal density because all particles are different at time *n* by construction.

However, this does not solve the problem of weight collapse because drawing the index *I* is nothing other than resampling the particles at time *n* − 1 with weights proportional to . If , the variance of these weights is exactly equal to the lower bound that we found in Section 2.3. The main difference is that the collapse now happens at time *n* − 1. The only advantage is that all particles will be different at time *n*.

*α*

_{i}in closed form, we can still use the importance sampling idea to draw from the mixture

*p*(

**x**

^{n}|

**y**

^{n}) by drawing pairs (

*I*,

**X**

^{n}) consisting of an index

*I*and a state

**X**

^{n}at time

*n*. We choose a proposal distribution for the index and proposal distributions for the state. Then we draw the index

*I*

_{i}with and conditionally on

*I*

_{i}=

*j*we draw from . Finally, we compute weights by

*p*(

**x**

^{n}|

**y**

^{n}) whereas the indices

*I*

_{i}can be discarded after the weights have been computed. We could produce an evenly weighted approximation by a further resampling step, or take the weights into account during the next iteration.

*N*, the algorithm contains a hidden weighting and resampling step of particles at time

*n*− 1. It thus remains susceptible to weight collapse in high dimensions.

*I*,

**X**

^{n}) is due to Pitt and Shephard (1999) who called it

*Auxiliary Particle Filter*(the index

*I*is an auxiliary variable that is discarded at the end). They discuss, in addition, approximations of the optimal proposal density and the optimal weights

*α*

_{i}. One of their suggestions is to use for the index

*I*the proposal with weights

*n*are propagated by a simplified model, for example, by omitting stochastic terms or with simplified subgrid-scale parametrizations or thermodynamics. If

*I*

_{i}=

*j*and the state at time

*n*is proposed from , the weights become

### 2.6 Equivalent-Weights Particle Filter

The EWPF (van Leeuwen, 2010; Ades and van Leeuwen, 2013) uses the idea to obtain a more evenly weighted set of particles by not sampling from the exact posterior, but allowing for a small error. It starts with determining the weight of each particle at the mode of for each particle *i*, . Note that these weights are equal to the weights obtained in the optimal proposal density. In the optimal proposal density case, the weights do not depend on the position **x**^{n} of the particle, but note that the proposal used here will be different.

The particles are not moved to these modes, but the weights are used to define a target weight. This target weight *w*_{target} is chosen such that a certain fraction *ρ* of particles can reach that weight. To this end we sort the weights in magnitude from high to low in an array and set . For instance, with 100 particles and a fraction of *ρ* = 0.8, we would find .

**x**

^{n}in

*i*that can reach this weight. There are many solutions of this equation, but we choose the one which is on the line through and and is closest to . Denote this position as . Note that this is a purely deterministic move, so a stochastic part still has to be added. The final position of these particles is then determined by adding a very small random perturbation

**from a chosen density, so**

*ξ*This stochastic move ensures that the proposal has full support and is not a delta function centred at . The density of *ξ*_{i} should on the one hand have most of its mass concentrated around 0 in order not to change the weights of the particles too much, and on the other hand it should be relatively constant since we divide by the value of the proposal density. Both requirements cannot be fulfilled exactly, but we can take some error in the sampling into account and choose a narrow uniform distribution. The scheme is depicted in Algorithm 6 for the special case of Gaussian model errors and a linear observation operator. If these conditions do not hold, one will typically need iterations to solve for *a*_{i} and *b*_{i}.

It is common knowledge (e.g. Doucet *et al.* 2001), that the proposal should be wider or at least as wide as the target, while the width of the stochastic part of the proposal is chosen very small here. The reason that we can do this is that the position of the centres of these proposal densities are typically further away from the observations than, for example, in the optimal proposal because the target weight forces particles away from their optimal positions, so away from the observations. This means that the deterministic moves of the particles ensure a large spread in the full proposal.

A formal way to avoid such an error has been described by Ades and van Leeuwen (2015b). They choose the proposal to be a mixture of a uniform density and a Gaussian which is also used in Algorithm 6. Both have small variance, and the mixture coefficient of the uniform density is chosen to be much larger than that of the Gaussian. This means that drawing from the Gaussian and also drawing from its tails becomes highly unlikely. In practice, since we always work with small ensemble sizes, the chance of filter degeneracy by drawing from the Gaussian, and then drawing from the tail of the Gaussian, is indeed highly unlikely.

Finally, the full weights for the new particles are calculated and the whole ensemble is resampled, including those particles that were unable to reach the target weight. Because of the target-weight construction, the weights of the particles are very similar, and filter degeneracy is avoided. This filter has been used in a reduced-gravity ocean model by Ades and van Leeuwen (2015b), and in the same system studied for the gravity-wave production by the scheme in Ades and van Leeuwen (2015a). It has also been applied in a climate model by Browne and van Leeuwen (2015).

**X**and particle index

*I*, but rather the state as function of the weight

*W*and index

*I*, so

**X**(

*W*,

*I*). Specifically,

*W*|

*I*has values in two ranges. For the particles with

*I*=

*i*that can reach the target weight, we find

*w*|

*I*=

*w*

_{target}+

*ϵ*

_{i}in which

*ϵ*

_{i}is a small perturbation from the target weight due to the small stochastic move discussed above. For those particles that cannot reach the target weight, their weights are very close to zero. So we find

*H*is linear and the errors in the observations and the model equations are Gaussian, we find , but if any of these three conditions does not hold this is not necessarily so. However, we do know that by construction . Since the sum of the weights should be equal to 1, we find that

*w*

_{target}≈1/(

*Nρ*), and hence

*E*

_{I}[

*W*] = 1/

*N*, as expected. Furthermore

This expression shows that the variance in the weights ranges between 0 for *ρ* = 1, so when all particles are kept, to (*N* − 1)/*N*^{2}≈1/*N* for *ρ* = 1/*N*, so when one particle is kept. We can compare this with the optimal proposal when the number of independent observations is large. In that case one particle will have a weight very close to one, and the rest will have weights very close to zero. The variance in the weights is then (*N* − 1)/*N*^{2}≈1/*N*, indeed equal to the *ρ* = 1/*N* case in the EWPF scheme, as expected. However, the EWPF can reduce that variance, even to zero, depending on the choice of the tuning parameter *ρ*.

When this tuning parameter is chosen close to one, the target weight will be low, and hence particles will be moved further away from the mode of the optimal proposal density. In practice this means that the particles are pushed further away from each other, leading to a wider posterior pdf. A small value for the fraction will have the opposite effect. Since we do not know *apriori* what the width of the posterior should be, this is a clear drawback of this method. We will come back to this later.

### 2.7 Implicit Equal-Weights Particle Filter

In the IEWPF we set the target weight equal to the minimum of the optimal proposal weights for all particles. Then, the position of each particle is set to the mode of the optimal proposal density plus a scaled random perturbation. The scale factor is chosen such that the weight of each particle is equal to the target weight. Note that in the standard setting, no resampling is needed, but Zhu *et al.* (2016) gives other possibilities.

*q*(

**) instead of the original**

*ξ**q*(

**x**

^{n}|

**x**

^{n − 1},

**y**

^{n}), following the same procedure as in the Implicit Particle Filter. We define a relation

**P**is a measure of the width of that pdf, is a standard Gaussian-distributed random vector, and

*α*

_{i}is a scalar.

*α*

_{i}such that all particles get the same weight

*w*

_{target}, so the scalar

*α*

_{i}is determined for each particle from:

This target weight is equal to the lowest weight over all particles in an optimal proposal. This ensures that the filter is not degenerate in systems with arbitrary dimensions and an arbitrary number of independent observations. The resulting equation for each *α*_{i} is nonlinear and complex because it will contain the Jacobian of the transformation from *ξ*^{n} to **x**^{n}, similar to the Implicit Particle Filter. The Jacobian will contain the derivative of *α*_{i} to *ξ*_{i}, which is the main source of the complexity in this scheme. Algorithm 7 depicts the scheme for the case of a linear observation operator. A nonlinear observation operator will lead to more complicated equations for the *α*s.

The scheme is similar to the optimal proposal density using the Implicit Particle Filter by first determining the mode of the proposal and then adding a random vector. The difference is that in the IEWPF the size of the vector is determined such that the each particle reaches the target weight. It turns out that this construction excludes part of state space for all but one particle. For each particle the excluded part is different, so the ensemble samples the whole space, but the individual particles do not. Details of the method can be found in Zhu *et al.*
(2016).

Analysing the scheme in more detail, the proposal density used in this scheme is of one dimension lower than that of the state itself. The *direction* of the random vector in state space is determined by the proposal density, but the *size* of the random vector is then found deterministically, dependent on that direction. So the proposal density misses one degree of freedom for all but one particle – the particle with the lowest weight that has *α*_{i} = 1. Although missing one degree of freedom in a very high-dimensional system might seem acceptable, it does lead to a bias. Figure 5 shows how the implicit equal-weights particle filter works.

### 2.8 Discussion

We first note that the optimal proposal is only optimal in a very limited sense, as has been known a long time with the invention of the auxiliary particle filter. We have seen that it is not difficult to generate particle filters that even have zero variance in the weights. In the optimal proposal setting, one forces Prob(*I* = *i*) = 1/*N*, while the simple choice leads to an equal-weight particle filter. Furthermore, schemes have been introduced that consider the state as a function of the state at the previous time and the weight the state at the current time should obtain, so instead of working with *W*(**X**,*I*) we choose **X**(*W*,*I*), which opens up a whole new range of efficient particle filters in high-dimensional systems.

The EWPF and the IEWPF are by construction particle filters that are not degenerate in high-dimensional systems and do not rely on localization. However, it is easy to see that both filters are biased, or inconsistent. In the limit of an infinite number of particles, the target-weight constructions will prevent the schemes to converge to the full posterior pdf. The schemes are only of interest when the ensemble size is limited. As long as the bias from the target-weight construction is smaller than the Monte-Carlo error, this bias is of no direct consequence. It will be clear that the number of possible methods that have this property is huge, and much more research is needed to explore the best possibilities.

## 3 TRANSPORTATION PARTICLE FILTERS

In resampling particle filters, the prior particles are first weighted to represent the posterior and then transformed to unweighted particles simply by duplicating high-weight particles and abandoning low-weight particles. In transformation particle filters, one tries to find a transformation that moves particles from the prior to particles of the posterior in a deterministic manner. A related approach, which uses random transformation steps, is based on tempering the likelihood, which we also discuss in this section.

### 3.1 One-step transportation

**X**

^{a}, and in which

**D**is a transformation matrix. The only conditions on

**D**are that

*d*

_{ij}≥ 0, and . These three conditions leave a lot of freedom for all

*N*

^{2}elements of

*D*, and a useful way to determine them is to ensure minimal overall movement in state space of the particles from prior to posterior. This leads to an optimal transportation problem and is typically solved by minimizing a cost function that penalizes movement of particles.

We can see immediately that this method will not work when the weights are degenerate as the solution will be degenerate and all particles have no other choice than move to the prior particle with weight (close to) one. However, the strength of this filter is that it allows for localization in a very natural way by making the weights, and hence the matrix **D**, space dependent. The method will be discussed in more detail in Section 4 on localization. Here we provide the basic algorithm in Algorithm 8.

The ETPF provides a direct map from prior to posterior particles without explicitly constructing a transformation map. An alternative approach has been suggested in Moselhy and Marzouk (2012), where an approximate transportation map is constructed such that belongs to certain family of maps and is chosen such that the Kullbeck–Leibler divergence between the pdf generated by and the posterior pdf is minimized. Spantini *et al.* (2018) gives an efficient implementation in the context of filtering and smoothing for low-dimensional systems.

### 3.2 Tempering of the likelihood

*et al.*2006, Beskos

*et al.*2014) one factorizes the likelihood as follows:

*γ*

_{i}< 1 and ensuring that the sum of the

*γ*s is equal to 1. Then the weighting of the particle filter is first done with the first factor, so

The reason for this is that the likelihood is much less peaked, and hence the degeneracy can be avoided when *γ*_{1} is small enough. Figure 6 illustrates the basic idea.

The particles are resampled, and now the weighting is performed using the second factor, followed by resampling, etc. In this way the scheme slowly moves all particles towards the high-probability regions of the posterior. Of course, after resampling several particles will be identical, so one needs to jitter the particles, so perturbing them slightly, to regain diversity.

This jittering should be a move of the particles that preserves the posterior pdf. It could be implemented as a MCMC method with the posterior as the target density (e.g. exploring resample-move strategies; Doucet *et al.* 2001). However, a problem is that in sequential filtering we only have a representation of the posterior density in terms of the present particles, and this representation is very poor due to the small number of particles. Possible avenues are to fit a pdf of a certain shape to the present particles, e.g. a Gaussian mixture model, and use that as target density.

A problem in the geosciences is that this posterior fit needs to preserve the delicate balances between the model variables that are present in each particle, and an extra complication is that these balances can even be nonlinear. Also the transition kernel of the Markov chain should somehow preserve these balances. An example of its use in the geosciences is the Multiple Data Assimilations (MDA) method of Emerick and Reynolds (2013), in which the intermediate pdfs are assumed to be Gaussian. Evensen (2018) also gives a comparison of this method to other iterative implementations of the Ensemble Kalman Filter/Smoother.

*et al.*(2014) does not have this problem. In that case the prior at observation time can be written as (Equation 9)

*n*− 1 for ease of presentation. In this case the MCMC method that has the posterior as invariant density is easy to find as the transition densities defined above, followed by an accept/reject step.

When several model time steps are performed between observation times, one can also perform tempering in the time domain, as explored in van Leeuwen (2003) and van Leeuwen (2009) in the Guided Particle Filter. The idea is to assimilate the observations ahead of time, with using as likelihood *p*(**y**^{∗}|**x**^{m})^{γ}), in which **y**^{∗} is taken equal to the value **y**^{n}, and *γ* < <1. Here *m* < *n* is the present time of the model. This is then followed by a resampling step. The procedure can be followed over several time instances during the forward integration of the particles, increasing *γ*_{i} each time. At the observation time, *γ* = 1 is used. This will force the particles towards the observations and does not need extra jittering because each particle will see a different model noise realization ** β** in the model integration after the resampling steps.

Of course one has to compensate for the fact that the transition density has been changed, and the way to do that is to realise that we have used importance sampling. Instead of sampling from , we sample from a pdf , in which **y**^{∗} is equal to **y**^{n} taken at time *m*, and with larger observation uncertainty related to *γ*. This means that we have to compensate for the weights created by this sampling, so we need to introduce particle weights
at each model time step we use this scheme.

The scheme generates extra weights during the model integration, but corrects for them at each new time when we resample, ensuring much better positioned particles at the actual observation time *n*. It has been used in a reduced-gravity primitive equation model in van Leeuwen (2003), but not in high-dimensional settings.

### 3.3 Particle flow filters

*p*(

**x**), to equal-weight particles representing the posterior,

*p*(

**x**|

**y**). In other words, one seeks a differential equation

*s*≥ 0 with the flow map defining the desired transformation. If the initial conditions of the differential Equation 64 are chosen from a pdf

*p*

_{0}(

**x**), then the solutions follow a distribution characterized by the Liouville equation

*p*

_{0}(

**x**) =

*p*(

**x**) and final condition .

*s*

_{final}= 1. We now take the limit of more and more tempering steps by choosing

*γ*

_{i}= 1/

*n*= Δ

*s*with , so , or (Daum and Huang 2011, 2013, Reich 2011). This leads to

*f*

_{s}are available for certain pdfs such as Gaussians and Gaussian mixtures (Reich, 2012). These particle flow filters can be viewed as a continuous limit of the tempering methods described in the previous subsection, avoiding the need for resampling and jittering. Note that the elliptic partial differential equation 67 does not determine

*f*

_{s}uniquely. Optimal choices in the sense of minimizing the

*L*

_{2}(

*p*

_{s})–norm of

*f*

_{s}lead to the theory of optimal transportation (Villani 2008; Reich and Cotter 2015).

Figure 7 shows the basic idea behind particle flow filters.

*β*

^{j}a random forcing term drawn from

*N*(0,

**I**). One can show that in the limit of

*j*→

*∞*these samples will be samples from the posterior. The corresponding Fokker–Planck equation for this stochastic PDE reads (e.g. Reich and Cotter 2015)

**f**

_{s}(

**x**) is given by:

(Lu *et al.* 2019). We come back to this method below.

*p*

_{s}(

**x**

_{i}), which is typically unknown as we only know the particle representation of

*p*

_{s}(

**x**). One way to solve this issue is to explore kernel embedding. A numerical implementation of the two formulations 69 and 71 can be based on a reproducing-kernel Hilbert space (RKHS) with reproducing kernel

*K*(.,.), typically taken as a Gaussian. In the following, we will therefore assume that the kernel is symmetric

*K*(

**x**,

**z**) =

*K*(

**z**,

**x**). The inner product in satisfies the reproducing property

*p*

_{s}by

**f**

_{s}by

*N*particles

**x**

_{j}move under the differential equations

*p*

_{s}||

*p*(·|

**y**)} between

*p*

_{s}and the posterior pdf (Reich and Cotter, 2015), it is natural to introduce the following particle approximation of the Kullback–Leibler divergence:

**x**

_{l}} minimizing . Details on the numerical implementation of this approach can be found in Pathiraja and Reich (2019).

*p*

_{s}, and hence the prior and the posterior, to be of the form Equation 73. Alternatively, one can embed the vector field of the flow in an appropriate reproducing kernel Hilbert space and not the density itself. With that we can derive a practical implementation of the Stein variational formulation 71 as follows. First, note that the change in KL due to the flow field

**f**

_{s}can easily be found as:

**x**in the RKHS. Note that here is different from the Hilbert space used earlier. Maximizing this change in KL as function of the flow field

**f**

_{s}is not trivial in general. However, with the reproducing kernel property of

**f**

_{s}, we have

**f**

_{s}. One now chooses

**f**

_{s}along this direction, which gives the steepest descent, as

*N*particles

**x**

_{j}.

The intuition behind Stein variational descent is that the first term in Equation 82 pulls the particles towards the mode of the posterior, while the second term acts as a repulsive force that allows for particle diversity. Liu and Wang (2016) derived this formulation for a steady-state problem, and Pulido and van Leeuwen (2018) have extended the method to sequential particle filters. The scheme is given in Algorithm 9.

The free parameter of these methods is the reproducing kernel *K*(.,.), which needs to be chosen such that the particles sample the posterior and that physical (and potentially other) balances are retained. One also needs to select a proper time-stepping scheme, typically chosen as a forward Euler scheme with variable time step *ϵ*, which can now be viewed as the step length in a gradient descent optimization algorithm.

### 3.4 Discussion

Viewing particle filters as a transportation problem from equal-weight particles of the prior to equal-weight particles of the posterior has led to an interesting set of filters. None of them have been implemented yet in high-dimensional settings, but some of them are ready to do so. The strong involvement of the machine learning community in problems of this kind also suggests rapid progress here. Finally we mention that the equal-weight particle filters from Section 2 can be viewed as one-step transportation filters that explore the proposal density freedom, and in fact transform equal-weight prior particles at time *n* − 1 to equal-weight posterior particles at observation time *n*.

## 4 LOCALIZATION IN PARTICLE FILTERS

Localization is a standard technique in Ensemble Kalman filtering to increase the rank of the ensemble perturbation matrix, allowing for more observations to be assimilated, and to suppress spurious correlations where real correlations are very small, but ensemble correlations are larger because of sampling noise. Localization limits the influence of each observation to a localization area that is much smaller than the full model domain. This idea can easily be incorporated when calculating the particle weights locally, as pioneered by Bengtsson *et al.* (2003) and van Leeuwen (2003), and used in a high-dimensional parameter estimation problem in Vossepoel and van Leeuwen (2006). The difficulty, as we shall see, lies in the resampling step: how does one generate 'smooth' global particles from locally resampled particles? Smooth is not well-defined here, but it is related to the particles having realistic physical relations (balances) between the model variables. For example, if geostrophic balance is dominant, the resampling procedure should not generate particles that are completely out of geostrophic balance as that would lead to spurious adjustment processes via spurious gravity waves. Up to now localization is mainly used in connection with the standard particle filter, while more advanced proposals, apart from the optimal proposal, have not been explored. Farchi and Bocquet (2018) provide an excellent review of localization in particle filtering, treating a subset of the methods presented here, but including interesting extensions of the methods they describe.

*k*as

**x**

^{k}. Hence, in contrast to other sections, a superscript here denotes not the time index, but the grid point. Note that in geoscience applications each grid point typically has several model variables, so

**x**

^{k}is a vector in general. Physically it makes sense to assume that the posterior of the state at this grid point depends only on a subset of the observations. Let us denote that subset as

**y**

^{[k]}. We can then write

**x**

^{(k)}:

**x**

^{(k)∖k}to denote all those grid points in that part of the state vector excluding grid point

*k*. Then we can rewrite the above as an integral over the joint pdf:

The weights thus depend only on the local observations **y**^{[k]} and the local prior particles , so that the variance of the weights will be much smaller. Figure 8 illustrates how this local weighting could look for two different particles.

The approximation Equation 83 is not unrealistic: a temperature observation in New York is not expected to change our pdf of the temperature in London at the moment of the observation. Of course, there will be an effect at later times, but that is not relevant here. The same assumption underlies the use of localization in EnKFs, and in variational methods when the background-error covariance is constructed.

However, mathematically it does not follow from the assumption that under the prior the values of the state at grid points separated by more than a certain distance are independent. There can be an indirect flow of information from observations far apart over observations between neigbouring grid points. In EnKFs, the Kalman gain is generally a dense matrix **HP**^{f}**H** + **R**, in which P^{f} is the forecast error covariance, is sparse, because its inverse (**HP**^{f}**H** + **R**)^{−1} can be dense. On the other hand, if **HP**^{f}**H** + **R** is diagonally dominant, then often its inverse is too.

Repeating the localization procedure for all grid points, we obtain all marginals of the posterior pdf. However, because the weights change from one grid point to the next, it is non-trivial to obtain a consistent posterior for pairs of state values (**x**^{k},**x**^{ℓ}) (and similarly for triplets, etc.). This can easily be seen using Figure 5: we would like to retain the left particle in the central upper half of the domain, and abandon elsewhere. That would mean that wherever it is abandoned, we need to replace it with another particle, perhaps partly with the particle in the right part of the figure. At the boundary between particles, a discontinuity will exist, which will lead to unphysical behaviour when this new particle is propagated forward in time.

This means that, to obtain global particles that can be forwarded with the model equations, one would need to somehow smoothly glue different particles together. This is a major problem and has hampered localization in particle filtering since the early 2000s. However, recently clever smoothing schemes have been constructed that seem to work well in high-dimensional geophysical applications. We will report on those below.

Another issue is that the localization area cannot be too large to avoid filter collapse. As a rule of thumb, when there are more than (say) ten independent observations inside a local area, the particle filter will still tend to be degenerate for the number of (10–1,000) particles one can typically afford. This means that, when the observation density is high, the localization areas have to become unphysically small, or observations have to be discarded. This issue might be solved using tempering techniques as discussed earlier, but is often avoided by artificially enforcing a minimal weight of the particles, or by changing the observations, for instance by projecting them on a lower-dimensional space favoured by the prior.

Setting a minimal weight or projecting observations to a lower-dimensional space favoured by the prior has a consequence that not all information will be extracted from the observations, as observations that are very different from the existing particles will be largely ignored. This is not directly equivalent to the standard quality-control measures used by operational weather forecasting centres, in which observations that are a few standard deviations away from the forecast are ignored. The issue here is that a distance of less then one standard deviation for a few observations can already lead to weight collapse, and artificially setting minimum values for the weights avoids that.

### 4.1 Localization based on resampling

Several localization schemes have been proposed and discussed in the review by van Leeuwen (2009) and those will not be repeated here. The most obvious thing to do is to weight and resample locally, and somehow glue the resampled particles together via averaging at the edges between resampled local particles (van Leeuwen, 2003). In the following, several schemes in this category are discussed.

#### 4.1.1 Localized Particle Filter

*j*, the observations close to that grid point are found and the weight of each particle

*i*is calculated based on the likelihood of only those observations:

**y**

_{j}denotes the set of observations within the localization area. Note the change of notation from the previous section, related to the explicit use of the particle index in all the following. This is followed by resampling via Stochastic Universal Resampling to provide ensemble members with

*i*= 1,…,

*N*for each grid point

*j*.

*G*(..) is a distance weighting function, e.g. a Gaussian or an approximation of that,

*d*

_{j,k}is the distance between grid points

*j*and

*k*, for each observation

**y**

_{k}at grid point

*k*in the neighbourhood of grid point

*j*. The parameter

*h*is a distance radius, another tuning parameter. This formulation can be used for each grid point

*j*, but also for each block of grid points

*j*. They note that

*G*can also be a Gaussian of a Gaussian, such that it works directly on .

*j*for each particle

*i*by averaging over the

*Np*neighbouring points within the localization area around grid point

*j*:

*j*

_{k}for

*k*= 1,…,

*N*

_{p}denotes the grid point index for those points in the localization area around grid point

*j*. The resampling via Stochastic Universal Resampling is done such that the weights are sorted before the resampling, so that high-weight particles are joined up to reduce spurious gradients.

*α*a tuning parameter. Note that by choosing

*α*= 1/2 and , we recover the scheme by Penny and Miyoshi (2016).

While these schemes have been shown to solve the degeneracy problem in intermediate dimensional systems with fixed balances, like the barotropic vorticity model, it is unclear how they will perform in complex systems such as the atmosphere in which fronts can easily be smoothed out, and nonlinear balances broken, e.g. discussion in van Leeuwen (2009).

#### 4.1.2 Local Particle Filter

A different scheme that involves a very careful process of ensuring smooth posterior particles and retaining nonlinear relations has recently been proposed by Poterjoy (2016). An important difference with the state-space localization methods discussed above is that observations are assimilated sequentially to avoid the discontinuity issues of the state-space localization. This makes the algorithm non-parallel, so slower than the state-space localization methods, but Farchi and Bocquet (2018) demonstrate that a lower root-mean square error (RMSE) can be achieved.

*y*

_{1}of the observation vector, as

These weights are then normalized by their sum . Then the ensemble is resampled according to these normalized weights to form particles .

The scalar *α* is an important parameter is this scheme, with *α* = 1 leading to standard weighting, and *α* = 0 leading to all weights being equal to 1 (before normalization). Its importance lies in the fact that the weights are always larger than 1 − *α*, so even a value close to 1, say *α* = 0.99, leads to a minimum weight of 0.01 that might seem small, but it means that particles that are more then 1.7 observational standard deviations away from the observations have their weights cut off to a value close to 1 − *α*. This limits the influence the observation can have on the ensemble. Furthermore, the influence of *α* does depend on the size of the observational error, which is perhaps not what one would like. It is included to avoid loosing any particle.

*j*. For each member

*i*, a weight is calculated as

*ρ*(..) is the localization function with localization radius

*r*. These weights are normalized with their sum over the particles, so a normalized weight

*ω*

_{i}for this grid point is obtained. Note, again, the role played by

*α*. Then the posterior mean for this observation at this grid point is calculated as

**x**

_{i,j}is the state at grid point

*j*of particle

*i*. Next a number of scalars are calculated that ensure smooth posterior fields (Poterjoy, 2016) as detailed in Algorithm 10.

*k*

_{i}is the index of the

*i*'s sampled particle. This procedure is followed for each grid point so that at the end an updated set of particles is obtained that have incorporated the first observation. As a next step the whole process is repeated for the next observation, with the small change that is multiplied by from the previous observation, until all observations have been assimilated. In this way, the full weight of all observations is accumulated in the algorithm. Now the importance of

*α*comes to full light: without

*α*the ensemble would collapse because the s would be degenerate when observations are accumulated.

The final estimate shows that each particle at grid point *j* is the posterior mean at that point plus a contribution from the deviation of the posterior resampled particle from that mean and a contribution from the deviation of the prior particle from that mean. So each particle is a mixture of posterior and prior particles, and departures from the prior are suppressed. When *α* = 1, so for a full particle filter, we find for grid points at the observation location, for which *ρ*(1,*j*,*r*) = 1, that *c*_{j} = 0, so *r*_{2j} = 0, and *r*_{1j}≈1, so indeed the scheme gives back the full particle filter. The basic elements of the scheme are depicted in Algorithm 10.

At grid points between observations, it can be shown that the particles have the correct first- and second-order moments, but higher-order moments are not conserved. Farchi and Bocquet (2018) generate a scheme that is quite similar, but they ensure correct first and second moments by exploring the localized covariances between observed and unobserved grid points directly in a regression step.) To remedy this, a probabilistic correction is applied at each grid point, as follows. The prior particles are dressed by Gaussians with width 1 and weighted by the likelihood weights to generate the correct posterior pdf. The posterior particles are dressed in the same way, each with weight 1/*N*. Then, the cumulative density functions (cdfs) for the two densities are calculated using a trapezoidal rule integration. A cubic spline is used to find the prior cdf values at each prior particle *i*, denoted by *cdf*_{i}. Then a cubic spline is fitted to the other cdf, and the posterior particle *i* is found as the inverse of its cdf at value *cdf*_{i}. Poterjoy (2016) gives details. The result of this procedure is that higher-order moments are brought back into the ensemble between observation points.

This scheme, although rather complicated, is one of the two local particle filter schemes that has been applied to a high-dimensional geophysical system based on primitive equations in Poterjoy and Anderson (2016). The other is the Localized Adaptive Particle Filter (LAPF) discussed below. (van Leeuwen 2003 applied a local particle filter to a high-dimensional quasi-geostrophic system, but that system is quite robust to sharp gradients as it does not allow for gravity waves.)

#### 4.1.3 Localized Adaptive Particle Filter

The LAPF is based on the localized version of the ensemble transform Equation 60 following the LETKF described in Hunt *et al.* (2007) and Reich (2013), with localization in observation space, and resampling in the spirit of Gaussian Mixture filters (Stordal *et al.* 2011). Localization is carried out around each grid point, and a transform matrix **D** is calculated for each localization box. We note that, as for the LETKF, the weights given by Equation 7 depend continuously on the box location and the observations.

In a first step, the observations are projected into the space spanned by the prior particles. As mentioned above, this will reduce the information extracted from the observations, but is perhaps less *adhoc* than setting a lower bound on the weights, as for instance used in the LPF. The LAPF carries out local resampling using universal resampling (e.g. van Leeuwen 2009).

In a second step, a careful adaptive sampling is carried out in ensemble space around each of the *N* temporary particles. This scheme runs as follows:

(a) Resampling is carried out based on a (radial) basis function centred at each particle. A simple case would be a Gaussian mixture, where the covariance of each of the centred Gaussians is taken as a scaled version *c***P** of the local dynamical ensemble covariance **P**.

(b) The scaling factor *c* is individually calculated for each box based on the local observation- minus background-error statistics. For details we refer to Potthast *et al.* (2019). By this, the LAPF guarantees to obtain a spread of the analysis ensemble which is consistent with the local dynamical observation minus background (o–b) statistics and the observation-error covariance **R**. Further standard tools from the LETKF literature to control ensemble spread can be employed if needed.

(c) To obtain sufficient smoothness of the fields in physical space, the LAPF uses *N* global random draws to generate the resampling vectors around each particle in the space of ensemble coefficients. In combination with the fact that the LAPF draws in each box around each particle only – in a globally uniform way modulated by the ensemble covariance **P** and the factor *c* only – consistency and balance of the fields is achieved with sufficient precision. The scheme is depicted in Algorithm 11.

The LAPF is the first particle filter that has been implemented and tested in an operational numerical weather prediction context, and we provide a short description of the procedure. The method has been implemented in the DACE (Data Assimilation Coding Environment) system of Deutscher Wetterdienst (DWD; Potthast *et al.* 2019). The DACE environment includes a Local Ensemble Transform Kalman Filter (LETKF) based on Hunt *et al.* (2007), both for the global ICON model system and the convection-permitting COSMO model system of DWD (Schraff *et al.* 2016), both of which are run operationally at DWD1 and build a basis, framework and reference for the LAPF particle filter implementation.

The ensemble data assimilation system is equipped with a variety of tools to control the spread of the ensemble, such as *multiplicative inflation* and *additive inflation*, *relaxation to prior spread* (RTPS), *relaxation to prior perturbations* (RTPP) and stochastic schemes to add spread to soil moisture and sea surface temperature (SST) when needed (details in Schraff *et al.*
2016).

Tests with the LAPF for the global ICON model with 40 particles of 40 km global resolution have been successfully and stably run over a duration of one month. Extensive tests on how many particles form the basis for resampling in each localization box have been carried out; the numbers vary strongly over the globe and all heights of the atmosphere, ranging from 1 to *N*, with relatively flat distribution. Diagnostics and tuning of the system is under development and is discussed in Potthast *et al.* (2019). Results show that the quality of the LAPF does not yet reach the scores of the operational global LETKF-EnVAR system, but the system runs stably and forecast scores are about 10–15% behind the current operational system.

### 4.2 Local Ensemble Transform Particle Filter

*N*×

*N*stochastic transition matrix

**D**with the following properties: all entries

*d*

_{ij}of

**D**are non-negative and

*j*

^{∗}∈ {1,…,

*N*} according to the probability vector for each

*j*= 1,…,

*N*. The

*j*th forecast particle is then replaced by and the new particles ,

*j*= 1,…,

*N*, provide an equally weighted set of particles from the posterior distribution. Note that multinomial resampling corresponds to the simple choice

*N*→

*∞*(McCann, 1995; Reich, 2013). This fact is utilized by the ETPF and one defines

*et al.*2017). Section 5.3 gives more details.

*k*, we extract the values of the forecast particle at that grid point and denote them by . Using the observations local to this grid point, we calculate localized importance weights for . Then Equation 98 gives rise to a localized transformation matrix

*k*with the set defined by

Note that the transport cost (distance)
can be replaced by any other localized cost function. Chen and Reich (2015) give more details. The transport problem (Equation 100) at each grid point can be computationally expensive. Less expensive approximations, such as the Sinkhorn approximation, and their implementation into the localized ETPF (LETPF) are discussed in de Wiljes *et al.* (2017). Farchi and Bocquet (2018) have extended this algorithm to block weighting, similar to their extension of the Local Particle Filter.

The latter authors also defined a local transform particle filter in state space. This involves a transformation, at each grid point, from prior to posterior particles by a transformation, which essentially becomes an anamorphosis step. The prior and posterior probability densities need to be known as continuous densities, and Farchi and Bocquet (2018) use kernel density estimation with the particles as basis. The interesting suggestion is that, since the transformation is deterministic and expected to be smooth over the space coordinates, no specific smoothing is needed after the transformation. We refer to their paper for details on this methodology.

### 4.3 Space–time particle filters

The idea to run a particle filter over the spatial domain was introduced by van Leeuwen (2009), and the first algorithm, the Location Bootstrap Filter, was published by Briggs *et al.* (2013). The Space–Time Particle Filter by Beskos *et al.* (2017) improves on this algorithm by removing the jitter step, as explained below. In the following we assume observations at every grid point, but the algorithms can easily be adapted to other observation networks.

The Location Particle Filter of Briggs *et al.* (2013) runs as follows. The grid points are ordered 1,…,*L*, such that points *l* and *l* + 1 are neighbouring grid points for each *l* ∈ 1,…,*L*. In each grid point *l* we have a sample **x**_{i,l} for *i* ∈ 1,…,*N*, and *l* denotes the grid point number. We start the spatial particle filter at location *l* = 1 by calculating the weight *p*(**y**_{1}|**x**_{i,1}) (where the time index is suppressed) for each prior particle *i*, and perform resampling using these weights over the whole spatial domain. This means that the resampled particles are now samples of *p*(**x**^{1:L}|**y**^{1}). A small amount of jitter is added to avoid identical particles. The choice of this jitter density is again not clear for geophysical applications; more research is needed on this issue.

Then, the algorithm moves to the next grid point, calculates the weights *p*(**y**_{2}|**x**_{i,2}), and resamples the full state particles using this weight, generating samples from *p*(**x**_{1:L}|**y**_{1},**y**_{2}). Again some jitter is needed to avoid ensemble collapse, and the algorithm moves to the next grid point, until all grid points are treated this way. Algorithm 12 describes the computational steps.

Note that the algorithm does not suffer from artificial sharp gradients because all resampled particles are global particles, but the algorithm will be very sensitive to the choice of the jitter density used after updating the ensemble in each grid point. Furthermore, when prior and posterior are very different, the algorithm will perform poorly, and Briggs *et al.* (2013) propose a smoother variant that employs copulas for numerical efficiency. We will not discuss that variant here.

Beskos *et al.* (2017) introduce the Space–Time Particle Filter. Instead of using a jitter density to avoid identical particles, they exploit the spatial transition density , in which *n* is the time index and *l* the spatial index. (In fact, Beskos *et al.* (2017) allow for a proposal density, but we will explain the algorithm using the prior spatial pdf as proposal.) So they exploit the pdf of the state at time *n* and grid point *l*, , conditioned on all previous grid points at the same time *n*, and conditioned on all grid points at time *n* − 1, denoted . They do this by introducing a set of *M* local particles *j*, for each global particle *i*, with *i* ∈ 1,…,*N*.

*i*they run the following algorithm over the whole grid:

- Starting from location
*l*= 1, the*M*local particle filters grow in dimension when moving over the grid towards the final position*L*. At the first grid point, the prior particles at that grid point are used, weighted with the local likelihood*p*(**y**_{1}|**x**_{1}) and resampled. Let us call these particles , in which*j*is the index of the local particle, and 1 is the index of the grid point. - The mean of the unnormalized weights is calculated.
- For the next grid point, each of these
*M*resampled particles is propagated to that grid point by drawing from . Since each of the*M*particles is drawn independently, they will differ and no jittering is needed. - Then the unnormalized weights
*p*(**y**_{2}|**x**_{2}) are calculated, and their mean , followed by a resampling step. - This process is repeated until
*l*=*L*, so until the whole space is covered. - Finally, the total weight is calculated, which is the unnormalized weight of the first global particle.

Algorithm 13 summarizes the scheme.

This procedure is followed *N* times for each global particle *i* independently. These global particles are then resampled according to the weight *G*_{i}. It is still possible that this filter is degenerate; Beskos *et al.* (2017) gives details and potential solutions.

The importance of this filter lies in the fact that there is a formal proof that it converges to the correct posterior for an increasing number of particles, unlike any of the other algorithms discussed. Furthermore, the authors show that degeneracy can be avoided if the number of particles grows as the square of the dimension of the system – much faster convergence than e.g. the optimal proposal density.

### 4.4 Discussion

Following in the footsteps of EnKFs, exploring localization in particle filters is a rapidly growing field. But localization in particle filters is not trivial as there is no automatic smoothing via smoothed sample covariances as in EnKFs. Most local particle filters impose explicit spatial smoothing, which can affect delicate balances in the system. Worth mentioning in this context is the localization introduced by Robert and Künsch (2017), who process observations sequentially in their hybrid Ensemble Kalman Filter–Particle Filter approach such that the second-order properties of the particle filter part remain correct. This method is discussed in the next chapter. The ETPF and the LAPF come closest to the EnKF by using a linear transportation matrix to transforms the prior ensemble into a posterior ensemble, and this matrix can be made smoothly varying with space. All of these smoothing operations rely on forming linear combinations of particles, so can potentially harm nonlinear balances in the model. Furthermore, it should be noted that the smoothing operation does not necessarily follow Bayes' theorem, so it might result in an extra approximation of the true posterior pdf. However, when the ensemble size is small, this approximation might be negligible compared to the Monte-Carlo noise from the finite ensemble size.

The Location Particle Filter and the Space–Time Particle Filter avoid this smoothing and rely on statistical connections between different grid points. The former does this via the prior pdf, defined by the prior particles. When the number of particles is low, this pdf is estimated rather poorly. Furthermore, the method needs jittering of the global particles to avoid ensemble collapse after every resampling step after each new observation is assimilated. This jittering pdf can be chosen arbitrarily, for instance a smooth Gaussian, but it does violate Bayes' theorem. As mentioned above, this error might be negligible when the ensemble size is small. The latter method explores the transition density over space and time, leading to consistent estimates of the spatial relations between grid points. Another potential issue of both methods is that, if the spatial field is two- (or higher) dimensional, as in geoscience applications, it is unclear how to order the grid points, and potentially large jumps might be created between neighbouring grid points that are treated as far apart by the algorithm. This needs further investigation.

## 5 HYBRIDS BETWEEN PARTICLE FILTERS AND ENSEMBLE KALMAN FILTERS

As mentioned in the previous section, there are two issues with localization. Firstly, particle filters that employ resampling need to ensure smooth updates in space so that the newly formed global particles do not encounter strong adjustments to physical balances due to artificial gradients from glueing particles together. Present-day localized particle schemes concentrate on this issue.

Secondly, the localization area cannot contain too many independent observations, and as a rule of thumb ten independent observations is often too many, to avoid weight collapse. As mentioned, this demand can be in strong contrast with physical considerations of appropriate length-scales. This is one of the main reasons to consider hybrids between particle filters and EnKFs within a localization scheme. In the following, several recent hybrid methods are presented.

### 5.1 Adaptive Gaussian Mixture Filter

A bridging formulation allows to smoothly transition between an ensemble Kalman filter and a particle filter analysis update. One such formulation is the adaptive Gaussian mixture filter (Stordal *et al.*
2011).

**P**

^{f}of the ensemble by multiplying with a so-called bandwidth parameter 0 <

*h*≤ 1 such that

Note that this is just a shorthand notation for updating each centre for the prior Gaussians. For computational efficiency, the analysis equations in the (adaptive) Gaussian mixture filter (Hoteit *et al.*
2008; Stordal *et al.* 2011) were proposed to use a factorized covariance matrix in the form , as can be obtained from a singular value decomposition of the ensemble perturbation matrix and used, for example, in the Singular Evolutive Interpolated Kalman (SEIK) filter (Pham, 2001) and Error-Subspace Transform Kalman Filter (ESTKF; Nerger *et al.* 2012). However, the particular form of the Kalman filter update equations is not crucial here.

*N*

^{−1}as

*α*is the bridging parameter. We obtain a transition between the EnKF and the particle filter by varying both

*α*and

*h*. For

*α*= 0 and

*h*= 1, we obtain the uniform weights of the EnKF, while for

*α*= 1 and

*h*= 0 we obtain the particle filter weights. Stordal

*et al.*(2011) proposed to adaptively estimate an optimal value of

*α*by setting where is the effective sample size.

The update formulation of the adaptive Gaussian mixture filter reduces the risk of ensemble degeneracy, but cannot fully avoid it. To this end, we can combine the filter with a resampling step as in other particle filters.

### 5.2 Ensemble Kalman Particle Filter

*α*∈ (0,1). In the first step the Stochastic Ensemble Kalman filter of Burgers

*et al.*(1998) is applied, and in the second step a particle filter. When the parameter

*α*is close to 0, the scheme is like a full particle filter, while for

*α*close to 1 it is essentially the ensemble Kalman filter. Figure 9 illustrates the idea.

*ϵ*_{i}∼

*N*(0,

**R**/

*α*) and

**K**

_{α}is the normal gain, but with

**R**divided by

*α*. Thus, the particles can be seen as draws from

*p*(

**y**|

**x**)

^{1 − α}analytically. This is easy because the EnKF posterior is a Gaussian mixture and the likelihood is a Gaussian, so the full posterior is a Gaussian mixture too. This leads to a full posterior

Note that the normalization constants in *γ*_{i} do not have to be calculated as we know that they should fulfil .

*N*samples from the distribution of the mixture coefficients

*γ*

_{i}and then draw from the selected Gaussian mixture components:

*k*

_{i}denotes the resampled particle index

*i*and . The variables

*ξ*_{i}can again be generated in two steps by

*ϵ*_{1.i}and

*ϵ*_{i,2}are independent draws from

*N*(0,

**R**/

*α*) and

*N*(0,

**R**/(1 −

*α*)), respectively.

The scheme is very closely related to a Gaussian mixture model, as the EnKF step forces the prior for the particle filter to be a Gaussian mixture. The strong point of this scheme is that the width of each Gaussian follows naturally from the stochastic part of the EnKF, while it is *adhoc* in standard Gaussian mixture models. Furthermore, while the standard Gaussian mixture model uses the observation covariance matrix **R**, this filter uses an inflated **HP**^{SEnKF}**H**^{T} + **R**/(1 − *α*), which will lead to a better weight distribution. Finally, the starting points of the centres of the prior Gaussians will be closer the observations, suggesting more uniform weights. The pseudocode of the scheme is presented in Algorithm 14.

In an extension of the scheme, Frei and Künsch (2013) suggest forming a tempering scheme, alternatively using the EnKF and the particle filter. The resampling step of the particle filter is not problematic in this case as the Kalman filter will diversify identical particles in each next iteration. The paper also discusses approximate schemes for non-Gaussian observation errors and nonlinear observation operators.

*et al.*(2018), a variant of this method has been introduced which is based on the LETKF instead of the stochastic variant and in which the update is in ensemble space:

**W**equal 1. The matrix

**W**can be split into

**W**

^{μ}corresponds to computing the centres

*μ*_{i},

**W**

^{α}to the resampling and

**W**

^{ξ}to the added noise

*ξ*_{i}. In the transform variant,

**W**

^{ξ}is deterministic and chosen such that the sample covariance of

**X**

^{PI}equals the covariance of the Gaussian mixture Equation 115. It thus belongs also to the class of second-order exact filters discussed in the next section.

Robert *et al.* (2018) apply a localized transform Ensemble Kalman Particle Filter in the KENDA (Kilometer-Scale Ensemble Data Assimilation) system with a set-up similar to the one used operationally by MeteoSwiss. This system computes the weight matrices **W** only on a coarse grid and then interpolates these matrices to the original grid. Therefore the discontinuities introduced by resampling are smoothed out, but in a way that is possibly optimal for the EnKF and not for the EnKPF. In Robert and Künsch (2017) a different localization method for the EnKPF was developed which proceeds by sequentially assimilating observations *y*^{k}, limiting the state components influenced by *y*^{k} to a subset. It smoothes out the discontinuities that occur when a resampled particle in the region influenced by *y*^{k} is connected to a background particle outside of this region. The smoothing is done in such a way that the second-order properties of the smoothed particle remain correct.

### 5.3 Second-order exact filters

#### 5.3.1 Merging particle filter

*et al.*(2007) explores the sampling aspect of the resampling step. The method draws a set of

*q*ensembles each of size

*N*from the weighted prior ensemble at the resampling step. Then these sets are merged via a weighted average to obtain a new set of particles that has the correct mean and covariance but is more robust than the standard particle filter. Define

**x**

_{i,j}as ensemble member

*i*in ensemble

*j*. The new merged ensemble members are generated via

*α*

_{j}have to be real and need to fulfil the two conditions

*q*> 3, there is no unique solution for the

*α*s, while for

*q*= 3 one finds

We can make the weights space-dependent in high-dimensional systems and, since the new particles are merged previous particles, the resulting global particles are expected to be smooth. The scheme is depicted in Algorithm 15.

#### 5.3.2 Nonlinear Ensemble Transform Filter

*et al.*2006; Tödter and Ahrens, 2015). For the associated ensemble perturbations, we can derive from Equation 126 with

**w**= (

*w*

_{1},…,

*w*

_{N})

^{T}and

**W**= diag(

**w**) that

**A**=

**W**−

**ww**

^{T}, for example, by a singular value decomposition as

**A**=

**VΛ**

*V*

^{T}. This leads to

**A**

^{1/2}=

**VΛ**

^{1/2}

**V**

^{T}and posterior perturbations are then given by

The computations of this filter are very similar to those in ensemble square-root Kalman filters like the ETKF (Hunt *et al.* 2007) or ESTKF (Nerger *et al.* 2012). As such, we can can also localize the filter in the same way. The localized NETF has been successfully applied to a high-dimensional geophysical system based on primitive equations in Tödter *et al.* (2016). In addition, the filter can be easily extended to a smoother by applying the filter transform matrix (the term in parentheses in Equation 132) to previous analysis times (Kirchgessner *et al.* 2017). The scheme is depicted in Algorithm 16.

#### 5.3.3 Nonlinear ensemble adjustment filter

*et al.*1998; Houtekamer and Mitchell, 1998). In this filter, we generate a set of perturbed model observations

*w*

_{i}(

**y**

_{k}) is computed from the likelihood of the perturbed measured ensemble member

**H**(

**x**

_{i}) . When we now define

*P*

^{a}is given by Equation 126. This update equation only yields the correct first and second moments of the posterior distribution in the limit of a large ensemble.

#### 5.3.4 Second-order exact ETPF

*et al.*2017). For this, we approximate

**Δ**being a symmetric

*N*×

*N*matrix. Using in Equation 137 and requiring that the result is equal to

**A**leads to the condition

**Δ**in the form of a continuous-time algebraic Riccati equation and there are known solution methods for this type of equation (e.g. de Wiljes

*et al.*2017). Note that still satisfies Equation 96. However, no longer holds, in general.

### 5.4 Hybrid LETPF–LETKF

*k*, i.e.

*α*∈ (0,1), but now the particle filter is employed first, followed by the ensemble Kalman filter. This is similar to tempering in just two steps. When the likelihood is Gaussian, the posterior is expected to be more Gaussian than the prior. Hence it makes sense to use a particle filter in the first step, and to try to use an EnKF in the second step of the tempering procedure.

**R**

^{k}, then the factorization is equivalent to scaling this matrix by 1/

*α*and 1/(1 −

*α*), respectively. Hence, one can, for example, first apply an LETPF to the forecast particles with inflated covariance matrix

**R**

^{k}/

*α*in order to obtain new particle values

*k*. One then applies the LETKF to these intermediate particles with inflated covariance matrix

**R**

^{k}/(1 −

*α*). The choice of

*α*is, of course, crucial. Numerical experiments indicate (Chustagulprom

*et al.*2016) that

*α*> 0 can lead to substantial improvements over a purely LETKF-based implementation and that the choice of

*α*can be based on the effective sample size of the associated LETPF. However, more refined selection criteria for the parameter

*α*are needed to make the hybrid LETPF-LETKF method widely applicable.

### 5.5 Hybrid EnVar PF

Based on the localized adaptive particle filter (LAPF) described in Section 4.1.3, a hybrid particle filter-based ensemble variational data assimilation system (PfVar) can also be constructed. The idea is to replace the LETKF-based ensemble in an EnVar by an LAPF-based ensemble.

We briefly discuss a practical numerical weather prediction example here. Following Buehner *et al.* (2013), the operational EnVAR system of DWD for the ICON model with 13 km global resolution and 6.5 km resolution of its two-way nested area over Europe is using the ensemble of the global 40-member LETKF for its dynamic covariance matrix with a ratio of 70:30 towards the classical NMC-based covariance matrix of the three-dimensional variational data assimilation system with 3 hr cycling interval. The LETKF ensemble is replaced by the LAPF ensemble, where the quality control of the variational high-resolution run is used for the ensemble data assimilation system under consideration. In the current system, no recentring of the ensemble with respect to the variational mean estimator is carried out, leading to a form of weak coupling of the systems.

In a quasi-operational set-up (without a high-resolution nest), the hybrid PfVAR is running stably for a period of one month. The observation minus background statistics show very promising behaviour in several case-studies which are under investigation at DWD (Walter *et al.* 2018). In the current state of tuning, the forecast quality of the PfVAR seems comparable to the forecasts based on the LETKF-based EnVAR. These new results studied in combination with Robert *et al.* (2018) show that today's particle filters are approaching the quality of state-of-the-art operational ensemble data assimilation systems and are already becoming important tools on all scales of NWP.

### 5.6 Discussion

Hybrid particle-ensemble Kalman filter schemes, especially when implemented adaptively, can avoid weight collapse in the particle filter part of the hybrid in any situation. The price paid is that not all information from the observations is extracted when the posterior pdf is severely non-Gaussian, but in many situations this is not the dominant source of error. The reason why these schemes are competitive is that they do take into account some non-Gaussianity via the particle filter, while the particle filter alone is very inefficient compared to the EnKF when the posterior is actually close to a Gaussian. So the objective is not necessarily to make the *α* as small as possible, but indeed to find an optimal *α* to ensure that the EnKF is used whenever possible. The same is true for the bridging parameter in the Adaptive Gaussian Mixture Filter.

The second-order exact filters are hybrids of a different kind, focussing on obtaining the posterior mean and the covariance correct given the limited prior ensemble. These methods are expected to be quite competitive to the hybrid filters discussed above, and the relative performance will depend strongly on the measure used to define what is best. For instance, RMSE are expected to be better for the second-order exact filters, while full ensemble measures like rank histograms and continuous ranked probability scores might benefit from the hybrid schemes.

One question that emerges when comparing the EnKPF and the LETPF–LETKF hybrid is which should be used first, the particle filter or the ensemble Kalman filter? Different experimental results seem to indicate that either ordering can be superior. The PF-first methods have the advantage of a theoretical justification via a two-step tempering interpretation in which the particle filter step makes the prior for the EnKF much more Gaussian. Applying the EnKF first will bring the particles closer to the observations, leading to better weight balance in the particle filter. At this moment it is unclear which order is best; much more research is needed.

## 6 CONCLUSIONS AND DISCUSSION

The largest issue of standard particle filters was until recently their degeneracy in high-dimensional settings: when the number of independent observations is large and the number of particles is limited (of order 10–1,000 for geophysical applications), one particle gets weight one, and all others get weight zero.

Two developments have revived the interest in particle filters: efficient proposal densities and localization, while hybrids with EnKFs and recently transportation filters enhance confidence in the usefulness of particle filters in high-dimensional settings. The new development is particle flow methods, whose popularity in the large machine-learning community ensures rapid progress here, too. It is unclear at this moment how competitive these new ideas will be. It is clear that developments on particle filters have been very fast, and the first tests of both localized and hybrid particle-EnKF filters in operational numerical weather prediction have been performed and show highly encouraging results.

This paper discussed these new developments and demonstrates that particle filters are useful in even the largest dimensional geophysical data assimilation problems and will allow us to make large steps towards fully nonlinear data assimilation. The emphasis was here on explaining and connecting existing and new ideas, including new understanding of the optimality of the optimal proposal density and equal-weight filters.

From the presentation it has become clear that the field is too young to provide solid guidance on which method will be most fruitful for which problem. Given that most data assimilation practitioners will have an implementation of a local EnKF in some form, localized particle filters seem to be the fastest way to make progress. However, one has to keep in mind that the resampling step needs smoothing that is more complex than in an EnKF, although exciting new variants like the ETPF and LAPF allow for smooth updates in a very natural way. Furthermore, with the small ensemble sizes now practical (10–100), more than ten independent observations in a localization area may already lead to filter degeneracy, forcing us to look into methods that limit the weights from below. This is another *adhoc* procedure that limits information extraction from observations, but it is unclear how severe this issue is.

Even easier are implementations of hybrid PF-EnKF filters, but it is still unclear what these filters target. At the moment their value lies in bringing more non-Gaussianity into EnKFs, but at the same time ensure that an EnKF is used when that is warranted.

We discussed two main variants that try to avoid localization because of the issues discussed above: the equal-weight particle filters and transportation particle filters. The equal-weight variants, which avoid weight collapse by construction, do not have a complete mathematical foundation yet. We know these schemes are biased, but since they are tailored to high-dimensional problems with small ensemble sizes, the bias error might be smaller than the Monte-Carlo error from the small ensemble size. Transportation particle filters still have to demonstrate their full potential in geoscience applications, but initial experiments with, for example, mapping particle filters on low-to-moderate dimensional systems together with the way they are formulated suggest they could become mainstream competitive schemes.

All in all, huge progress has been made in particle filtering, and initial attempts to implement the schemes into full-scale numerical weather prediction models have succeeded, with promising initial results. This shows that particle filters can no longer be ignored for high-dimensional geoscience applications.

## ACKNOWLEDGEMENTS

PJvL thanks the European Research Council (ERC) for funding the CUNDA project under the European Union Horizon 2020 research and innovation programme. The research of SR has been partially funded by Deutsche Forschungsgemeinschaft (DFG) through grant CRC 1294 “Data Assimilation (project A02)”. We thank three anonymous reviewers for their critical comments, leading to a much improved paper.

##
REFERENCES

## APPENDIX

## LAW OF TOTAL VARIANCE

*E*

_{A}[

*B*] as denoting the expectation of

*B*under pdf

*p*(

*a*):

_{X}[

**X**] leads to: