Twin experiments with the equivalent weights particle filter and HadCM3

This article investigates the use of a particle filter for data assimilation with a full‐scale coupled ocean–atmosphere general circulation model. Synthetic twin experiments are performed to assess the performance of the equivalent weights filter in such a high‐dimensional system. Artificial two‐dimensional sea surface temperature fields are used as observational data every day. Results are presented for different values of the free parameters in the method. Measures of the performance of the filter are root mean square errors, trajectories of individual variables in the model and rank histograms. Filter degeneracy is not observed and the performance of the filter is shown to depend on the ability to keep maximum spread in the ensemble.


The forecasting problem
One of the major goals of modelling in the geosciences is to use numerical models to make predictions of the system in question. There are two major components to such a prediction: firstly, ensuring that the model is representative of the dynamics of the system, and secondly, the model must be initialised to capture the current state, from which the prediction can begin.
To initialise a climate model, observations alone are not enough. Observation networks do not currently (and will probably never) have the capability to observe all of the state variables at every single grid point, and so climate model initialisation is a classic under-determined problem. Data assimilation (DA) is the process by which the observations of the real system will update an initial guess of the state in order to initialise the model.
Bayes' Theorem is the mathematical description of the data assimilation problem. If x represents a model state, and y some observations of it, then This states that the probability density function (PDF) of the state given some observations (the posterior PDF) can be written as the product of the PDF of the state itself (the prior) and the PDF of the observations given the state (the likelihood) normalised by the probability of the observations themselves. Different DA methods have varying approaches to finding a representation of the posterior PDF. For example, a variational DA method will simply try and find the mode of the posterior (e.g. Cohn, 1997).

Particle filters and the posterior PDF
A particle filter is a fully nonlinear DA method which seeks to represent the full posterior PDF. For an overview of particle filters, see e.g. van Leeuwen (2009). They represent the prior PDF as an ensemble of delta functions, and propagate this using the dynamical model.
Each of these ensemble members, or particles, has an associated weight referring to its contribution in the posterior PDF, i.e.
where x i and ω i are a pair denoting the ensemble member and its weight for each of the m ensemble members. The weights must satisfy which leads to an issue when ω j ≈ 1 for some particle j ∈ 1, . . ., m. This situation is known as filter degeneracy and, in such a case, the posterior PDF reduces to effectively being represented by the single ensemble member x j . Hence the mean of p(x | y) ≈ x j and all estimated higher-order moments of the distribution will be close to 0. It is well known that for a simple (no proposal density or resampling) particle filter, the number of ensemble members needed to avoid filter degeneracy grows exponentially with the number of independent observations of the system used Snyder et al., 2008). This shows that for high-dimensional systems, without using a more sophisticated method, these sizes of ensembles are not possible. In this context this drawback is referred to as the curse of dimensionality.
Other data assimilation methods make certain assumptions which make the representation of the posterior tractable. Famously, the Kalman filter results from assuming that the posterior PDF is Gaussian, and hence can be represented by its first two moments (Meinhold and Singpurwalla, 1983). Since a climate model is strongly nonlinear due to several feedbacks, it is unlikely that these Gaussian assumptions are valid in the initialisation of a climate model. For example, Figure 1 shows a reconstruction of the marginal PDF of eastward surface winds in the mid-Atlantic. In Figure 1(a) it can be seen that in the early states of the model evolution the PDF is unimodal and the Gaussian approximation would suffice. However, Figure 1(b) shows the PDF of the same variable after the model has evolved further in time. These marginal PDFs appear to be bimodal and hence exploring DA methods that do not assume Gaussian PDFs is of interest.
In this article we shall consider the equivalent weights particle filter (EWPF; van Leeuwen, 2010; van Leeuwen, 2013, 2015) which makes use of 'clever' proposal densities and resampling in order to avoid the curse of dimensionality. The advantages of using such a particle filter scheme include the lack of assumptions about the posterior PDF, its fully nonlinear nature, and the fact that, if it can still avoid filter degeneracy in a system of the size we consider here, it will quantify the uncertainty in the posterior PDF and hence a subsequent forecast.

Current climate model initialisation strategies
An excellent overview of the initialisation techniques for climate models used for seasonal to decadal prediction was given by Meehl et al. (2014), in relation to those models used as part of the Coupled Model Intercomparison Project Phase 5 (CMIP5). Many of the models considered are simply nudged towards some reanalysis product. Others are initialised with a form of three-dimensional variational assimilation (3DVar) in the ocean. In a few cases, the Ensemble Kalman Filter (EnKF) and Ensemble Optimal Interpolation (EnOI) are used. Many of the models are initialised using different methods for the atmosphere and the ocean.
When a variational method is used, only a mode of the PDF is found and so there is very limited information about the uncertainty in the solution. 4DVar applied to a coupled model with components having different time-scales has difficulties as it is unclear how long the assimilation window should be. For example, the accuracy of the tangent linear and adjoint models will be much worse for a faster, more nonlinear component when applied over the same period as a slower component in the coupled model. This is the case in a coupled ocean-atmosphere model where the atmosphere evolves substantially quicker than the ocean, and is the subject of much current research (e.g. Smith et al., 2015).
The EnKF methods assume that the posterior can be classified by the mean and covariance only, i.e. the posterior is assumed to be Gaussian. This may not be the case and so it is worth investigating whether this is a valid assumption. Use of the EnKF in a high-dimensional system typically requires localisation and inflation of error covariance matrices (e.g. Hamill et al., 2001), something that is not required in the EWPF which we investigate in this article. Many researchers are attempting to address these complications with the EnKF, for example Bocquet (2011) has developed a version of the EnKF which does not use inflation.
In this article we consider the model a fully coupled system. In this sense, we do not treat the ocean and atmosphere separately. Hence we use covariance information from one to influence the other. This is useful in guarding against having unbalanced initial states for the forecast, and hence should reduce the chance of encountering initialisation shock (Chen and Zebiak, 1997;Balmaseda et al., 2009).
There has been some work done to apply particle filters to climate models in the field of paleoclimatology. Dubinkina et al. (2011) and Goosse et al. (2012aGoosse et al. ( , 2012b have applied the Sequential Importance Resampling (SIR) particle filter to LOVECLIM, a more simple climate model than HadCM3 † which we shall introduce in section 2. Using a particle filter on a system the size of LOVECLIM one might expect to encounter filter degeneracy, however as Ades and van Leeuwen (2013) showed, filter degeneracy is a consequence of the number of independent observations, not the size of the state vector. In the experiments with LOVECLIM, the number of independent observations used were small, due to considering spatial averages over very large areas. Apart from this, the weights of the particles had a specified minimal value to avoid filter degeneracy. The negative result of this procedure is that not all information is extracted from the observations. Recently Dubinkina and Goosse (2013) have applied a nudging proposal particle filter to LOVECLIM, i.e. simply not using the weight-equalising step of the EWPF (Eq. (8)), with the justification that filter degeneracy does not occur in the low-dimensional system. In this article the size of the observation vector (27 370)

Article structure
The rest of this article is organised as follows. In section 2 we discuss the properties of the coupled climate model we use in this study. In section 3 we give a brief description of the particle filter method we consider. In section 4 we describe how we generate two different model evolution error covariance matrices for use in subsequent experiments. Section 5 describes the technical details of the set-up of the experiments, section 6 gives the results from various assimilation experiments, and finally in section 7 we finish with some conclusions about the efficacy of applying such a particle filter to initialise a coupled climate model.

HadCM3
HadCM3 is a coupled ocean-atmosphere general circulation model (GCM) which has been extensively used in Intergovernmental Panel on Climate Change (IPCC) reports (Gordon et al., 2000;Solomon, 2007;Stocker et al., 2013) and does not require flux adjustments (Collins et al., 2001). The atmosphere component has five prognostic variables: surface pressure, zonal and meridional velocities, potential temperature and humidity. These are stored on a 3.75 • × 2.5 • staggered B-grid with 19 vertical levels. The ocean component has four distinct prognostic variables: zonal and meridional velocities, temperature and salinity. These are stored on a 1.25 • × 1.25 • staggered B-grid with 20 vertical levels, horizontally aligned with the atmospheric grid.
The total number of state variables in the model is 2 314 430. The model's timestepping scheme works on a 24 h cycle. Firstly it integrates the atmosphere over the 24 h period with a timestep of 30 min. Then the ocean is integrated over the same 24 h period with a timestep of 60 min. The coupling between the ocean and atmosphere occurs once every 24 h, or equivalently every 72 model timesteps.
We consider the climate model HadCM3 to act as the function f in the equation where x k represents the model state at time k. β k is a stochastic term which represents missing or incorrect physics in the model, the distribution of which we will have to specify (section 4). We denote the covariance of β k by Q. If β k were not present, and the model were purely deterministic, this would restrict the ability to modify the state of the system and hence would limit the data assimilation methods we wish to use. Specifically, β k is required in order to satisfy the conditions to use the EWPF.

The equivalent weights particle filter
In order to describe a method in which the ensemble member in a particle filter can be directed in state space, the concept of a proposal density must be introduced. Rewriting Bayes' theorem (1) as where q(x, y) is the proposal density, is possible so long as q(x, y) has support larger than p(x). The key concept is that the proposal density can be a function not just of the model state x, but also (future) observations y. In this sense the ensemble members can be directed towards the future observations by using a proposal density which modifies the increment of the deterministic model to achieve this goal. This is the basis of particle filters such as the Implicit Particle Filter (Chorin and Tu, 2009) and the EWPF. When the weights, ω i , in a particle filter are calculated, they can be decomposed into those coming from the prior (p(x)), the likelihood (p(y | x)) and the proposal density q(x, y). The component of the weight associated with the proposal density is weighted by a term involving the distribution of the model error term, β. If β ≡ 0 then these weights would become infinite, thus there would be no freedom to direct the particles anywhere other than to where the deterministic model suggests.
The EWPF is a fully nonlinear DA method which works in a two-stage process which we describe subsequently. For a comprehensive overview of the EWPF see van Leeuwen (2010) and Ades and van Leeuwen (2013). Suppose that an observation occurs at timestep n. Then for each model timestep k before an observation occurs (k ∈ {0, . . . , n − 2}), the model state of each ensemble member, x k i , is updated via the equation where β k is a random vector, y n is the next observation in time, H is the observation operator that maps the model state onto observation space with associated observation error covariance matrix R and A is a relaxation term. In this work we consider where the matrix Q corresponds to the model evolution error covariance matrix. In this article we consider p constant, however it is possible to make p a function of the time between observations. The second stage of the EWPF involves updating each ensemble member at the observation time n via the formula where α i are scalars computed so as to make the weights of the particles equal. This is done for a given proportion (0 < κ ≤ 1) of the ensemble which can make the desired weight. After this equivalent weights step, all ensemble members are resampled from the members with the desired weight using stochastic universal sampling (Baker, 1987). It is important to realise that the covariance of the model state plays no role in the EWPF (apart from the beginning of the experiments, but that is forgotten quickly). Instead, the covariance of the error in the model evolution Q is crucial.

Specification of model evolution error covariance matrices
We must specify a model evolution error covariance matrix Q to use within the EWPF. As in this work we are considering twin experiments, it is safe to assume that there are no biases in the model, and hence we can take the mean of β k to be 0. Now we consider the model evolution error to be Gaussian with zero mean and covariance Q so that In order to estimate Q, we consider a long model run and assume that the variance in the model evolution error will be related to the variance in the model itself. Considering the daily variance in the model over a 5 year control run, we can decompose the covariances in the model as follows.
where is a correlation matrix and is a diagonal matrix of variances. Due to its large size, we only compute a localised version of Cov(x), taking covariances of each variable within two gridpoints of each other. This has the effect of removing spurious long-range correlations that arise due to under-sampling and makes the calculation computationally feasible. Note that in this coupled system there are no known control variable transforms which allow efficient application of functions of covariance matrices as used in fields such as atmospheric DA.
As we can only compute the highly localised entries in Cov(x), we seek a method to increase the length-scales that will be present in the model evolution error covariance matrix. Repeatedly applying the dimensionless correlation matrix has this effect.
Hence we model Q as The short length-scales present in Q defined in this manner means that the information from observations will be passed over longer length-scales indirectly via the nudging term (7) and the propagation of these terms in the dynamical model (4).
Unfortunately very little previous work has been done in the field to quantify model errors. We have taken a pragmatic approach and ensured the model errors are large enough to influence the evolution of the system, but not too large to destroy the physics represented in the deterministic evolution. Hence the coefficient, or proportionality, is chosen so that the size of the random perturbations for each of the separate variables is, in the L 2 -norm, less than 0.1 times that of the deterministic model update. We shall refer to this matrix as Q 1 . Note that Eq. (11) does not directly give the decomposition of Q into standard deviations and a correlation matrix, however such a decomposition is unnecessary for the computations. For efficiency, matrix vector multiplication of a vector by the matrix Q 1/2 has been implmented using the sparse BLAS library LIBRSB (Martone et al., 2010).
Note that there is very little experience in modelling Q. In this article we shall test different variants of the model evolution error covariance matrix in order to investigate the sensitivity of the assimilation results to Q. The modelling of Q we have just described assumes that the variance of the errors in the model were related to the variances in the model. However the variances in the model are dominated by the seasonal cycle. It is unrealistic to suggest that the errors in the model also follow directly the seasonal cycle and so we have removed the seasonal cycle from the samples we used to determine it. This was done by subtracting a running mean from each sample, with the mean taken 7 days either side of the sample. This matrix we shall refer to as Q 2 . Figure 2 gives a representation of some of the correlations contained within . For instance we see from Figure 2(a, b) the effect of Ekman transport (e.g. Price et al., 1987), noting the change in sign at the Equator. Figure 2(c, d) capture the strong relationship between SSTs and near surface humidity, in that increasing the sea surface temperature permits more moisture to be transferred to the atmosphere. As seen in Figure 2(d), the areas either side of the Equator in the Western Pacific showing very little correlation may be due to a precipitation feedback mechanism present in HadCM3.
It can be seen from Figure 2 that the correlations contained within Q 1 and Q 2 have broadly the same structure, with those calculated without the seasonal cycle showing stronger correlations. However the variances within these two matrices have a somewhat different structure. Figure 3 shows the unscaled standard deviations in 1 2 that correspond to variability in surface air pressure, surface eastward winds and subsurface sea water temperatures respectively, for both Q 1 and Q 2 . In the case of Q 2 , Figure 3(b) showing the standard deviations of the surface pressure variables no longer has the large variance in the Southern Ocean that is seen with Q 1 in Figure 3(a). Figure 3(b) retains the large variance over the North Pacific as seen in Figure 3(a), but the variances over Europe are reduced.
The variances of the eastward winds shown in Figure 3(d) are not a maximum over the North Atlantic and North Pacific as in Figure 3(c). Instead the maximum is in the northern Indian Ocean and the Maritime Continent.
Comparing the variances in the ocean temperatures in Figure 3(e, f), we can see that they have broadly the same spatial distribution. However the scaling is very different. In Q 1 which has the seasonal cycle (Figure 3(e)), the Kuroshio current off the east coast of Japan is the strongest signal and dominates the variance that you can see. In Q 2 where we have removed the seasonal cycle (Figure 3(f)), the Kuroshio still remains the area of   highest variability, but it is markedly less strong in comparison with the flow in the rest of the oceans.

EMPIRE coupling
To allow HadCM3 to be used within a DA system, the EMPIRE ‡ coupling has been used (Browne and Wilson, 2015). This involves inserting a small number of message passing interface (MPI) commands (Gropp et al., 1996) into the model in order to send the state vector to a different piece of software which acts to implement the EWPF while considering the model as a 'black box'. In doing so the climate model remains a separate executable from the DA system.

Twin experiment truth run
In order to perform an identical twin experiment, we must first run the model in order to obtain a truth which we wish to recover using the EWPF. For statistical consistency, the model is evolved ‡ Employing Message Passing Interface for Researching Ensembles. stochastically according to Eq. (4). That is, at each step, k, of the truth run, stochastic noise, β k , is added to the deterministic model where β k ∼ N (0, Q). Synthetic observations from this truth run will be taken as defined in section 5.3 and used within the data assimilation method to try to constrain the system.
When assimilating real data, the model used to evolve the state would then be imperfect. That is, the approximation made in defining the model error evolution covariance Q would likely be incorrect, and so the model and reality would evolve in different ways. For using the EWPF with real data, the approximation of Q becomes crucial, in a similar way to approximating the background error covariance matrix B and the observation error covariance matrix R required for many classical data assimilation techniques.

Observing network
We observe only the sea surface temperature (SST) in the coupled system. The SSTs are stored as the top-level ocean temperature variable. As we are considering only twin experiments, we are free to choose full coverage of SST observations, and so the observation operator will be simply the linear operator that restricts the whole state space to the upper-level ocean temperature variables.
We assume that the observation errors are Gaussian so that, for a given model state x, the corresponding observation y will be of the form Thus η ∼ N (0, R) is a realisation of observation noise. Note here that, written in this form, the observation operator is assumed unbiased and linear. For simplicity this work will consider only uncorrelated observation errors, i.e. R is assumed diagonal. This is unlikely to be true in practice, and future work will be devoted to investigating the effect of assimilating correlated observations using this model together with the EWPF. Moreover, we assume that the variance in the observations are spatially uniform so that with σ taken to be √ 0.3 K. This gives a total of 27 370 independent observations every day.

Creating an initial ensemble
The generation of the initial ensemble is ad hoc. However, as the particle filter is a sequential method, an initial ensemble has to be generated only once, after which the ensemble will just evolve over time, updated whenever observations are available, year after year. Because of the random forcing, the memory of the system is limited and the exact details of the initial ensemble become irrelevant.
A typical method of creating an initial ensemble for a twin experiment would be to have the ensemble perturbed around the truth. Assuming that the initial PDF is Gaussian with covariance B then given some truth state x t , each of the m ensemble members x i would be created as follows.
for all ensemble members i ∈ 1, . . . , m, where ξ i are realisations of an uncorrelated random variable. A more statistically consistent manner of generating the initial ensemble would be to ensure that the initial distribution is not centred on the truth, but rather the truth follows the same stochastic distribution as each ensemble member. This requires some reference state x r then and where ξ t , ξ i ∼ N (0, I) for all i ∈ 1, . . . , m. In this way the truth would be statistically indistinguishable from any of the ensemble members as x t , x i ∼ N (x r , B). With HadCM3 we do not have B, instead we experimented with the approximation that B = γ 2 Q, for some scalar γ . However this technique has been unable to produce a realistic amount of spread in the initial ensemble. If the coefficient γ is large enough to create sufficient spread, the perturbation to the reference state γ Q 1/2 ξ i destroys the delicate balances present in the model (e.g. hydrostatic and geostrophic balance), causing the model to crash.
The strategy which we adopt in this article is to take each initial ensemble member to be an instantaneous state of a long control run. In order for each ensemble member to be plausibly representative of the truth, they are all sampled at the same time of year. The advantage of this procedure is to eliminate the seasonal cycle in our initial conditions. Initialising the ensemble in such a way gives each ensemble member a dynamically consistent (balanced) state, and the desired spread is achieved by ensuring the samples are taken enough years apart within the control run. Note that the initial state for the truth is also taken as one of the instantaneous states from the control run.

Further technical aspects
The linear solve required in the equivalent weights step, is performed with HSL MA87 (Hogg et al., 2009), a direct linear solver optimized for shared memory systems.

Results
In this section we compare the results of assimilation runs with different parameters chosen in the EWPF, namely the strength of the nudging p and the proportion of particles, κ, kept by the equivalent weights step, for the more physically realistic Q matrix Q 2 . Corresponding results for Q 1 , i.e. the model-error covariance matrix dominated by the seasonal cycle, can be found in the Appendix.
We have run the model for six months, beginning on 1 December 1859, assimilating artificial SST data at the end of each day. Hence there were 180 assimilation cycles. We have arbitrarily chosen an ensemble of 32 members as this was as many as could be practically afforded for the experiments. Ideally, to reduce statistical noise in the results, we would run all the experiments multiple times with different random forcing. However, due to the computational complexity of these runs, we have been limited to a single instance. For example, running 32 ensemble members required the use of 1152 processors for approximately 3 h on ARCHER, the UK national supercomputer. Browne and Wilson (2015) give full details on the timings of the run.

Root mean squared errors
Throughout this section we shall use root mean squared error (RMSE) to mean the square root of the spatial mean of the square of the error between the ensemble mean and the truth. Figure 4 shows RMSEs for three different variables with different parameters used in the EWPF. In green is shown the RMSE for a stochastically forced ensemble, i.e. simply forecasting from the initial conditions. Figure 4(a) shows the RMSEs for the observed variables. We can see that sufficient nudging brings the ensemble closer to the truth than the stochastic ensemble. The stronger the nudging, the more reduced the RMSE is, until the nudging term becomes too large and the model breaks. In this case we were unable to increase the size of the nudging beyond what is shown here without the model crashing. This shows the sensitive nature of the model to perturbations.
The long-term performance of the assimilation is improved when κ = 1 when compared to κ < 1. In Figure 4(b, c), only the assimilation runs with κ = 1 give RMSEs comparable with the stochastic ensemble. Figure 4(b) shows the RMSEs of the specific humidity at the surface. Note that in this variable the assimilated ensemble behaves in a similar manner to the stochastic ensemble. This is indicative of the fact that the atmosphere is poorly constrained by the SSTs. Subsequent results will provide more evidence that the atmosphere is not well constrained and is behaving like the stochastic ensemble. Figure 4(c) shows the RMSEs of the zonal sea water flow at the surface. In this case we can see that only assimilation runs with strong nudging and κ = 1 follow the stochastic ensemble. In the other case, when κ < 1, the ensembles lose their spread as the resampling step replaces some particles with copies of other, better performing particles. This has been observed for many other cases of κ < 1 which we do not include in Figure 4 below. With the stochastic error term that we have introduced to HadCM3, this does not provide a sufficient spread in the remaining ensemble to encompass the truth. In this way, the RMSE increases markedly as the truth then lies outside of the ensemble. This is evidenced further by the rank histograms shown later in section 6.3. To show the effect of using the EWPF for assimilating SSTs on the other variables in the coupled ocean-atmosphere system, Figures 5-9 are plots of RMSEs for different levels in the model across different physical variables. For all these plots, we are displaying the results of the experiment with p = 3.5d7 and κ = 1. Figure 5 shows the seawater temperatures for levels of increasing depth. It can be see that the SSTs are significantly constraining these variables for at least the top 50 m of the ocean, at which point the influence of the observations becomes much less significant. At the 96 m level, the seawater temperatures appear to behave in a similar manner to the stochastic ensemble. Figure 6 shows the seawater meridional flow for levels of increasing depth. The effect of the assimilation on the nearsurface fields is clearly stronger than those deeper in the ocean, with visible reductions in RMSE seen to a depth of 35 m. Below this depth, the RMSEs of the EWPF and the stochastic ensemble are qualitatively similar. Figure 7 shows the seawater zonal flow for levels of increasing depth. If we compare this to Figure 6, we can see that the zonal flow is much less constrained by the SSTs. The reasons for this are currently not well understood and require further investigation into the physics underlying it. The RMSE plots shown in Figures 6  and 7 are global averages, and to understand this behaviour would mean looking into regional differences, which is beyond the scope of this article.  assimilating SSTs with the EWPF is not constraining the atmospheric temperatures. However for this specific experiment (κ = 1) there remains sufficient ensemble spread to avoid the errors becoming worse than the stochastic ensemble. Figure 9 shows the atmospheric humidity for levels of increasing height. At the surface, there is improvement in the RMSE over the stochastic ensemble. As we might expect, this improvement is reduced as we consider levels higher in the atmosphere further from the SST observations. Figure 10 shows trajectories of two different observed variables and one unobserved variable for the same parameter configurations as shown in Figure 4. Figure 10(a-c) show a SST variable in the Southern Hemisphere and Figure 10(d-f) a SST variable in the Northern Hemisphere. Figure 10(g-i) show a surface wind variable that is not observed. In yellow is the stochastic ensemble, blue the results of the assimilation with the specified parameters, and red the truth run.

Trajectories of individual state variables
Considering Figure 10(a-c), we can see that all the assimilation runs are constraining the ensemble to be a lot closer the the truth than the stochastic ensemble. However, when κ = 0.8 in  Figure 10(a), we see that the truth is quite often outside the ensemble and the spread of the ensemble is poor at times, when compared with σ , the observation error standard deviation. In Figure 10(c) we can see that there is a better spread but the ensemble does not completely track the truth as it is not nudged as strongly as the other two assimilation runs shown. Figure 10(b) is performing best in this case as the ensemble is following the truth while retaining a good degree of spread. For the other observed SST variable shown in Figure 10(d-f), we can see that the final two ensembles with κ = 1 are performing in a broadly similar manner. However, Figure 10(d) with κ = 0.8 again shows too small a spread in the ensemble and for a disproportionately large number of timesteps the truth falls outside the ensemble.
Considering now the trajectories of ensembles with κ = 1, we can see that the observed variables follow the truth well. However, after day 150, when the truth diverges from the stochastic ensemble in the second observed SST trajectory (Figure 10(e, f)), it can be seen that the spread in the ensemble increases. This shows an increase in the uncertainty of the assimilation, which we may not see with other assimilation methods that do not use an ensemble.
In the final row, Figure 10(g-i) show trajectories of an unobserved zonal wind variable at the surface. Consistent with the previous trajectories, Figure 10 reduced spread compared to the others. However, none of these assimilated ensembles appear to accurately track the truth, further showing that the atmospheric component of the model is not well constrained by the SST observations.

Rank histograms
In Figure 11, the columns correspond to those in Figure 10, i.e. the different parameter values in the EWPF. Figure 11(a-c) are rank histograms of the observed variables, Figure 11(d-f) rank histograms of eastward sea water flow in the deep ocean, and Figure 11(g-i) rank histograms of northward wind in the high atmosphere. The histograms are formed by summing the rank histograms at a number of well-separated grid points through all the timesteps of the experiment. In the ocean these are 10 latitude and 19 longitude grid points apart, getting increasingly sparse towards the poles. In the atmosphere these are 4 latitude and 16 longitude grid points apart, again getting increasingly sparse towards the poles. Figure 11(a, d, g) with κ = 0.8 show clearly that the ensemble is under-dispersive in all the variables. This is vastly improved as shown in the other plots with κ = 1. Figure 11(h, i) appear to be flat. This shows again that, in the atmosphere, the truth is indistinguishable from the ensemble members. This is consistent with the atmosphere being unconstrained by the observations.
In the experiments with κ = 1, for the observed SST variables (Figure 11(b, c)), the rank histograms appear slightly underdispersive. The remaining plots (Figure 11(e, f)) show that the ensembles are underdispersive in other ocean variables. This is no doubt compounded by the small amount of stochastic noise we were able to introduce to the system.

Delta function representation of marginal PDFs
In this section we show the delta function representation of the (marginal) posterior distribution as defined in Eq. (2). The height of each particle in Figure 12 corresponds to the weight of the particle. These marginal PDFs are shown after resampling, so by definition these particles have weight 1/m. Had filter degeneracy occurred, no spread would be seen in the ensemble. This is not the case as, by construction, the equivalent weights step (8) avoids this and hence keeps the ensemble well spread. Figure 12 shows the delta function representation of marginal PDFs arising from the assimilation run with p = 3.5d7 and κ = 1 in the experiment with Q = Q 2 . Figure 12(a-c) show the representations of specific humidity at the surface, and sea water temperature at depths 47 m and 3347 m, respectively. Figure 12(d-f) show the same ocean eastward velocity variable at three different points in time.
These marginal PDFs are the main motivation for using the nonlinear particle filter to initialise the climate model. From these  Figure 11. Rank histograms of model variables (a, b, c) observed SST, (d, e, f) ocean meridional flow at 3347 m, and (g, h, i) air meridional flow at 0.2p * when using parameters (a, d, g) p = 3.5d7, κ = 0.8, (b, e, h) p = 3.5d7, κ = 1, and (c, f, i) p = 3d7, κ = 1 in the equivalent weights particle filter for the case that Q = Q 2 .
PDFs we can get an estimate of the uncertainty in the posterior PDF at a given point, not just a maximum a posteriori estimate that we would get from using a variational method. With only 32 particles, it is unclear if any of these PDFs are truly multimodal, however Figure 12(d-f) suggest they may be bimodal.

Conclusions
In this article we have applied a fully nonlinear DA method to a coupled climate model. The EWPF was applied to a system with 2 314 430 variables and 27 370 independent observations per timestep. The particle filter did not degenerate at any stage thanks to the equivalent weights step. This model is over 35 times larger than any other model previously using the method (Ades and van Leeuwen, 2015). We have been able to constrain the model well in the observed variables. This constraining of model variables extended into the ocean temperatures a number of model levels below the observed SST fields, but had little effect in the atmosphere and on the non-temperature prognostic variables in the ocean. This suggests that, to further constrain the other variables, we require different observations to be used. For instance, observations from ARGO profiles to have observations in the deep ocean may be of great benefit for stronger uncertainty quantification of a subsequent prediction.
The best results we have obtained have come in the cases in which we applied the strongest nudging. However, the amount of nudging we could use was restricted by the numerical sensitivity of the model. Too much nudging would lead to catastrophic failure of the model and hence the assimilation process could not continue. Applying the EWPF to a model with less numerical sensitivity may allow stronger nudging to be employed, although this should still be smaller than the increment from the dynamical model.
We found it hard to attain an appropriate spread in the ensemble. Indeed, simply to achieve a reasonable spread in a stochastically forced ensemble required a non-standard method of initialising it. Because of this lack of spread available from the stochastic forcing, we achieved the best results when we kept all of the ensemble members in the equivalent weights step. This is different behaviour than has been observed in the equivalent weights filter when compared to its application in previous models van Leeuwen, 2013, 2015). We conjecture that this is due to an increase in degrees of freedom in this larger model, allowing equivalent weights to be achieved with a relatively smaller change to the model state, however this requires much more thorough investigation.
The EWPF, being a sequential DA, avoids any potential difficulties associated with the different time-scales in the various components of the coupled climate system. For instance, in a 4DVar DA method, the time period over which observations of the ocean and atmosphere are assimilated would typically be different. Naively one would think that the window of observations in the relatively fast-moving atmosphere should be shorter than that of the relatively slow-moving ocean. This can bring challenges to solving the fully coupled variational DA problem, leading to 'weakly coupled' variational solution methods (Smith et al., 2015). Other sequential methods such as the EnKF would similarly avoid the issues of multiple time-scales which variational methods face.
The performance of the whole assimilation system is dependent on the specification of the model evolution error covariance matrix. In this article, doing twin experiments, we have been able to prescribe this. However in reality a great deal of work would be required to approximate this matrix. This is an area which has largely been neglected by the wider community up to now, and which needs much research. The EWPF has been applied to a high-dimensional system and as a result the posterior PDF has been approximated, which is not possible with single-run DA methods like a simple nudging, or variational methods like 4DVar. The computational cost of this method is similar to using an EnKF with a stochastic model. This cost would be comparable to the computational cost of 4DVar, where the repeated iterations and running of the slower tangent linear and adjoint models would offset the cost of the ensemble of model runs. An additional issue with 4DVar is that, if the ocean and atmosphere are treated as one in the adjoint calculations, their different time-scales lead to difficulties when specifying the optimal optimization window length. To avoid this problem and to avoid having to generate an adjoint of the full ocean-atmosphere system, two separate adjoints are often used. Treating the coupled system together minimizes the effects of initialisation shock which is known to affect methods that treat the ocean and atmosphere separately.
The use of the EWPF, a fully nonlinear DA method, has shown that some marginal PDFs could be multimodal. This is a result of both the dynamics of the model and the observations which we have used. In other circumstances, the posterior PDF may be Gaussian, in which case using a method such as the EnKF would be advantageous as it is designed specifically for these cases. However if the variables of interest clearly do not follow a Gaussian distribution (e.g. bounded humidity variables), then the EWPF should be seriously considered as a DA method to be used.
Incorporating a new DA scheme into a GCM could be a daunting task. However the use of the EMPIRE (Browne and Wilson, 2015) coupling system allowed us to do so extremely quickly within a particularly small team. This system, along with the open-source codes implementing the EWPF, should encourage the community that it is possible to quickly and easily use fully nonlinear DA methods with new models of ever increasing size and complexity.

Results with Q = Q 1
In this Appendix we show the results using the model error covariance matrix which retains the seasonal cycle. When using a different model error covariance matrix, the values used for the nudging parameter p changes accordingly. We have found that with Q 1 larger values of p were possible that for Q 2 . Hence in this Appendix we show results for three of the best performing values of p which we experimented with, in a similar manner to section 6. Figure A1 shows the RMSEs for a different variables in the case Q = Q 1 . In green is shown the RMSE for a stochastically forced ensemble, i.e. simply forecasting from the initial conditions. Figure A1(a) shows the RMSEs for the observed variables. We can see that sufficient nudging brings the ensemble closer to the truth than the stochastic ensemble. The best performing assimilations are those that have a strong nudging and set the proportion of particles kept in the equivalent weights step, κ, to 1.0. In this case we were unable to increase the size of the nudging beyond what is shown here without the model crashing. This shows the sensitive nature of the model to perturbations. Figure A1(c) shows the RMSEs of the northward sea water velocities at the surface. In this case we can see that only assimilation runs with strong nudging and κ = 1 follow the stochastic ensemble. In the other case, when κ < 1, the ensembles lose their spread as the resampling step replaces some particles with copies of other, better performing, particles. This has been observed for many other cases of κ < 1 which we do not include in Figure A1. With the stochastic error term that we have introduced to HadCM3, this does not provide a sufficient spread in the remaining ensemble to encompass the truth. In this way, the RMSE increases markedly as the truth then lies outside the ensemble. This is evidenced further by the rank histograms shown in section A3. Figures A2-A6 correspond to the same experiments as reported in Figures 5-9. Note that the behaviour of all of these are qualitatively similar to those with Q 2 . Figure A7 shows the trajectories of the same points as in Figure 10 but for Q = Q 1 . Figure A7(a) shows that with κ < 1 we have lost the appropriate spread in the ensemble. After around 130 days, the truth diverges from the ensemble and the nudging is not strong enough to bring the ensemble back to the truth. Figure A7(d) shows similar behaviour, but note that, in this variable, in the  1.00 1.02 1.04 1.06 1.08 1.10 1.12 1.14 1.16 Figure A9. As Figure 12, but for Q = Q 1 , p = 4d7 and κ = 1.

A2. Trajectories of individual state variables
final 30 days, the truth is far away from the stochastic ensemble. In this case, although the spread is still poor, the ensemble is significantly closer to the truth than the stochastically forced ensemble. Figure A7(g) behaves broadly similarly to Figure 10(g) in that the truth is not well constrained and the spread in the ensemble has been reduced. As in Figure 10, Figure A7(h, i) show how the unobserved wind variables are poorly constrained by the SST observations.
In Figure A7(c), a single particle in the ensemble can be seen to drop substantially below the rest of the ensemble. Even with κ = 1, we still perform Universal Importance Resampling after each equivalent weights step. This provides a small chance that the particle will be replaced by a copy of another from the ensemble. This can be seen happening at around 112 days where the trajectory of the low particle is brought up back into the main body of the ensemble. This resampling was found to happen less than 0.1% of the possible times it could occur.

A3. Rank histograms
In Figure A8, the columns correspond to those in Figure A7, i.e. the different parameter values in the EWPF but with the different model evolution error covariance matrix Q = Q 1 . Figure A8(a-c) are rank histograms of the observed variables, Figure A8(d-f) rank histograms of eastward sea water flow in the deep ocean and Figure A8(g-i) rank histograms of northward wind in the high atmosphere.
Considering Figure A8(a, d, g) where κ = 0.8, we can see that these are marginally more underdispersive than the corresponding plots in Figure 11. Figure A8(b, c), rank histograms of the observed variables, show signs of being underdispersive, unlike the corresponding plots Figure 11(b, c). The remaining Figure A8(e, f, h, i) appear to be qualitatively similar to their counterparts in Figure 11. Figure A9 shows the delta function representation of marginal PDFs arising from the assimilation run with p = 4d7 and κ = 1 for the experiment with Q = Q 1 . The variables are the same as shown in Figure 12. If we compare these figures to Figure 12, we can see that the spread of the sea water temperature variables close to the surface ( Figure A9(b)) is increased. As in the case that Q = Q 2 , the variables shown in Figure A9(e, f) appear to be bimodal, however in this case with Q = Q 1 , this bimodality occurs later as it is not seen in Figure A9(d).