Time‐correlated model error in the (ensemble) Kalman smoother

Data assimilation is often performed in a perfect‐model scenario, where only errors in initial conditions and observations are considered. Errors in model equations are increasingly being included, but typically using rather adhoc approximations with limited understanding of how these approximations affect the solution and how these approximations interfere with approximations inherent in finite‐size ensembles. We provide the first systematic evaluation of the influence of approximations to model errors within a time window of weak‐constraint ensemble smoothers. In particular, we study the effects of prescribing temporal correlations in the model errors incorrectly in a Kalman smoother, and in interaction with finite‐ensemble‐size effects in an ensemble Kalman smoother. For the Kalman smoother we find that an incorrect correlation time‐scale for additive model errors can have substantial negative effects on the solutions, and we find that overestimating of the correlation time‐scale leads to worse results than underestimating. In the ensemble Kalman smoother case, the resulting ensemble‐based space–time gain can be written as the true gain multiplied by two factors, a linear factor containing the errors due to both time‐correlation errors and finite ensemble effects, and a nonlinear factor related to the inverse part of the gain. Assuming that both errors are relatively small, we are able to disentangle the contributions from the different approximations. The analysis mean is affected by the time‐correlation errors, but also substantially by finite‐ensemble effects, which was unexpected. The analysis covariance is affected by both time‐correlation errors and an in‐breeding term. This first thorough analysis of the influence of time‐correlation errors and finite‐ensemble‐size errors on weak‐constraint ensemble smoothers will aid further development of these methods and help to make them robust for e.g. numerical weather prediction.

Data assimilation is often performed in a perfect-model scenario, where only errors in initial conditions and observations are considered. Errors in model equations are increasingly being included, but typically using rather ad hoc approximations with limited understanding of how these approximations affect the solution and how these approximations interfere with approximations inherent in finite-size ensembles. We provide the first systematic evaluation of the influence of approximations to model errors within a time window of weak-constraint ensemble smoothers. In particular, we study the effects of prescribing temporal correlations in the model errors incorrectly in a Kalman smoother, and in interaction with finite-ensemble-size effects in an ensemble Kalman smoother. For the Kalman smoother we find that an incorrect correlation time-scale for additive model errors can have substantial negative effects on the solutions, and we find that overestimating of the correlation time-scale leads to worse results than underestimating. In the ensemble Kalman smoother case, the resulting ensemble-based space-time gain can be written as the true gain multiplied by two factors, a linear factor containing the errors due to both time-correlation errors and finite ensemble effects, and a nonlinear factor related to the inverse part of the gain. Assuming that both errors are relatively small, we are able to disentangle the contributions from the different approximations. The analysis mean is affected by the time-correlation errors, but also substantially by finite-ensemble effects, which was unexpected. The analysis covariance is affected by both time-correlation errors and an in-breeding term. This first thorough analysis of the influence of time-correlation errors and finite-ensemble-size errors on weak-constraint ensemble smoothers will aid further development of these methods and help to make them robust for e.g. numerical weather prediction. KEYWORDS data assimilation; ensembles; model error; statistical methods

INTRODUCTION
Data assimilation (DA) combines incomplete and imperfect sources of information of a system to obtain a better estimate of that system, including uncertainties. These sources -models and observations for example -can be represented as random variables with given probability density functions (pdfs). The reader is referred to e.g. van  and Ash et al. (2017), for recent introductions.
The general solution to the DA problem is given by Bayes' theorem (Bayes and Price, 1763): variables given the observations -is the result of multiplying prior and likelihood. The denominator p(y) is the marginal pdf of the observations and does not depend on the state variables. Hence, DA can be regarded as a multiplication problem involving different sources of information.
We seldom work with pdfs in practice because, as soon as the dimension of the system is larger than (say) 3, we have difficulty storing and propagating full pdfs. Most DA methods rely on estimating statistics of the posterior pdf and are based -to different degrees -on assumptions of Gaussianity in the error sources. Most methods are optimal when the evolution of the system and the observation process are linear. Variational methods like 4D-Var (Le Dimet and Talagrand, 1986;Talagrand and Courtier, 1987) work with the mode of the posterior pdf, whereas methods based on the Kalman filter (KF; Kalman, 1960;Kalman and Bucy, 1961) and its ensemble implementations (e.g. the EnKF; Evensen, 1994;Burgers et al., 1998) work with the first two moments.
DA often works with systems that evolve in time. This time dependence can be handled in two ways. Filters update the state variables only at the observed times, whereas smoothers update whole trajectories of the state variables in a given assimilation window, using simultaneously all the observations available during that time frame. In this work we focus on the latter.
Consider that the true evolution of the system is generated by a true model, and that the forecast step of the DA process is generated by a forecast model. If these two models perfectly match, we say we work in a strong-constraint (SC) framework. This situation reduces the smoothing problem to searching for the initial conditions of the assimilation window. In geosciences, however, there is always a mismatch between the true model and the forecast model. This mismatch is called model error, and it can arise from the discretization of the underlying partial differential equations describing the system, the parametrization of processes that cannot be explicitly resolved, the lack of knowledge of some physical processes, and many other sources. The reader is referred to Howles et al. (2017) for a further discussion. Sometimes model error is small enough to be ignored compared to other uncertainties, but this is not always the case.
Model error can be simulated in different forms within the DA forecast step. It can be inserted as a random additive term at every given number of model time steps, or as a random multiplicative factor in the tendencies of the model equations (Palmer et al., 2009). More indirect ways include using different parametrization schemes for different ensemble members, which means we can sometimes represent model error even without clearly knowing its statistics. Including any existing model error is particularly important in ensemble forecasting, since it is needed to produce a good estimate of the actual forecast uncertainty.
Even when it is present in the DA forecast step, model error is often not treated explicitly in the DA analysis step (e.g. Bonavita et al., 2016). Treating model error in the DA analysis step is known as weak-constraint (WC) framework; the reader is referred to Sasaki (1970) and Tremolet (2006) for an introduction. A WC smoother poses a considerably harder problem for several reasons. Firstly, model error statistics are often unknown, and although some methods have arisen to estimate them (e.g. Todling, 2015;Zhu et al., 2018), this is a challenging task. Secondly, the size of the control variable and the computational expense of the problem is larger than in the SC case. This is because not only the initial conditions but intermediate jumps need to be estimated as well. To reduce this burden one can consider "effective" model errors over a large number of time steps (Tremolet, 2006).
Since specifying the correct model error statistics is not trivial, it is important to understand the consequences of using an incorrect time-autocorrelation of the model error in a WC smoother. This is a question we aim to answer in this paper. Under linear model evolution and linear observation operators, the problem we study is equivalent to the inner loops of an ensemble of WC 4D-Vars with perturbed predicted observations and additive model errors for each trajectory. Since the (full) nonlinear problem is often solved as a sequence of successively linearized problems, our results are also relevant for the nonlinear case.
For many dynamical systems of interest, the background-error covariance used in the prior p(x) evolves with time. One way to find a time-evolving estimator of this covariance is to use an ensemble, as in the case of ensemble Kalman filters; Vetra-Carvalho et al. (2017) give a review on the implementation of several different flavours. Furthermore, the use of hybrid ensemble-variational methods has grown in the last years, and these are seen by many as a pathway to the future for e.g. numerical weather prediction. The reader is referred to Goodliff et al. (2015) and the references therein for an overview.
As discussed, sample information is important in many DA methods, but the finite-sized nature of an ensemble method introduces error. Besides the direct sampling (Monte-Carlo) error, there is a more subtle indirect sampling error which comes from the use of sample covariance statistics in the gain required in the analysis step of the smoothers. These issues were recognised by Houtekamer and Mitchel (1998) and analysed by van Leeuwen (1999), and studied in more detail by Sacher and Bartello (2008) and Furrer and Bengsston (2007) in the filtering setting. In this paper we identify and quantify the effects of both direct and indirect sampling errors in the ensemble Kalman smoother. Furthermore, we study the interactions between these finite-size sample effects and the errors arising from incorrect specification of the temporal-correlation, or memory, of the model error. This is done within a single (forecast/assimilation) time window. Future work will explore the effect of cycling, and the extension into nonlinear systems.
The paper is structured as follows. In section 2 we derive the exact analysis solution for a time window of the ensemble smoother in the presence of autocorrelated additive Gaussian model error. For an exponential autocorrelation memory function, we illustrate the propagation of information from observations inside the assimilation window. This exact solution serves as benchmark for the work in section 3 where we introduce two sources of imperfection: a mis-specification of the autocorrelation memory of the model error, as well as the use of finite-size ensembles. Section 4 illustrates the difference between direct and indirect sampling errors arising from an ensemble using numerical examples. Section 5 provides a summary and conclusions. This work is heavy in equations. To aid the reader we have underlined the most important expressions which often have significance throughout the whole work.

KALMAN SMOOTHER WITH TEMPORAL-CORRELATED MODEL ERROR
Let us consider a system akin to that of Howles et al. (2017), although our ultimate purpose is different. We denote the state variable at t = 0 as x 0 ∈  N x . This random variable x 0 follows a multivariate Gaussian distribution (MGD) , where 0,b x ∈  N x is its mean, B ∈  N x ×N x is its covariance matrix, and the superscript b stands for background. Over one time step the state variable evolves as The linear operator M (t−1)→t ∈  N x ×N x has the property for 0 ≤ t 1 < t 2 < t 3 . The variable v t ∈  N x is the model error jump at time step t. This random variable has a MGD v t ∼ N (0, Q), with mean 0 ∈  N x and model error covariance matrix Q ∈  N x ×N x . The model error jumps can be correlated in time: where 0 ≤ ≤ 1 represents the memory, |i−j| is the absolute difference between time steps i and j, and is a characteristic memory time-scale of the system. The function takes the value 1 when |i−j| = 0, and decreases monotonically towards 0 as |i − j| increases. We keep our work general for any function that fulfils these two conditions, but for more specific examples we use an exponentially decaying memory: Observations are taken every Δ obs time steps. The lth observation y l ∈  N y is obtained as: where H l ∈  N y ×N x is the lth linear observation operator, and l ∈  N y is the observational error. This random variable follows a zero-mean MGD l ∼ N (0, R), where R ∈  N y ×N y is the observational-error covariance. The random variables x 0 , and v t are assumed to be statistically independent of each other. For brevity in the derivations, we consider the time window t = {0, 1, · · · , − 1, } to contain only one observation at t = , i.e. at the end of the time window. This can be generalized to L observation in two ways. The first is to realise that -in the case of linear model and observation operators -observations at different times can be assimilated sequentially. This is akin to the serial EnKF of Whitaker and Hamill (2002). In the case of assimilating all observational times at once, the following derivations are still valid when using the extended expressions of the Appendix. The collection of model error jumps can be written as one This random variable follows a MGD v 1∶ ∼ N ( 0, Q 1∶ ) , with mean 0 ∈  N x and covariance Q 1∶ ∈  N x × N x , which is a block-matrix in which each element contains Q multiplied by a memory coefficient. It is helpful to write Q 1∶ as the Kronecker (outer) product: where 1∶ ∈  , is a Toeplitz matrix of memory coefficients: It is useful to note two limiting cases: (i) The zero-memory case occurs when → 0, yielding (|i − j|, ) = 0 ∀ |i − j| > 0. Then 1∶ becomes the identity matrix, and Q 1∶ becomes a block-diagonal with Q in each diagonal block-element. This corresponds to completely independent model error jumps. This is shown schematically in the top row of Figure 1.
(ii) The infinite-memory case occurs when → ∞, yielding (|i − j|, ) = 1 ∀ |i − j|. Then 1∶ becomes a matrix of ones, and Q 1∶ becomes a block-matrix with Q in every block-element. This corresponds to fixed model error jumps. This is shown schematically in the bottom row of Figure 1.

The weak-constraint Kalman smoother
Now we describe the process performed by the WC Kalman smoother over one time window. Consider the extended control variable z 0∶ ∈  (1+ )N x containing initial conditions and model error jumps: The posterior pdf p(z 0∶ |y) is simply: where the prior and This matrix can be written in blocks as: For later reference we note that we can formulate the problem in terms of the state variables x too via: This equation exploits the linearity of the model, showing that each model error jump propagates independently of the rest of the variables to the end of the assimilation window where the observation is located. It allows us to write the like- . Note that y is given, so the likelihood as function of HM 0∶ z 0∶ is given by N (y, R). It is useful to compute the first two moments of x ,b as: Performing the product in Equation 11 and doing some factorizations allows us to write the posterior as: ] .
(17) Therefore, the posterior pdf is: where ∈  N y ×N y is the total covariance in observation space at the end of the assimilation window: Finally, we note that in this Gaussian case 0∶ ,a z is also the minimizer of the cost function of the problem: which is nothing else but the minus logarithm of the numerator of Equation 11. This solution is standard knowledge (e.g. Jazwinski, 1970;Howles et al., 2017). What follows, however, is our contribution. We examine in detail the effect that temporal correlation of model error has on the analysis values over the whole time window. We will study the solution in both initial condition-model jump space and in state-trajectory space.

Solution in terms of initial conditions and model error jumps
We now separate the components of the solution into those related to x 0 and those related to v 1∶ . We start by expanding Equation 22 as: with B ∈  N x ×N x and ∈  N x ×N x defined as: B ∈  N x ×N x results from the evolution of B, and ∈  N x ,N x is the effective contribution of the model error. Explicitly, is a double sum: Using the short-hand notationM = ∑ j=1 M j→ we can write the two limits: (27) These limits show that when , and hence the memory, increases, the contribution of model error to the total covariance contains more terms and is expected to be larger. This, of course, is not surprising as a larger memory means that each random jump is felt long into the future.
The gain can be written as K 0∶ acts on x 0 and the second Once more, the two limits of interest are: (31) In the zero-memory case, only one term in the sum (30) remains and it is different at every time step. In the infinite-memory case, all terms in Equation 30 have the same coefficient = 1, and K j v is exactly the same for all time steps. With these expressions for the gains, we can formulate the full solution to the problem. We define d ∈  N y as departures of observations from evolved background: The analysis mean can then be written: and the analysis covariance becomes:

Solution in terms of state variables
We now express the solution in terms of the state variables at different times: We can compute x t,a = M 0∶t z 0∶t,a as The second term actually contains a double sum: Finally we compute the first two moments of x t,a . The mean is and the covariance where B t and t are defined as in Equation 25 but for a general t.

Illustration in the scalar case
The effect of the temporally correlated model errors, encoded in ( ), on the assimilation results can be analysed in more detail in the univariate case. This will allow us to gain an understanding of the relative order of magnitude of the different contributions. Let the error variances be b 2 , q 2 and r 2 for background, model, and observational errors, respectively. We observe directly and the model is a constant m inside the assimilation window. A value of m = 0 maps any state variable to zero. Values in 0 < m < 1 constitute compressions of the state variable, the case m = 1 is the identity, and m > 1 is an expansion. Negative values have the same behaviour, but the difference is that they cause the state variable to alternate signs in consecutive time steps. Hence, we only consider m ≥ 0. We will analyse the results for the zeroand infinite-memory cases, which apply for any ( ). In the finite non-zero cases, we have to specify the exact memory dependence. We restrict ourselves to exponential-memory dependences here. For the scalar case the model operators are just powers of m: The scalar versions of Equations 22, 28 and 30 are and we used the notation: The zero-memory and infinite-memory cases are given as .
(45) Note that in the latter case the gain is exactly the same for all time steps. Using

a property of geometric sums), the limits of zero-memory and infinite-memory of Equation 46 become
where clearly the second expression is larger than the first, except for = 1 in which case 2 (m, ) = 1 in both limits. As function of m we have: • When m → 0, 2 (m, 0) → 1 and 2 (m, ∞) → 1.
For an exponentially decaying memory, we use Wolfram Mathematica to get a closed-form expression The for m = 1 renders an undetermined form (0 ÷ 0) and one must take a limit. For the exponential-memory case, we substitute 2 ,exp from Equation 48 into Equation 42 to get ) .
When the solution is expressed in terms of the state variables, we need the scalar version of Equation 37 -i.e. the gain acting on x t -which now becomes which has the following two limits: and substituting again 2 ,exp from Equation 48 into Equation 37 we have explicitly .
(52) In the univariate case, 0 ≤ K ≤ 1 (for all gains). As K grows, the influence of the observation on the analysis increases.
For illustration we choose an assimilation window with = 3 time steps and variances b 2 = 5, q 2 = 0.25 and r 2 = 0.1. In Figure 3 we plot the gains (a-d) K j v,exp and (e-h) K t x,exp . Each column shows a different time step from (a, e) t = 0 to (d, h) t = 3. In each panel we plot the gain for every combination (c) We see a different behaviour for compressive m < 1 and expansive m > 1 models, and for those close to persistence with m = 1.
We investigated different combinations of values for the variances, and the general behaviour was similar. For example, in Figure 4 we plot the case for b 2 = 4, q 2 = 4 and r 2 = 1. Since we increased the observational variance, the impact of the observation is smaller on all panels.

Analysing the gains
In the final section on the univariate case, we shed some more light on the behaviour of the gains K 0 x , K t x (0 < t < ) and K x . The summary of this analysis is given in Figure 5, which also includes the limiting expressions of these gains as m, and change.
These analytical results on the dependence on m can be understood as follows. Schematically, we can write In general, the numerator of K t x arises from the covariance between the state at time t and the state at observation time . This covariance consists of two terms, the propagation of the background covariance from time 0, and the accumulated propagated model errors. The denominator consists of the total propagated state error at observation time, and the observation errors. The former consists again of a part related to the background covariance at time 0 propagated to time , and the accumulated propagated model errors.
For small m, most terms in the numerator will be small, since they are proportional to some positive power of m. There is one model error term that is not propagated, and that term is proportional to (| − t|, ), which is small for < 1. The denominator is dominated by terms that do not contain propagation, specifically the observation error and the model error in the last model time step. Combining numerator and denominator, we see directly that the gain will be small. This means that observations at the end of a window have small influence on the state at other times when m is small. This is simply because the propagated state covariance will be very small at observation time, so the propagated state is much more accurate than the observation. An exception is when is large, in For large m, the largest terms in the expression for K t x are those that contain the propagation of the background covariance all the way to the observation time. The term in the numerator has m to the power +t, but the denominator has m to the power 2 , and hence, again,the gain will be small. This means that observations at the end of a window have small influence on the state at other times when m is large. Although the observation is much more accurate than the propagated state, the propagation of the innovation backwards in time, i.e. in the direction where the model contracts, leads to a small influence.

INEXACT TIME CORRELATION AND FINITE ENSEMBLE SIZE
We have discussed the effect of temporally correlated model error has on a time window of the Kalman smoother. In this section we consider the effect of two sources of imperfection: an incorrectly prescribed memory time-scale for the model error, and covariance information coming from samples with a finite-ensemble size.

Errors from inexact time autocorrelation
Assume that we know Q exactly, but we do not know the real memory time-scale . Hence, the DA forecast model uses a guess g , and the model error used in the DA is the result of this mis-specification. While 1∶ has functions (|i − j|, ), the guess 1∶ g has functions (|i − j|, g ). Their difference and We need a measure of the "magnitude" of the difference 1∶ Φ relative to the "magnitude" of 1∶ . We use the metric where the norm ‖.‖ is the maximum eigenvalue. Finding the analytical dependence of Equation 56 on and g is not simple. Hence, we have resorted to a numerical experiment to illustrate the behaviour of 2 ( , g , ) as function of the mis-specification of the memory term. Results are shown in Figure 6 for an assimilation window of size = 10, using an exponential-memory function. This figure is generated in the following manner: • We choose an to produce a matrix 1∶ . • We choose an g to produce a guess matrix 1∶ g . • We evaluate Equations 54 and 56. The resulting value is saved. • We repeat these steps for every pair , and we populate a matrix. The resulting matrix is plotted for different values of (horizontal axis) and of g (vertical axis). The colour bar spans the interval 0 (dark purple) to 1 (light blue), and everything above 1 is plotted in white. Black lines show the contours corresponding to 2 = {0, 0.2, 0.4, 0.6, 0.8, 1.0}. This figure shows that having g < results in larger 2 than the contrary situation. Hence underestimating the temporal correlation time-scale leads to larger errors then overestimating it. In general, there is a large region in which the magnitude is smaller than unity. The value does not simply depend on the difference | − g | or the ratio ∕ g , but their individual values as well. We performed experiments with different values of and they yielded similar results, so they are not included.
The corresponding analysis mean and covariance react to the mis-specification of the memory scale as follows. Interestingly, both z 0∶ ,b and z 0∶ ,b g share the same expectation: and the only difference is in the covariances: and the rest of the problem is the same. This means that we have the posterior pdf: Therefore, substituting 1∶ with 1∶ g renders a different D 0∶ g , g and K 0∶ z,g , and these are combined in a nonlinear fashion to produce guess analysis solutions. The differences with the exact solutions in Equation 12 are not simple to analyse. Before doing so let us first consider the noise introduced by finite-size ensembles.

Direct and indirect errors coming from ensemble statistics
Consider an ensemble of N e elements sampled from the background pdf N . The n e th member is One can generate a sample n e from the observational error pdf N (0, R). Following Burgers et al. (1999), but perturbing the observational departure from the background instead of the observations for statistical consistency, we find d n e = y − (Hx n e + n e ) where we have again, but now for each ensemble member x n e = M 0∶ z 0∶ ,b n e . One can then apply Equation 19 using the sample expressions (62) and (63) to get z 0∶ ,a Note that Equation 64 uses the exact gain K 0∶ z . The associated n e th cost function uses the true covariance D 0∶ø : , with the moments defined as in Equations 19 and 20. Clearly, the estimators coming from any finite-size ensemble constructed in this way will have direct, i.e. Monte-Carlo, sampling errors. Now we discuss the more subtle indirect sampling errors. These come from using the sample estimator B e instead of B in the computation of the analysis values. This is a natural step of the Kalman filter/smoother and it helps in incorporating flow-dependent information since this covariance matrix usually evolves in time and it is often impossible to compute it exactly. In this case, the associated cost function is identical to Equation 65 but with D 0∶ replaced by D 0∶ e . This block-matrix is This leads to an ensemble-based total covariance and an ensemble-based gain Then one can construct each analysis member as z 0∶ ,a where we emphasize that the empirical gain K 0∶ z,e has been used. An ensemble constructed in this manner will have both direct and indirect sampling errors. This indirect sampling error is taken into consideration, for instance, in the creation of the EnKF-N of Bocquet et al. (2014). From now on, we will use z 0∶ ,a n e as defined in Equation 69 and not in Equation 64.

Effects of the two sources of imperfection
We now combine the two sources of imperfection and perform a more detailed examination if their consequences. From the last two subsections recall that the subindex g indicates variables related to the guess model error Q 1∶ g , whereas the subindex e indicates variables related to the sample covariance B e . It is clear that some elements will have two subindices, for example K 0∶ ge . For each ensemble member we decompose z 0∶ ,b g,n e and d g,n e as the sum of this expectation and a perturbation: g,n e d g,n e = d + g,n e (70) where 0∶ ,b g,n e is a sample from N ( 0, D 0∶ ) , d defined as in Equation 32 and the perturbation g,n e is g,n e = − ( Hence, we can write the (imperfect) analysis value for each ensemble member z 0∶ ,a g,n e as: The first parenthesis is the update for the means, while the second parenthesis is the update for the perturbations. The empirical gain K 0∶ z,ge appears in both. The structure of Equation 72 makes it difficult to disentangle the contributions from the two sources of error. To proceed we follow what van Leeuwen (1999) and Sacher and Bartello (2005) did for the EnKF. We express D ge as a departure from the exact D 0∶ : The sample covariance B e is a random matrix. Since x 0,b has a MVG pdf, B e follows a Wishart distribution with N x − 1 degrees of freedom. B ∈  N x ×N x is also a random matrix, but it can have both positive and negative values in its entries, and its distribution is not simple. 1∶ Φ is not random since it comes from a wrong prescribed time-scale g .
We can write the ratio 0∶ where we have used the mixed-product property of the Kronecker product to get the bottom-right element. This property states that if A, B, C and D are matrices of such size that one can form the matrix products AC and BD, then (A ⊗ B) (C ⊗ D) = (AC)⊗(BD) (e.g. Golub and Loan, 1983;Horn and Johnson, 1991). The sample-based total covariance at the end of the assimilation window ge is where g = HM 0∶ 0∶ D ( M 0∶ ) T H T is the contribution from the sampling error. Finally the ensemble-based gain K 0∶ z,e can be written as Therefore the ensemble gain is a the real gain multiplied by two factors. The left factor contains the error in the construction of the covariance matrix D e , and the right factor contains the respective error in the total covariance ge at the time of the observation.

Small error approximations
Every expression has been exact up to this point. To continue we require an approximation: we consider that the B e is "not too far" from B, and that g is "not too far" from . To be more precise: We perform a Taylor expansion for the right factor (inverse) in Equation 77, and use Equation 78 to neglect all terms after the linear:  . The guess analysis covariance is A 0∶ z,g contains the exact gain, which means that the reduction in uncertainty due to the DA step is correct. The only source of error is the incorrect time-scale g in D 0∶ g . Finally we have the indirect sampling errors which are formed of two terms: one linear and one a nonlinear product. These expressions allow us to calculate the imperfections in the analysis mean and covariance. Since both K 0∶ z,ge and K 0∶ z,g are constant for all members of a given ensemble, the sample mean is given bȳ where the overbar denotes the arithmetic average. The sam- is more complicated to compute. In a finite-size sample the observational-error covariance R e = R+ is not exact -this comes from perturbing Hx in Equation 63 -and spurious correlations between can arise. The full expression for A 0∶ z,ge is: The inexact R e does not participate in the gain; it only appears in the direct sampling effect. We consider that we have either a second-order exact sampling scheme (Pham, 2002) or a large-enough sample such that both → 0 and ge → 0. With these assumptions, several terms in Equation 90 are null (or at least negligible). If we take the remaining terms, substitute K 0∶ z,ge from Equations 81 and 82, and keep only leading-order terms, we can express A 0∶ z,ge approximately as a departure from the true A 0∶ : with the exact part found in Equation 20, and the error part is The first term in Equation 92 is direct sampling error. It corresponds to the reduction of 0∶ D due to the action of the exact gain. The second and third terms are indirect sampling noise. The second term is the a reduction of D 0∶ due to the use of the inexact part of the gain, and it is linear in 0∶ D . The last term is quadratic in 0∶ D and is called in-breeding, to be discussed later.

3.5
Behaviour of the sample estimators Table 1 summarizes the additive elements of the approximate ensemble analysis mean and covariance in the small error approximation. We separate the exact part and the direct and indirect sampling errors. Both estimators have an exact part and errors coming from both sampling (direct and indirect) and the mis-specification of the memory. The errors arising from 0∶ Φ are not random, and hence do not depend on sample size. The random vari-ables̄0 ∶ ,a g and̄g have zero expected value but incorrect covariances A 0∶ z,g ∕N e and g ∕N e respectively due to the guess for the temporal correlations denoted by g. The presence of K 0∶ z,g d causes a bias because of the dependence on 0∶ Φ . If B were not random, but instead were fixed as a result of an incorrect static estimator of B, it would have similar consequences as 0∶ Φ . In the last part of this section we consider 0∶ Φ = 0 (correct memory), and focus only on the behaviour of random errors coming from B as N e grows. We take the expected value = 0 ∈  ( +1)N x . The third term also has zero expected value. This can be seen by writing explicitly: For the latter we can see this explicitly via: The second term is zero since K 0∶ z, and̄are statistically independent. Hence We can write the first term of (96) as: which is not an easy expression. An application of Basu's theorem (Basu, 1955) states that ifz and z are the sample mean and sample covariance coming from a MGD, they are statistically independent (e.g. Ghosh, 2002). Hence the expected value of their product is the product of their expected values. However, in Equation 98 we have the expected value of products involving transformations of both̄0 ∶ ,b and B e , so the validity of independence may depend on the particular structure of the matrices involved. After the previous examination, we can finally state that which implies that small sampling errors can only produce bias through the nonlinear product K 0∶ z, HM 0∶̄0 ∶ ,b . According to the knowledge of the authors, this has not been noticed in the literature before, and is also true for ensemble Kalman filters when the time dimension is disregarded. In the experiments discussed in the next section we show that this effect can be substantial, and much larger than the estimated error covariances.
The expected value of the sample analysis covariance is: We know that both E z, ] = 0. However, the last term is quadratic in 0∶ D so its expected value is not zero. Therefore which shows that in-breeding (the last term) leads to a consistent underestimation of the real analysis covariance, a result previously found to hold for ensemble Kalman filters (Houtekamer and Mitchell, 1998;van Leeuwen, 1999;Sacher and Bartello, 2005).

ILLUSTRATION WITH A NUMERICAL EXPERIMENT
In this simple example we illustrate the difference between direct and indirect sampling errors. We use a one-step assimilation window and estimate initial conditions and one model jump. In this case the temporal correlation errors play no role. No cycling is performed in this experiment.
We let the size of the system be N x = 250, H = I, M = mI, B = b 2 I, R = r 2 I, Q = q 2 I. The total covariance becomes = 2 I, with 2 = m 2 b 2 + q 2 + r 2 . We let all variances be the same b = q = r = 1, and m = 1. Let the prior mean of x 0 be 0,b = 0. For simplicity we use an observation with the same value in all components: y = 3, where 3 ∈  N x is a vector with a 3 in every position. The exact posterior moments are: 0∶1,a z Figure 7 shows the results of this experiment. Each panel shows a different statistic -(a, b) the means and (c)-(f) the different elements of the covariance matrices. For each panel, the horizontal axis is the ensemble size (in logarithmic scale), and the vertical axis is the value of the estimator. The thick grey line indicates the analytical value in each panel. The blue lines represent the percentiles resulting from using B in the gains, and the red lines represent those resulting from using B e in the gains.
It is clear that the blue estimators only contain direct sampling errors. The spread of the percentiles is symmetric around the expected value, and this spread reduces consistently as the ensemble size increases. The red estimators, on the contrary, present a more complicated behaviour. Looking at the means, there is some bias in the estimation for up to N e ≈ 500, and it has different sign for and 0,a z and 1,a z , and the estimators have large spread. After N e ≈ 500, the spread of the estimator becomes symmetric. This behaviour comes from the nonlinear product K 0∶ z, HM 0∶̄0 ∶ ,b , which presents a serious bias for small and moderate sample sizes.
In Figure 7c-e, we have the diagonal elements of the analysis covariance blocks. In this case the bias in the estimation of the values is again substantial when using B e . In particular we see small spread but large bias in the estimators (slightly larger for A 1,1 z ), which reduces slowly as the sample size grows. This is due to the in-breeding term 0∶ The off-diagonal elements (Figure 7f) show no bias in the estimation, probably because cross-products of the random matrix elements are uncorrelated.

SUMMARY AND DISCUSSION
Model errors have been ignored in atmospheric data assimilation far too long. Efforts have been made to correct this in special cases, but it is important to have a basic understanding of the effect of a more general form of model error.
In this work we have provided the first systematic exploration of the solution over one time window of the (ensemble) Kalman smoother in the presence of temporal-correlated model errors, and the consequences of assuming a guess correlation time-scale that is inaccurate. We have provided exact expressions for the analysis mean and covariance for very general time correlation functions in the model errors. In the univariate case, we performed a deeper exploration of the information flow from observations at the end of the assimilation window to the model variables at different time steps. This information flow is strongly dependent of the magnitude of the model (compressing, identity or expanding) and the magnitude of temporal correlation in the model error.
We have then moved to situations in which our knowledge of the temporal correlations in the model errors in imperfect. First, we considered using a guess for the model error memory, as this is the case often encountered in practice. For instance, fully independent model errors (zero memory) are often represented as fixed by intervals, or fixed in the whole assimilation window (infinite memory). This is done to reduce the computational expense of the problem (Tremolet, 2006). We derived exact expressions for the biases introduced this way, and found that this practical solution can lead to serious errors in the obtained solutions, and an overestimation of the temporal correlations in the model error leads to worse results than an underestimation.
Next we formulated the exact solution for each ensemble member in the case of an finite-size ensemble. We identified the direct and indirect sources of sampling error. The direct sampling errors arise even when using the exact B in the computation of the gain in the problem. The indirect sampling errors come from the effect of using B e in the computation of the gain. In particular, we find that the ensemble-based gain is the exact gain left-and right-multiplied by two factors: The left factor comes from the error in the joint background-model error statistics, while the right factor comes from error in the total covariance. For small errors in both the background and the model error specification, we are able to create an approximate expression for the analysis value for each ensemble member. In this expression we identify the exact solution, direct errors, indirect linear errors, and indirect nonlinear errors.
Finally we computed the finite-size sample analysis mean and sample analysis covariance. We showed that the mis-specification of the model error memory leads to a wrong analysis covariance, and to the presence of a bias in the analysis sample mean. The sampling errors vanish as the sample size tends to infinity, but this occurs slowly because of a nonlinear product and can lead to a bias in the ensemble mean in small-to-moderate sample sizes, which has not been reported before. In the case of the covariance, the mis-specification of the memory leads to a bias, and the sampling errors do not vanish, instead they tend to a negative offset of the analysis covariance. This is the so-called in-breeding which leads to underestimation of covariances. Although some of these results had been established for the EnKF, this is the first time this is explored within a smoother, and it is done while also exploring the interaction with mis-specified model error temporal correlations.
It is important to remember that the Kalman smoother and its ensemble approximation (EnKS) are sequential algorithms. This is, the solution to the problem includes applying these algorithms serially on subsequent time windows. This paper has analysed the behaviour over one window, as the extension to multiple windows is straightforward (in the linear case). We have assumed that the true model has temporal error correlations, but the observations do not.
It is true that we only consider these correlations inside the time window and ignore temporal correlations over the boundary of two time windows. In principle one could update the trajectories in the previous window when observations in the new window become available. Another approach is to use overlapping windows (e.g. Bocquet and Sakov, 2014, provide a discussion). Nonetheless, the trajectories in the latest window would not be affected, since the starting point -the state of the system given all observations up to the start of the window -does not change. Therefore, improving trajectories in previous windows would not be useful when the emphasis is on forecasting, so the results of this paper are especially important for that case. If the emphasis is on reanalysis then ignoring temporal correlations over window boundaries would become important if the temporal correlations are long compared to the window length.
Our next step will be to move to more realistic systems, for instance in the presence of nonlinear model operators and observational processes. In these cases the effect of cycling is not straightforward, and this will be explored in detail. Our experiments will use a system similar to that of Bonavita et al. (2016), but with additive model error instead of multiplicative one (in first instance). Since the solution of the nonlinear problem is a recursion of linearized problems, the results of this paper will provide guidance, but it is clear that many more numerical simulations will be needed in that case.