A review of operational methods of variational and ensemble-variational data assimilation

Variational and ensemble methods have been developed separately by various research and development groups and each brings its own beneﬁts to data assimilation. In the last decade or so, various ways have been developed to combine these methods, especially with the aims of improving the background-error covariance matrices and of improving efﬁciency. The ﬁeldhasbecomeconfusing,eventomanyspecialists,andsothereisnowaneedtosummarize themethodsinordertoshowhowtheywork,howtheyarerelated,whatbeneﬁtstheybring, whytheyhavebeendeveloped,howtheyperform,andwhatimprovementsarepending.This articlestartswithareminderofbasicvariationalandensembletechniquesandshowshow theycanbecombinedtogivetheemergingensemble-variational(EnVar)andhybridmeth-ods.Akeypartofthearticleincludesdetailsofhowlocalizationiscommonlyrepresented. Therehasbeenaparticularpushtodevelopfour-dimensionalmethodsthatarefreeof linearizedforecastmodels.Thisarticleattemptstoprovidederivationsoftheformulations ofmostpopularschemes.Theseareotherwisescatteredthroughouttheliteratureorabsent. Itbuildsonthenomenclatureusedtodistinguishbetweenmethods,anddiscussesfurther possibledevelopmentstothemethods,includingtherepresentationofmodelerror.


Data assimilation and uncertainty
Dealing with uncertainty is at the heart of data assimilation (DA). Forecast models (here those used in numerical weather prediction, NWP) use initial conditions that are imperfect, and are based upon imperfect representations of physical processes. It is well known that a free-running model will accumulate errors until its forecast is no longer useful (Tribbia and Baumhefner, 2004). The only way to restore usefulness is to allow the model to be influenced by observations (Leith, 1993). DA (Daley, 1991;Kalnay, 2002;Rabier, 2005) is the procedure of maintaining the link between evolving models and reality by updating the model with fresh observations. Researchers are working towards formal and robust mathematical methods that work in ways that are consistent with the model, the data, and their degree of uncertainty. DA is related to approaches used in other fields and which go by different names, e.g. state estimation (Wunsch, 2012); optimization (Biegler, 1997); history matching (Emerick, 2012); retrieval production (Rodgers, 2000); inverse modelling (Tarantola, 2005)), and there are additionally many ways of solving a DA problem. Most known DA methods are based on probabilistic theories (most, if not all, exploit Bayes' Theorem (Lorenc, 1986)) and each is made practical by making approximations.
Traditionally there are three Bayesian-based strategies that allow the DA problem to be solved in approximate (hence suboptimal) ways. These may be categorized as the following. (i) Variational DA (Var, or its specific implementation 3D-Var or 4D-Var) implements an algorithm to minimize a cost function. In its basic form, Var gives a single, (quasi-) optimal analysis state based on an a priori state (the background or forecast), some observations, and the prescribed Gaussian uncertainty statistics for the background and observations. Var requires linearized versions of the observation operators and 4D-Var requires a linearized version of the forecast model. (ii) Ensemble DA, based on the ensemble Kalman filter (EnKF), offers a priori error statistics from an ensemble instead of from a prescribed source, which changes as the system evolves, and does not require use of linearized models. (iii) Monte-Carlo methods additionally allow the assimilation of information from sources that have non-Gaussian errors. A box with a straight (curved) top represents a 3D (4D) scheme (the relevant background-error covariance matrix is specified in the upper portion). A box with a curved (straight) base represents a Var (sequential) scheme. Multiple overlaid boxes are for ensemble schemes. The numbers of control variables are specified and, where localization is used, values are given for the (Buehner, 2005) (B05) and/or (Lorenc, 2003) (L03) methods. The pure Var schemes are boxes 1 , 2 and 3 . The lower dashed box encloses the pure ensemble schemes and the upper dashed box encloses the EnVar schemes. The broad dashed lines show the development path of the theory (for instance hybrid En4DVar (box 11 ) is built from pure En4DVar, 7 , which is itself built from 4D-Var, 2 , and an EnKS scheme, 5 ). The solid arrows show the basic flow of information in a working system, where multiple (single) arrows represent the transfer of an ensemble of states (single state). Alternative names of schemes are in square braces. Acronyms and symbols are defined in the specified sections of this article.
Approaches (i) and (ii) have been successful in NWP, but have limitations (sections 1.2 and 1.3). In recent years this has led to an explosion of techniques that combine them in the hope of maximizing the benefits and eliminating the inadequacies of the separate methods. This includes so-called 'pure EnVar' schemes (section 1.5) and 'hybrid EnVar' schemes (section 1.6). The modern DA toolbox now includes a spectrum of methods that are often subtly different in their name, but may be profoundly different in their treatment of uncertainty (especially in nonlinear systems), in their efficiency, and in the way that the equations are solved (e.g. Buehner et al., 2015b;Liu and Xue, 2016). The boxes in Figure 1 represent the main methods, including Var, ensemble, pure EnVar and hybrid EnVar methods and how they are linked (the Figure caption gives details). The suggested nomenclature to distinguish the methods (Lorenc, 2013) is shown and the Figure will be explained throughout the article.
The aim of this article is to firstly remind ourselves of the basic forms of (i) and (ii), to then review the methods that combine them (pure EnVar and hybrid EnVar), and to summarize the nomenclature. The similarities, differences, advantages, disadvantages and successes of each new method are reviewed. It is beyond the scope of this article to review in depth the numerous ways of implementing each traditional method (such as parallelizable implementations of 4D-Var, and the many implementations of the EnKF such as the stochastic and squareroot formulations (Ehrendorfer, 2007;Houtekamer and Zhang, 2016). A review of (iii) is given by van Leeuwen (2009).
The structure of this article is as follows. In the remainder of this section we outline the basic principles of each traditional and combined approach, and mention the advantages, disadvantages and limitations of each. In section 2 we derive common forms of pure variational methods. In section 3 we review the basic principles of pure EnKF-related methods. In section 4 we review how DA systems can be formulated by replacing the fixed background-error statistics of Var with ensemble-derived statistics, while maintaining the variational framework. Broadly, there are two approaches -one that requires the linear models of 4D-Var and another that does not. In section 5 we review hybrid methods, which are DA systems that merge the prescribed and ensemble-derived background-error covariances, again in ways that may or may not require linearized models. For reasons of simplicity, up to this point in this article the inevitable ensemble sampling noise is not considered. In section 6 we review state-space localization methods which reduce sampling noise. In section 7 we give localized forms of the methods discussed in sections 3-5, and we present a summary of accepted nomenclature. In section 8 we discuss how they could handle model error. In section 9 we compare how they perform in practice, and in section 10 we summarize which schemes are used at leading centres and groups. In section 11 we provide a brief summary and some closing comments. In the appendices we give some important derivations which may help understanding of the theory and efficient implementation of the new assimilation schemes.

Data assimilation challenges
The objective of DA is to produce information about the posterior probability density function (PDF). This is produced as a first moment (the analysis), or as information about the second moment (e.g. from an ensemble). These can be used as initial conditions for weather forecasts. Observations influence the posterior by matching synthetic versions of the observations (found from the model state via observation operators) to the real observations. This process depends on the nature of each observation (whether it is in situ or remotely sensed) and the time of each observation relative to the analysis time. Most observations nowadays are asynoptic (or asynchronous), meaning that respecting the detailed temporal nature of the problem is essential. This problem poses many challenges, including the following.
• Analyses must be produced even in unobserved regions and times, hence the need for prior information (from a background forecast or an ensemble). • All data are imperfect and uncertainties between different pieces of data can be correlated. The statistics of this information are needed for DA to work well, so uncertain data, for example, should be given lower weight than more precise data. In Var the background-and observationalerror covariance matrices are specified and in the EnKF the background-error covariance matrix is estimated from the ensemble. The background-error covariances can significantly affect the analysis (e.g. Navon et al., 2005) and if they are incorrectly specified then the DA is suboptimal. An incorrect background-error covariance matrix can actually lead to analyses being worse than the prior (Morss and Emanuel, 2002). • All NWP models are imperfect. This is accounted for in weak-constraint 4D-Var (section 2.1) and in EnKF methods that include stochastic processes in the NWP model (section 3.2). • Many modes can be excited by DA; these are associated with processes of very different space/time-scales. Modifying the right modes is an issue that goes right back to Richardson's forecast in the 1920s. The use of 4D methods, an appropriate background-error covariance matrix, and initialization (Temperton, 1988) all help with this issue. • All DA problems for NWP are large and so methods must be able to deal with huge numbers of degrees of freedom (typically ≥ O(10 7 ) pieces of prior information and similar numbers of observations). Building DA systems that are practicable is often the task that takes the most effort.
The methods dealt with in this review are designed to deal with these points in different ways.

Variational data assimilation
Variational DA (Talagrand and Courtier, 1987;Schlatter, 2000) is a tool used to estimate a single initial state and a single trajectory of an NWP model. * These are found by minimizing a cost function to optimize the fit of (i) the initial conditions to the background and (ii) the model's version of the observations to the actual observations over a time window. Variational methods have been used in NWP for many years (Courtier et al., 1994;Park and Zupanski, 2003;Rawlins et al., 2007) as they allow assimilation of a wide range of observations, including those remotely sensed (e.g. from satellite and radar). Variational methods are restricted in their ability to quantify adequately the flow-dependent background and model error statistics, which are prescribed in Var by a parametrized scheme (Trémolet, 2007;Bannister, 2008). Such schemes will almost inevitably be incapable of representing properly the true space and time structure of error statistics. The Var equations are derived in section 2. In 4D-Var the background-error covariance statistics do evolve with the flow within each window, and weak constraint formulations can account for model errors. Normally only the first moment (the mode) of the analysis is found and 4D-Var relies on linearized and adjoint versions of the numerical model.

Ensemble data assimilation
Instead of specifying a covariance model, the EnKF (Evensen, 1994;Hamill, 2006;Ehrendorfer, 2007;Houtekamer and Zhang, 2016) uses an ensemble of possible forecasts (section 3) that contains valuable flow-dependent information about the background-error statistics. The EnKF is designed to produce an ensemble of analyses, which will similarly contain information about the analysis-error statistics, and can be cycled to future analysis times. The EnKF does not require linearized or adjoint versions of the model, nor of the observation operators. One of the first centres to use an EnKF operationally was Environment Canada (Buehner et al., 2010a(Buehner et al., ,2010b. Due to cost, the number of ensemble members is too small in practice. Error statistics found from small ensembles will suffer sampling error, and so the statistics will not necessarily be representative of the state. This shows itself in the form of ensemble-derived correlations that are noisy, and variances that are systematically too small (van Leeuwen, 1999;Houtekamer and Mitchell, 2001). The issue with the correlations is problematic when errors between two variables are only weakly correlated in reality (e.g. variables that are widely separated in space). † The issue with the variances will cause the DA to believe that the forecast ensemble is more accurate that it really is. This will place too much trust in the ensemble in comparison with the observations. Left to its own devices, the ensemble will collapse effectively to a single trajectory and the observations will be ignored. This problem in DA is called 'filter divergence' (Houtekamer and Mitchell, 1998). These problems can be lessened with a procedure called localization (which dampens the sample covariances at long range; Gaspari and Cohn, 1999) and ensemble inflation (which artificially increases the spread of an ensemble; Anderson and Anderson, 1999;Hamill et al., 2001) by either multiplicative or additive processes (Kalnay et al., 2007;Whitaker et al., 2008). The ad hoc nature of these fixes is largely unsatisfactory, although methods have been investigated that use adaptive (rather than imposed) moderation functions (Bishop and Hodyss, 2007, 2009a, 2011 and inflation factors that are based on innovation statistics (Bowler et al., 2008).
The EnKF is applied sequentially. The Ensemble Kalman Smoother (EnKS) on the other hand can assimilate observations at times later than the nominal analysis time (Evensen and van Leeuwen, 2000). Elements of the EnKF and the EnKS are covered in section 3.

Non-Gaussian Monte-Carlo assimilation
Monte-Carlo techniques form a range of methods that include the particle filter (PF). The PF represents a probability distribution by an ensemble (the particles) but, unlike in the EnKF, non-Gaussianity of the distribution is represented. PFs do suffer from degeneracy problems (where the filter ignores all but one particle). For this reason, PFs have to be implemented in special ways which are not used currently for operational purposes (van Leeuwen, 2009). PFs are not considered further in this article.

Combined 4D-Var/EnKF assimilation schemes ('EnVar')
Most of this article is about combining Var with the EnKF. Such schemes usually run a Var and an EnKF scheme in parallel with information being exchanged between them, which can be one-way or two-way.
One-way coupling can be achieved by shifting each EnKF analysis member for the ensemble mean to equal the Var analysis ('recentring'). In this configuration the EnKF does not transmit information to Var. This kind of coupling has been used, for example, at the Met Office (Bowler et al., 2008).
Another one-way coupling may be achieved by using the ensemble as a source of background-error covariance information (instead of or in addition to climatological covariances) in a Var system to give some flow dependency. This is the basis of En3/4DVar and 3/4DEnVar methods (section 4). Two-way coupling may be achieved, e.g. by combining En3/4DVar or 3/4DEnVar with recentring.

Hybrid data assimilation
Var is suboptimal partly due to the unrealistic imposed error covariance matrix, and the EnKF is suboptimal partly due to under-sampling. Hybrid DA has been developed to accentuate the best features of each source of background-error information, namely the statistically robust background-error covariances of Var and the flow-dependency of the EnKF. Mathematical approaches have been developed to do this, and most are solved with a (deterministic) Var-style method but with added ensemble information. The simplest solves a Var-like problem, but replaces the static background-error covariance matrix with a weighted average of itself with the error covariance matrix from the ensemble. Efficient methods have been developed to deal with the very large matrices used in real problems, while at the same time attending to the ensemble sampling error problem (section 5).

Notation
The schemes reviewed in this article try to tackle the above challenges, but they become complex, so a consistent notation is required. We use a notation as close as possible to that of (Ide et al., 1997) and to subsequent literature, although our priority here is to use a common notation throughout the article. To help readers follow the equations, a summary of notation is provided in Table 1.

Pure variational data assimilation
Let the n-element vectors x(t) and x b represent a state of the model at time t and the background state at time 0 respectively (quantities with no time argument are implicitly at t = 0, and underlined quantities will represent augmented quantities throughout the time window). State x(t) is found by propagating x(t − 1) forward by one time step δt by the model, M t−1,t : where the n-element perturbation η(t) is to offset model error introduced over t − 1 → t. Let the p t -element vectors y o t and y x t be the set of observations made at time t, and their modelled counterparts respectively. y x t is related to x(t) via the observation operator H t : 2.1. 4D-Var cost functions

A full-form (non-incremental) cost function
Var has been the workhorse of DA for many years (Le Dimet and Talagrand, 1986;Talagrand and Courtier, 1987;Thepaut and Courtier, 1991;Zupanski, 1997;Rabier et al., 2000;Rawlins et al., 2007). The 4D-Var problem may be posed as a cost function, which is a functional of the T + 1 states in x = {x, x(1), . . . , x(T)} (over the time window t = 0 to T) and then varying x to minimize J (Zupanski, 1997): where a 2 A −1 ≡ a T A −1 a. The interpretation of this cost function is as follows.
• The first term (the background term, J b ) measures the deviation between x and x b . This is calculated in the L 2 norm described by the n×n background-error covariance matrix B 0 . • The second term (the observation term, J o ) measures the deviation between y o t and y x t . This is calculated in the L 2 norm described by the p t ×p t observation-error covariance matrix R t and is summed over the window. Equation (3) does not allow for observations that depend on the model state at more than one time (e.g. precipitation accumulation which at time t depends upon earlier states), although it may be adapted accordingly.
• The third term (the model error term, J Q ) measures the deviation between x(t) and M t−1,t {x(t − 1)}. This is calculated in the L 2 norm described by the n× n modelerror covariance matrices Q t . In J Q we have assumed that model errors are uncorrelated in time (white noise approximation) and have a priori values of zero.
This minimization may be regarded as an inverse problem, which can be solved given the ability to solve the forward problems (M t 1 ,t 2 and H t ) and then minimizing J(x). Cost function (3) is weak-constraint, meaning that x does not have to be exactly a model trajectory (Sasaki, 1970;Trémolet, 2006). This means that we can formally account for imperfect models in the DA problem.
A common approximation follows by setting Q t = 0 which forces the states to follow a model trajectory exactly. This is called strong constraint 4D-Var, and such systems have been used operationally for some years (e.g. Rabier et al., 2000;Rabier, 2005;Gauthier et al., 2007;Rawlins et al., 2007). In this approximation only the state at t = 0 needs to be determined as all subsequent states follow from the model equations ((1) with η = 0), and the J Q term no longer appears. Table 1. Summary of key notation used in this article, where n is the size of the state space, p is the total number of observations over the assimilation window (p t at time t), T is the number of time steps, N is the number of ensemble members, and ∀t here means 'all times'.

Symbol
Description Size State vector: time t, 0, ∀t n, n, (T +1)n ( b background, a analysis, (i)   In general, an underlined symbol represents that quantity at all relevant times over the assimilation window, and a δ prefix in brackets indicates that the quantity may be represented as a perturbation.

An incremental cost function
Cost function (3) is a full-fields approach to DA. A more practical approach is incremental DA, which deals with perturbations made to known reference states (Courtier et al., 1994) and has some advantages over full-fields DA.
The cost function has a quadratic form if M t 1 ,t 2 and H t are linear (which they are not in general). In practical terms, a non-quadratic cost function is difficult to minimize, especially if it possesses multiple local minima. A quadratic problem emerges by linearizing M t 1 ,t 2 and H t about a guess (or reference) state, x g , and formulating the problem in terms of perturbations (increments) to x g . In this article we will take x g = x b for simplicity. The result is a modified cost function which is an approximation to the original but is easier to minimize.
The 4D state x b is found by integrating the model over the window using (1) with the assumption that η(t) = 0: This can be perturbed to give a general state x(t) as follows: Substituting these into (1) gives: where H t = ∂y x t /∂x(t). Substituting (9) into (3) gives the incremental cost function: These equations represent incremental weak-constraint 4D-Var ( 1 in Figure 1). Minimizing J inc (over an inner loop) gives particular values of (δx, δη) = {δx, δη(1), . . . , δη(T)}. The strong-constraint 4D-Var version is obtained by setting all δη(t) = 0 in (10) and (11), which is valid when model error is negligible over the window ( 2 in Figure 1). For practical reasons Var uses a B 0 -matrix (and weakconstraint 4D-Var uses Q t matrices) that is (are) essentially the same for each cycle, although flow-dependence can be added artificially to the way that B 0 is modelled by, for example, cycling variance information (Dee, 2002) and the use of (linearized) nonlinear balance relationships (Fisher, 2003;Barker et al., 2004).

Further simplifications
Strong-constraint 4D-Var may be further simplified to exclude part of the time dimension (Courtier et al., 1998;Gauthier et al., 1999;Lorenc et al., 2000;Gustafsson et al., 2001). The most popular is called 3D-FGAT (3D-Var with the First Guess at the Appropriate Time, but also known as just 3D-Var), which emerges from the strong-constraint version of (10) with the assumption that M t 1 ,t 2 ≈ I ∀t 1 , t 2 ( 3 in Figure 1): (Lee et al., 2004;Lawless, 2010), where δx is here valid at the centre of the observation window. The 3D aspect is due to the loss of the time dimension through the neglect of all M t 1 ,t 2 and the FGAT aspect is due to the preservation of time in δd t ((12) still holds for d t ). Notably, 3D-FGAT was used to produce the ERA-40 reanalysis (Uppala et al., 2005). Note that even operational 4D-Var systems in practice use approximations, e.g. the Jacobian M 0,t does not normally contain linearizations of the complete model, as many physical processes are problematic to linearize (Xu, 1996).

The derivative of the 4D-Var cost function
The minimum of J inc is found iteratively (e.g. Liu and Nocedal, 1989) with algorithms that require calculation of the gradient of (10), ∇J inc . This is a column vector of the same structure as the argument of J inc in (10), i.e. (δx, δη), but of derivatives as follows: where the ith components [•] i of the sub-vectors are: There are n components of each sub-vector, so ∇J inc has (T + 1)n components in total. (For strong-constraint 4D-Var and 3D-Var, elements exist with respect to only δx in (14) so ∇J inc has just n components.) The gradient is important because in phase space −∇J inc points in the direction of greatest descent of J inc . Components of ∇J inc are derived in Appendix A to be: 4D-Var is limited in its scope for computational parallelizability. The estimation of δx and δη(t) -as in formulation (10) -cannot be parallelized temporally (e.g. for δη(t): this requires integration of the forward linear model from earlier times in (11), and the integration of the adjoint model from future times in (16)). There are alternative formulations though that may allow parallelization of 4D-Var in time (M. Fisher, 2011;personal communication;Desroziers and Berre, 2012) .

Control variable transforms in Var
The cost function (10), with definitions (11) and (12), requires the matrices B 0 , R t , and Q t to be known explicitly, but even modern computers are incapable of dealing with such large matrices. The method of control variable transforms (CVTs) is a 'trick' to represent B 0 and Q t without needing to know them explicitly. ‡ Instead of dealing with (10) -a function of (δx, δη) -an alternative cost function is formulated in terms of new variables ('control variables'). There are number of key studies that describe the choice of control variables used for meteorological DA (e.g. Parrish and Derber, 1992;Derber and Bouttier, 1999;Berre, 2000), for ocean DA (Weaver et al., 2005), and in general (Bannister, 2008;Ménétrier and Auligné, 2015).
The following transformations may be made for δx, and δη(t) in (10): Here δχ var and δγ var (t) are 'control variables', which are associated with the model space variables δx and δη(t) respectively, and U and V t are the CVTs. Vectors δχ var and δγ var (t) need not have n elements each, but for the purposes of this article we shall assume that each is an n-element vector.
Substituting (17) and (18) into (10) gives the preconditioned cost function: and substituting into (11) gives: (d t is still defined by (12)). The aim is to have control variables that have error covariances I, which is achieved by choosing U and V t such that U T B −1 0 U = I and V T t Q −1 t V t = I. Setting these in (19) gives: The gradients in control space follow from (15) and (16) with the chain rule: and Cost function (21) is minimized with respect to (δχ var , δγ var ), which is an easier and better conditioned problem than minimizing (10) with respect to (δx, δη) (Lorenc et al., 2000;Haben et al., 2011) and useful physical constraints can be built into U (Ingleby, 2001;Bannister, 2008). Importantly the two problems are equivalent given: and The optimum values of (δχ var , δγ var ) relate to (δx, δη) with the CVTs (17) and (18). In practice U and V t are modelled (leading to implied covariance matrices above) by making simplifying assumptions about how errors are thought to be related (Bannister, 2008), and the CVT viewpoint will be key to understanding EnVar methods (section 4 onwards). In most of the remainder of this article we do not consider model error.

Pure ensemble data assimilation systems
Var has been a success in NWP but is limited by the B 0matrix, which influences the quality of the analysis . Despite its importance, B 0 has to be estimated -usually crudely -due to its large size. It is also assumed to be static (at the start of each cycle), even though in reality backgrounderror statistics change in time. Ensemble methods (e.g. the EnKF) circumvent this problem by representing background uncertainty with an ensemble. We show briefly how the EnKF approximates background-error statistics and mention some basic EnKF formulations. However, to avoid complications here, we leave discussion of localization until section 6.

Background uncertainty represented by an ensemble
Consider a population of N background forecasts, § each labelled with index (i) x b (i) (all valid at t = 0). The n × n background-error covariance matrix P b is approximated by this population: Here P b is the true (and full-rank) matrix, approximated by B 0 in Var and by P b (N) in the EnKF. The overbar in (27) indicates the sample average. P b (N) is approximate (rank-N −1) due to sampling error as N is finite (section 6). A compact way of writing P b (N) is given in (26) where the columns of the n×N matrix X b are Evensen, 1994).

Ensemble filters
Substituting P b (N) into the Kalman filter (KF) update equation (Jazwinski, 1970;Evensen, 2003) gives: where The ensemble of analyses x a (i) can be cycled to the next observation time.
The key idea is that (28) does not need the explicit n×n matrix P b (N) . Equation (28) is a stochastic formulation of the EnKF (Houtekamer and Mitchell, 1998;Burgers et al., 1998) -socalled because the observations are perturbed stochastically to give the correct spread of the analysis ensemble. Other important variations of the EnKF (e.g. the ensemble transform Kalman filter) use deterministic formulations (Anderson, 2001;Bishop et al., 2001;Tippett et al., 2003) which do not require perturbed observations. (It is outside the scope of this article to review the many flavours of ensemble filters.) The EnKF is 4 in Figure 1.
The last line of (28) is written so that it does not require the adjoint of the observation operator. The linear operation HX b is also often approximated by the nonlinear operations The EnKF can therefore be implemented without the need for linearized operators (this includes the linearized forecast model in the case of the ensemble smoother -section 3.3). This represents a significant reduction of development effort compared to Var (Lorenc, 2003). Many studies show that the EnKF is a useful tool in NWP, often comparable or better than Var (e.g. Zhang et al., 2006;Meng and Zhang, 2007;Bonavita et al., 2008;Zhang, 2008a,2008b;Szunyogh et al., 2008, Whitaker et al., 2008Hamill et al., 2011aHamill et al., ,2011bZhang et al., 2011, Houtekamer et al., 2014. The differences between EnKF and Var can be most evident in data-sparse regions (Whitaker et al., 2008). The background-error covariance matrix in the ordinary KF is assumed to include predictability errors (propagation of analysis error from the previous analysis, generically MP a M T ) and a model error contribution (Q). Since the model producing the x b (i) is imperfect, model error should be allowed to affect the spread of the ensemble background states (the system simulation approach of Houkekamer et al., 1996). There are many strategies for accounting for model error, including covariance inflation  and additive error . These approaches have been tested in an EnKF with a simplified primitive-equation model to represent the model error introduced due to unresolved scales (Hamill and Whitaker, 2005). However the difficulty is choosing the degree of inflation and the structure of the additive error. Hamill and Whitaker (2005) chose to prescribe inflation factors a little over unity (although more advanced approaches are possible, e.g. Raynaud et al., 2012), and they chose additive errors from three options: random samples of differences between forecasts of different resolutions, random samples of differences between forecasts and climatology, and random samples of forecast tendencies. It is also possible to relax the analysis ensemble spread closer to the spread of the forecast ensemble (Zhang et al., 2004).

Ensemble smoothers
In the EnKF, x b (i) , y o (i) , and x a (i) are valid at t = 0, but information is propagated from one time to the next via cycling (sequential DA). An EnKS (Evensen, 2007) on the other hand introduces the time dimension into the analysis problem itself so it can deal directly with observations spanning a time window (0 ≤ t ≤ T). We discuss two ways of implementing an EnKS, which later influence how EnVar is developed.

An EnKS with a 3D state vector
The EnKF (28) may be adapted by keeping its 3D state vector (valid at the start of the window), but incorporating the forecast model into the observation operator (as in 4D-Var) for observations made at t > 0: where is the time-distributed observation operator, and H M is its Jacobian: where H t (•) means use the quantity to the right as an argument and H t and H t are defined in section 2. The above also serves to define the operators H, M, H and M. Furthermore the linear step H M X b can be avoided by approximating . This EnKS is 5 in Figure 1.

An EnKS with a 4D state vector
An EnKS may be alternatively formulated with 4D state vectors (van Leeuwen and Evensen, 1996), contained in the (T + 1)n×N- (27)): and a similar ensemble exists for the analysis, X a . An EnKS can be derived by underlining all quantities (e.g. X b → X b ) in (28): , and H(x b (i) ) and H are defined in (30) and (31). Again the linear step HX b can be avoided with the approximation . This EnKS is 6 in Figure 1. An EnKS may also be posed in a formulation which considers observations in batches over the window (Evensen and van Leeuwen, 2000).
The EnKS formulations with 3D and 4D state vectors will be important in the next section where they form the basis of methods that we shall call En4DVar and 4DEnVar respectively.

Pure EnVar: combining an ensemble with Var
The last decade has seen experimentation with combining the EnKF with Var. Perhaps the simplest means of doing this is to use a Var analysis to recentre the EnKF analysis ensemble. This (one-way) coupling has been used by Zhang et al. (2009) with the simple Lorenz 96 model (Lorenz, 1996), and by Bowler et al. (2008) with MOGREPS (the Met Office Global and Regional Ensemble Prediction System).
However, in this section we look at one-way coupling the other way round and review two (related) methods of using the EnKF ensemble forecasts to define the background-error statistics in Var (collectively known as 'EnVar'). We focus on 4D methods (3D methods follow in the same way as in section 2.1.3). The two methods differ in the use -or not -of the linear/adjoint models as used in 4D-Var. Following the nomenclature recommended by Lorenc (2013), we call a pure EnVar system that uses a linear model 'En4DVar' (the labels 'E4DVar', '4DVarBen', and '4DVarBenkf' have also been used by authors). In all such labels, '4DVar' appears, indicating that it uses the same linearized machinery as 4D-Var does. Also following Lorenc (2013), we call a pure EnVar system that does not use a linear model '4DEnVar'. En4DVar and 4DEnVar are discussed below (without localization at this stage). Bear in mind that this nomenclature is not consistent with all the literature, especially in articles prior to Lorenc (2013). Liu et al. (2008Liu et al. ( , 2009) for instance called their method En4DVar even though it does not use a linear model, but this has been changed to the recommended 4DEnVar in later articles (e.g. Liu and Xiao, 2013).
Note that pure En4DVar and pure 4DEnVar use background errors from one source (here the ensemble), but further coupling can be achieved by merging the ensemble statistics with B 0 of Var. This forms another family of methods called 'hybrid EnVar' (Lorenc, 2013) (section 5) but again this term is not always used consistently in the literature.

Pure En4DVar: requiring linear/adjoint models
A deterministic analysis x a (essentially the ensemble mean of (29)) may be found using a Var technique. Consider the cost function (Liu et al., 2008): which is a function of a new N-element control variable χ ens , and . The latter of the two equations uses the time-compact (underlined) notation. This control variable comprises coefficients that multiply the ensemble perturbations in the linear combination as follows: (Lorenc, 2003). Here X b acts as a CVT, which is taken from a parallel EnKF system (or even from a population of NMCstyle forecast differences , where the National Meteorological Centre (NMC) method approximates forecast errors with a set of lagged forecasts (Parrish and Derber, 1992)). To see how (34) (20) [where in (21) and (20) no model error is considered, δγ var (t) = 0]). These systems have the same form, except that the CVT U in (17) becomes X b in (36). The implied B-matrix of En4DVar is then akin to (24), i.e. X b X b T , as consistent with (26). Minimizing (34) is efficient as it is well conditioned, and χ ens has only N elements. At the minimum of (34) χ ens = χ a ens , which leads to the analysis x a = x b +X b χ a ens . The gradient of (34) (Appendix A) is required for the minimization and has the forms: where, as in 4D-Var, the linear and adjoint operators appear. This method is formally called En4DVar ( 7 in Figure 1) and the 3D counterpart is called En3DVar ( 8 ). An accepted alternative name for this method is 4D/3DVarBen (Lorenc, 2013), which translates as 4D/3D-Var with background-error covariance matrix implied from an ensemble. Our recommended nomenclature (broadly in line with Lorenc (2013)) is summarized in Figure 1 (but note the comments at the start of section 4). Minimizing J En4DVar when the model and observation operators are linear is equivalent to the mean EnKS solution in section 3.3.1 (Appendix C).

Pure 4DEnVar: avoiding linear/adjoint models
Just as the linear and adjoint operators can be avoided in the EnKF and EnKS (section 3), a similar approximation can be applied to En4DVar. For p t observations at time t, the combination H t M 0,t X b can be translated to the explicit p t ×N-element matrix Y x t using the nonlinear operators H t and M 0,t (Liu et al., 2008): where y x t is the model observation vector at time t based on the ensemble mean, y x t = H t {M 0,t (x b )} and Y x has dimension p×N. The (time-compact) cost function and gradient of this formulation are adapted from (35) and (37) using (39): (32)) may be computed in advance, which means that this method is akin to using an EnKS with a 4D state vector (section 3.3.2). In (40) and (41), the CVT is δx = X b χ ens . This method is formally called 4DEnVar ( 9 in Figure 1) (Desroziers et al., 2014;Fairbairn et al., 2014), and is similar to methods of Hunt et al. (2004) and Tian et al. (2008). The 3D counterpart is 3DEnVar (although it is essentially the same as En3DVar) and alternative names are given in sections 4 and 7.5.

Comments on EnVar methods
For En4DVar and 4DEnVar the implied background-error covariance at t = 0 is X b X b T , and between times t 1 and En4DVar and 4DEnVar are each used typically to find a single deterministic analysis per analysis cycle, and the perturbed ensemble members, X b , are used in these methods only to define the CVT. There are advantages and disadvantages of EnVar over traditional Var: • They do not require a background-error covariance model as Var does. This is important when modelling background errors in B 0 involving processes that are too complicated, nonlinear or when geophysical balances are not relevant. • The control vector χ ens has N elements while the control vector δχ in 3D-Var or strong-constraint 4D-Var has n elements. As N n, this represents an efficient problem that has shrunk in size to reflect the low rank of P b (N) . • 4DEnVar does not need linearized models, but 4D-Var and En4DVar do. If NWP centres no longer have to develop, maintain and run linear and adjoint models, this results in a significant cost gain, especially if linearizing highly nonlinear processes with M can be avoided (e.g. Xu, 1996;Stiller and Ballard, 2009;Stiller, 2009). As a consequence however, other quantities that rely on the adjoint will no longer be available, such as adjoint sensitivity analysis (e.g. Benedetti et al., 2003) or Hessian singular vector calculations (e.g. Lawrence et al., 2009). • As with the EnKF/S, EnVar has a suitability for parallel processing for computing the matrix Y x . Such systems have limitless parallelizability with the number of ensemble members. • The low-rank property of the implied background-error covariance matrix in EnVar means that sampling error problems will inevitably arise when N is small. This usually requires some kind of mitigation such as localization (section 6).

Generating an ensemble within EnVar
The EnVar forms given in sections 4.1 and 4.2 produce only a single analysis (the mode of the posterior), ¶ while the EnKF/S produce an ensemble of possible analyses. However EnVar permits further constraints to the analysis, such as a tangent linear normal mode constraint (Kleist et al., 2009;Wang et al., 2013) or a variationally-based initialization term applied with an extra term-normally called J C -added to the cost function ¶ Instead of a single analysis, such minimization problems can, if required, be repeated for each ensemble member i, e.g. (Liu and Xiao, 2013), where the following substitutions are made in (35):  (Clayton et al., 2012;Ge et al., 2012), which can help the analysis to be close to a defined balance.
We also mention a couple of related EnVar schemes that are capable of producing an ensemble. The 'Maximum Likelihood Ensemble Filter' (MLEF) (Zupanski, 2005) can also bypass use of the linearized model. MLEF, which preconditions the variational problem on the Hessian, has been demonstrated in a 3D context. This preconditioning replaces (36) with the following CVT: where in 3D S = R −1/2 HX b T R −1/2 HX b , which is written in a way so that the H (as before) can be approximated with two runs of the nonlinear observation operator. Hessian preconditioning not only allows a very efficient minimization but it also allows calculation of an analysis ensemble, so the method can be used without a separate ensemble generator. The 'Ensemble Variational Integrated localized (or Lanczos)' (EVIL) scheme of Auligné et al. (2016) is another method (discussed further in section 5.5 where it is applied to a hybrid setting).

Hybrid methods
According to the definition in Lorenc (2013), a hybrid method refers to one that blends B 0 from Var with P b (N) from an ensemble (recall B 0 is full-rank but quasi-static and crudely modelled, and P b (N) is flow-dependent but rank deficient). Six ways of combining these matrices are discussed here. The first is an ensemble-based recalibration of the B 0 -matrix in Var; the second and third are (equivalent) ways of averaging B 0 and P b (N) ; the fourth averages the gain matrices instead of the covariance matrices; the fifth returns to averaging B 0 and P b (N) but provides an ensemble of analyses as part of the variational iterations; and the sixth combines B 0 with P b (N) , but not by averaging them. Hybridization introduces climatological error information into EnsVar (where the hybrid system becomes less sensitive to ensemble size than the pure EnKF; Zhang et al., 2009). The use of hybrid covariances has long been thought to be useful (Gustafsson, 2007;Bishop and Satterfield, 2013). Here we shall consider the former viewpoint (equivalent to hybridization of, for example, 4D-Var with En4DVar). From this perspective, the hybrid background-error covariance matrix, here denoted B h , replaces the usual B 0 -matrix in Var. Localization is not discussed until section 6 since at this stage it is a distraction from the hybridization procedure.

Ensemble recalibration of B 0 in Var
Some operational centres use a parallel ensemble prediction system to continuously recalibrate the model of B 0 in their Var system (this blending of ensemble and climatological information would arguably classify this as a hybrid method, so we should call it a B h -matrix). Météo-France and the ECMWF, for example, use the spread of their ensemble prediction systems (initialized not from an EnKF but from ensembles of perturbed 4D-Var assimilations; Belo Berre et al., 2007Berre et al., , 2009Isaksen et al., 2010) to model the 'variances of the day' of B h (Bonavita et al., 2011(Bonavita et al., , 2012Raynaud et al., 2012). The ensemble is known to have significant sampling errors (and so needs to be filtered) and is known to be under-spread (and so needs to be inflated).
Recently the ECMWF and Météo-France extended their hybrid schemes to include on-line calibration of the spatial correlations in their B-matrices. At the ECMWF, e.g. these spatial structures are described by a 'wavelet diagonal' scheme (Fisher, 2003), which can represent different correlation length-scales for different geographical regions. The original implementation of this scheme used a climatological calibration, which implied static background-error length-scales, but these are known in reality to be flow-dependent (e.g. longer length-scales in higher pressure systems). The ECMWF has experimented with two ways of calibrating their wavelet scheme, namely using samples of background errors from either (i) a 12 day window (25 ensemble members, twice per day, leading to 600 samples), or (ii) a combination of current members (25 members within an 8 h window -200 samples) plus samples representing climatology (400 samples spread over the seasons, leading again to 600 samples in total) (Bonavita et al., 2016). The 600 samples in each case are considered necessary to calibrate the wavelet model without filtering or localizing. Even though the forms of their B hmatrix models (Derber and Bouttier, 1999;Fisher, 2003) are not changed by recalibration, important flow-dependence is added as a consequence of recalibration with either (i) or (ii) above, resulting in modestly improved forecasts right out to 10 days in some quantities (Bonavita et al., 2016).

The explicit average of error covariance matrices
The classic hybrid background-error covariance matrix, B h , is defined according to Hammill and Snyder (2000) as the weighted average of B 0 and P b (N) : where β (0 ≤ β ≤ 1) is a tunable factor controlling the weighting of the static and flow-dependent covariances, and B h replaces B 0 in (10). The use of explicit matrices is obviously impractical for large systems and so methods of using B h implicitly are introduced.

The implicit average of error covariance matrices
This method merges the CVTs of (17) and (36) where (34)). In (44) the augmented control vector (δχ var , χ ens ) has n+N elements where δχ var is associated with B 0 (as δχ var in section 2.3), and χ ens associated with P b (N) (as χ ens in section 4). Recall that n is the size of the state, N is the number of ensemble members, U is the original n×n CVT of section 2.3, and X b is the n×N matrix comprising the ensemble perturbations The hybrid En4DVar cost function can be compactly written as where χ h = δχ var χ ens .
For conciseness, the formula for the gradient of J HEn4DVar with respect to χ h is left to section 7, where localization is also considered. The CVT for (46) relates δχ var and χ ens to δx at t = 0: (cf. (17)). The hybrid CVT which consolidates U and X b is called U h , which is the n×(n+N) matrix √ 1 − βU √ βX b . In the χ h representation, background errors have covariance χ h χ h T b = I, which implies the following error covariance in the model representation (δx): where • b performs an expectation over the background population. Equations (49) and (43) are identical, which shows that minimizing (44) with respect to δχ h (with CVT (48)), and then transforming the resulting control variable increment into model space with (48), is the same as minimizing the strongconstraint 4D-Var cost function with B 0 → B h . This is hybrid En4DVar (11 in Figure 1, 12 for En3DVar). A benefit of form (44) over (43) is efficiency as it avoids prohibitively large matrices. Wang et al. (2007b) provide an alternative derivation of (49). Although this hybrid method of representing the background-error covariance matrix has been applied to strong-constraint 4D-Var, it may also be used with the weak-constraint formulation (21). When β = 1 (β = 0), this system is identical to pure En4DVar (pure Var) (section (4.1)).

Combining gain matrices
The most common type of hybrid schemes average B 0 with P b (N) , but an alternative is to follow Penny (2014) by averaging the Kalman gains of pure Var (gain K var ) and of pure EnKF (gain K ens ) to give K h . For 3D schemes, e.g. where and β i are scalars. K ens and K var capture the way that the respective schemes work but are not computed explicitly. Penny (2014) applied this averaging to 3D-Var and an LETKF with the Lorenz 96 model (Lorenz, 1996) and the choices β 1 = 1, β 2 = α, and β 3 = −α. This leads to the hybrid gain: and the hybrid analysis . This analysis is equivalent to first running the EnKF to give the ensemble mean x a ens , running Var with this as the background to give x a var , and then doing the following weighted average to give the hybrid analysis x a h =x a ens +αx a var . Penny (2014) found that this kind of hybrid improves the analysis over those of the separate pure schemes. However the benefit of this hybrid is that it requires little extra coding to existing EnKF and Var schemes.
It is possible to allow the β to vary with vertical level, as is done by Buehner et al. (2013). Some authors put the √ 1−β and √ β terms on the ensemble and climatological terms respectively instead of the way round in (48). Other authors (e.g. Wang et al., 2008b;, prefer to set the β weights in the cost function instead of in the CVT. In this case the background 'Var' and 'ens' terms would contain the extra factors 1/(1−β) and 1/β respectively and the √ 1−β and √ β would be omitted in (48).

EVIL
Most EnVar systems use a separate EnKF-style system to generate X b for each cycle. The fact that the EnVar and EnKF are separate can lead to inconsistencies, and hence to sub-optimality in hybrid EnVar. EVIL (Ensemble Variational Integrated Localized [or Lanzos]) (Auligné et al., 2016) represents a modification to hybrid EnVar which uses information gained in the Var minimization procedure to estimate an analysis ensemble, instead of the usual single analysis. EVIL is based on the fact that conjugate gradient (CG)-based minimization algorithms are closely related to Lanczos methods (Paige and Saunders, 1975;Fisher and Courtier, 1995;El Akkraoui et al., 2013). In outline, the gradient descent vectors from q iterations of the CG procedure form a Krylov sub-space in which the Hessian of the cost function is tridiagonal (q here takes the role of N). Typically q ∼ O(50), so such a Hessian can be easily diagonalized. The eigenvectors of the Hessian in control space -stored in the n×q matrix Z q -and the eigenvectors -in the q×q diagonal matrix -are called Ritz pairs. In general in quadratic systems, the analysis-error covariance matrix, A, is the inverse of the Hessian. This leads to the following approximation for an n×q analysis ensemble, X a in 3D (Auligné et al., 2016): where Z q −1 q Z T q is the analysis-error covariance (in control space) found from the CG/Lanczos procedure, Y o is a p×q matrix of stochastically perturbed observations, and U h is the hybrid CVT (48). (Note that the 3D objects like X b , H, etc. could be extended to 4D with the underline notation of sections 3.3.1 and 3.3.2 as appropriate.) The EnVar minimization would provide the Ritz pair information and (54) would provide the machinery to compute an analysis ensemble -then propagated in time for the next forecast ensemble. EVIL is 10 in Figure 1.
Equation (54) is a particular implementation of EVIL called stochastic-EVIL (S-EVIL). Other implementations are deterministic EVIL (D-EVIL), which uses the Ritz pairs to compute a square-root of the analysis-error covariance matrix, and a resampling EVIL (R-EVIL) which can yield additional ensemble members (Auligné et al., 2016). It may also be applied to bi-conjugate gradient algorithms giving dual representations of the Ritz pairs.
The number of iterations, q, may need to be relatively high to obtain a reasonable representation of the correct eigenspectrum of A, and Auligné et al. (2016) suggest that several hundred may be needed in their cut-down (single-level) system. This may not yet be affordable to run routinely, but is an option for future systems. The ability of Z q −1 q Z T q to represent the analysis-error covariance matrix will need to be checked thoroughly, as will the performance of the method with nonlinear operators where the link between the Hessian and the analysis-error covariance matrix is no longer strictly valid.

The ensemble reduced-rank Kalman filter
In (43), B 0 is able to influence how the DA affects all degrees of freedom, including the subspace spanned by the ensemble. In Petrie and Bannister (2011), another hybrid scheme is proposed which, instead of averaging, essentially uses P b (N) in the ensemble subspace and uses B 0 elsewhere. The general method is based on the 'reduced-rank Kalman filter' (or the 'simplified Kalman filter') developed at ECMWF in the 1990s for use with Hessian singular vectors (Fisher, 1998;Fisher and Andersson, 2001;Beck and Ehrendorfer, 2005), but Petrie and Bannister (2011) adapted it for use with an ensemble and called it the 'Ensemble Reduced-Rank Kalman Filter' (EnRRKF).
The cost function for the EnRRKF is a function of the n-element control vector χ EnRRKF : where χ EnRRKF = χ s +χ s , The first part of χ EnRRKF in (56) is χ s (n-elements) which represents the subspace (the part that has flow-dependent covariances) and is non-zero only in the first N elements. The second part isχ s (n-elements) which is the remainder (the part that has climatological covariances) and is non-zero only in the last n−N elements. Due to the zero terms, summing these vectors in (56) does not lose any information. The first line of (55) has three terms due to errors that lie (i) exclusively inside the N-dimensional subspace, (ii) inside and outside this subspace, and (iii) exclusively outside this subspace respectively. The CVT for the EnRRKF is: where U is the same CVT used in pure Var (17) and X is a special n×n matrix (XX T = I and X T X = I) (do not confuse X with X b ). X has the following special properties: when δx lies in the ensemble subspace, X −1 U −1 δx is a vector which can be non-zero only in the first N elements (the last n−N elements are zero), and when δx lies outside the ensemble subspace, X −1 U −1 δx is a vector which can be non-zero only in the last n−N elements (the first N elements are zero). This reflects the structures of χ s and χ s and can be achieved using Householder transforms (Fisher, 1998). Although P b χ −1 appears in (55), it acts only on vectors that are always zero in the last n−N elements. This means that only part of the inverse matrix P b χ −1 (the left-most n×N part) needs to be known, which is achievable for N ∼ O(10)-O(10 2 ). This reduced P b χ −1 matrix is derived in Petrie and Bannister (2011) and has a complicated, but calculable form. To our knowledge, the EnRRKF has not yet been tested, so it is impossible to assess it. Since the flow-dependent part of the error covariance is nearly full-rank in the N-dimensional space (or at least a K-dimensional subspace (K < N) can be defined where P b (K) is full rank), there is reason to suppose that localization in the EnRRKF is not essential. Unlike for the other hybrid methods, the length of the EnRRKF control vector is no larger than that used for pure Var. Note that the implied background-error covariance matrix for the EnRRKF is not of the form UX(UX) T as the χ EnRRKF does not have uncorrelated background-error covariances.

Sampling noise and localization
In ensemble applications in NWP, N n, where n is ≥ O(10 7 ), but N is typically < O(10 3 ). This means that P b (N) (26) is undersampled. The consequences of under-sampling are two-fold: P b (N) usually underestimates the variance (leading to filter divergence) and P b (N) is severely rank deficient; the rank of P b (N) computed with (26) is ≤ (N −1), which leads to the introduction of analysis noise (Houtekamer and Mitchell, 1998;van Leeuwen, 1999;Hamill et al., 2001;Houtekamer and Mitchell, 2005;Ehrendorfer, 2007;Meng and Zhang, 2012;Houtekamer and Zhang, 2016). In normal situations when N n, filter divergence can be mitigated with inflation (Anderson and Anderson, 1999) and rank deficiency can be mitigated with localization (Hamill et al., 2001;Lorenc, 2003).
Localization plays a fundamental role in EnKF-based systems, but before this section, for simplicity, it has been missing from the equations. This section is about how localization is included.
Different means of reducing far-field sampling errors by localization are used. One common method modifies the sample covariances by a Schur product in model space (Lorenc, 2003;Hamill et al., 2001). This is the type of localization that is reviewed below. Another method works in observation space by limiting the observations used to update a particular point to those within a prescribed radius, and/or by gradually reducing the influence of the observations as a function of distance from the point (Houtekamer and Mitchell, 2005). The latter method is used for instance in the local ensemble Kalman filter Szunyogh et al., 2005) and with other ensemble DA techniques (Bishop et al., 2001;Pham, 2001;Szunyogh et al., 2008), but presents problems with interpreting observations with non-local observation operators (Fertig et al., 2008), such as satellite radiance measurements.

Schur product localization applied in model space
Sampling noise due to small N is most significant when the true correlation values are small. Since true correlation values are expected to be small between points that are separated by a large distance, a common approach is to filter computed correlation values according to the distance separating the points. Consider a single field and let the sample covariance between points p and q (at positions r p and r q respectively) be the matrix element . This is multiplied by a moderation function c(r p , r q ) which is unity when r p = r q and goes to zero when |r p −r q | → ∞ (Gaspari and Cohn, 1999). This is localization. In the context of (26), the localized covariance function is where δx b (i) (r p ) is the ith background perturbation (27) at r p . Define an n×n localization matrix C to have elements C pq = c(r p , r q ) and define the localized background-error covariance matrix to be P b (N) , whose matrix elements are P b . This can be written in the following compact way: where the • symbol is the Schur (element-by-element or Hadamard) product, and where a covariance with a hat indicates its localized form. (This hat notation will apply to any quantity: covariance matrices and cost functions, etc.) This localization can be generalized to the multivariate situation (Bishop and Hodyss, 2009a,2009b,2011Bannister, 2015). Matrix C is mathematically a correlation matrix, and should be chosen such that P b (N) remains a true covariance (Gaspari and Cohn, 1999).
The Schur product localization discussed above may be written implicitly, which allows localization to be used in DA without the need for the explicit covariance matrices in (59). The two known forms are labelled 'B05' and 'L03' after Buehner (2005) and Lorenc (2003) respectively (as is done in Wang et al., 2007b).

The 'B05' representation of the localization Schur product
In a similar way to the decomposition of P b (N) as P b (N) = X b X b T (26), suppose similarly that C can be decomposed as C = U C U C T (where the n×M matrix U C is a kind of 'square-root' of C), and P b (N) can be decomposed as P b (N) = X b X b T (where the n×NM matrix X b is defined below). Appendix D shows that P b (N) can be written as: (Buehner, 2005), where δx b (i) / √ N −1 is the vector occupying the ith column of X b , u C (i) is the vector occupying the ith column of U C , and the diag operator evaluates to an n×n diagonal matrix whose diagonal comprises its n-element vector argument. In other words, the columns of X b are formed in (61) from every possible Schur-product-pair of columns of X b and U C , and X b can be thought of as a matrix of NM scaled effective ensemble members whose covariance X b X b T is the localized backgrounderror covariance P b (N) . As NM > N, P b (N) would be expected to have a higher rank (maximum rank ∼ NM) than P b (N) (maximum rank ∼ N). This approach may be used to localize any of the pure EnVar or hybrid schemes discussed in sections 4 and 5 by changing (36) to where χ ens is the corresponding NM-element control vector.

The 'L03' representation of the localization Schur product
Another compact representation of a localized ensemble background-error covariance matrix is often used for the variational formulations of sections 4 and 5 (Lorenc, 2003). This representation is explained with reference to localizing the En4DVar system (34) ( 7 in Figure 1), which involves N new control vectors: and α (i) is the n-element control vector associated with the ith ensemble member, which are assembled into the n×N matrix A, and 1 N is the column vector whose N elements all contain 1. Equation (65) is an unpreconditioned form of L03 localization. A preconditioned form can be devised with N alternative vectors χ (1) to χ (N) assembled into the matrix V: where χ (i) (n-elements) is the preconditioned version of α (i) , V (n×N) is the preconditioned version of A and U C is an n×n potentially full-rank square-root of C, the localization matrix in (59). The relationships between the preconditioned and unpreconditioned variables are α (i) = U C χ (i) and A = U C V.
The unpreconditioned and preconditioned forms are identical provided that U C U C T = C (as in section 6.2). In a similar way to the 4D-Var control variable δχ var in section 2.3, the components of χ (i) in (66) have unit background covariance within and between control vectors, The key to understanding the L03 formulation is given in Appendix E, which shows that the background-error covariance matrix implied by (68) is, as for B05, P b (N) •C. There are Nn control vector elements in total in this formulation. Equation (68) is just an extension of (36), where each control vector element in (36) itself becomes a vector. Following Lorenc (2003), we choose to assemble the χ (i) vectors in a matrix (here V) as in (67). Other authors (e.g. Wang and Lei, 2014;Lorenc et al., 2015) instead choose to assemble these vectors as a long Nn-element vector, but the two representations are equivalent.

Comments on localization
Compact representations B05 and L03 can help to reduce noise in the sample error covariance matrix. However localization (and inflation) are only a partial solution to the under-sampling problem because the moderation functions can modify some important properties of the raw ensemble, e.g. its balance properties (Lorenc, 2003;Houtekamer and Mitchell, 2005;Kepert, 2009;Greybush et al., 2011;Bannister, 2015). This is one justification for developing the hybrid methods (section 5). Hybrid methods still require localization so either B05 or L03 are used in most hybrid systems (section 7 summarises localization in most ensemble-related methods described so far).
In B05 and L03, the specification of the matrix U C remains, which can still be a huge matrix. There are ways of modelling U C instead of storing large matrices. The approach is identical to that used to model the spatial component of B 0 in Var, e.g. using homogeneous and isotropic functions (Berre, 2000;Bannister, 2008;Errera and Menard, 2012), or more generally using using recursive filters (Hayden and Purser, 1995), wavelets (Fisher, 2003;Deckmyn and Berre, 2005;Bannister, 2007), or spectral localization (Buehner and Charron, 2007). Methods also exist that generate adaptive moderation functions, which depend upon the flow itself (Bishop and Hodyss 2007,2009a,2009b,2011. The method of Buehner (2012) in particular is a cross between spatial and spectral localization. It is useful in the context of B05 localization, where, instead of (61) member i and waveband j this gives δ j (i) = S −1 j Sδx b (i) , where S is a spectral transform, and j is the diagonal bandpass filter associated with band j. Under this scheme, the alternative version of (61) is where U Cj is a square-root of a localization matrix associated with band j, i.e. U Cj U Cj T = C j . This form allows band (i.e. scaleselective) localization, which is useful because less localization (i.e. larger localization length-scales) is needed for lower wavebands (i.e. larger error scales) than for higher wavebands (Buehner, 2012). This method effectively increases the potential rank of the background-error covariances by a factor of J, which results in the need for fewer ensemble members for similar noise level (Buehner, 2012). This method is extended in (Buehner and Shlyaeva, 2015). B05 and L03 localizations are equivalent, but their efficiencies depend upon the chosen rank of U C . Consider for instance an implementation of localization in pure En4DVar.
• The total number of control vector elements required for B05 is NM, making up χ ens in (62). Elements of χ ens are associated with NM perturbation vectors in X b . For lower-rank implementations of U C (where M n), B05 is the more efficient approach. B05 is used, for example, in the Environment Canada global and regional deterministic DA systems (Buehner et al., 2015a;Caron et al., 2015), and also in the recent work of Liu and Xue (2016).
• The total number of control vector elements required for L03 is Nn (comprising V in (66)- (68)), which are associated with the N perturbation vectors in X b . For full-rank implementations of U C , L03 is the more efficient approach (since X b does not have to be stored) as long as U C can be modelled efficiently. L03 is used, for example, in the Met Office's global hybrid system (Clayton et al., 2012).
We note that it is arguably more straightforward to account for model error in 4DEnVar schemes that use B05 localization (section 8).

Summary of the EnKF, pure EnVar, hybrid EnVar
We have presented many options for approximating the DA problem which incorporate an ensemble in one way or another. These have been classified as either pure ensemble (EnKF/S, section 3), pure EnVar (essentially Var approaches to finding the mean state of the pure ensemble system, section 4) and hybrid EnVar (extending EnVar with added information from the B 0 -matrix of Var, section 5). The B05 (section 6.2) and L03 (section 6.3) localization formulations are now applied to these DA systems to give the final forms of the equations that operational DA can deal with. It is possible (within bounds) to mix-and-match a DA method with a localization method, leaving us with a relatively large number of combinations.

Localization in the EnKF
The most common way of localizing P b (N) in an EnKF ( 4 in Figure 1) is by restricting the observations that are allowed to affect each analysis point (discussion in section 6). However, to localize in model space, B05 is the appropriate implicit method. As described in section 6.2, a localized version of (28) is found by replacing P b → P b in the first line, which is equivalent to replacing X b → X b in the last line, where X b is defined in (61) (not shown).
To avoid linearized operators in the resulting expression, H X b can be approximated by the nonlinear operations (i) is the ith column of X b , and x b is the ensemble mean).

Localization in the EnKS (3D state vector)
The EnKS with a 3D state vector (where observations are predicted by combining the forecast model with the observation operator (29), 5 in Figure 1) may also be localized with B05; in fact it would be difficult to use the observation space localization mentioned in section 6 with this EnKS, as the presence of the forecast model means that even point observations for t > 0 cannot be associated with points in the model domain at t = 0. This localized EnKS is found by a identical procedure to that described in section 7.1, applied to (29) (not shown).
To avoid linearized operators in the resulting expression,

Localization in the EnKS (4D state vector)
The EnKS with a 4D state vector ( 6 in Figure 1) uses X b instead of X b (32), which eliminates the forecast model from within the DA (33). The localization must also be extended to the time dimension, C → C, so U C → U C , where C = U C U C T . This version of the localized EnKS is found by a similar procedure to that described in sections 7.1 and 7.2.1, applied to (33), by replacing X b → X b (not shown).
As above, each column of H X b can be approximated by the

Comments on the implied localized background-error covariances in these systems
Due to the way that the different localizations work in the above 3D and 4D versions of the EnKS, they do not necessarily share the same implied localized background-error covariances, even when the forecast model is linear. In order to examine how they differ, consider the implied error covariance matrix between fields at time t. This is studied by looking at the analysis increment (propagated to time t) due to observations at (only) t. Under these conditions the analysis increment of the 3D state vector form of the EnKS (29) (X b → X b ) is: N) . As usual, the hat (no hat) means 'localized' ('unlocalized'). This analysis allows us to see that P b (N) (t) is the propagated localized background-error covariance implied in the localized EnKS with 3D state vector, which is just the localized covariance at t = 0 propagated with the linear model.
The analysis increment of the 4D state vector form of the EnKS (33) (X b → X b ) due to observations at (only) t is now studied. It exists at all times in the window, but the increment at t can be isolated with the operator comprising T + 1 blocks, each n × n: I t = 0 · · · I · · · 0 , where all blocks are 0, except for the one corresponding to time t, which is I. Noting that for this case H X b = H t X b (t), and I t X b = X b (t), the time t This analysis allows us to see that the localized covariance matrix implied here at time t is The differences between this covariance and the one for the 3D state vector are that (i) here no linearized model is used and (ii) the localization at time t, C(t) here needs to be specified (via U C ), rather than being propagated from C(0). This difference also separates the localized En4DVar and 4DEnVar schemes (below).

Localization and hybridization in En4DVar
In this section, En4DVar (the method in section 4.1 of finding the mean of the EnKS variationally, making use of the linear models of 4D-Var) is localized with B05 and with L03 (11 in Figure 1, and 11 is for hybrid En3DVar). The equations shown are also hybridized with the climatological B 0 .

B05 localization in hybrid En4DVar
The En4DVar equations hybridized and localized with B05 emerge from a straightforward application of concepts in section 5.3 and 6.2 to (35). The cost function, CVT and gradient formulae for this case are given as: where H M is defined in (31), d is defined after (33), and we define the augmented (hybridized) control vector as χ h = ( δχ var χ ens ), which has n+NM elements in total. Here χ ens is the NM-element vector associated with the N ensemble members in X b combined with the M localization members in U C (61). The differences with the original form (35) due to localization (75) include the extended length of the control vector (χ ens → χ ens ), and X b → X b .

L03 localization in hybrid En4DVar
The cost function for En4DVar with L03 localization (66) was used as the example in section 6.3. Incorporating the hybrid scheme from section 5.3 leads to the cost function, CVT and gradient formulae for this case given as: where we define the augmented (hybridized) control vector for L03 as χ h = ( δχ var V ), which has n+Nn elements in total. Note that the ensemble part, V, comprises N n-element vectors ( χ (i) , assembled here as columns of matrix V, so it should strictly be called a control matrix). The gradient of (75) with respect to V (forming the n×N sub-matrix) gives the lower part of (77), which has matrix elements Appendix B derives this part of the gradient, where , which is the gradient of the observation term of the strong-constraint 4D-Var cost function (15). The 4D nature of the problem is wrapped up in ∇ δx J 4DVar o . This hybrid En4DVar system is essentially that of Clayton et al. (2012), but with a number of differences in the detail. Clayton et al. (2012) have an additional term in the cost function, J c (a term to penalize imbalance). Another is that Clayton et al. (2012) perform localization in the space of 'parameter perturbations' instead of model variable perturbations. These parameters are related to the model perturbations via relationships that include balance relations. This is a way (in addition to J c ) of introducing balance into the analysis increments which would have been disrupted if localization was done directly in the space of model variables.

Localization and hybridization in 4DEnVar
In this section, 4DEnVar (the method in section 4.2 of finding the mean of the EnKS variationally, without the need for the linear models) is localized with B05 and with L03 (13 in Figure 1). As for En4DVar, the equations shown are also hybridized with B 0 .

B05 localization in hybrid 4DEnVar
The 4DEnVar equations hybridized and localized with B05 emerge from an application of concepts in sections 5.3 and 6.2 to (40). The cost function, CVT and gradient formulae for this case are given as: where H is defined in (31), d is defined after (33), and the hybridized control vector has the same structure as in section 7.3.1. Apart from the 4D nature of the ensemble-related matrices like X b , notice that U (now underlined) now needs to be a 4D operator to make up for the lack of the linearized forecast model to propagate perturbations from t = 0. This is an important difference between hybrid En4DVar and hybrid 4DEnVar. The are possible ways to structure U and some are listed below.
• Let U be a persistence model of climatological perturbations (Desroziers et al., 2014;Lorenc et al., 2015): where U is the t = 0 CVT as used in (17) in section 2.3. The preceding matrix contains T + 1 blocks, each comprising the n×n identity matrix. The contribution from √ 1−βUδχ var in (79) then behaves as 3D-FGAT (section 2.1.3 and Poterjoy and Zhang, 2015) and the climatological component of the implied background-error covariance between any two state-vector components is independent of time. Even though this is classified as a 4D method, it does not exploit the climatological covariance propagation property of 4D-Var. This is the simplest option which is used with most, if not all, applications of hybrid 4DEnVar at present.
• A more sophisticated form of U is to extend a 3D covariance model used to represent U to 4D. A strategy for modelling 3D background-error covariances is to exploit the normal modes of the system (e.g. Žagar et al., 2004), where the control vector represents coefficients that weight perturbations as linear combinations of 3D normal modes. This could in principle be extended to 4D (where columns of U would be related to the evolving normal modes of the system) although would be limited by linearity of the forecast model. • A further option is to replace the climatological term of 4DEnVar with that of En4DVar, but use a highly simplified model -e.g. with advection only, M adv 0,t . The computational cost issues associated with use of linear models (and adjoints) would remain, but such a simple model would require little maintenance.
It might also be possible to propagate the localization square-root matrix in 4DEnVar (either with the full, linearized or simplified model). Column i of the square-root matrix U C (section 6.2, call u C (i) ) would then, e.g. comprise the pre-computed column vector u C (i) = u C (i) (0), M adv 0,1 u C (i) (0), . . . , M adv T−1,T u C (i) (T −1) . This, and ideas of P. Arbogast (2016; personal communication) are possible ways of dealing with the 4D aspects of localization.

L03 localization in hybrid 4DEnVar
The hybrid 4DEnVar equations with L03 localization emerge from an application of concepts in sections 5.3 and 6.3 to (40). The cost function, CVT and gradient formulae for this case are given as: where the total control 'vector' χ h = (δχ var , C) has n+n(T +1)N elements in total. In these forms, virtually all objects are underlined, indicating that they are 4D. This is the way that localized hybrid 4DEnVar avoids the Jacobian of the forecast model (i.e. the forecast part of the problem is dealt with in the preparation of X b ). The gradient ∇ δx J 4DVar o that appears (as N repeated columns) in (84) is not found from (15); that equation gives ∇ δx(0) , the gradient with respect to the initial perturbation in 4D-Var. Instead ∇ δx J 4DVar o is the following n(T +1)-element vector that does not require the model adjoint: where d t is defined by (12). Note that V is a much larger matrix than V is in hybrid En4DVar with L03 localization. The other difficulty over En4DVar is that a 4D localization must be modelled via U C (discussion in section 7.2.3). This, and the 'U problem' are important outstanding issues in the development of hybrid 4DEnVar (the discussion in section 7.4.1 still stands for L03 localization).

Comments on the Jacobian of the observation operator and on the implied localized background-error covariances in these systems
Hybrid 4DEnVar in (78) and (82) retains the Jacobian H. This could be avoided whenever H X b appears where it can be approximated with differences between runs of the nonlinear observation operator; e.g. for perturbations at time t at column i of X b the approximation would be There is no obvious way of avoiding H elsewhere. The ensemble part of the implied localized background-error covariances between localized En4DVar and localized 4DEnVar differ in the same way as those between the 3D and 4D versions of the EnKS (section 7.2.3). Having a 3D state vector (where future states are generated by the model within the DA), En4DVar has the implied covariance at time t: P b (N) (t) = M 0,t C•P b (N) M T 0,t (quantities in the brackets are valid at t = 0), and having a 4D state vector, 4DEnVar has the implied covariance at time t: This result is independent of whether the B05 or the L03 formulation is used. Figure 2 gives a pictorial representation of how the background-error covariances evolve throughout the assimilation window in each 4D scheme discussed in this article.

Summary of nomenclature
Figure 1 serves to summarize the nomenclature, including the names recommended by (Lorenc, 2013) and in this article, and alternative names. It also summarizes the basic features of the localized schemes including the number of control variables (the Figure caption gives more information). As is the case throughout this article, we recognise, but do not elaborate on, the spectrum of methods that count as an EnKF (e.g. ETKF, etc.).

Model error with (hybrid) En4DVar and 4DEnVar
No assimilation is optimum until it accounts correctly for all sources of error. Model error is an important source, which is a particular concern when the assimilation window is 'long'. We have already considered model error with weak-constraint 4D-Var (section 2.1), and here we consider how model error might be represented in En4DVar and 4DEnVar.

Model error in (hybrid) En4DVar
Accounting for model error in methods that use the linearized model, e.g. hybrid En4DVar, may be achieved by augmenting the control variable with the variables associated with model error, δη(1), . . . , δη(T), adding a model error term as in (10), and modifying the prediction operator of the observations at time t. For instance hybrid En4DVar with B05 localization (14 in Figure 1) has the modified model observations (based on (8)): The new terms accumulate the estimated model error contributions up to and including time t. A similar procedure is valid for L03 localization. * *

Model error in 4DEnVar
Accounting for model error in methods that do not use the linearized model, like hybrid 4DEnVar, is more in line with the requirements of many operational centres, but is arguably harder to do and there are few examples in the literature. The key is to build flexibility into a scheme to relax the need for the solution to be synthesized from model trajectories over the window. One possibility, based around the B05 localization is to allow a different linear combination of members at each time (Amezcua et al., 2017;Goodliff et al., 2017). Instead of one control vector associated with the ensemble, χ ens in (79), there would be T +1 such vectors. The versions of (78) and (79) become (87) and (88) respectively, J H4DEnVar δχ var , χ ens (0), . . . , χ ens (T) where there are n+(T +1)NM elements to the control vector in total (or (T +1)NM for the non-hybrid version). The last term of (87) penalizes model trajectory misfits, which have covariance Q t . There is an inconsistency in system (87) and (88), namely that model error has been accounted for in the part associated with χ ens (t), but not in the part associated with δχ var (the latter is akin to 3DFGAT). Clearly accounting for model error effectively in hybrid 4DEnVar will require a great deal of research.
Another point of interest is the nature of the 4D ensemble comprising X b (and hence the 'localized' version X b ). Changing only the linear combination of members at each time step does not allow significant deviations from the 'model attractor', which would be required when the model is in error. An obvious way of dealing with this is to modify the way that the ensemble is prepared by allowing the trajectories to deviate stochastically. There are many well-documented ways of doing this (Palmer and Williams, 2010), e.g. using multiple physical parametrization schemes (Stensrud et al., 2000;Berner et al., 2011), using the stochastic kinetic energy backscatter scheme (Shutts, 2005), perturbing physics tendencies (Buizza et al., 1999;Bouttier et al., 2012;Fresnay et al., 2012), or perturbing physics parameters (Bowler et al., 2008;Hacker et al., 2011;Gebhardt et al., 2011;Vié et al., 2012;Baker et al., 2014). Timing errors may be represented by adding ensemble members with a slightly different validity time to the background time . Including additive model error during model integration has been done by many authors in the context of the EnKF to good effect (e.g. Hamill and Whitaker, 2005;Kalnay et al., 2007).

Which method to use, and how do the methods compare in use?
There is now a variety of known DA methods, so here we compare the merits and the relative performance of each as reported in the literature.

Factors to be considered
The decision on which DA method to use depends upon many factors, some of which are listed here.
• Parallelizability: Methods that do not require time integration within the DA algorithm have the edge on parallelizability, since the forward and adjoint integrations need to be done serially. Suitable methods include the EnKS (4D state vector), 4DEnVar, and (hybrid) 4DEnVar. Although the EnKS (3D state vector) and (hybrid) En4DVar do time integration within the DA algorithm this can be allocated to separate processors.
• Existing 4D-Var infrastructure. If linearized and adjoint versions of the forecast model are already available (e.g. used already in 4D-Var), then upgrading to (hybrid) En4DVar may well be beneficial (section 9.2). • Systems with highly nonlinear processes or frequently upgraded forecast models. It is well known that highly nonlinear models (e.g. with moist processes, switches, etc.) present difficulties with linearizing. There are also complications over linearizing models that have, for example, semi-Lagrangian dynamical cores. Anyway most models are time consuming to linearize and are expensive to run, so methods that avoid the need for linearized and adjoint models are desirable. Methods like the EnKS (4D state vector) and (hybrid) 4DEnVar are desirable in this case. • Known background-error statistics. In systems that have background-error covariance statistics that are easily modelled and whose variability is negligible then 4D-Var (strong-and weak-constraint) remains useful. This is not normally the case for weather forecasting, even though 4D-Var is historically successful. • Model error is important. If the model error accumulated over the required assimilation time window is significant compared to other sources of error that are accounted for already, then a method with model error should be considered. Of the methods shown in Figure 1, weak-constraint 4D-Var (section 2.1.2), weak-constraint (hybrid) EnVar (section 8), and the EnKF account for model error. There are other possibilities, e.g. methods that use a perfect model, but absorb model error into the R t -matrix (Howes, 2016). • Deterministic or probabilistic. Probabilistic forecasts are increasingly in demand (e.g. Palmer, 2012). If the information contained in a single forecast is not adequate, and measures of analysis and forecast uncertainty are required, then the EnKF, EnKS (and other flavours) are useful. Of course (hybrid) En4DVar and 4DEnVar need an ensemble system running alongside them anyway. It is possible to run these methods many times, by perturbing inputs like the background state and the observations (Liu and Xiao, 2013) or use a scheme like EVIL (section 5.5), which does not need a separate ensemble. • Non-instantaneous or 'rate-of-change' observations. If observations are to be assimilated that are a function of the state at multiple times (e.g. precipitation accumulation) then methods that have 4D state vectors are the natural choice, e.g. 4D-Var, EnKS and (hybrid) En4DVar/4DEnVar. The equations describing these systems will need to be modified to allow for such observations.

Systems studied in literature and their relative performance
There are now a large number of studies testing and comparing the DA methods discussed in this article. In this section we summarize the main findings. We consider three kinds of study: (i) investigations in simple models (Table 2), (ii) investigations with operational-scale models, but with simulated observations (Table 3), and (iii) as (ii), but with real data (Table 4). The nomenclature used here is consistent with Figure 1, even though some articles use their own. The main outcomes are that the 4D methods tend to be better than 3D methods, hybrid methods tend to be better than pure Var, ensemble, or EnVar methods, (and hybrids are more robust when N is small) but 4DEnVar still needs some development effort if it is to become better than En4DVar in all systems; only one study found hybrid 4DEnVar gave better performance than hybrid En4DVar (Gustafsson and Bojarova, ). However 4DEnVar is much cheaper to run than En4DVar so, if the performance of 4DEnVar could be improved through further work, it could then allow the cost savings in computation to be shifted to models with higher resolution and with more ensemble members. The results from the simplified systems are in general in agreement with those from the full systems, confirming the value of the simplified studies, and all studies have helped ultimately guide centres to decide on which methods to use operationally.

Summary of variational/ensemble methods used in NWP centres
It is often reported that most, if not all, NWP centres have the capability to combine their ensemble and variational systems. Table 5 lists the capabilities of a number of leading (inter)national NWP centres and research groups. Obviously this list is a snapshot and the true capabilities will change as developments progress.

Summary and outlook
Data assimilation needs to work well to serve NWP. State-ofthe-art Var systems assume that background information obeys Gaussian statistics, with a covariance, P b , estimated from a quasistatic representation, from an ensemble, or a combination of both. The use of ensemble information in Var systems using the raft of methods discussed in this article to give appropriate flow dependence of P b , has been challenging, but largely successful. This article is intended to be a pedagogical review of Var methods and the practical ways of introducing ensemble information into them to take advantage of the efficiency of Var and the flow-dependence of an ensemble, while minimizing their drawbacks (Buehner et al., 2015b, and section 1). Broadly speaking, this is presently done in three ways.
1. Use the ensemble to recalibrate the (co)variances of the B 0 -matrix used as an approximation for P b in 4D-Var (section 5.1). 2. Use the ensemble to define P b by preconditioning the control vector on the ensemble itself (so-called pure EnVar methods). This describes the analysis increment as a linear combination of ensemble perturbations (where the control vector contains the coefficients) and implies a backgrounderror covariance matrix equal to the sample covariance of the ensemble, P b (N) . This can be done using methods that exploit (En4DVar) or avoid (4DEnVar) the linear/adjoint forecast model (section 4). 3. Average B 0 with P b (N) (so-called hybrid EnVar). This is usually done with an augmented control vector comprising a part that is associated with the static part (as used in traditional Var) and a part that is associated with the ensemble (the same as that used in 1. here). Alternative methods are reviewed such as the a hybrid gain method, the EVIL method, and an ensemble reduced-rank Kalman filter (section 5.3).
These methods use the Var machinery to solve a DA problem and most require a separate ensemble system, which many NWP centres have anyway for ensemble prediction purposes. Practical methods that use ensemble information rely on localization to mitigate some of the effects of sampling error. The mathematical complexities that localization brings can mask the workings of the equations and so the methods are shown first without localization, but it is introduced later with two Schur product methods documented by Buehner (2005) (B05) and Lorenc (2003) EnVar using 2. and 3. here has been shown to be fruitful and the key reasons are summarized as: • (Hybrid) En4DVar and (hybrid) 4DEnVar are efficient ways of finding the deterministic (single) analysis. • The methods provide flow-dependency to the P b used in the DA.  • As these methods use information from an ensemble, there is still sensitivity to the ensemble size, N, especially when it is small, but hybrid methods have been shown to reduce the sensitivity to N. • (Hybrid) EnVar methods are competitive with, or outperform, the pure Var or ensemble counterparts.
There are outstanding issues that still need to be solved. For instance in hybrid 4DEnVar the 4D ensemble is averaged with 3D-FGAT and not with 4D-Var because the latter needs the linearized model (section 7.4). Possible ways forward involve extensions to U (the control variable transform) or to reintroduce a linear model, albeit highly simplified (section 7.4.1). Furthermore in (hybrid) 4DEnVar the way that 4D localization is done is thought to be crucial to its effectiveness. It is relatively straightforward to do 4D localization with functions that are separable in space and time, but there are reasons why this is not appropriate (features that are together at t = 0 -so unaffected by localization -would remain correlated at t > 0, yet may be further apart, and so would be wrongly damped by localization). Such a problem may be dealt with by advecting the localization with similar simplified linear model or by using an adaptive scheme (e.g. Bishop and Hodyss, 2007,2009a,2009b,2011. How to account for model error in (hybrid) 4DEnVar is also an ongoing problem (section 8), especially if it introduces the need for a model error covariance matrix (which brings us back to a covariance calibration problem, as for B 0 ).
There is scope to test the methods covered in this review to situations where pure Var can struggle. One application is quantitative precipitation forecasting at the convective scale. There are already some encouraging results in this area: e.g. hybrid En4DVar can outperform pure ensemble and Var methods in summer convection , and hybrid En3DVar is better than pure ensemble and Var methods for analysing precipitation positions . Furthermore hybrid 3DEnVar (unlike 3D-Var) can improve the position of a hurricane (Wang, 2011). However, there are some less encouraging studies showing that, for example, humidity is particularly sensitive to sampling error and is not improved with 4DEnVar over 3DEnVar (Liu et al., 2009). Operational centres like the Met Office use EnVar systems for their global models, but have yet to develop similar systems for their regional models. It remains an important question how much these techniques can improve analyses and forecasts in their high-resolution regional models. Environment Canada though has already shown that 4DEnVar can lead to better forecasts of moisture-related quantities like precipitable water and equitable threat score for precipitation (Caron et al., 2015). It should also be noted that all of the methods may be suitable for application to other geophysical problems like coupled systems and estimating surface fluxes of chemical species.
Finally a note on nomenclature. As the number of methods grows, it becomes important to label each succinctly and accurately. Lorenc (2013) has attempted to do this, and authors have tended to follow. We support the view that the 'hybrid' label should be used only for methods that combine ensemble and static aspects in their background errors, but we do think that it should be necessary to specify this when a method does use information from both sources (unlike point 5. in Lorenc, 2013)); see Figure 1. It is also useful to distinguish between 3D and 4D pure ensemble methods. In this case it is sensible to specify 4D ensemble methods as smoothers (EnKS) rather than as filters (EnKF). To add to the complexity, there are also methods that run ensembles of 4D-Var (Belo Berre et al., 2007Berre et al., , 2009Isaksen et al., 2010) or of 4DEnVar (Fairbairn et al., 2014) and a nomenclature is needed here. One suggestion is to use En-4DVar and En-4DEnVar respectively to indicate that the 'En' (separated from the main method with a hyphen) represents independent runs of the system specified.
where y (p-elements), H (p×n-elements), X (n×N-elements), U (n×n-elements), the symmetric matrix S (p×p-elements), and 1 N (N-elements, each of value 1). This is the structure of cost functions that use the L03 formulation of localizing ensemblederived background-error covariances ((66) in section 6.3). The matrix derivative is defined as where V ij is the ijth element of V. In order to derive a matrix expression for this gradient, we use the chain rule. Let v = {X • [UV]} 1 N , where v has n elements. ∇ V J can be found using the vector derivative result in Appendix A: The chain rule allows the derivative with respect to Z to be found: ∂J/∂v l which appears in (B4) is the lth component of (B3), and ∂v l /∂V ij can be found by first expanding out the definition of v in terms of V given above: Substituting this into the chain rule gives: which can be confirmed to be the ijth element of the matrix expression: where ∇ v J · · · ∇ v J is the n×N matrix of repeated columns of ∇ v J.
√ N −1 is the ith column of X b and the diag operator gives an n×n diagonal matrix whose diagonal elements comprise the vector argument. There are NM columns of X b , which may be thought of as NM effective ensemble members whose covariance has the desired property. Evaluating X b X b T from (D3) gives: (1) ) . . .
The i, jth element of (D4) is given below, which is shown to be the i, jth element of P b (N) •C: (D5)

Appendix E: The 'L03' method of localizing covariances
This appendix shows that the CVT given in (68): implies a background-error covariance matrix that is localized. For N ensemble members and a state size n, here X b (n×N-elements) is the matrix of scaled background deviations (i.e. X b ji = (x b (i) −x b ) j / √ N −1), U C (n×n-elements) is a squareroot of localization matrix C, V (n×N-elements) is the matrix whose columns are the mutually uncorrelated control vectors χ (j) (i.e. V ij = ( χ (j) ) i -the ith component of the jth control vector), and 1 N is the N-element column vector of 1s. The kth element of δx from (E1) is: Calculating the covariance between elements k and k of δx: The left-hand side is matrix element k, k of δxδx T and the right-hand side of the last line is matrix element k, k of P b (N) • C. This shows that the error covariance matrix implied by the CVT is the error covariance matrix sampled from the forecast ensemble, P b (N) localized with C.