A conjugate BFGS method for accurate estimation of a posterior error covariance matrix in a linear inverse problem

One effective data assimilation/inversion method is the four‐dimensional variational method (4D‐Var). However, it is a non‐trivial task for a conventional 4D‐Var to estimate a posterior error covariance matrix. This study proposes a method to estimate a posterior error covariance matrix applied to the linear inverse problem of an atmospheric constituent. The method was constructed within a 4D‐Var framework using a quasi‐Newton method with the Broyden–Fletcher–Goldfarb–Shanno (BFGS) algorithm. The proposed method was constructed such that conjugacy among the set of increment vector pairs was ensured. It is theoretically demonstrated that, when this conjugate property is coupled with preconditioning, an analytical solution of a posterior error covariance matrix could be obtained from the same number of vector pairs as observations. Furthermore, to accelerate the speed of convergence, the method can be coupled with an ensemble approach. By performing a simple advection test, it was confirmed that the proposed method could obtain an analytical matrix of the posterior error covariance within the same number of iterations as the observations. Furthermore, the method was also evaluated using an atmospheric CO2 inverse problem, which demonstrated its practical utility. The evaluation revealed that the proposed method could provide accurate estimates not only of the diagonal but also of the off‐diagonal elements of the posterior error covariance matrix. Although far more expensive than optimal state estimation, the computational efficiency was found to be reasonable for practical use, especially in conjunction with an ensemble approach. The accurate estimation of a posterior error covariance matrix resulting from the proposed method could provide valuable quantitative information regarding the uncertainties of estimated variables as well as the observational impacts, which would be beneficial for designing observation networks. Furthermore, error correlations derived from the estimated off‐diagonal elements could benefit the interpretation of optimised parameter variations.

from the estimated off-diagonal elements could benefit the interpretation of optimised parameter variations.

4D-Var, posterior error covariance, BFGS formula, inverse analysis INTRODUCTION
Various data assimilation and inversion methods have been used to optimise model variables, such as the initial model states or boundary conditions, in the field of atmospheric sciences. The four-dimensional variational method (4D-Var: e.g. Sasaki, 1969) is an effective data assimilation/inversion method because it can obtain optimal model variables without dealing explicitly with a model's operator matrix, which is generally too large to store in a computer's memory. Based on Bayes' theorem, 4D-Var determines optimal (posterior) model variables through an iterative approach using a prescribed prior estimate and its error covariance, observations, and the error covariance of model-observation mismatches. Ideally, a posterior estimate should have an associated error covariance matrix (a posterior error covariance matrix), although obtaining it is a non-trivial task for conventional 4D-Var approaches.
In this study, we developed a new method to estimate a posterior error covariance matrix within a 4D-Var framework that we apply to an inverse problem of a given atmospheric constituent. In such an inverse problem, we estimate the fluxes of an atmospheric constituent at the Earth's surface from observations of its atmospheric concentrations. Here, the target tracer is an inert, long-lived tracer such as carbon dioxide (CO 2 ) and nonlinear chemical processes are not considered. Furthermore, the atmospheric processes considered are all transport-related ones, such as advection, diffusion and convection, and the tracer is treated as a passive tracer. Therefore, the model is linear and results in a linear inverse problem.
Generally, a 4D-Var seeks a set of model variables that minimises a cost function, defined as where x and x pri are the control vector and its prior estimate, respectively, and y o denotes observations. The sizes of the vectors x and y o are n and m respectively. B and R are the error covariance matrices of the prior estimates (n × n) and model-observation mismatches (m × m), respectively.

M(.)
is an operator representing a forward model calculation, including an observation operator that transforms variables from the model space into variables in the observation space. In an inverse problem for a given atmospheric constituent, the control vector, x, and the observation vector, y o , generally represent surface fluxes and atmospheric concentrations, respectively, and contain not only spatial but also temporal dimensions for the entire analysis period. Thus, the model operator M(.) describes all the model calculations from the beginning to the end. As mentioned, our inverse problem assumes that the model is linear; that is, M(.) can be replaced by an m × n matrix, M. Here, the linear model matrix, M, can be also regarded as the tangent linear operator. Note here that the observation operator included in M(.) is also assumed to be linear because we only consider linear spatio-temporal interpolation. Therefore, Equation (1) can be rewritten as (3) which satisfies Here, M T represents an adjoint model that performs a backward calculation from concentrations to fluxes. When a cost function is derived from the logarithm of the Gaussian probability density distributions, the posterior error covariance matrix, P a (n × n), is equivalent to the inverse matrix of the Hessian A. The Hessian is the square matrix of the second-order partial derivatives of the cost function, and in the linear case with the cost function of Equation (2), Therefore, In practice, the sizes of the matrices are too large to be stored explicitly in computer memory, rendering the inverse calculation of (B −1 + M T R −1 M) −1 infeasible. To avoid the calculation of (B −1 + M T R −1 M) −1 , 4D-Var uses either a conjugate gradient method or a quasi-Newton method to search optimal model variables iteratively. Each iterative calculation requires the cost function value and its gradient vector. Therefore, in practice, the posterior error covariance matrix is not calculated explicitly. Rabier and Courtier (1992) proposed a randomisation method to estimate a posterior error covariance matrix in which an ensemble of perturbed pseudo-observations was employed to approximate the observational part of the Hessian M T R −1 M. Then, the posterior error covariance matrix was obtained by inverting the approximated Hessian matrix using the Sherman-Morrison-Woodbury formula (Sherman and Morrison, 1949). Fisher and Courtier (1995) compared the randomisation method with two other methods: the Lanczos method (Lanczos, 1950) and the Broyden-Fletcher-Goldfarb-Shanno (BFGS) method (Broyden, 1969;Nocedal, 1980;Liu and Nocedal, 1989). The Lanczos method is closely related to a conjugate gradient method, whereas the BFGS method represents one familiar technique of a quasi-Newton method. Therefore, these two methods could provide a posterior error covariance matrix as a by-product of an optimisation. Fisher and Courtier (1995) demonstrated some superiority of the Lanczos method, following comparison of the randomisation, Lanczos and BFGS methods. Another approach for estimating a posterior error covariance matrix is to use a Monte Carlo method in which the posterior error covariance matrix is estimated from an ensemble spread of posterior estimates. In fact, the idea of estimating an error covariance matrix from an ensemble is the basis of the ensemble Kalman filter (e.g. Evensen, 1994), which is another prominent data assimilation method besides 4D-Var. Within a 4D-Var framework, Chevallier et al. (2007) used the Monte Carlo method to estimate a posterior error covariance applied to a CO 2 inverse problem, in which 4D-Var iterative calculations were performed for each ensemble member. This method was also adopted for another CO 2 inverse system by Liu et al. (2014). However, the Monte Carlo method has trouble estimating off-diagonal elements of a posterior error covariance matrix, because of its stochastic nature. It is well known that stochastic methods usually suffer from sampling errors (e.g. Fisher and Courtier, 1995). Bousserez et al. (2015), who also targeted a linear inverse problem as we do, proposed a new BFGS method that is combined with the Monte Carlo method (described briefly in Appendix A). In their comparison, the conventional BFGS method was found to be poor at estimating posterior errors (square root of diagonal elements of a posterior error covariance matrix) compared with stochastic methods (i.e. the Monte Carlo and randomisation methods). Through the combination with the Monte Carlo method, their new BFGS method exhibited a performance comparable to stochastic methods for determining posterior errors.
This study proposes a BFGS-based method different from Bousserez et al. (2015), with a particular focus on the linearity of the inverse problem. In general, the quasi-Newton method used with the BFGS algorithm allows some nonlinearity in the optimisation and this is one advantage of the quasi-Newton method. However, our BFGS method has been limited to an exactly linear case. This study, then, leverages a property called "conjugacy." To our knowledge, the conjugacy has not been focused on conventional studies of atmospheric constituent inversions though the problem is linear. Optimisation methods which exploit conjugacy have been employed to minimise nonlinear cost functions in the context of data assimilation. In particular, several studies (Tshimanga, 2007;Fisher et al., 2008;Tshimanga et al., 2008;Gürol et al., 2014) proposed to use a matrix derived from conjugate vector pairs in preconditioning for accelerating convergence of subsequent assimilation cycles in the so-called incremental 4D-Var (Courtier et al., 1994). In this study, we, however, limit our focus to a linear inverse problem, and we carefully constructed an algorithm that exactly achieves the conjugacy for accurate estimation of a posterior error covariance matrix.
For quantitative evaluation of inversion uncertainties, the estimation of a posterior error covariance matrix requires high accuracy. In an inverse analysis, posterior errors derived from diagonal elements of a posterior error covariance matrix provide valuable information because they would allow for the quantification of the uncertainties of optimised fluxes and the evaluation of the strengths of observational constraints (e.g. Hungershoefer et al., 2010;Niwa et al., 2012;Saeki et al., 2013). Furthermore, error correlations derived from off-diagonal elements are also useful for the interpretation of estimated flux variations. An inverse analysis often generates erroneous dipoles in the estimated fluxes due to insufficient observational coverage (e.g. Bousquet et al., 1999). Error correlations enable us to identify such erroneous dipoles.
For an achievement of the high accuracy with a reasonable computation time, we also introduced an ensemble approach as in the method of Bousserez et al. (2015) but achieving the conjugacy. Furthermore, we investigated its feasibility in a realistic inverse problem of an atmospheric constituent and found that the method provides accurate estimation of a posterior error covariance not only for the diagonal elements but also for off-diagonal elements. In this article, we first describe the method starting from the basic BFGS formula to the techniques introduced for the conjugacy. Then, we evaluate the performance of the proposed method when applied to a simple toy model and a state-of-the-art atmospheric inverse system.

Quasi-Newton method with the BFGS algorithm
Here, we introduce the principle of the quasi-Newton method combined with the BFGS formula. Readers are encouraged to see Nocedal and Wright (2006) for treatments of the fundamental mathematics and for background theories. Hereafter, we omit of the control vector, x, for simplicity. Also note that, according to conventions, we will use vectors denoted as d and y in subsequent equations to represent a decent direction vector and a gradient increment vector, respectively, which are different from the innovation vector, d o , and the observation vector, y o , in the previous equations.
Generally, an optimisation method performs iterative calculations as where k is the iteration counter, and k is the step length of the line search in the descent direction d k − 1 that minimises J(x k − 1 + k d k − 1 ). Determination of the descent direction d k is method dependent. The Newton method determines d k using the Hessian A and the gradient of the cost function g k ( If the problem is linear and the cost function is quadratic, then the optimal control vector can be determined through one iteration with = 1. This is verified by the fact that Equation (3) is obtained by Equations (6-8). However, in our problem, the matrix size of the Hessian is so large that it renders the Newton method infeasible. Instead, we use a quasi-Newton method that substitutes an approximated inverse Hessian, H k (n × n), as At each iteration, the approximated inverse Hessian is updated by the BFGS formula as where , and I is the identity matrix (n × n). In the above BFGS formula, an arbitrary positive-definite symmetric matrix can be used for the first approximation of the inverse Hessian H 0 . In practice, H k is calculated via a two-loop recursion scheme using all the available vector pairs of (y, p), such that the full matrix of the approximated inverse Hessian is not stored in a computer's memory (e.g. Nocedal and Wright, 2006).

2.2
Estimation of a posterior error covariance matrix with the BFGS formula Based on the previous algorithm, each column of an approximated inverse Hessian can be obtained by replacing g k of Equation (9) with a unit vector such as (1, 0, … , 0) T , (0, 1, … , 0) T , … , (0, 0, … , 1) T . Therefore, if N pairs of (y, p) vector pairs are obtained, then the inverse Hessian approximated by N vector pairs, H N , can be derived via a two-loop recursion scheme (Nocedal, 1980) as follows.
Algorithm 1 (BFGS-based estimation of the posterior error covariance matrix) Let v be the jth unit vector as According to Equation (6), this approximated inverse Hessian, H N , can be considered the posterior error covariance in the linear case and under the Gaussian assumption. We should note here that H N is determined only by the set of the vector pairs of (y, p) and H 0 . Therefore, the approximation accuracy of H N depends on the properties of the (y, p) vector pairs and H 0 , as well as the available number of (y, p) pairs. In the next section, we outline the algorithms to prepare (y, p) vector pairs that have a preferred property of conjugacy. Furthermore, we choose an appropriate positive definite matrix for H 0 . Note here that, in Algorithm 1, the order of (y, p) vector pairs does not affect the result. This implies that we do not necessarily need to perform N iterations of the quasi-Newton method, but simply need to prepare N vector pairs, which enables us to employ the ensemble approach as described in Section 3.2.2.

Conjugacy
As mentioned in Section 1, our BFGS method is specifically designed for a linear problem, such as the inverse analysis of an atmospheric constituent. Furthermore, our method needs to fulfil the property of mutual conjugacy among the increment vectors; that is, which can be rewritten as This conjugacy property presents a favourable condition for the BFGS formula. If the conjugacy is satisfied, then an iterative calculation converges with the analytical solution in at most n iterations, and it is expected that we can obtain the analytical inverse Hessian after n iterations (Nocedal and Wright, 2006) as, Therefore, the conjugacy is expected to enhance the efficiency of the posterior error covariance estimation. Thus, we impose the vector pairs of (y, p) used in the BFGS formula to satisfy Equation (12). To this end, we applied the so-called exact line search and orthogonalization, which are described in Section 3.2.

Preconditioning
Equation (13) indicates that we need n conjugate vector pairs to obtain an analytical inverse Hessian; however, the necessary number of vector pairs could be further reduced by introducing preconditioning. In an optimisation, preconditioning is often applied to accelerate convergence. One of the most popular preconditioning approaches is that of Lorenc (1988), which transforms the control vector asx Then, the transformed Hessian can be written as where X ≔ B 1/2 M T R −1/2 . Because n > m, in general, this Hessian has at most m eigenvalues other than 1 (recall here that m is the number of observations), which effectively accelerates the convergence speed.
In this study, we also use this preconditioning to estimate a posterior error covariance matrix. By applying the preconditioning to the BFGS method, the transformed approximated inverse

the non-transformed approximated inverse Hessian is
The matricesH 0 and H 0 are required to be positive-definite symmetric matrices which approximateÃ and A, respectively. It is reasonable to setH 0 = I due to its simplicity and Equation (15). In this case, the above equation becomes identical to the non-transformed BFGS form with H 0 = B since H 0 = B 1∕2H 0 B 1∕2 . Therefore, in practice, by simply setting H 0 = B in Algorithm 1, we can introduce the preconditioning of Lorenc (1988). Here, it is noted that H 0 = B is also a reasonable choice because of Equation (5).
When the preconditioning of Lorenc (1988) is applied and the conjugacy is satisfied, the BFGS formula is more favourable than Equation (13); that is, we can obtain the analytical inverse Hessian from only m (y, p) vector pairs, that is, This important property can be derived from the fact that y k andp k for any k ≤ m can be represented by a linear combination of the same m orthogonal vectors, when these vector pairs are obtained through iterations of the preconditioned quasi-Newton method, from x 0 = 0. That is, where {u 1 , u 2 , … , u m } is a set of singular vectors of X. Therefore, Detailed proofs of Equations 20 and 21 are given in Appendix B.
Here, we prove Equation (19) as follows. For 1 ≤ k ≤ m, the conjugacy in the preconditioned spacẽ whereÃ is the transformed Hessian, leads tõ Becausẽ Equation (23) gives an interesting property (Dennis and Moré, 1977;Fisher and Courtier, 1995) that Furthermore, an arbitrary vector, u, that is orthogonal to every member of {u 1 , u 2 , … , u m }, satisfies whereP ≔ (p 1 ,p 2 , … ,p m ) and a ≔ (a 1 , a 2 , … , a m ) T . and Equations (25) and (26). This leads to and thereforeH mÃ = I, since Equation (29) must be satisfied for an arbitrary vector b. Substituting In an inversion problem for an atmospheric constituent, the number of observations, m, is usually much smaller than the number of flux parameters, n. Therefore, preconditioning is indispensable given the fact that only m (y, p) vector pairs are required to obtain the analytical solution of a posterior error covariance matrix. However, this is only true if the conjugacy is perfectly satisfied.

Preparation of conjugate vector pairs
In order to prepare (y, p) vector pairs to estimate a posterior error covariance matrix with the BFGS formula, we first perform an optimising calculation. Although the calculation itself is not necessarily the same approach as the estimation of a posterior error covariance, we employ the quasi-Newton method with the BFGS formula herein. To ensure the conjugacy, we introduce the exact line search to an existing quasi-Newton algorithm. Furthermore, we add an ensemble approach to increase the number of (y, p) vector pairs similarly to Bousserez et al. (2015), but also introducing orthogonalization for the conjugacy.

POpULar with the exact line search
In an optimisation, a line search determines k in Equation (7) Generally, in a nonlinear problem, it requires additional model simulations within iterative quadratic or cubic interpolations to satisfy adequate conditions, such as the so-called Wolfe conditions (e.g. Nocedal and Wright, 2006). In a weakly nonlinear problem, k = 1 is usually used for the initial estimate of k because it is considered adequate and consequently does not require extra simulations at most iterations. However, this is not the exact value that minimises J(x k − 1 + k d k − 1 ). If a problem is linear, then the cost function is a convex quadratic and the optimal k is analytically determined with the Hessian A as This is the exact line search. It is known that, if the problem is linear and the exact line search is employed, the quasi-Newton method with the BFGS formula gives conjugate vector pairs (Dennis and Moré, 1977). In fact, that is equivalent to the linear conjugate gradient method (e.g. Nazareth, 1979), which also satisfies the conjugacy (e.g. Nocedal and Wright, 2006). The practical calculation of Equation (30) is non-trivial because of the huge matrix size of the Hessian A (n × n). To avoid dealing explicitly with A, Equation (30) is rewritten as where In Equation (32), we have to perform the adjoint calculation as denoted by M T ; meanwhile we can replace the calculation of g k by which no longer includes the adjoint calculation. The above calculation of the exact line search originates from the method of Derber and Rosati (1989), in which the line search is applied in the linear conjugate gradient method. In this study, we applied this exact line search in the Preconditioned Optimizing Utility for Large-dimensional analyses (POpULar: Fujii and Kamachi, 2003;Fujii, 2005), which uses the quasi-Newton method with the BFGS formula and the preconditioning of Lorenc (1988). One salient feature of POpULar is that it implicitly applies the preconditioning (variables in the preconditioned space, e.g.x andg, do not appear), while not requiring the calculation of B −1 nor B 1/2 , which are included in the cost function and gradient calculations, or the variable transformation (Equation (14)). Therefore, it is reasonably flexible for modelling B and for introducing off-diagonal elements to consider error correlations. The original algorithm of POpULar is designed for nonlinear problems, but in this study, we impose linearity on the algorithm and introduce the exact line search. The original algorithm of POpULar can be found in Fujii (2005) and we only describe the "linear" POpULar as follows.
In the iterative calculation, variables including ancillary variables of h ≔ Bg, e ≔ B −1 d are updated as follows: Algorithm 2 (Linear POpULar).
where k is determined by the exact line search of Equation (31); In practice, the above updates of d k and e k are simultaneously calculated using the two-loop recursion scheme as the original POpULar does (Fujii, 2005). It is noted that the original POpULar algorithm replaces the conjugate gradient method of Derber and Rosati (1989) with the quasi-Newton method with a usual (non-exact) line search in order to deal with nonlinear problems. Therefore, the above linear POpULar forms an intermediate between Derber and Rosati (1989) and the original POpULar.
An iterative calculation of the quasi-Newton method often uses the "limited-memory" BFGS algorithm (Nocedal, 1980;Liu and Nocedal, 1989), which uses a limited number of latest (y, p) vector pairs instead of the full k pairs. However, we here employ the "full-memory" BFGS algorithm to satisfy the conjugacy among all the vector pairs of (y, p). In many cases nowadays, as well as an inverse problem that optimises only two-dimensional data of surface fluxes, a 4D-Var calculation can retain all the vectors in storage (but not necessarily in memory) thanks to recent progress in computer technology. Through iterative calculations of the described linear POpULar, conjugate vector pairs are obtained and they are substituted into the BFGS formula of Algorithm 1 with H 0 = B for estimation of a posterior error covariance matrix.

3.2.2
Ensemble with orthogonalization Bousserez et al. (2015) combined the BFGS algorithm with a Monte Carlo method, which functionally increases the number of (y, p) vector pairs. Increasing the number of (y, p) vector pairs according to the ensemble corresponds to an increase in the number of iterations using a single calculation. The total computational cost might not be changed, but the ensemble with application of parallel computation can effectively reduce the wall-clock-time, because each ensemble member can be calculated independently of the others. Here, we use a similar approach to Bousserez et al. (2015). However, we newly introduce orthogonalization to ensure the conjugacy among all the vector pairs produced from the ensemble.
To generate the ensemble members, we prepare M sets of d o (=y o − Mx pri ) that are generated simply from random values, where M is the number of ensemble members. Randomly generating d o is justified because the posterior error covariance matrix is independent of the observational vector y o and the prior vector x pri (Equation (6)), and we use the ensemble only to increase the number of (y, p) vector pairs. Thus, we do not need to use actual observations or prior fluxes here. Furthermore, in contrast to the Monte Carlo method of Bousserez et al. (2015), it is not required that we perturb the observations according to R, nor perturb x pri according to B.
After the ensemble calculation, we introduce an orthogonalization procedure that ensures the conjugacy among the vector pairs from all of the ensemble members; within each ensemble member, the (y, p) vector pairs are mutually conjugate, owing to the exact line search. However, when extending the operation beyond a member, this is no longer the case. The conjugacy is not satisfied between different members, because the optimisation itself is independent of other members. In fact, similar orthogonalization is implemented in previous optimisation algorithms as "re-orthogonalization" with the Lanczos method (Fisher, 1998;Gürol et al., 2014). They modify a vector set whose conjugacy should be theoretically achieved but is deteriorated by round-off errors. Those round-off errors are typical of the Lanczos method (Golub and van Loan, 1996).
The orthogonalization is performed as follows. First, we organise all the p and y vectors into rows to form respective matrices and Y = (y 1 , y 2 , y 3 · · · , y N ), where N is the total number of (y, p) pairs of all the ensemble members. Then, we multiply P by Y to create a matrix, Q, expressed as Because Q is a symmetric matrix owing to the fact that p T y = p T Ap, and A is symmetric, it can be decomposed as where U and L are the eigenvector and eigenvalue matrices (N × N) of Q, respectively. We should note here that the matrix size of Q (N × N) is generally not as large compared with that of H or B (n × n). Theoretically, Q is positive definite and all the eigenvalues are positive. However, too small or negative eigenvalues arise from computational errors. Therefore, those eigenvalues are filtered out from L to ensure computational stability. In this study, we filtered out eigenvalues whose ratio against the maximum eigenvalue was less than 1.0 × 10 −6 . Finally, the p and y vectors are transformed as where L ′ consists of the effective eigenvalues of L, whose size is N ′ × N ′ for N ′ < N, and U ′ is the corresponding eigenvector matrix, whose size is N × N ′ . Here, N ′ is the total number of effective eigenvalues after the filtering. The transformed matrices P ′ and Y ′ (n × N ′ ) satisfy and the conjugacy property is satisfied as To estimate the inverse Hessian in Algorithm 1, these modified vectors of y ′ and p ′ and N ′ replace y, p, and N, respectively. Then, as described in Section 3.1.2, with the preconditioning of H 0 = B, the analytical inverse Hessian can be obtained if N ′ = m. It is theoretically true that the exact line search is not necessarily required when one performs the described orthogonalization, which alone achieves conjugacy among all the vector pairs. However, we recommend the combination of the exact line search and the orthogonalization because we have experienced that their combination stabilises the estimation of the posterior error covariance matrix in the experiments that are described below. Furthermore, it is also encouraged because the exact line search does not require any additional computation (the calculation of g k that requires the forward and adjoint calculations in a conventional BFGS method is replaced by Equation (32) and g k is obtained in the following simple calculation of Equation (33)) and it is sometimes more efficient than the usual line search owing to the fact that an iterative interpolation calculation is not needed to obtain an acceptable value of k .

EXPERIMENTS
To evaluate the efficiency and accuracy of our proposed method, we conducted a set of numerical experiments. First, we performed a simple one-dimensional (1D) advection test that was intended to confirm whether the proposed BFGS method worked as theoretically expected. Then, we performed experiments using the case of atmospheric CO 2 inversion with a state-of-the-art three-dimensional (3D) model at two different resolutions. The lower-resolution experiment was designed so that we could obtain an analytical matrix of the posterior error covariance to quantify the accuracy of a matrix estimated using the BFGS method. The higher-resolution experiment, in which the numbers of flux parameters were increased in both space and time, was performed to evaluate the practical usability of the method and also to demonstrate values of the posterior error covariance matrix in terms of observational constraint information.
In the evaluations of the aforementioned experiments, we compared our new BFGS method (hereafter denoted as "the conjugate BFGS method") with another BFGS method proposed by Bousserez et al. (2015), to elucidate the importance of the conjugacy we proposed. Bousserez et al. (2015) used the BFGS formula in conjunction with a Monte Carlo-based ensemble and also introduced cyclic calculations. However, their approach featured a regular quasi-Newton method in which a usual (non-exact) line search was employed, and, thus, the conjugacy is not imposed onto the vector pairs of (y, p). In this study, we modified the cyclic method of Bousserez et al. (2015) (details are described in Appendix A), and combined it with the original POpULar using a non-exact line search (hereafter denoted as "the non-conjugate BFGS method"). In this method, the conjugacy is unsatisfied, but the same preconditioning of H 0 = B as in the proposed method is applied. Note that the cyclic method was not used in the conjugate BFGS method because it is not effective when the conjugacy is satisfied (see Appendix A).

Simple 1D advection test
For the simple 1D advection test, we made a toy model that advects a Gaussian distribution with a constant wind (10 m⋅s −1 ), as shown in Figure 1. The 1D model's domain ranged from 0 to 1,000 km and the analysis period was 900 min. The interval of the spatial discretization was set as 10 km, that is, n = 100. Here, the numerical advection was calculated using the van Leer Scheme I (van Leer, 1977) and its adjoint model was also constructed so that the adjoint relationship of (Mx In this experiment, we had five observational locations within the domain (denoted by crosses in Figure 1) and each made three observations during the period with a 300 min interval (t = 300, 600, 900 min); consequently, m = 15. In this experiment, the parameter to be optimised was the initial state (n = 100) and its prior estimate was set to zero everywhere in the domain. For simplicity, the error covariance matrices of B and R were set as diagonal matrices and their diagonal elements were uniquely set to 0.2 and 0.9, respectively. In this experiment, the ensemble was not used because m was so small that a single calculation was enough to achieve convergence.

CO 2 inversion by NICAM-TM 4D-Var
The CO 2 inversion experiment used the 4D-Var inverse system, named the NICAM-TM 4D-Var (Niwa et al., 2017a;2017b), which is based on the Non-hydrostatic ICosahedral Atmospheric Model (NICAM: Tomita and Satoh, 2004;Satoh et al., 2008;)-based Transport Model (NICAM-TM: Niwa et al., 2011). The NICAM adopts an icosahedral grid system consisting of hexagonal and pentagonal grids, instead of the conventional latitude-longitude square grid. The NICAM-TM 4D-Var inverse system uses the forward and adjoint tracer transport models of the NICAM-TM (Niwa et al., 2017a). These models, which are designed "offline", only calculate tracer transport using prescribed meteorological data. Therefore, they are computationally efficient since the number of iterations required for a 4D-Var calculation is feasible. The forward and adjoint transport models have nonlinear and linear options in their advective calculations. The nonlinearity in the advection calculation stems from having a numerical limiter equipped to achieve tracer monotonicity (Thuburn, 1996;Miura, 2007). In the linear option, that numerical limiter is turned off, achieving the complete linearity of the model as well as the adjoint relationship of (Mx) T d o = x T (M T d o ), within the limits of computer machine accuracy; however, monotonicity is not assured and some negative values arise. Here, we used the linear option of the NICAM-TM for the strict linear requirement of the conjugate BFGS method. NICAM-TM 4D-Var originally uses POpULar for optimisation (Niwa et al., 2017b). In this study, we replaced it with the linear POpULar as described in Section 3.2.1.
Using NICAM-TM 4D-Var, we performed the lowand high-resolution CO 2 inversion experiments adopting the settings of the identical twin experiment by Niwa et al. (2017b). For the low-resolution experiment, we used a reduced horizontal resolution of 480 km (number of grid cells = 2,562) and only the monthly mean flux for July 2010 was optimised. For the high-resolution test, we used a horizontal resolution of 240 km (number of grid cells = 10,242) and the mean fluxes for all 12 months in 2010 were optimised. Therefore, the number of flux parameters to be optimised for the low-and high-resolution experiments were 2,562 (= n low ) and 122,904 (= n high ), respectively.
For the prior flux error covariance B, we prescribed the diagonal elements based on the prior flux data used in Niwa et al. (2017b). Specifically, a terrestrial flux error (square root of diagonal element of B) represented the square root sum of the squares of the gross primary production (GPP) and respiration (RE) values multiplied by 0.5 (this number was empirically determined). The values of GPP and RE were derived from the terrestrial biosphere model, Vegetation Integrative Simulator for Trace Gases (VISIT: Ito and Inatomi, 2012). The ocean-flux error was defined as the multi-year standard deviation of prior ocean-flux data (Iida et al., 2015). Figure 2a shows the spatial pattern of the prepared prior flux errors for the low-resolution experiment.
To consider error correlations, the off-diagonal elements of B were defined as where l i,j is the horizontal length between the different flux locations whose element indices are denoted as i and j, and L c is the correlation scale length. Here, L c was set to 500 and 1,000 km for terrestrial and ocean areas, respectively. In practice, the full calculation of Equation (42) is computationally expensive, therefore we truncate it by setting B i j = 0 for l i,j > 4L c . In fact, under a spatially and temporally sparse observation network, introducing off-diagonal elements that consider error correlations can improve posterior flux estimates (e.g. Niwa et al., 2017b). The synthetic observational data of atmospheric CO 2 concentrations assumed weekly flask sampling at 65 ground-based stations (see Niwa et al. (2017b) for the detailed list). Therefore, the numbers of observations for the low-(1-month-long: 4 weeks) and high-(1-year-long: 52 weeks) resolution experiments were 260 (= m low ) and 3,380 (= m high ), respectively. The geographic locations of the 65 observation sites are shown in Figure 2. For the error covariance matrix of the model-observation mismatches, R, we assumed a diagonal matrix and we set a value of 1 ppm 2 for all the diagonal elements.

Diagnostic measures
For the robust evaluation of the accuracy of the estimated posterior error covariance matrices, we used the three diagnostic measures defined in the following items. In the second and third measures below, we denote the estimated and analytically obtained posterior error covariance matrices, the sizes of which are n × n, as P est a and P anl a , respectively.

• Total uncertainty
This measure represents the integrated uncertainty.
Vector a represents the spatial integration operator vector, for example, a T x represents the global total amount of surface fluxes (scalar) for the global inversion problem. If x contains different time states, a can be integrated further in time.
This quantifies the difference between P est a and P anl a using the square root of the sum of the squared differences of all the elements. (45) This measure also represents a difference but in another aspect; it was first introduced and named divergence by Ueno and Nakamura (2016). Divergence is non-negative and it becomes zero if P est a = P anl a .

• Divergence
In the simple 1D advection test and the low-resolution CO 2 inversion experiment, we can easily obtain P anl a because of its small matrix size (n = 100 and 2,562 for the simple 1D advection test and the low-resolution inversion experiment, respectively). We obtained each P anl a by analytically calculating Equation (6), by performing forward simulations with a pulse flux at each model grid cell and their corresponding adjoint simulations n times, followed by the inverse calculation of the n × n matrix. In contrast, for the high-resolution CO 2 inversion experiment, P anl a was not available due to its huge matrix size. Therefore, we only used the measure of the total uncertainty (Equation (43)). Figure 2b shows the distribution of the posterior flux errors derived using the analytical calculation in the low-resolution CO 2 inversion experiment. The posterior flux error is defined as the square root of a diagonal element of the posterior error covariance matrix, that is, i ≔ √ P a ii .

Simple 1D advection test
In Figure 3, we show the results of the simple advection test using the conjugate and non-conjugate BFGS methods. As a reference, we show the cost function changes along iterations in the parameter (initial state) optimisation in Figure 3a, in which the results by the exact and non-exact line searches, that is, by the linear POpULar and the original POpULar, respectively, are presented. Note here that the calculation of the parameter optimisation is independent of the estimation of the posterior error covariance matrix. Figure 3a shows that this experiment was so simple that the convergence of the optimisation was achieved within just two iterations, and this was true regardless of whether the exact line search was used. However, in the estimation of the posterior error covariance matrix, the convergence required many more iterations and the exactness of the line search, that is, conjugacy became important. Figure 3b-d show the diagnostic measures of the estimated posterior error covariance matrices that were derived from the (y, p) vector pairs of the optimisation calculations. At least 15 iterations were required for convergence with the conjugate BFGS method. This number is consistent with the number of iterations theoretically required by the preconditioning (m, i.e. 15 in this case, see Section 3.1.2). However, with the non-conjugate BFGS method, convergence was not achieved, even after 15 iterations. Furthermore, the calculation was stopped before any convergence due to its computational instability (not shown). After 15 iterations, the conjugate BFGS method achieved almost zero values for the Frobenius norm and the divergence, and reached 6.3 for the total uncertainty; this was the same as the analytically obtained value, indicating that the conjugate BFGS method worked as theoretically expected. In the non-conjugate BFGS method, the cyclic method was used with the non-exact line search and it improved the performance to some extent (Appendix A), but never reaching the level of the conjugate BFGS method. This fact demonstrates the significant advantage of the exact line search used in the conjugate BFGS method. Note here that the ensemble was not used given the small size of the problem. Figure 4 presents distributions of the posterior error and how the estimated distribution changed with number of iterations, compared to the analytically obtained distribution. At the 15th iteration, the conjugate BFGS method produced an error distribution that is no longer distinguishable from the analytical one. However, the non-conjugate BFGS method could not reproduce the analytical distribution well.
The off-diagonal elements of the estimated posterior error covariance matrix were also investigated using a  Figure 5 presents the posterior error correlations at the 15th iteration derived by the conjugate and non-conjugate BFGS methods, compared with the ones that were analytically obtained. The error correlations estimated by the conjugate BFGS method were coincident with the analytical set, again demonstrating the theoretical realisation of the conjugate BFGS method. However, the error correlations were poorly calculated using the non-conjugate BFGS method, and erroneous correlations were produced for remote locations.

Low-resolution CO 2 inversion experiment
As demonstrated above, just replacing the usual non-exact line search by the exact one achieved the pronounced improvements for the estimation of the posterior error covariance matrix. This manifests the importance of the conjugacy we proposed. In this section, we show the results of the CO 2 inversion experiments, in which the

F I G U R E 5 Posterior error correlations between x and x + Δx
in the simple 1D advection test, derived after 15 iterations using (a) the analytical calculation, and by the (b) conjugate (this study) and (c) non-conjugate BFGS methods number of observations was much larger and therefore we required the introduction of the ensemble.

Single case
First, we show how the methods performed without the ensemble, that is, using only one member as in the previous simple 1D advection test. Figure 6a presents the cost function reduction associated with increased numbers of iterations. It shows that the exactness of the line search does not largely affect the convergence speed in the optimisation of the control variables (i.e. the flux estimation), which is the same behaviour as in the simple 1D advection test (here, the optimisation used the linear and original POpULar algorithms for the conjugate and non-conjugate cases, respectively). In both cases, approximately 15 iterations are required for convergence. Figure 6b-d show how closely the estimated matrix matches the analytical one, as the number of iterations increased. As in the simple 1D advection test, the posterior error covariance estimation requires more iterations for convergence than the flux estimation. Furthermore, the convergence speed of the conjugate BFGS method is much faster than that of the non-conjugate one, as was the case in the simple 1D advection test. Nevertheless, the conjugate BFGS method did not fully achieve convergence within 100 iterations. The estimated total uncertainty almost reached the true value of 0.32 PgC⋅year −1 after 100 iterations and the divergence was also nearly zero after 70 iterations; however, the Frobenius norm did not achieve convergence, even after 100 iterations. This result also indicates that the Frobenius norm is the strictest measure of the three. Figure 7 shows the relative "error" distributions of the posterior errors estimated using the conjugate BFGS method with 10, 50 and 100 iterations, and that from the non-conjugate BFGS method with 100 iterations; these values are represented as percentages of the analytical values, that is, Figure 7 demonstrates that, in the conjugate BFGS method, the posterior errors were insufficiently estimated during the first stage, especially in regions with dense observations (e.g. Europe), but that such insufficiency was gradually mitigated as the iteration proceeded. After 100 iterations, the posterior errors were nearly compatible with the analytical ones. The estimate of the off-diagonal elements was similarly improved (Figure 8). Here, we evaluated the off-diagonal elements using the error correlations defined by Equation (45), and j of Equation (45) was fixed to a grid-point flux on East Asia; the distributions of r i,j along i are presented. Figure 8 shows that the distribution of the error correlation gradually approached the analytical results in the conjugate BFGS method. However, both the Figures 7 and 8 show that the non-conjugate BFGS method did not sufficiently estimate the posterior error covariance, even after 100 iterations.
F I G U R E 7 Accuracy of the posterior errors estimated by the conjugate BFGS method at the (a) 10th, (b) 50th, and (c) 100th iterations, and (d) that by the non-conjugate BFGS method at the 100th iteration, which are all derived from a single member in the low-resolution CO 2 inversion. Also shown are those for the 50-ensemble member cases with 15 iterations: (e) conjugate, (f) non-conjugate. Here, the error is defined as As mentioned earlier, the conjugate BFGS method did not fully achieved convergence even at the 100th iteration. This is not surprising because convergence theoretically requires 260 iterations (= m low ), at most, as described in Section 3.1.2. Nevertheless, the state of the model nearly approached convergence at the 100th iteration, indicating that the number of effective eigenvalues of the second term of the preconditioned Hessian (Equation (15)) was much smaller than 260. However, this slow convergence might not be suitable for practical use, although it depends on the degree of accuracy required. Therefore, we adopt the ensemble approach to accelerate convergence. Figure 9 shows how the use of the ensemble accelerated the convergence speed. Here, we used 50 ensemble members that were arbitrarily generated. As described in Section 3.2.2, all the vector pairs of (y, p) in the conjugate BFGS method were modified to achieve the conjugacy, using the orthogonalization in conjunction with the exact line search. We also performed the calculation using the non-conjugate BFGS method with 50 ensemble members. Figure 9 demonstrates that, when the 50 ensemble members are used in the conjugate BFGS method, the Even with the strictest measure, that is, the Frobenius norm, convergence was achieved within approximately 15 iterations and the value reached near zero. As in the single-member case, this convergence speed was much faster than that of the non-conjugate BFGS method. The With the conjugate BFGS method, we investigated the sensitivity to the number of ensemble members. When the number of the ensemble members was reduced from 50 to 10, the required number of iterations for convergence increased threefold; the total uncertainty reached the analytical value at the 39th iteration and the Frobenius norm approximately converged after 60 iterations. This demonstrates that we can obtain a faster convergence speed with a larger number of ensemble members. However, the iteration numbers required for convergence in the 10-member case are not five times larger than those of the 50-member case, suggesting that the efficiency of the ensemble decreases as the number of members employed increases.

Ensemble case
As described in Section 3.2.2, in the conjugate BFGS method, all the (y, p) pairs were transformed to (y ′ , p ′ ) for conjugacy, and some vector pairs were filtered out as part of this transformation. Consequently, the available number of (y ′ , p ′ ), N ′ , must have been smaller than the original available number of (y, p), N. Figure 10a shows the total uncertainty change with N ′ in the 10and 50-member cases. It clearly demonstrates that the convergence speed against N ′ is almost the same for the 10-and 50-member cases, indicating that N ′ determines the estimation accuracy of the posterior error covariance matrix. Figure 10b shows the change in the ratio N ′ /N associated with the number of iterations. It indicates that the transformation filters out more vectors as the iteration proceeds, and that its extent is enhanced with larger ensemble members, which in fact induced the decrease of the ensemble efficiency. In summary, these results suggest that the matrix, Q, of Equation (36) produced from more iterations would be better conditioned than that produced by a larger ensemble, if N were the same. In practice, if the total computational costs were limited, it would be better to perform a large number of iterations with a small number of ensemble members. In contrast, if one wanted to prioritise computational speed and if massive parallel computing were feasible, a large number of ensemble members might be preferable.

High-resolution CO 2 inversion experiment
Finally, we investigate the feasibility of the conjugate BFGS method in a practical problem and also demonstrate the utility of the posterior error covariance matrix. In this experiment, we used a typical spatio-temporal parameter size that adopted the half-grid size of the previous low-resolution experiment, that is, approximately 240 km, and we estimated 12 monthly mean fluxes for the entire year instead of only 1 for the month of July.
In this experiment, approximately 20 iterations were required for the convergence of the flux optimisation ( Figure 11a). As already demonstrated, the estimation of the posterior error covariance matrix required many more iterations than the flux estimation, and the same was true even when 50 ensemble members were employed (Figure 11b). With 50 members, the conjugate BFGS method dramatically reduced the total uncertainty within a few iterations, although this reduction subsequently slowed. With a larger ensemble size of 200, the convergence feature was similar but the total uncertainty was significantly smaller than with the 50 members.
When one sees the number of (y ′ , p ′ ) vector sets, N ′ , the convergence appears to require N ′ > 1,000 (Figure 11c), which corresponds to 46 iterations using the 200 ensemble members. This suggests the required value of N ′ is only 10 times larger than that of the low-resolution experiment (Figure 10a). This number is similar to the increased ratio of the number of the observations, that is, m high /m low = 13 (from 4 to 52 weeks), rather than that of the flux parameters, that is, n high /n low = 48 (4 times in space and 12 times in time). In fact, this is consistent with the theoretical fact that the number of observations determines the number of iterations required for convergence; at most m iterations are required if the conjugacy is satisfied and the preconditioning of Lorenc (1988) is used (Section 3.1.2).
Thus, although a larger number of ensemble members are required than in the low-resolution experiment, we confirmed the feasibility of the developed method at the high resolution, which is practically applied using real observations. A posterior error covariance matrix estimated for a 1-year period would provide useful information that varies seasonally. One valuable diagnostic parameter that can be derived from the posterior error covariance matrix is the error reduction ratio, which is defined as where pos and pri are posterior and prior errors, respectively, and i denotes each flux component. This explains how errors have decreased from their corresponding prior errors and it is useful to diagnose where, when, and by how much the observations constrain flux estimates, in the inverse analysis. Monthly distributions of the error reduction ratio defined by Equation (47) are shown in Figure 12.
In the series of the distributions, we find clear temporal variation in the observational impacts over terrestrial areas. In boreal summer (May-October), the error reduction ratios are notable, whereas they are marginal in other months. This is especially true for inland areas located far from the observation sites. This is attributable to the larger prior errors seen in terrestrial areas in the summer, which have enough room to be constrained by the observations. In contrast, in the winter, the prior errors of terrestrial areas are comparatively small, resulting in strong prior constraints. Furthermore, the stratified atmosphere of winter might have limited the observational impacts near the observation sites. Ocean fluxes are tightly constrained by small prior errors and, consequently, the observational impacts are generally limited. Figure 13 shows the posterior error correlations against the flux in East Asia for July and January, which are similar to those in Figure 8e, but instead are derived from the high-resolution experiment and focus on Asian regions. Compared with Figure 8e, the spatial pattern for July looks generally similar but shows more detailed structures in East Asia (Figure 13a). Surrounding areas around target flux have positive correlations. In fact, those correlations F I G U R E 12 (a-l) Distributions of the error reduction ratio for January-December 2010, as derived by the conjugate BFGS method with 200 ensemble members and 100 iterations in the high-resolution CO 2 inversion experiment. Magenta triangles represent the locations of the observational sites were inherited from the prior error covariance, which was defined by Equation (42) using L c = 500 km. Beyond these positive correlations, terrestrial grids exhibit negative correlations, followed by positive correlations in part, just like a ripple. In January, the positive correlations surrounding the target flux become pronounced, indicating a larger contribution from the prior error correlation. This is consistent with the smaller error reduction ratios shown by Figure 12, that is, weaker observational constraints in January than in July for this region. Furthermore, negative correlations show a different spatial pattern in July, located on the western side of the positive correlations. These posterior error correlation distributions are determined by the prior errors, airflow patterns, model-observation mismatch errors, and timings and locations of observations. The former two factors differ by seasons, causing such different correlation distributions.
To evaluate the global features of the posterior error correlations, we shrank the error covariance matrix through aggregation as where G is an operator matrix that spatially integrates fluxes by regions and P agg a is the aggregated posterior error covariance matrix. Figure 14 shows the posterior errors and their correlations among aggregated regions for each month, which were derived from diagonal and off-diagonal elements of P agg a , respectively. Here, the aggregation was performed by integrating the fluxes into 13 areas comprising 11 terrestrial regions (A-K) defined by the TransCom 3 project (Gurney et al., 2002;Baker et al., 2006), the global ocean (L) and the global total (M). In Figure 14, we found a number of existing negative correlations at each month. This indicates that, even after the aggregation, the fluxes were not independently estimated. Specifically, error correlations persistently exist throughout the year between tropical America (C) and temperate South America (D), between northern (E) and southern Africa (F) and between tropical (I) and temperate Asia (H), thus indicating that it is difficult to clearly distinguish fluxes within the continents of South America, Africa and southern Asia. Meanwhile, some correlations differ according to season. We found significant negative correlations between boreal (A) and temperate North America (B) for May-September, and between boreal (G) and temperate Asia (H) for July; they are not pronounced in the other seasons. Plotting error correlations against the global total (M) are also useful, which are depicted in the bottom row of each panel in Figure 14. They are positive values and represent the size of each regional flux estimate's contribution to the global total uncertainty. Figure 14 shows that significant positive correlations against the global total uncertainty persisted in the regions of South America (C and D), Africa (E and F) and tropical Asia (I). They were in fact caused by significant observational gaps in these regions. Such positive correlations quantify the observational insufficiency for the global CO 2 flux estimate. This would help us to interpret the information contained in the global flux estimates and also to consider the future design of global observational networks.

DISCUSSION
This study proposed a BFGS-based method for estimating a posterior error covariance matrix that is novel in its focus on the conjugacy among (y, p) vector pairs for September, and (f) November, estimated using the conjugate BFGS method in the high-resolution CO 2 inversion experiment; regional labels listed in (a). The numbers along the diagonal represent the posterior errors (PgC⋅month −1 ) and the colours denote the error correlations a linear inverse problem of an atmospheric constituent; it achieves a high accuracy and efficient (theoretically expected) convergence speed. In fact, the basic idea of our method, using the conjugate vector pairs in the BFGS formula, is the same as "the quasi-Newton Limited-Memory Preconditioner (QN-LMP)" referred to by Tshimanga et al. (2008). However, differently from the study of Tshimanga et al. (2008), which aims at acceleration of optimisation, we devoted the method to estimate a posterior error covariance matrix as accurate as possible for elaborating uncertainties of an atmospheric constituent inversion. According to Tshimanga (2007), if the information of all the previous iterations is exploited, QN-LMP is analytically equivalent to "the Ritz LMP," which is recommended by Tshimanga et al. (2008). The Ritz-LMP method uses approximated eigenvalues and eigenvectors, which are so-called Ritz pairs (Golub and van Loan, 1996). The drawback of QN-LMP is that it requires twice as much computer memory as Ritz-LMP. This is true for our method, because it employs two series of the vectors (y, p). However, it is compatible with the ensemble approach that we additionally introduced, because the inevitable non-conjugacy between different members can be modified by the orthogonalization with the simple multiplication of p by y (Equation (36)). In addition to the requirement of the two series of the vectors, we found that the method needs a few hundred ensemble members for practical use, which is much larger than that of Bousserez et al. (2015) (a few members). Furthermore, we fully store the information obtained from all the previous iterations (i.e. not "limited-memory"). However, storing all these data is feasible in a practical inverse analysis of an atmospheric constituent, because the control vector consists of surface fluxes (two-dimensional in space). Fisher and Courtier (1995) and Meirink et al. (2008) used eigenvalues and eigenvectors of the Hessian obtained by the combination of the conjugate gradient method and the Lanczos method and they calculated the posterior error covariance matrix with an equation equivalent to "the spectral Limited-Memory Preconditioner LMP" referred to by Tshimanga et al. (2008). In practice, their eigenvalues and eigenvectors are approximated by Ritz pairs from a limited number of iterations and are not exact ones. If the Ritz-LMP is applied, their method would provide an equivalent posterior error covariance matrix to that from our method. It is also noted that, in the Lanczos method, the orthogonality among the approximated eigenvectors tends to deteriorate as the number of the vectors increases due to computational errors (Golub and van Loan, 1996).
One limitation of our method is that it imposes linearity on a model. This linearity should be ensured in a discretised programme level. In this application, using a model that only deals with linear processes (e.g. atmospheric transport) is not enough because such a model often includes a nonlinear numerical scheme in discretised programmes, intentionally added for a given reason, such as ensuring tracer monotonicity. Any such nonlinear numerical scheme should be turned off in advance to ensure compatibility with our BFGS-based method. Furthermore, its corresponding adjoint model is required so that the adjoint relationship is satisfied.
Application of the conjugate BFGS method to a nonlinear problem (e.g. with a nonlinear numerical scheme) would not work well without modification. First, the exact line search is not applicable to a nonlinear problem. This is readily understood from the result of Niwa et al. (2017b) that the linear conjugate gradient method, which uses the exact line search and hence is equivalent to the conjugate BFGS method, failed to optimise fluxes when applied to a nonlinear model. In addition, the orthogonalization procedure for the conjugacy is also not applicable because Q (=P T Y) becomes asymmetric when the problem is nonlinear. Modification of the conjugate BFGS method for nonlinear problems is, thus, a future study.
Nevertheless, it may well be that a nonlinear (well-behaved) numerical scheme improves model accuracy and consequently provides better estimates in an inverse analysis. In fact, Niwa et al. (2017b) found this to be true for NICAM-TM 4D-Var, which has both linear and nonlinear options. Meanwhile, Niwa et al. (2017b) also found that the nonlinear effect is limited to a small degree, indicating that the cost function structure of the nonlinear case, and hence its Hessian, might not differ largely from that of the linear case. Therefore, in an intrinsically linear inverse problem, it would be acceptable to, on the one hand, use a linear model for posterior error covariance estimation and, on the other hand, use a nonlinear model for flux estimation.
The most prominent feature of our method is the highly accurate (almost analytical) estimation with a reasonably feasible computational cost. As shown in Figure 9, the non-conjugate BFGS method can achieve almost the same accuracy as our method, but it requires several times more iterations. Furthermore, the non-conjugate BFGS method requires a greater computational cost when calculating the BFGS formula. Our method first performs inexpensive orthogonalization and reduces the number of (y, p) vector pairs from N to N ′ . For instance, the reduction ratio was about 90% at the 60th iteration in the low-resolution CO 2 inversion experiment, with 50 members (Figure 10). However, the non-conjugate BFGS method fully uses N vector pairs, and when the cyclic method is employed, the practical number of the vector pairs introduced in the BFGS formula is L × N, where L is the number of cycles. Therefore, the calculation of the BFGS formula differs by two or more orders between the conjugate and non-conjugate methods. This is critical when the number of model parameters, n, is large, and in fact, it is computationally demanding for the typical size of an inverse analysis.
Another virtue of our method is its flexibility regarding the initial ensemble perturbation. Because of its stochastic nature, the Monte Carlo method has to perform the perturbation along with B and R, which often requires eigenvalue decomposition of those matrices (Chevallier et al., 2007). However, our method generates ensemble members using plain random numbers, and therefore, it does not require the decomposition of any large-size matrix. Although eigenvalue decomposition should be performed in Equation (37), its matrix size is much smaller than B. Furthermore, we coupled the conjugate BFGS method with the optimisation scheme named POpULar, in which the implicit preconditioning is employed to avoid the calculations of B −1 and B 1/2 and the exact lines search is implemented without any additional numerical cost. Thus, the required computational techniques are completely simple. The preconditioning is imperative for efficient estimation of a posterior error covariance matrix. As described in Section 3.1.2, the preconditioning assures convergence within at most the number of observations, which is generally smaller than the number of the parameters in an inverse problem of an atmospheric constituent.
Nevertheless, the method's computational cost should be further reduced in the future because it is more expensive than the flux estimation. A wall-clock-time can be reduced by introducing an ensemble with parallel computing but its efficiency decreases as the number of members increases, as shown by Figure 10b. One prospective way to improve the ensemble efficiency might lie in the initial ensemble perturbation; however, the convergence speed was not significantly altered by the different initial perturbations we have tested so far (changing perturbation magnitudes or perturbing along with B and R, as the Monte Carlo method does). Further investigations and developments might be required but they are left for future study.

CONCLUSIONS
This study proposed to use a method based on the BFGS formula for estimating a posterior error covariance matrix in the inversion of an atmospheric constituent. By limiting the method to an exactly linear case in use and ensuring the conjugacy, we achieved a high accuracy and efficiency. The method can use the ensemble technique to accelerate the convergence speed and hence, it is compatible with massive parallel computing. Our method employed the exact line search to satisfy the conjugacy among the available (y, p) vector pair set. When using the ensemble approach, further transformation of the (y, p) vectors through eigenvalue decomposition, ensured conjugacy beyond an ensemble member. We proved that the conjugacy combined with the preconditioning yields a preferable property: only the same number of vector pairs as the observations are required for it to converge with the analytical solution of a posterior error covariance matrix. This fact was confirmed by conducting a simple 1D advection test. The test clearly demonstrated the importance of the conjugacy for efficient convergence. The experiment of the CO 2 inverse problem showed that the method has the ability to produce a satisfactory accurate posterior error covariance matrix within a reasonable number of iterations, not only for diagonal but also for off-diagonal elements. Furthermore, compared with another ensemble that does not achieve the conjugacy, our method is superior in terms of convergence speed. Such accurate estimation of a posterior error covariance matrix would be useful for quantitative evaluation of observational impacts and for design of optimal observational networks. Furthermore, it could also yield insights regarding posterior flux estimates by elucidating their independency. was also supported by the JSPS KAKENHI Grant Number 19K03976, the ISM Cooperative Research Program (2019-ISMCRP-2030) and the FLAGSHIP2020, MEXT, within priority study 4 (advancement of meteorological and global environmental predictions utilising observational "Big Data"). Y. N. thanks the staff members at JAM-STEC, RIKEN, and the University of Tokyo for developing the NICAM. The calculations of this study were performed on the FUJITSU PRIMEHPC FX100 supercomputer system of the Meteorological Research Institute. In this study, we modified the method of Bousserez et al. (2015) and have referred to this form as "the non-conjugate BFGS method" in the evaluation of our conjugate BFGS method. Bousserez et al. (2015) similarly used the BFGS formula, but they proposed using a diagonal matrix derived from the diagonal elements of the approximated inverse Hessian at the previous iteration, that is,

ORCID
where k is the iteration counter. Note here that, in practice, one may use a different initial approximation of the inverse Hessian at each iteration if it is a positive definite symmetric matrix. It is expected that Equation (A1) improves the initial approximation of the inverse Hessian as the iteration proceeds. Furthermore, Bousserez et al. (2015) also introduced a method that performs the BFGS calculation cyclically, whereby the same (y, p) vector pairs are used several times while the initial approximation of the inverse Hessian is updated using the previous cycle as Equation (A1). From this cyclic method, the initial approximation of the inverse Hessian is further improved as the cycle proceeds, resulting in an improved estimate of a posterior error covariance matrix.
Here, we adopt the approach of the cyclic method in the BFGS calculation of Algorithm 1, but modify the initial approximation of the inverse Hessian. Instead of Equation (A1), we simply use the full matrix of the approximated inverse Hessian of the previous cycle as where l represents the counter of the cycles and changes from 1 to L. This is achieved by simply reusing the (y, p) vector sets L times, which uses almost the same equation as Equation (10); where i ranges from 1 to M( ≔ L × N). Note here that N is the available number of (y, p) vector pairs. When an ensemble is used, N is the total number of (y, p) pairs from all the ensemble members. The above (y, p) vectors satisfy y i = y j and p i = p j for i = j (mod N) and 1 ≤ j ≤ N.
(A4) This modification is intended to retain the off-diagonal elements of the approximated inverse Hessian matrix. In the method of Bousserez et al. (2015), Equation (A1) resets the off-diagonal elements to zero at each cycle, which might constrict the improvement of the accuracy of the initial approximation of the inverse Hessian, when the off-diagonal elements are significantly large. This is especially true when a prior error covariance matrix is designed as a non-diagonal matrix. For the initial approximation in Equation (A3), we set H 0 = B as the conjugate BFGS method proposed in this study.
Note here that the method we proposed does not employ this cyclic calculation. That is because if the conjugacy is satisfied, then adding the same vector pairs into the BFGS formula does not make a difference. This is proven later. In this method, the (y, p) vector pairs are prepared similarly through iterative calculations of the BFGS-based quasi-Newton method as the conjugate BFGS method, but with a usual (non-exact) line search; that is, we used the original POpULar (Fujii, 2005). Bousserez et al. (2015) combined their BFGS method with a Monte Carlo method. We also coupled the above modified method with an ensemble approach, but not with the Monte-Carlo method. When an ensemble is used, we perturb observations around M(x pos ), where x pos is the posterior flux vector obtained by a single optimisation calculation, that is, where i is an ensemble member index and d o, i is the perturbation vector that is generated by simple random values (the same as our proposed method). Bousserez et al. (2015) perturbed observations according to R and also perturbed x pri according to B as well. However, we found that this observation-only perturbation was enough, because there was almost no difference between the above simple perturbation and the Monte Carlo perturbation in the CO 2 inversion experiment (not shown).
In the analysis that compared our proposed method with the non-conjugate BFGS method (described in the main text), we coherently set L = 10. One may increase this number although a larger number requires a proportionately more computational cost. Figures A1 and A2 shows how this cyclic calculation improved the results.
Furthermore, we here prove that the cyclic method does not affect the estimate of a posterior error covariance Frobenius norm, and (c) divergence in the low-resolution CO 2 inversion. The panels show the results from a single member with the non-conjugate BFGS method without the cyclic calculation (solid line) and with 10 cycles (dotted line) (same as "non-conjugate" in Figure 6b-d) (a) (b) (c) F I G U R E A2 Same as Figure 9, but the grey solid line indicates the non-conjugate BFGS method without the cyclic calculation. The grey dotted line is the same as Figure 9 (the non-conjugate BFGS method with 10 cycles). Both the two lines are derived from 50 ensemble members whose diagonal elements ( 1 , 2 , … , m ) are singular values of X. Then, is represented by a linear combination of the right-singular vectors as Therefore,g 0 can be rewritten as This gradient is used as the first search direction, consequently, Here, let us make the induction hypothesis thatx k ′ ,p k ′ and g k ′ −1 are expressed as linear combinations of {u 1 , u 2 , … , u m } for k ′ ≤ k. Then,g k can be also expressed as a linear combination of {u 1 , u 2 , … , u m } as follows: where c ′ k,i = (1 + 2 i )c k,i and b k, i = c ′ k,i + i a i . Consequently, y k (=g k −g k−1 ) is also expressed as a linear combination of {u 1 , u 2 , … , u m }. Moreover,d k could be a linear combination of {u 1 , u 2 , … , u m }, ifd k is derived byg k ′ ,p k ′ and y k ′ that are all expressed as linear combinations of {u 1 , u 2 , … , u m } for 1 ≤ k ′ ≤ k. For instance, by the preconditioned BFGS formula, the next search directiond k can be derived by Equation (17) as