Volume 147, Issue 740 p. 3719-3734
RESEARCH ARTICLE
Open Access

Randomised preconditioning for the forcing formulation of weak-constraint 4D-Var

Ieva Daužickaitė

Corresponding Author

Ieva Daužickaitė

School of Mathematical, Physical and Computational Sciences, University of Reading, Reading, UK

Correspondence

Ieva Daužickaitė, Department of Mathematics and Statistics, Pepper Lane, Whiteknights, Reading, RG6 6AX, UK

Email: i.dauzickaite@pgr.reading.ac.uk

Search for more papers by this author
Amos S. Lawless

Amos S. Lawless

School of Mathematical, Physical and Computational Sciences, University of Reading, Reading, UK

National Centre for Earth Observation, Reading, UK

Search for more papers by this author
Jennifer A. Scott

Jennifer A. Scott

School of Mathematical, Physical and Computational Sciences, University of Reading, Reading, UK

Scientific Computing Department, STFC Rutherford Appleton Laboratory, Didcot, UK

Search for more papers by this author
Peter Jan van Leeuwen

Peter Jan van Leeuwen

School of Mathematical, Physical and Computational Sciences, University of Reading, Reading, UK

Department of Atmospheric Science, Colorado State University, Fort Collins, Colorado

Search for more papers by this author
First published: 20 August 2021

Funding information: Engineering and Physical Sciences Research Council, EP/L016613/1; H2020 European Research Council, 694509; NERC National Centre for Earth Observation

Abstract

There is growing awareness that errors in the model equations cannot be ignored in data assimilation methods such as four-dimensional variational assimilation (4D-Var). If allowed for, more information can be extracted from observations, longer time windows are possible, and the minimisation process is easier, at least in principle. Weak-constraint 4D-Var estimates the model error and minimises a series of quadratic cost functions, which can be achieved using the conjugate gradient (CG) method; minimising each cost function is called an inner loop. CG needs preconditioning to improve its performance. In previous work, limited-memory preconditioners (LMPs) have been constructed using approximations of the eigenvalues and eigenvectors of the Hessian in the previous inner loop. If the Hessian changes significantly in consecutive inner loops, the LMP may be of limited usefulness. To circumvent this, we propose using randomised methods for low-rank eigenvalue decomposition and use these approximations to construct LMPs cheaply using information from the current inner loop. Three randomised methods are compared. Numerical experiments in idealized systems show that the resulting LMPs perform better than the existing LMPs. Using these methods may allow more efficient and robust implementations of incremental weak-constraint 4D-Var.

1 INTRODUCTION

In numerical weather prediction, data assimilation provides the initial conditions for the weather model and hence influences the accuracy of the forecast (Kalnay, 2002). Data assimilation uses observations of a dynamical system to correct a previous estimate (background) of the system's state. Statistical knowledge of the errors in the observations and the background is incorporated in the process. A variational data assimilation method called weak-constraint 4D-Var provides a way to also take the model error into account (Trémolet, 2006), which can lead to a better analysis (e.g., Trémolet, 2007).

We explore the weak-constraint 4D-Var cost function. In its incremental version, the state is updated by a minimiser of the linearised version of the cost function. The minimiser can be found by solving a large sparse linear system. The process of solving each system is called an inner loop. Because the second derivative of the cost function, the Hessian, is symmetric positive-definite, the systems may be solved with the conjugate gradient (CG) method (Hestenes and Stiefel, 1952), the convergence rate of which depends on the eigenvalue distribution of the Hessian. Limited-memory preconditioners (LMPs) have been shown to improve the convergence of CG when minimising the strong-constraint 4D-Var cost function (Fisher, 1998; Tshimanga et al., 2008). Strong-constraint 4D-Var differs from weak-constraint 4D-Var by making the assumption that the dynamical model has no error.

LMPs can be constructed using approximations to the eigenvalues and eigenvectors (eigenpairs) of the Hessian. The Lanczos and CG connection (section 6.7 of Saad, 2003) can be exploited to obtain approximations to the eigenpairs of the Hessian in one inner loop, and these approximations may then be employed to construct the preconditioner for the next inner loop (Tshimanga et al., 2008). This approach does not describe how to precondition the first inner loop, and the number of CG iterations used on the ith inner loop limits the number of vectors available to construct the preconditioner on the ( i + 1 ) th inner loop. Furthermore, the success of preconditioning relies on the assumption that the Hessians do not change significantly from one inner loop to the next.

In this article, we propose addressing these drawbacks by using easy-to-implement subspace iteration methods (see chapter 5 of Saad, 2011) to obtain approximations of the largest eigenvalues and corresponding eigenvectors of the Hessian in the current inner loop. The subspace iteration method first approximates the range of the Hessian by multiplying it with a start matrix (for approaches to choosing it see, e.g., Duff and Scott, 1993) and the speed of convergence depends on the choice of this matrix (e.g., Gu, 2015). A variant of subspace iteration, which uses a Gaussian random start matrix, is called Randomised Eigenvalue Decomposition (REVD). REVD has been popularised by probabilistic analysis (Halko et al., 2011; Martinsson and Tropp, 2020). It has been shown that REVD, which is equivalent to one iteration of the subspace iteration method, can often generate a satisfactory approximation of the largest eigenpairs of a matrix that has rapidly decreasing eigenvalues. Because the Hessian is symmetric positive-definite, a randomised Nyström method for computing a low-rank eigenvalue decomposition can also be used. It is expected to give a higher quality approximation than REVD (e.g., Halko et al., 2011). We explore these two methods and another implementation of REVD, which is based on the ritzit implementation of the subspace method (Rutishauser, 1971). The methods differ in the number of matrix–matrix products with the Hessian. Even though more computations are required to generate the preconditioner in the current inner loop compared with using information from the previous inner loop, the randomised methods are block methods and hence easily parallelisable.

In Section 2, we discuss the weak-constraint 4D-Var method and, in Section 3, we consider LMPs and ways to obtain spectral approximations. The three randomised methods are examined in Section 4. Numerical experiments with linear advection and Lorenz 96 models are presented in Section 5, followed by a concluding discussion in Section 6.

2 WEAK-CONSTRAINT 4D-VAR

We are interested in estimating the state evolution of a dynamical system x 0 , x 1 , , x N , with x i n , at times t 0 , t 1 , , t N . Prior information about the state at t 0 is called the background and is denoted by x b n . It is assumed that x b has Gaussian errors with zero mean and covariance matrix B n × n . Observations of the system at time t i are denoted by y i q i and their errors are assumed to be Gaussian with zero mean and covariance matrix R i q i × q i ( q i n ). An observation operator i maps the model variables into the observed quantities at the correct location, i.e. y i = i ( x i ) + ζ i , where ζ i is the observation error. We assume that the observation errors are uncorrelated in time.

The dynamics of the system are described using a nonlinear model i such that
x i + 1 = i ( x i ) + η i + 1 , (1)

where η i + 1 is the model error at time t i + 1 . The model errors are assumed to be uncorrelated in time and to be Gaussian with zero mean and covariance matrix Q i n × n .

The forcing formulation of the nonlinear weak-constraint 4D-Var cost function, in which we solve for the initial state and the model error realizations, is
J ( x 0 , η 1 , , η N ) = 1 2 ( x 0 x b ) T B 1 ( x 0 x b ) + 1 2 i = 0 N ( y i i ( x i ) ) T R i 1 ( y i i ( x i ) ) (2)
+ 1 2 i = 1 N η i T Q i 1 η i ,

where x i satisfies the model constraint in Equation (1) (Trémolet, 2006). The analysis (approximation of the state evolution over the time window) x 0 a , x 1 a , , x N a can be obtained from the minimiser of Equation (2) using the constraints in Equation (1).

2.1 Incremental 4D-Var

One way to compute the analysis is to approximate the minimum of Equation (2) with an inexact Gauss–Newton algorithm (Gratton et al., 2007), where a sequence of quadratic cost functions is minimised. In this approach, we update x 0 and the model error
p ( j + 1 ) = p ( j ) + δ p ( j ) , (3)
where p ( j ) = ( x 0 ( j ) T , η 1 ( j ) T , , η N ( j ) T ) T is the jth approximation and δ p ( j ) = ( δ x 0 ( j ) T , δ η 1 ( j ) T , , δ η N ( j ) T ) T . The jth approximation of the state x ( j ) = ( x 0 ( j ) T , , x N ( j ) T ) T is calculated with Equation (1) using p ( j ) . The update δ p ( j ) is obtained by minimising the following cost function:
J δ ( δ p ( j ) ) = 1 2 | | δ p ( j ) b ( j ) | | D 1 2 + 1 2 | | H ( j ) ( L 1 ) ( j ) δ p ( j ) d ( j ) | | R 1 2 , (4)
where | | a | | A 1 2 = a T A 1 a and the covariance matrices are block-diagonal, that is, D = d i a g ( B , Q 1 , , Q N ) n ( N + 1 ) × n ( N + 1 ) and R = d i a g ( R 0 , , R N ) q × q , where q = i = 0 N q i . We use the notation (following Gratton et al., 2018) H ( j ) = d i a g ( H 0 ( j ) , , H N ( j ) ) q × n ( N + 1 ) , where H i ( j ) is the linearised observation operator, and
( L 1 ) ( j ) = I M 0 , 0 ( j ) I M 0 , 1 ( j ) M 1 , 1 ( j ) I M 0 , N 1 ( j ) M 1 , N 1 ( j ) M N 1 , N 1 ( j ) I , (5)
b ( j ) = x b x 0 ( j ) η 1 ( j ) η N ( j ) , (6)
d ( j ) = y 0 0 ( x 0 ( j ) ) y 1 1 ( x 1 ( j ) ) y N N ( x N ( j ) ) , (7)
where M i , l ( j ) = M l ( j ) M i ( j ) and M i ( j ) is the linearised model, that is, M i , l ( j ) denotes the linearised model integration from time t i to t l + 1 , ( L 1 ) ( j ) n ( N + 1 ) × n ( N + 1 ) , x ( j ) , δ x ( j ) , b ( j ) n ( N + 1 ) , and d ( j ) q . The outer loop consists of updating Equation (3), calculating x ( j ) , b ( j ) , d ( j ) , and linearising i and i for the next inner loop.
The minimum of the quadratic cost function (Equation 4) can be found by solving a linear system
A ( j ) δ p ( j ) = D 1 b ( j ) + ( L −T ) ( j ) ( H T ) ( j ) R 1 d ( j ) , (8)
A ( j ) = ( D 1 + ( L −T ) ( j ) ( H T ) ( j ) R 1 ( H ) ( j ) ( L 1 ) ( j ) ) n ( N + 1 ) × n ( N + 1 ) , (9)
where A ( j ) is the Hessian of Equation (4), which is symmetric positive-definite. These large sparse systems are usually solved with the conjugate gradient (CG) method, the convergence properties of which depend on the spectrum of A ( j ) (see Section 3.1 for a discussion). In general, clustered eigenvalues result in fast convergence. We consider a technique to cluster eigenvalues of A ( j ) in the following section. From now on we omit the superscript  ( j ) .

2.2 Control-variable transform

A control-variable transform, also called first-level preconditioning, maps the variables δ p to δ p ˜ , the errors of which are uncorrelated (see, e.g., Section 3.2 of Lawless, 2013). This can be denoted by the transformation D 1 / 2 δ p ˜ = δ p , where D = D 1 / 2 D 1 / 2 and D 1 / 2 is the symmetric square-root. The update δ p ˜ is then the solution of
A pr δ p ˜ = D 1 / 2 b + D 1 / 2 L −T H T R 1 d , A pr = I + D 1 / 2 L −T H T R 1 H L 1 D 1 / 2 . (10)
Here, A pr is the sum of the identity matrix and a rank q positive semidefinite matrix. Hence, it has a cluster of n ( N + 1 ) q eigenvalues at one and q eigenvalues that are greater than one. Thus, the spectral condition number κ = λ max / λ min (here, λ max and λ min are the largest and smallest eigenvalues of A pr , respectively) is equal to λ max . We discuss employing second-level preconditioning to reduce the condition number while also preserving the cluster of eigenvalues at one. In the subsequent sections, we use notation that is common in numerical linear algebra. Namely, we use A for the Hessian with first-level preconditioning, x for the unknown, and b for the right-hand side of the system of linear equations. Thus, we denote Equation (10) by
A x = b , (11)
where the right-hand side b is known and x is the required solution. We assume throughout that A is symmetric positive-definite.

3 PRECONDITIONING WEAK-CONSTRAINT 4D-VAR

3.1 Preconditioned conjugate gradients

The CG method (see, e.g., Saad, 2003) is a popular Krylov subspace method for solving systems of the form in Equation (11). A well-known bound for the error at the ith CG iteration ϵ i = x x i is
| | ϵ i | | A | | ϵ 0 | | A 2 κ 1 κ + 1 i , (12)
where κ is the spectral condition number and | | ϵ i | | A 2 = ϵ i T A ϵ i (see, e.g., section 5.1 of Nocedal and Wright, 2006). Note that this bound describes the worst-case convergence and only takes into account the largest and smallest eigenvalues. The convergence of CG also depends on the distribution of the eigenvalues of A (as well as the right-hand side b and the initial guess x 0 ); eigenvalues clustered away from zero suggest rapid convergence (lecture 38 of Trefethen and Bau, 1997). Otherwise, CG can display slow convergence and preconditioning is used to try to tackle this problem (chapter 9 of Saad, 2003). Preconditioning aims to map the system in Equation (11) to another system that has the same solution, but different properties that imply faster convergence. Ideally, the preconditioner P should be cheap both to construct and to apply, and the preconditioned system should be easy to solve.
If P is a symmetric positive-definite matrix that approximates A 1 and is available in factored form P = C C T , the following system is solved:
C T A C x ^ = C T b , (13)
where x ^ = C 1 x . Split preconditioned CG (PCG) for solving Equation (13) is described in Algorithm 1 (see, for example, algorithm 9.2 of Saad, 2003). Note that every CG iteration involves one matrix–vector product with A (the product A p j 1 is stored in step 3 and reused in step 5) and this is expensive in weak-constraint 4D-Var, because it involves running the linearised model throughout the assimilation window through the factor L 1 .

Algorithm . Split preconditioned CG for solving A x = b

3.2 Limited-memory preconditioners

In weak-constraint 4D-Var, the preconditioner P approximates the inverse Hessian. Hence, P can be obtained using quasi-Newton methods for unconstrained optimization that construct an approximation of the Hessian matrix, which is updated regularly (see, for example, chapter 6 of Nocedal and Wright, 2006). A popular method to approximate the Hessian is Broyden–Fletcher–Goldfarb–Shanno (BFGS, named after the researchers who proposed it), but it is too expensive in terms of storage and updating the approximation. Instead, the so-called block BFGS method (derived by Schnabel, 1983) uses only a limited number of vectors to build the Hessian, and when new vectors are added older ones are dropped. This is an example of a limited-memory preconditioner (LMP), and the one considered by Tshimanga et al. (see Tshimanga et al., 2008; Gratton et al., 2011; Tshimanga, 2007) in the context of strong-constraint 4D-Var. An LMP for an n A × n A symmetric positive-definite matrix A is defined as follows:
P k = ( I n A S ( S T A S ) 1 S T A ) ( I n A A S ( S T A S ) 1 S T ) + S ( S T A S ) 1 S T , (14)
where S is an n A × k ( k n A ) matrix with linearly independent columns s 1 , , s k , and I n A is the n A × n A identity matrix (Gratton et al., 2011). P k is symmetric positive-definite and if k = n A then ( S T A S ) 1 = S 1 A 1 S −T and P k = A 1 . In data assimilation, we have k n A , hence the name LMPs. P k is called a balancing preconditioner in Tang et al. (2009).

A potential problem for practical applications of Equation (14) is the need for expensive matrix–matrix products with A . Simpler formulations of Equation (14) are obtained by imposing more conditions on the vectors s 1 , , s k . Two formulations that Tshimanga et al. (2008) calls spectral-LMP and Ritz-LMP have been used, for example, in ocean data assimilation in the Regional Ocean Modeling System (ROMS: Moore et al., 2011) and the variational data assimilation software with the Nucleus for European Modelling of the Ocean (NEMO) ocean model (NEMOVAR: Mogensen et al., 2012), and coupled climate reanalysis in Coupled ECMWF ReAnalysis (CERA: Laloyaux et al., 2018).

To obtain the spectral-LMP, let v 1 , , v k be orthonormal eigenvectors of A with corresponding eigenvalues λ 1 , , λ k . Set V = ( v 1 , , v k ) and Λ = d i a g ( λ 1 , , λ k ) , so that A V = V Λ and V T V = I k . If s i = v i , i = 1 , , k , then the LMP in Equation (14) is the spectral-LMP P k sp (it is called a deflation preconditioner in Giraud and Gratton, 2006), which can be simplified to
P k sp = I n A i = 1 k ( 1 λ i 1 ) v i v i T . (15)
Then P k sp = C k sp ( C k sp ) T with (presented in section 2.3.1 of Tshimanga, 2007)
C k sp = i = 1 k I n A 1 ( λ i ) 1 v i v i T . (16)
In many applications, including data assimilation, exact eigenpairs are not known, and their approximations, called Ritz values and vectors, are used (we discuss these in the following section). If u 1 , , u k are orthogonal Ritz vectors, then the following relation holds: U T A U = Θ , where U = ( u 1 , , u k ) , Θ = d i a g ( θ 1 , , θ k ) and θ i is a Ritz value. Setting s i = u i , i = 1 , , k , the Ritz-LMP P k Rt is
P k Rt = ( I n A U Θ 1 U T A ) ( I n A A U Θ 1 U T ) + U Θ 1 U T . (17)
Each application of P k Rt requires a matrix–matrix product with A . If the Ritz vectors are obtained by the Lanczos process (described in Section 3.4 below), then Equation (17) can be simplified further, so that no matrix–matrix products with A are needed (see section 4.2.2 of Gratton et al., 2011 for details).

An important property is that if an LMP is constructed using k vectors then at least k eigenvalues of the preconditioned matrix C T A C will be equal to one, and the remaining eigenvalues will lie between the smallest and largest eigenvalues of A (see theorem 3.4 of Gratton et al., 2011). Moreover, if A has a cluster of eigenvalues at one, then LMPs preserve this cluster. This is crucial when preconditioning Equation (10): because the LMPs preserve the n ( N + 1 ) q smallest eigenvalues of A pr that are equal to one, the CG convergence can be improved by decreasing the largest eigenvalues. Hence, it is preferable to use the largest eigenpairs or their approximations.

In practice, both spectral-LMP and Ritz-LMP use Ritz vectors and values to construct the LMPs. It has been shown that the Ritz-LMP can perform better than spectral-LMP in a strong-constraint 4D-Var setting by correcting for the inaccuracies in the estimates of eigenpairs (Tshimanga et al., 2008). However, Gratton et al. (2011) (their theorem 4.5) have shown that, if the preconditioners are constructed with Ritz vectors and values that have converged, then the spectral-LMP acts like the Ritz-LMP.

3.3 Ritz information

Calculating or approximating all the eigenpairs of a large sparse matrix is impractical. Hence, only a subset is approximated to construct the LMPs. This is often done by extracting approximations from a subspace, and the Rayleigh–Ritz (RR) procedure is a popular method for doing this.

Assume that 𝒵 n A is an invariant subspace of A , that is, A z 𝒵 for every z 𝒵 , and the columns of Z n A × m , m < n A , form an orthonormal basis for 𝒵 . If ( λ , y ^ ) is an eigenpair of K = Z T A Z m × m , then ( λ , v ) , where v = Z y ^ , is an eigenpair of A (see, e.g., theorem 1.2 in chapter 4 of Stewart, 2001). Hence, eigenvalues of A that lie in the subspace 𝒵 can be extracted by solving a small eigenvalue problem.

However, generally the computed subspace 𝒵 ˜ with orthonormal basis as columns of Z ˜ is not invariant. Hence, only approximations v ˜ to the eigenvectors v belong to 𝒵 ˜ . The RR procedure computes approximations u to v ˜ . We give the RR procedure in Algorithm 2, where the eigenvalue decomposition is abbreviated as EVD. Approximations to eigenvalues λ are called Ritz values θ , and u are the Ritz vectors. Eigenvectors of K ˜ = Z ˜ T A Z ˜ , which is the projection of A on to 𝒵 ˜ , are denoted by w and are called primitive Ritz vectors.

Algorithm . Rayleigh–Ritz procedure for computing approximations of eigenpairs of symmetric A

3.4 Spectral information from CG

Tshimanga et al. (2008) use Ritz pairs of the Hessian in one inner loop to construct LMPs for the following inner loop, i.e. information on A ( 0 ) is used to precondition A ( 1 ) , and so on. Success relies on the Hessians not changing significantly from one inner loop to the next. Ritz information can be obtained from the Lanczos process that is connected to CG, hence information for the preconditioner can be gathered at a negligible cost.

The Lanczos process is used to obtain estimates of a few extremal eigenvalues and corresponding eigenvectors of a symmetric matrix A (section 10.1 of Golub and Van Loan, 2013). It produces a sequence of tridiagonal matrices T j j × j , the largest and smallest eigenvalues of which converge to the largest and smallest eigenvalues of A . Given a starting vector f 0 , it also computes an orthonormal basis f 0 , , f j 1 for the Krylov subspace K j = s p a n { f 0 , A f 0 , , A j 1 f 0 } . Ritz values θ i are obtained as eigenvalues of a tridiagonal matrix, which has the following structure:
T j = γ 1 τ 1 τ 1 γ 2 τ 2 τ j 1 γ j . (18)
The Ritz vectors of A are u i = F j w i , where F j = ( f 0 , , f j 1 ) and the eigenvector w i of T j is a primitive Ritz vector. Eigenpairs of T j can be obtained using a symmetric tridiagonal QR algorithm or Jacobi procedures (see, e.g., section 8.5 of Golub and Van Loan, 2013).
Saad (see section 6.7.3 of Saad, 2003) discusses obtaining entries of T j when solving A x = b with CG. At the jth iteration of CG, new entries of T j are calculated as follows:
γ j = 1 α j for j = 1 , 1 α j + β j 1 α j 1 for j > 1 , (19)
τ j = β j α j , (20)
and the vector f j = r j / | | r j | | , where | | r j | | 2 = r j T r j and α j , β j , and r j are as in Algorithm 1. Hence, obtaining eigenvalue information requires normalizing the residual vectors and finding eigenpairs of the tridiagonal matrix T j . In data assimilation, the dimension of T j is small, because the cost of matrix–vector products restricts the number of CG iterations in the previous inner loop. Hence its eigenpairs can be calculated cheaply. However, caution has to be taken to avoid ‘ghost’ values, that is, repeated Ritz values, due to the loss of orthogonality in CG (section 10.3.5 of Golub and Van Loan, 2013). This can be addressed using a complete reorthogonalization in every CG iteration, which is done in the CONGRAD routine used at the European Centre for Medium Range Weather Forecasts (ECMWF, 2020). This makes every CG iteration more expensive, but CG may converge in fewer iterations (Fisher, 1998).

4 RANDOMISED EIGENVALUE DECOMPOSITION

If the Hessian in one inner loop differs significantly from the Hessian in the previous inner loop, then it may not be useful to precondition the former with an LMP that is constructed with information from the latter. Employing the Lanczos process to obtain eigenpair estimates and use them to construct an LMP in the same inner loop is too computationally expensive, because each iteration of the Lanczos process requires a matrix–vector product with the Hessian, thus the cost is similar to the cost of CG. Hence, we explore a different approach.

Subspace iteration is a simple procedure to obtain approximations to the largest eigenpairs (see, e.g., chapter 5 of Saad, 2011). It is easily understandable and can be implemented in a straightforward manner, although its convergence can be very slow if the largest eigenvalues are not well separated from the rest of the spectrum. The accuracy of subspace iteration may be enhanced by using a RR projection.

Such an approach is used in Randomised Eigenvalue Decomposition (REVD: see, e.g., Halko et al., 2011). This takes a Gaussian random matrix, that is, a matrix with independent standard normal random variables with zero mean and variance equal to one as its entries, and applies one iteration of the subspace iteration method with RR projection, hence obtaining a rank m approximation A Z 1 ( Z 1 T A Z 1 ) Z 1 T , where Z 1 n A × m is orthogonal. We present REVD in Algorithm 3. An important feature of REVD is the observation that the accuracy of the approximation is enhanced with oversampling (which is also called “using guard vectors” in Duff and Scott, 1993), that is, working on a larger space than the required number of Ritz vectors. Halko et al. (2011) claim that setting the oversampling parameter to 5 or 10 is often sufficient.

Algorithm . Randomised eigenvalue decomposition, REVD

Randomised algorithms are designed to minimise the communication instead of the flop count. The expensive parts of Algorithm 3 are the two matrix–matrix products A G and A Z 1 in steps 2 and 4, that is, in each of these steps, matrix A has to be multiplied with ( k + l ) vectors, which in serial computations would be essentially the cost of 2 ( k + l ) iterations of unpreconditioned CG. However, note that these matrix–matrix products can be parallelised.

In weak-constraint 4D-Var, A is the Hessian, hence it is symmetric positive-definite and its eigenpairs can also be approximated using a randomised Nyström method (algorithm 5.5 of Halko et al., 2011), which is expected to give much more accurate results than REVD (Halko et al., 2011). We present the Nyström method in Algorithm 4, where singular-value decomposition is abbreviated as SVD. It considers a more elaborate rank m approximation than in REVD: A ( A Z 1 ) ( Z 1 T A Z 1 ) 1 ( A Z 1 ) T = F F T , where Z 1 n A × m is orthogonal (obtained in the same way as in REVD, e.g. using a tall skinny QR (TSQR) decomposition: Demmel et al., 2012) and F = ( A Z 1 ) ( Z 1 T A Z 1 ) 1 / 2 n A × m is an approximate Cholesky factor of A , which is found in step 6. The eigenvalues of F F T are the squares of the singular values of F (see section 2.4.2 of Golub and Van Loan, 2013). In numerical computations we store matrices E ( 1 ) = A Z 1 and E ( 2 ) = Z 1 T E ( 1 ) = Z 1 T A Z 1 (step 4), perform the Cholesky factorization of E ( 2 ) = C T C (step 5), and obtain F by solving the triangular system F C = E ( 1 ) .

Algorithm . Randomised eigenvalue decomposition for symmetric positive semidefinite A , Nyström

The matrix–matrix product with A at step 4 of Algorithms 3 and 4 is removed in Rutishauser's implementation of subspace iteration with RR projection called ritzit (Rutishauser, 1971). It can be derived in the following manner (see chapter 14 of Parlett, 1998). Assume that G 3 n A × m is an orthogonal matrix and the sample matrix is Y 3 = A G 3 = Z 3 R 3 , where Z 3 n A × m is orthogonal and R 3 m × m is upper triangular. Then a projection of A 2 onto the column space of G 3 is K ^ = Y 3 T Y 3 = R 3 T Z 3 T Z 3 R 3 = R 3 T R 3 . Then K 3 = R 3 R 3 T = R 3 R 3 T R 3 R 3 1 = R 3 K ^ R 3 1 , which is similar to K ^ and hence has the same eigenvalues. This leads to another implementation of REVD presented in Algorithm 5. This is a single-pass algorithm, meaning that A has to be accessed just once, and, to the best of our knowledge, this method has not been considered in the context of randomised eigenvalue approximations.

Algorithm . Randomised eigenvalue decomposition based on ritzit, REVD_ritzit

Note that the Ritz vectors given by Algorithms 3, 4, and 5 are different. Although Algorithm 5 accesses the matrix A only once, it requires an additional orthogonalisation of a matrix of size n A × ( k + l ) .

In Table 1, we summarise some properties of the Lanczos, REVD, Nyström, and REVD_ritzit methods when they are used to compute Ritz values and vectors to generate a preconditioner for A in incremental data assimilation. Note that the cost of applying spectral-LMP depends on the number of vectors k used in its construction and is independent of which method is used to obtain them. The additional cost of using randomised algorithms arises only once per inner loop when the preconditioner is generated. We recall that in these algorithms the required EVD or SVD of the small matrix can be obtained cheaply and the most expensive parts of the algorithms are the matrix–matrix products of A and the n A × ( k + l ) matrices. If enough computational resources are available, these can be parallelised. In the best-case scenario, all k + l matrix–vector products can be performed at the same time, making the cost of the matrix–matrix product equivalent to the cost of one iteration of CG plus communication between the processors.

TABLE 1. A summary of the properties of the different methods of obtaining k Ritz vectors and values to generate the preconditioner for a n A × n A matrix A in the ith inner loop; here l is the oversampling parameter, while * applies for CG with reorthogonalization
Lanczos REVD Nyström REVD_ritzit
Information source Previous inner loop Current inner loop Current inner loop Current inner loop
Preconditioner for the first inner loop No Yes Yes Yes
k dependence on the previous inner loop Bounded by the number of CG iterations Independent Independent Independent
Matrix-matrix products with A None 2 products with n A × ( k + l ) matrices 2 products with n A × ( k + l ) matrices 1 product with n A × ( k + l ) matrix
QR decomposition None None None Y 3 n A × ( k + l )
Orthogonalisation None Y n A × ( k + l ) Y n A × ( k + l ) G n A × ( k + l )
Cholesky factorization None None E ( 2 ) ( k + l ) × ( k + l ) None
Triangular solve None None F C = E ( 1 ) for F n A × ( k + l ) None
Deterministic EVD T k k × k * K ( k + l ) × ( k + l ) None K 3 ( k + l ) × ( k + l )
Deterministic SVD None None F n A × ( k + l ) None

When a randomised method is used to generate the preconditioner, an inner loop is performed as follows. Estimates of the Ritz values of the Hessian and the corresponding Ritz vectors are obtained with a randomised method (Algorithm 3, 4, or 5) and used to construct an LMP. Then the system of Equation (11) with the exact Hessian A is solved with PCG (Algorithm 1) using the LMP. The state is updated in the outer loop using the PCG solution.

5 NUMERICAL EXPERIMENTS

We demonstrate our proposed preconditioning strategies using two models: a simple linear advection model to explore the spectra of the preconditioned Hessian and the nonlinear Lorenz 96 model (Lorenz, 1996) to explore the convergence of split preconditioned CG (PCG). We perform identical twin experiments, where x t = ( ( x 0 t ) T , , ( x N t ) T ) T denotes the reference trajectory. The observations and background state are generated by adding noise to i ( x i t ) and x 0 t with covariance matrices R and B , respectively. We use direct observations, thus the observation operator i is linear.

We use covariance matrices R i = σ o 2 I q i , where q i is the number of observations at time t i , Q i = σ q 2 C q , where C q is a Laplacian correlation matrix (Johnson et al., 2005), and B = σ b 2 C b , where C b is a second-order auto-regressive correlation matrix (Daley, 1993).

We assume that first-level preconditioning has already been applied (recall Equation 10). In data assimilation, using Ritz-LMP as formulated in Equation (17) is impractical because of the matrix products with A and we cannot use a simple formulation of Ritz-LMP when the Ritz values and vectors are obtained with randomised methods. Hence, we use spectral-LMP. However, as we mentioned in Section 3.2, the spectral-LMP that is constructed with well-converged Ritz values and vectors acts like Ritz-LMP. When we consider the second inner loop, we compare the spectral-LMPs using information from the randomised methods with the spectral-LMP constructed using information obtained with the Matlab function eigs in the previous inner loop. eigs returns a highly accurate estimate of a few largest or smallest eigenvalues and corresponding eigenvectors. We will use the term randomised LMP to refer to the spectral-LMPs that are constructed with information from the randomised methods, and deterministic LMP to refer to the spectral-LMP that is constructed with information from eigs.

The computations are performed with Matlab R2017b. Linear systems are solved using the Matlab implementation of PCG (function pcg), which was modified to allow split preconditioning to maintain the symmetric coefficient matrix at every loop.

5.1 Advection model

First, we consider the linear advection model:
u ( z , t ) t + u ( z , t ) z = 0 , (21)
where z [ 0 , 1 ] and t ( 0 , T ) . An upwind numerical scheme is used to discretise Equation (21) (see, e.g., chapter 4 of Morton and Mayers, 1994). To allow us to compute all the eigenvalues (described in the following section), we consider a small system with the linear advection model. The domain is divided into n = 40 equally spaced grid points, with grid spacing Δ z = 1 / n . We run the model for 51 time steps ( N = 50 ) with time-step size Δ t = 1 / N , hence A is a 2 , 040 × 2 , 040 matrix. The Courant number is C = 0 . 8 (the upwind scheme is stable with C [ 0 , 1 ] ). The initial conditions are Gaussian,
u ( z , 0 ) = 6 exp ( z 0 . 5 ) 2 2 × 0 . 1 2 ,
and the boundary conditions are periodic, u ( 1 , t ) = u ( 0 , t ) .

We set σ o = 0 . 05 , σ q = 0 . 05 , and σ b = 0 . 1 . C q and C b have length-scales equal to 10 Δ z . Every fourth model variable is observed at every fifth time step, ensuring that there is an observation at the final time step (100 observations in total). Because the model and the observational operator i are linear, the cost function (Equation 2) is quadratic and its minimiser is found in the first loop of the incremental method.

5.1.1 Eigenvalues of the preconditioned matrix

We apply the randomised LMPs in the first inner loop. Note that if the deterministic LMP is used, it is unclear how to precondition the first inner loop. We explore what effect the randomised LMPs have on the eigenvalues of A . The oversampling parameter l is set to 5 and the randomised LMPs are constructed with k = 25 vectors.

The Ritz values of A given by the randomised methods are compared with those computed using eigs (Figure 1a). The Nyström method produces a good approximation of the largest eigenvalues, while REVD gives a slightly worse approximation, except for the largest five eigenvalues. The REVD_ritzit method underestimates the largest eigenvalues significantly. The largest eigenvalues of the preconditioned matrices are smaller than the largest eigenvalue of A (Figure 1b). However, the smallest eigenvalues of the preconditioned matrices are less than one and hence applying the preconditioner expands the spectrum of A at the lower boundary (Figure 1c), so that theorem 3.4 of Gratton et al. (2011), which considers the non-expansiveness of the spectrum of the Hessian after preconditioning with an LMP, does not hold. This happens because the formulation of spectral-LMP is derived assuming that the eigenvalues and eigenvectors are exact, while the randomized methods provide only approximations. Note that, even though REVD_ritzit gives the worst approximations of the largest eigenvalues of the Hessian, using the randomised LMP with information from REVD_ritzit reduces the largest eigenvalues of the preconditioned matrix the most and the smallest eigenvalues are close to one. Using the randomised LMP with estimates from Nyström gives similar results. Hence, the condition number of the preconditioned matrix is lower when the preconditioners are constructed with REVD_ritzit or Nyström compared with REVD.

Details are in the caption following the image
Advection problem. (a) The 25 largest eigenvalues of A (eigs) and their estimates given by randomised methods; the largest eigenvalues and their estimates given by REVD and Nyström coincide. (b) The largest eigenvalues of A (no LMP, the same as eigs in (a)) and ( C 25 sp ) T A C 25 sp , where C 25 sp is constructed with Ritz values in (a) and corresponding Ritz vectors. (c) The smallest eigenvalues of A and ( C 25 sp ) T A C 25 sp . (d) Quadratic cost function value versus PCG iteration when solving systems with A and ( C 25 sp ) T A C 25 sp [Colour figure can be viewed at wileyonlinelibrary.com]

The values of the quadratic cost function at the first ten iterations of PCG are shown in Figure 1d. Using the randomised LMP that is constructed with information from REVD is detrimental to the PCG convergence compared with using no preconditioning. Using information from the Nyström and REVD_ritzit methods results in similar PCG convergence and low values of the quadratic cost function are reached in fewer iterations than without preconditioning. The PCG convergence may be explained by the favourable distribution of the eigenvalues after preconditioning using Nyström and REVD_ritzit, and the smaller-than-one eigenvalues when using REVD. These results, however, do not necessarily generalize to an operational setting, as this system is well conditioned while operational settings are not. This will be investigated further in the next section.

5.2 Lorenz 96 model

We next use the Lorenz 96 model to examine what effect the randomised LMPs have on PCG performance. In the Lorenz 96 model, the evolution of the n components X j , j { 1 , 2 , , n } of x i is governed by a set of n coupled ODEs:
d X j d t = X j 2 X j 1 + X j 1 X j + 1 X j + F , (22)
where X 1 = X n 1 , X 0 = X n , X n + 1 = X 1 , and F = 8 . The equations are integrated using a fourth-order Runge–Kutta scheme (Butcher, 1987). We set n = 80 and N = 150 (the size of A is 12 , 080 × 12 , 080 ) and observe every tenth model variable at every tenth time step (120 observations in total), ensuring that there are observations at the final time step. The grid-point distance is Δ X = 1 / n and the time step is set to Δ t = 2 . 5 × 1 0 2 .
For the covariance matrices we use σ o = 0 . 15 and σ b = 0 . 2 . C b has length-scale equal to 2 Δ X . Two setups are used for the model-error covariance matrix:
  • σ q = 0 . 1 and C q has length-scale L q = 2 Δ X (the same as C b );
  • σ q = 0 . 05 and C q has length-scale L q = 0 . 25 Δ X .

In our numerical experiments, the preconditioners have very similar effect using both setups. Hence, we present results for the case σ q = 0 . 1 and L q = 2 Δ X in Figures 2-5 in the following sections, except Figure 3.

The first outer loop is performed and no second-level preconditioning is used in the first inner loop, where PCG is run for 100 iterations or until the relative residual norm reaches 1 0 6 . In the following sections, we use randomised and deterministic LMPs in the second inner loop. PCG has the same stopping criteria as in the first inner loop.

5.2.1 Minimising the inner loop cost function

In Figure 2, we compare the performance of the randomised LMPs with the deterministic LMP. We also consider the effect of varying k, the number of vectors used to construct the preconditioner. We set the oversampling parameter to l = 5 . Because results from the randomized methods depend on the random matrix used, we perform 50 experiments with different realizations for the random matrix. We find that the different realizations lead to very similar results (see Figure 2a).

Details are in the caption following the image
A comparison of the value of the quadratic cost function at every PCG iteration when spectral-LMP is constructed with k { 5 , 10 , 15 } Ritz values and vectors obtained with the randomised methods in the current inner loop, and function eigs in the previous inner loop. We also show no second-level preconditioning (no LMP), which is the same in all four panels. For the randomised methods, (a) shows 50 experiments for k = 5 and the remaining panels display means over 50 experiments. Here σ q = 0 . 1 and L q = 2 Δ X [Colour figure can be viewed at wileyonlinelibrary.com]

Independently of the k value, there is an advantage in using second-level preconditioning. The reduction in the value of the quadratic cost function is faster using randomised LMPs compared with deterministic LMPs, with REVD_ritzit performing the best after the first few iterations. The more information we use in the preconditioner (i.e., the higher the k value), the faster REVD_ritzit shows better results than the other methods. The performance of the REVD and Nyström methods is similar. Note that, as k increases, the storage (see Table 1) and work per PCG iteration increase. Examination of the Ritz values given by the randomised methods shows that REVD_ritzit gives the worse estimate of the largest eigenvalues, as was the case when using the advection model. We calculated the smallest eigenvalue of the preconditioned matrix ( C 5 sp ) T A C 5 sp using eigs. When C 5 sp is constructed using REVD_ritzit or Nyström, the smallest eigenvalue of ( C 5 sp ) T A C 5 sp is equal to one, whereas using REVD it is approximately 0.94. This may explain why the preconditioner constructed using REVD does not perform as well as other randomised preconditioners, but it is not entirely clear why the preconditioner that uses REVD_ritzit shows the best performance.

The PCG convergence when using the deterministic LMP and the randomised LMP with information from REVD_ritzit with different k values is compared in Figure 3 for both setups of the model-error covariance matrix. We also show an additional case where the model-error covariance matrix is constructed setting σ q = σ b / 100 = 0 . 002 and L q = 0 . 25 Δ X . In this case, the performance of the REVD and Nyström methods is very similar, outperforming no preconditioning after the first 10–15 iterations, with better performance for higher k values (results not shown). Moreover, REVD_ritzit again outperforms the deterministic LMP from the first PCG iterations. For the deterministic LMP in Figure 3, varying k has little effect, especially in the initial iterations. However, for REVD_ritzit, in general increasing k results in a greater decrease of the cost function. Setting k = 5 gives better initial results compared with k = 10 in the σ q = 0 . 002 case, but the larger k value performs better after that. Also, at any iteration of PCG we obtain a lower value of the quadratic cost function using the randomised LMP with k = 5 compared with the deterministic LMP with k = 15 , which uses exact eigenpair information from the Hessian of the previous loop.

Details are in the caption following the image
A comparison of the values of the quadratic cost function at every PCG iteration when using deterministic LMP with information from the previous loop (eigs) and randomised LMP with information from REVD_ritzit for different k values (5, 10, and 15). No second-level preconditioning is also shown (case (a) is the same as in Figure 2). In cases (a), (b), and (c) the model-error covariance matrices are constructed using parameters σ q and L q [Colour figure can be viewed at wileyonlinelibrary.com]

5.2.2 Effect of the observation network

To understand the sensitivities of results from the different LMPs to the observation network, we consider a system with the same parameters as in the previous section, where we had 120 observations, but we now observe the following:
  • every fifth model variable at every fifth time step (480 observations in total);
  • every second variable at every second time step (3,000 observations in total).

The oversampling parameter is again set to l = 5 and we set k = 5 and k = 15 for both observation networks. Since the number of observations is equal to the number of eigenvalues that are larger than one and there are more observations than in the previous section, there are more eigenvalues that are larger than one after first-level preconditioning. Because all 50 experiments with different Gaussian matrices in the previous section were close to the mean, we perform 10 experiments for each randomised method, solve the systems, and report the means of the quadratic cost function.

The results are presented in Figure 4. Again, the randomised LMPs perform better than the deterministic LMP. However, if the preconditioner is constructed with a small amount of information about the system ( k = 5 for both systems and k = 15 for the system with 3,000 observations), then there is little difference in the performance of different randomised LMPs. Also, when the number of observations is increased, more iterations of PCG are needed to get any improvement in the minimisation of the quadratic cost function when using the deterministic LMP over using no second-level preconditioning.

Details are in the caption following the image
As in Figure 2, but for two systems with q observations; 10 experiments are done for each randomised method and the mean values plotted [Colour figure can be viewed at wileyonlinelibrary.com]

When comparing the randomised and deterministic LMPs with different values of k for these systems, we obtain similar results to those in Figure 3a, that is, it is more advantageous to use the randomised LMP constructed with k = 5 than the deterministic LMP constructed with k = 15 .

5.2.3 Effect of oversampling

We next consider the effect of increasing the value of the oversampling parameter l. The observation network is as in Section 5.2.1 (120 observations in total). We set k = 15 and perform the second inner loop 50 times for every value of l { 5 , 10 , 15 } with all three randomised methods. The standard deviation of the value of the quadratic cost function at every iteration is presented in Figure 5.

Details are in the caption following the image
Standard deviation of the quadratic cost function at every iteration of PCG when spectral-LMP is constructed with different randomised methods. For every randomised method we perform 50 experiments. Here σ q = 0 . 1 and L q = 2 Δ X [Colour figure can be viewed at wileyonlinelibrary.com]

For all the methods, the standard deviation is greatest in the first iterations of PCG. It is reduced when the value of l is increased and the largest reduction happens in the first iterations. However, REVD_ritzit is the least sensitive to the increase in oversampling. With all values of l, REVD_ritzit has the largest standard deviation in the first few iterations, but it stills gives the largest reduction of the quadratic cost function. Hence, large oversampling is not necessary if REVD_ritzit is used.

6 CONCLUSIONS AND FUTURE WORK

We have proposed a new randomised approach to second-level preconditioning of the incremental weak-constraint 4D-Var forcing formulation. It can be preconditioned with an LMP that is constructed using approximations of eigenpairs of the Hessian. Previously, by using the Lanzcos and CG connection these approximations were obtained at a very low cost in one inner loop and then used to construct the LMP in the following inner loop. We have considered three methods (REVD, Nyström, and REVD_ritzit) that employ randomisation to compute the approximations. These methods can be used to construct the preconditioner cheaply in the current inner loop, with no dependence on the previous inner loop, and are parallelisable.

Numerical experiments with the linear advection and Lorenz-96 models have shown that the randomised LMPs constructed with approximate eigenpairs improve the convergence of PCG more than deterministic LMPs with information from the previous loop, especially after the initial PCG iterations. The quadratic cost function reduces more rapidly when using a randomised LMP rather than a deterministic LMP, even if the randomised LMP is constructed with fewer vectors than the deterministic LMP. Also, for randomised LMPs, the more information about the system we use (i.e., the more approximations of eigenpairs are used to construct the preconditioner), the greater the reduction in the quadratic cost function, with a possible exception in the first PCG iterations for low k values and very small model error. Using more information to construct a deterministic LMP may not result in larger reduction of the quadratic cost function, especially in the first iterations of PCG, which is in line with the results in Tshimanga et al. (2008). However, if not enough information is included in the randomised LMP, then preconditioning may have no effect on the first few iterations of PCG.

Of the randomised methods considered, the best overall performance was for REVD_ritzit. However, if we run a small number of PCG iterations, the preconditioners obtained with different randomised methods give similar results. In the case of a very small model error, using REVD and Nyström is useful after the initial iterations of PCG, whereas REVD_ritzit improves the reduction of the quadratic cost function from the start. The performance is independent of the choice of random Gaussian start matrix and it may be improved with oversampling.

In this work we apply randomised methods to generate a preconditioner, which is then used to accelerate the solution of the exact inner loop problem (Equation 11) with the PCG method (as discussed in Section 4). A different approach has been explored by Bousserez and Henze (2018) and Bousserez et al. (2020), who presented and tested a randomised solution algorithm called the Randomized Incremental Optimal Technique (RIOT) in data assimilation. RIOT is designed to be used instead of PCG and employs a randomised eigenvalue decomposition of the Hessian (using a different method from the ones presented in this article) to construct directly the solution x in Equation (11), which approximates the solution given by PCG.

The randomised preconditioning approach can also be employed to minimise other quadratic cost functions, including the strong-constraint 4D-Var formulation. Further exploration of other single-pass versions of randomised methods for eigenvalue decomposition, which are discussed in Halko et al. (2011), may be useful. In particular, the single-pass version of the Nyström method is potentially attractive. If a large number of Ritz vectors are used to construct the preconditioner, more attention can be paid to choosing the value of the oversampling parameter l in the randomised methods. In some cases, a better approximation may be obtained if l depends linearly on the target rank of the approximation (Nakatsukasa, 2020).

AUTHOR CONTRIBUTIONS

Ieva Daužickaitė: conceptualization, formal analysis, investigation, methodology, software, validation, visualization, writing-original draft, writing-review and editing. Amos S. Lawless: investigation; methodology; supervision; validation; writing – review and editing. Jennifer A. Scott: funding acquisition; investigation; methodology; project administration; resources; supervision; validation; writing – review and editing. Peter Jan van Leeuwen: investigation; methodology; supervision; validation; writing – review and editing.

ACKNOWLEDGEMENTS

We are grateful to Dr Adam El-Said for his code for the weak-constraint 4D-Var assimilation system. We also thank two anonymous reviewers, whose comments helped us to improve the article.

    CONFLICT OF INTEREST

    The authors declare no conflict of interest.

      Journal list menu