Randomised preconditioning for the forcing formulation of weak-constraint 4D-Var
Funding information: Engineering and Physical Sciences Research Council, EP/L016613/1; H2020 European Research Council, 694509; NERC National Centre for Earth Observation
Abstract
There is growing awareness that errors in the model equations cannot be ignored in data assimilation methods such as four-dimensional variational assimilation (4D-Var). If allowed for, more information can be extracted from observations, longer time windows are possible, and the minimisation process is easier, at least in principle. Weak-constraint 4D-Var estimates the model error and minimises a series of quadratic cost functions, which can be achieved using the conjugate gradient (CG) method; minimising each cost function is called an inner loop. CG needs preconditioning to improve its performance. In previous work, limited-memory preconditioners (LMPs) have been constructed using approximations of the eigenvalues and eigenvectors of the Hessian in the previous inner loop. If the Hessian changes significantly in consecutive inner loops, the LMP may be of limited usefulness. To circumvent this, we propose using randomised methods for low-rank eigenvalue decomposition and use these approximations to construct LMPs cheaply using information from the current inner loop. Three randomised methods are compared. Numerical experiments in idealized systems show that the resulting LMPs perform better than the existing LMPs. Using these methods may allow more efficient and robust implementations of incremental weak-constraint 4D-Var.
1 INTRODUCTION
In numerical weather prediction, data assimilation provides the initial conditions for the weather model and hence influences the accuracy of the forecast (Kalnay, 2002). Data assimilation uses observations of a dynamical system to correct a previous estimate (background) of the system's state. Statistical knowledge of the errors in the observations and the background is incorporated in the process. A variational data assimilation method called weak-constraint 4D-Var provides a way to also take the model error into account (Trémolet, 2006), which can lead to a better analysis (e.g., Trémolet, 2007).
We explore the weak-constraint 4D-Var cost function. In its incremental version, the state is updated by a minimiser of the linearised version of the cost function. The minimiser can be found by solving a large sparse linear system. The process of solving each system is called an inner loop. Because the second derivative of the cost function, the Hessian, is symmetric positive-definite, the systems may be solved with the conjugate gradient (CG) method (Hestenes and Stiefel, 1952), the convergence rate of which depends on the eigenvalue distribution of the Hessian. Limited-memory preconditioners (LMPs) have been shown to improve the convergence of CG when minimising the strong-constraint 4D-Var cost function (Fisher, 1998; Tshimanga et al., 2008). Strong-constraint 4D-Var differs from weak-constraint 4D-Var by making the assumption that the dynamical model has no error.
LMPs can be constructed using approximations to the eigenvalues and eigenvectors (eigenpairs) of the Hessian. The Lanczos and CG connection (section 6.7 of Saad, 2003) can be exploited to obtain approximations to the eigenpairs of the Hessian in one inner loop, and these approximations may then be employed to construct the preconditioner for the next inner loop (Tshimanga et al., 2008). This approach does not describe how to precondition the first inner loop, and the number of CG iterations used on the ith inner loop limits the number of vectors available to construct the preconditioner on the $(i+1)$th inner loop. Furthermore, the success of preconditioning relies on the assumption that the Hessians do not change significantly from one inner loop to the next.
In this article, we propose addressing these drawbacks by using easy-to-implement subspace iteration methods (see chapter 5 of Saad, 2011) to obtain approximations of the largest eigenvalues and corresponding eigenvectors of the Hessian in the current inner loop. The subspace iteration method first approximates the range of the Hessian by multiplying it with a start matrix (for approaches to choosing it see, e.g., Duff and Scott, 1993) and the speed of convergence depends on the choice of this matrix (e.g., Gu, 2015). A variant of subspace iteration, which uses a Gaussian random start matrix, is called Randomised Eigenvalue Decomposition (REVD). REVD has been popularised by probabilistic analysis (Halko et al., 2011; Martinsson and Tropp, 2020). It has been shown that REVD, which is equivalent to one iteration of the subspace iteration method, can often generate a satisfactory approximation of the largest eigenpairs of a matrix that has rapidly decreasing eigenvalues. Because the Hessian is symmetric positive-definite, a randomised Nyström method for computing a low-rank eigenvalue decomposition can also be used. It is expected to give a higher quality approximation than REVD (e.g., Halko et al., 2011). We explore these two methods and another implementation of REVD, which is based on the ritzit implementation of the subspace method (Rutishauser, 1971). The methods differ in the number of matrix–matrix products with the Hessian. Even though more computations are required to generate the preconditioner in the current inner loop compared with using information from the previous inner loop, the randomised methods are block methods and hence easily parallelisable.
In Section 2, we discuss the weak-constraint 4D-Var method and, in Section 3, we consider LMPs and ways to obtain spectral approximations. The three randomised methods are examined in Section 4. Numerical experiments with linear advection and Lorenz 96 models are presented in Section 5, followed by a concluding discussion in Section 6.
2 WEAK-CONSTRAINT 4D-VAR
We are interested in estimating the state evolution of a dynamical system ${\mathbf{x}}_{0},{\mathbf{x}}_{1},\dots ,{\mathbf{x}}_{N}$, with ${\mathbf{x}}_{i}\in {\mathbb{R}}^{n}$, at times ${t}_{0},{t}_{1},\dots ,{t}_{N}$. Prior information about the state at ${t}_{0}$ is called the background and is denoted by ${\mathbf{x}}^{\mathrm{b}}\in {\mathbb{R}}^{n}$. It is assumed that ${\mathbf{x}}^{\mathrm{b}}$ has Gaussian errors with zero mean and covariance matrix $\mathbf{B}\in {\mathbb{R}}^{n\times n}$. Observations of the system at time ${t}_{i}$ are denoted by ${\mathbf{y}}_{i}\in {\mathbb{R}}^{{q}_{i}}$ and their errors are assumed to be Gaussian with zero mean and covariance matrix ${\mathbf{R}}_{i}\in {\mathbb{R}}^{{q}_{i}\times {q}_{i}}$ (${q}_{i}\ll n$). An observation operator ${\mathscr{H}}_{i}$ maps the model variables into the observed quantities at the correct location, i.e. ${\mathbf{y}}_{i}={\mathscr{H}}_{i}({\mathbf{x}}_{i})+{\mathit{\zeta}}_{i}$, where ${\mathit{\zeta}}_{i}$ is the observation error. We assume that the observation errors are uncorrelated in time.
where ${\mathit{\eta}}_{i+1}$ is the model error at time ${t}_{i+1}$. The model errors are assumed to be uncorrelated in time and to be Gaussian with zero mean and covariance matrix ${\mathbf{Q}}_{i}\in {\mathbb{R}}^{n\times n}$.
where ${\mathbf{x}}_{i}$ satisfies the model constraint in Equation (1) (Trémolet, 2006). The analysis (approximation of the state evolution over the time window) ${\mathbf{x}}_{0}^{\mathrm{a}},{\mathbf{x}}_{1}^{\mathrm{a}},\dots ,{\mathbf{x}}_{N}^{\mathrm{a}}$ can be obtained from the minimiser of Equation (2) using the constraints in Equation (1).
2.1 Incremental 4D-Var
2.2 Control-variable transform
3 PRECONDITIONING WEAK-CONSTRAINT 4D-VAR
3.1 Preconditioned conjugate gradients
Algorithm . Split preconditioned CG for solving $\mathbf{A}\mathbf{x}=\mathbf{b}$
3.2 Limited-memory preconditioners
A potential problem for practical applications of Equation (14) is the need for expensive matrix–matrix products with $\mathbf{A}$. Simpler formulations of Equation (14) are obtained by imposing more conditions on the vectors ${\mathbf{s}}_{1},\dots ,{\mathbf{s}}_{k}$. Two formulations that Tshimanga et al. (2008) calls spectral-LMP and Ritz-LMP have been used, for example, in ocean data assimilation in the Regional Ocean Modeling System (ROMS: Moore et al., 2011) and the variational data assimilation software with the Nucleus for European Modelling of the Ocean (NEMO) ocean model (NEMOVAR: Mogensen et al., 2012), and coupled climate reanalysis in Coupled ECMWF ReAnalysis (CERA: Laloyaux et al., 2018).
An important property is that if an LMP is constructed using k vectors then at least k eigenvalues of the preconditioned matrix ${\mathbf{C}}^{\mathrm{T}}\mathbf{A}\mathbf{C}$ will be equal to one, and the remaining eigenvalues will lie between the smallest and largest eigenvalues of $\mathbf{A}$ (see theorem 3.4 of Gratton et al., 2011). Moreover, if $\mathbf{A}$ has a cluster of eigenvalues at one, then LMPs preserve this cluster. This is crucial when preconditioning Equation (10): because the LMPs preserve the $n(N+1)-q$ smallest eigenvalues of ${\mathcal{A}}^{\mathrm{pr}}$ that are equal to one, the CG convergence can be improved by decreasing the largest eigenvalues. Hence, it is preferable to use the largest eigenpairs or their approximations.
In practice, both spectral-LMP and Ritz-LMP use Ritz vectors and values to construct the LMPs. It has been shown that the Ritz-LMP can perform better than spectral-LMP in a strong-constraint 4D-Var setting by correcting for the inaccuracies in the estimates of eigenpairs (Tshimanga et al., 2008). However, Gratton et al. (2011) (their theorem 4.5) have shown that, if the preconditioners are constructed with Ritz vectors and values that have converged, then the spectral-LMP acts like the Ritz-LMP.
3.3 Ritz information
Calculating or approximating all the eigenpairs of a large sparse matrix is impractical. Hence, only a subset is approximated to construct the LMPs. This is often done by extracting approximations from a subspace, and the Rayleigh–Ritz (RR) procedure is a popular method for doing this.
Assume that $\mathcal{Z}\subset {\mathbb{R}}^{{n}_{A}}$ is an invariant subspace of $\mathbf{A}$, that is, $\mathbf{A}\mathbf{z}\in \mathcal{Z}$ for every $\mathbf{z}\in \mathcal{Z}$, and the columns of $\mathbf{Z}\in {\mathbb{R}}^{{n}_{A}\times m}$, $m<{n}_{A}$, form an orthonormal basis for $\mathcal{Z}$. If $(\lambda ,\widehat{\mathbf{y}})$ is an eigenpair of $\mathbf{K}={\mathbf{Z}}^{\mathrm{T}}\mathbf{A}\mathbf{Z}\in {\mathbb{R}}^{m\times m}$, then $(\lambda ,\mathbf{v})$, where $\mathbf{v}=\mathbf{Z}\widehat{\mathbf{y}}$, is an eigenpair of $\mathbf{A}$ (see, e.g., theorem 1.2 in chapter 4 of Stewart, 2001). Hence, eigenvalues of $\mathbf{A}$ that lie in the subspace $\mathcal{Z}$ can be extracted by solving a small eigenvalue problem.
However, generally the computed subspace $\tilde{\mathcal{Z}}$ with orthonormal basis as columns of $\tilde{\mathbf{Z}}$ is not invariant. Hence, only approximations $\tilde{\mathbf{v}}$ to the eigenvectors $\mathbf{v}$ belong to $\tilde{\mathcal{Z}}$. The RR procedure computes approximations $\mathbf{u}$ to $\tilde{\mathbf{v}}$. We give the RR procedure in Algorithm 2, where the eigenvalue decomposition is abbreviated as EVD. Approximations to eigenvalues $\lambda $ are called Ritz values $\theta $, and $\mathbf{u}$ are the Ritz vectors. Eigenvectors of $\tilde{\mathbf{K}}={\tilde{\mathbf{Z}}}^{\mathrm{T}}\mathbf{A}\tilde{\mathbf{Z}}$, which is the projection of $\mathbf{A}$ on to $\tilde{\mathcal{Z}}$, are denoted by $\mathbf{w}$ and are called primitive Ritz vectors.
Algorithm . Rayleigh–Ritz procedure for computing approximations of eigenpairs of symmetric $\mathbf{A}$
3.4 Spectral information from CG
Tshimanga et al. (2008) use Ritz pairs of the Hessian in one inner loop to construct LMPs for the following inner loop, i.e. information on ${\mathbf{A}}^{(0)}$ is used to precondition ${\mathbf{A}}^{(1)}$, and so on. Success relies on the Hessians not changing significantly from one inner loop to the next. Ritz information can be obtained from the Lanczos process that is connected to CG, hence information for the preconditioner can be gathered at a negligible cost.
4 RANDOMISED EIGENVALUE DECOMPOSITION
If the Hessian in one inner loop differs significantly from the Hessian in the previous inner loop, then it may not be useful to precondition the former with an LMP that is constructed with information from the latter. Employing the Lanczos process to obtain eigenpair estimates and use them to construct an LMP in the same inner loop is too computationally expensive, because each iteration of the Lanczos process requires a matrix–vector product with the Hessian, thus the cost is similar to the cost of CG. Hence, we explore a different approach.
Subspace iteration is a simple procedure to obtain approximations to the largest eigenpairs (see, e.g., chapter 5 of Saad, 2011). It is easily understandable and can be implemented in a straightforward manner, although its convergence can be very slow if the largest eigenvalues are not well separated from the rest of the spectrum. The accuracy of subspace iteration may be enhanced by using a RR projection.
Such an approach is used in Randomised Eigenvalue Decomposition (REVD: see, e.g., Halko et al., 2011). This takes a Gaussian random matrix, that is, a matrix with independent standard normal random variables with zero mean and variance equal to one as its entries, and applies one iteration of the subspace iteration method with RR projection, hence obtaining a rank m approximation $\mathbf{A}\approx {\mathbf{Z}}_{1}({\mathbf{Z}}_{1}^{\mathrm{T}}\mathbf{A}{\mathbf{Z}}_{1}){\mathbf{Z}}_{1}^{\mathrm{T}}$, where ${\mathbf{Z}}_{1}\in {\mathbb{R}}^{{n}_{A}\times m}$ is orthogonal. We present REVD in Algorithm 3. An important feature of REVD is the observation that the accuracy of the approximation is enhanced with oversampling (which is also called “using guard vectors” in Duff and Scott, 1993), that is, working on a larger space than the required number of Ritz vectors. Halko et al. (2011) claim that setting the oversampling parameter to 5 or 10 is often sufficient.
Algorithm . Randomised eigenvalue decomposition, REVD
Randomised algorithms are designed to minimise the communication instead of the flop count. The expensive parts of Algorithm 3 are the two matrix–matrix products $\mathbf{A}\mathbf{G}$ and $\mathbf{A}{\mathbf{Z}}_{1}$ in steps 2 and 4, that is, in each of these steps, matrix $\mathbf{A}$ has to be multiplied with $(k+l)$ vectors, which in serial computations would be essentially the cost of $2(k+l)$ iterations of unpreconditioned CG. However, note that these matrix–matrix products can be parallelised.
In weak-constraint 4D-Var, $\mathbf{A}$ is the Hessian, hence it is symmetric positive-definite and its eigenpairs can also be approximated using a randomised Nyström method (algorithm 5.5 of Halko et al., 2011), which is expected to give much more accurate results than REVD (Halko et al., 2011). We present the Nyström method in Algorithm 4, where singular-value decomposition is abbreviated as SVD. It considers a more elaborate rank m approximation than in REVD: $\mathbf{A}\approx (\mathbf{A}{\mathbf{Z}}_{1}){({\mathbf{Z}}_{1}^{\mathrm{T}}\mathbf{A}{\mathbf{Z}}_{1})}^{-1}{(\mathbf{A}{\mathbf{Z}}_{1})}^{\mathrm{T}}=\mathbf{F}{\mathbf{F}}^{\mathrm{T}}$, where ${\mathbf{Z}}_{1}\in {\mathbb{R}}^{{n}_{A}\times m}$ is orthogonal (obtained in the same way as in REVD, e.g. using a tall skinny QR (TSQR) decomposition: Demmel et al., 2012) and $\mathbf{F}=(\mathbf{A}{\mathbf{Z}}_{1}){({\mathbf{Z}}_{1}^{\mathrm{T}}\mathbf{A}{\mathbf{Z}}_{1})}^{-1/2}\in {\mathbb{R}}^{{n}_{A}\times m}$ is an approximate Cholesky factor of $\mathbf{A}$, which is found in step 6. The eigenvalues of $\mathbf{F}{\mathbf{F}}^{\mathrm{T}}$ are the squares of the singular values of $\mathbf{F}$ (see section 2.4.2 of Golub and Van Loan, 2013). In numerical computations we store matrices ${\mathbf{E}}^{(1)}=\mathbf{A}{\mathbf{Z}}_{1}$ and ${\mathbf{E}}^{(2)}={\mathbf{Z}}_{1}^{\mathrm{T}}{\mathbf{E}}^{(1)}={\mathbf{Z}}_{1}^{\mathrm{T}}\mathbf{A}{\mathbf{Z}}_{1}$ (step 4), perform the Cholesky factorization of ${\mathbf{E}}^{(2)}={\mathbf{C}}^{\mathrm{T}}\mathbf{C}$ (step 5), and obtain $\mathbf{F}$ by solving the triangular system $\mathbf{F}\mathbf{C}={\mathbf{E}}^{(1)}$.
Algorithm . Randomised eigenvalue decomposition for symmetric positive semidefinite $\mathbf{A}$, Nyström
The matrix–matrix product with $\mathbf{A}$ at step 4 of Algorithms 3 and 4 is removed in Rutishauser's implementation of subspace iteration with RR projection called ritzit (Rutishauser, 1971). It can be derived in the following manner (see chapter 14 of Parlett, 1998). Assume that ${\mathbf{G}}_{3}\in {\mathbb{R}}^{{n}_{A}\times m}$ is an orthogonal matrix and the sample matrix is ${\mathbf{Y}}_{3}=\mathbf{A}{\mathbf{G}}_{3}={\mathbf{Z}}_{3}{\mathbf{R}}_{3}$, where ${\mathbf{Z}}_{3}\in {\mathbb{R}}^{{n}_{A}\times m}$ is orthogonal and ${\mathbf{R}}_{3}\in {\mathbb{R}}^{m\times m}$ is upper triangular. Then a projection of ${\mathbf{A}}^{2}$ onto the column space of ${\mathbf{G}}_{3}$ is $\widehat{\mathbf{K}}={\mathbf{Y}}_{3}^{\mathrm{T}}{\mathbf{Y}}_{3}={\mathbf{R}}_{3}^{\mathrm{T}}{\mathbf{Z}}_{3}^{\mathrm{T}}{\mathbf{Z}}_{3}{\mathbf{R}}_{3}={\mathbf{R}}_{3}^{\mathrm{T}}{\mathbf{R}}_{3}$. Then ${\mathbf{K}}_{3}={\mathbf{R}}_{3}{\mathbf{R}}_{3}^{\mathrm{T}}={\mathbf{R}}_{3}{\mathbf{R}}_{3}^{\mathrm{T}}{\mathbf{R}}_{3}{\mathbf{R}}_{3}^{-1}={\mathbf{R}}_{3}\widehat{\mathbf{K}}{\mathbf{R}}_{3}^{-1}$, which is similar to $\widehat{\mathbf{K}}$ and hence has the same eigenvalues. This leads to another implementation of REVD presented in Algorithm 5. This is a single-pass algorithm, meaning that $\mathbf{A}$ has to be accessed just once, and, to the best of our knowledge, this method has not been considered in the context of randomised eigenvalue approximations.
Algorithm . Randomised eigenvalue decomposition based on ritzit, REVD_ritzit
Note that the Ritz vectors given by Algorithms 3, 4, and 5 are different. Although Algorithm 5 accesses the matrix $\mathbf{A}$ only once, it requires an additional orthogonalisation of a matrix of size ${n}_{A}\times (k+l)$.
In Table 1, we summarise some properties of the Lanczos, REVD, Nyström, and REVD_ritzit methods when they are used to compute Ritz values and vectors to generate a preconditioner for $\mathbf{A}$ in incremental data assimilation. Note that the cost of applying spectral-LMP depends on the number of vectors k used in its construction and is independent of which method is used to obtain them. The additional cost of using randomised algorithms arises only once per inner loop when the preconditioner is generated. We recall that in these algorithms the required EVD or SVD of the small matrix can be obtained cheaply and the most expensive parts of the algorithms are the matrix–matrix products of $\mathbf{A}$ and the ${n}_{A}\times (k+l)$ matrices. If enough computational resources are available, these can be parallelised. In the best-case scenario, all $k+l$ matrix–vector products can be performed at the same time, making the cost of the matrix–matrix product equivalent to the cost of one iteration of CG plus communication between the processors.
Lanczos | REVD | Nyström | REVD_ritzit | |
---|---|---|---|---|
Information source | Previous inner loop | Current inner loop | Current inner loop | Current inner loop |
Preconditioner for the first inner loop | No | Yes | Yes | Yes |
k dependence on the previous inner loop | Bounded by the number of CG iterations | Independent | Independent | Independent |
Matrix-matrix products with $\mathbf{A}$ | None | 2 products with ${n}_{A}\times (k+l)$ matrices | 2 products with ${n}_{A}\times (k+l)$ matrices | 1 product with ${n}_{A}\times (k+l)$ matrix |
QR decomposition | None | None | None | ${\mathbf{Y}}_{3}\in {\mathbb{R}}^{{n}_{A}\times (k+l)}$ |
Orthogonalisation | None | $\mathbf{Y}\in {\mathbb{R}}^{{n}_{A}\times (k+l)}$ | $\mathbf{Y}\in {\mathbb{R}}^{{n}_{A}\times (k+l)}$ | $\mathbf{G}\in {\mathbb{R}}^{{n}_{A}\times (k+l)}$ |
Cholesky factorization | None | None | ${\mathbf{E}}^{(2)}\in {\mathbb{R}}^{(k+l)\times (k+l)}$ | None |
Triangular solve | None | None | $\mathbf{F}\mathbf{C}={\mathbf{E}}^{(1)}$ for $\mathbf{F}\in {\mathbb{R}}^{{n}_{A}\times (k+l)}$ | None |
Deterministic EVD | ${\mathbf{T}}_{k}\in {\mathbb{R}}^{k\times k}$ * | $\mathbf{K}\in {\mathbb{R}}^{(k+l)\times (k+l)}$ | None | ${\mathbf{K}}_{3}\in {\mathbb{R}}^{(k+l)\times (k+l)}$ |
Deterministic SVD | None | None | $\mathbf{F}\in {\mathbb{R}}^{{n}_{A}\times (k+l)}$ | None |
When a randomised method is used to generate the preconditioner, an inner loop is performed as follows. Estimates of the Ritz values of the Hessian and the corresponding Ritz vectors are obtained with a randomised method (Algorithm 3, 4, or 5) and used to construct an LMP. Then the system of Equation (11) with the exact Hessian $\mathbf{A}$ is solved with PCG (Algorithm 1) using the LMP. The state is updated in the outer loop using the PCG solution.
5 NUMERICAL EXPERIMENTS
We demonstrate our proposed preconditioning strategies using two models: a simple linear advection model to explore the spectra of the preconditioned Hessian and the nonlinear Lorenz 96 model (Lorenz, 1996) to explore the convergence of split preconditioned CG (PCG). We perform identical twin experiments, where ${\mathbf{x}}^{t}={({({\mathbf{x}}_{0}^{t})}^{\mathrm{T}},\dots ,{({\mathbf{x}}_{N}^{t})}^{\mathrm{T}})}^{\mathrm{T}}$ denotes the reference trajectory. The observations and background state are generated by adding noise to ${\mathscr{H}}_{i}({\mathbf{x}}_{i}^{t})$ and ${\mathbf{x}}_{0}^{t}$ with covariance matrices $\mathbf{R}$ and $\mathbf{B}$, respectively. We use direct observations, thus the observation operator ${\mathscr{H}}_{i}$ is linear.
We use covariance matrices ${\mathbf{R}}_{i}={\sigma}_{\mathrm{o}}^{2}{\mathbf{I}}_{{q}_{i}}$, where ${q}_{i}$ is the number of observations at time ${t}_{i}$, ${\mathbf{Q}}_{i}={\sigma}_{q}^{2}{\mathbf{C}}_{q}$, where ${\mathbf{C}}_{q}$ is a Laplacian correlation matrix (Johnson et al., 2005), and $\mathbf{B}={\sigma}_{\mathrm{b}}^{2}{\mathbf{C}}_{\mathrm{b}}$, where ${\mathbf{C}}_{\mathrm{b}}$ is a second-order auto-regressive correlation matrix (Daley, 1993).
We assume that first-level preconditioning has already been applied (recall Equation 10). In data assimilation, using Ritz-LMP as formulated in Equation (17) is impractical because of the matrix products with $\mathbf{A}$ and we cannot use a simple formulation of Ritz-LMP when the Ritz values and vectors are obtained with randomised methods. Hence, we use spectral-LMP. However, as we mentioned in Section 3.2, the spectral-LMP that is constructed with well-converged Ritz values and vectors acts like Ritz-LMP. When we consider the second inner loop, we compare the spectral-LMPs using information from the randomised methods with the spectral-LMP constructed using information obtained with the Matlab function eigs in the previous inner loop. eigs returns a highly accurate estimate of a few largest or smallest eigenvalues and corresponding eigenvectors. We will use the term randomised LMP to refer to the spectral-LMPs that are constructed with information from the randomised methods, and deterministic LMP to refer to the spectral-LMP that is constructed with information from eigs.
The computations are performed with Matlab R2017b. Linear systems are solved using the Matlab implementation of PCG (function pcg), which was modified to allow split preconditioning to maintain the symmetric coefficient matrix at every loop.
5.1 Advection model
We set ${\sigma}_{\mathrm{o}}=0.05$, ${\sigma}_{q}=0.05$, and ${\sigma}_{\mathrm{b}}=0.1$. ${\mathbf{C}}_{q}$ and ${\mathbf{C}}_{\mathrm{b}}$ have length-scales equal to $10\mathrm{\Delta}z$. Every fourth model variable is observed at every fifth time step, ensuring that there is an observation at the final time step (100 observations in total). Because the model and the observational operator ${\mathscr{H}}_{i}$ are linear, the cost function (Equation 2) is quadratic and its minimiser is found in the first loop of the incremental method.
5.1.1 Eigenvalues of the preconditioned matrix
We apply the randomised LMPs in the first inner loop. Note that if the deterministic LMP is used, it is unclear how to precondition the first inner loop. We explore what effect the randomised LMPs have on the eigenvalues of $\mathbf{A}$. The oversampling parameter l is set to 5 and the randomised LMPs are constructed with $k=25$ vectors.
The Ritz values of $\mathbf{A}$ given by the randomised methods are compared with those computed using eigs (Figure 1a). The Nyström method produces a good approximation of the largest eigenvalues, while REVD gives a slightly worse approximation, except for the largest five eigenvalues. The REVD_ritzit method underestimates the largest eigenvalues significantly. The largest eigenvalues of the preconditioned matrices are smaller than the largest eigenvalue of $\mathbf{A}$ (Figure 1b). However, the smallest eigenvalues of the preconditioned matrices are less than one and hence applying the preconditioner expands the spectrum of $\mathbf{A}$ at the lower boundary (Figure 1c), so that theorem 3.4 of Gratton et al. (2011), which considers the non-expansiveness of the spectrum of the Hessian after preconditioning with an LMP, does not hold. This happens because the formulation of spectral-LMP is derived assuming that the eigenvalues and eigenvectors are exact, while the randomized methods provide only approximations. Note that, even though REVD_ritzit gives the worst approximations of the largest eigenvalues of the Hessian, using the randomised LMP with information from REVD_ritzit reduces the largest eigenvalues of the preconditioned matrix the most and the smallest eigenvalues are close to one. Using the randomised LMP with estimates from Nyström gives similar results. Hence, the condition number of the preconditioned matrix is lower when the preconditioners are constructed with REVD_ritzit or Nyström compared with REVD.
The values of the quadratic cost function at the first ten iterations of PCG are shown in Figure 1d. Using the randomised LMP that is constructed with information from REVD is detrimental to the PCG convergence compared with using no preconditioning. Using information from the Nyström and REVD_ritzit methods results in similar PCG convergence and low values of the quadratic cost function are reached in fewer iterations than without preconditioning. The PCG convergence may be explained by the favourable distribution of the eigenvalues after preconditioning using Nyström and REVD_ritzit, and the smaller-than-one eigenvalues when using REVD. These results, however, do not necessarily generalize to an operational setting, as this system is well conditioned while operational settings are not. This will be investigated further in the next section.
5.2 Lorenz 96 model
- ${\sigma}_{q}=0.1$ and ${\mathbf{C}}_{q}$ has length-scale ${L}_{q}=2\mathrm{\Delta}X$ (the same as ${\mathbf{C}}_{\mathrm{b}}$);
- ${\sigma}_{q}=0.05$ and ${\mathbf{C}}_{q}$ has length-scale ${L}_{q}=0.25\mathrm{\Delta}X$.
In our numerical experiments, the preconditioners have very similar effect using both setups. Hence, we present results for the case ${\sigma}_{q}=0.1$ and ${L}_{q}=2\mathrm{\Delta}X$ in Figures 2-5 in the following sections, except Figure 3.
The first outer loop is performed and no second-level preconditioning is used in the first inner loop, where PCG is run for 100 iterations or until the relative residual norm reaches $1{0}^{-6}$. In the following sections, we use randomised and deterministic LMPs in the second inner loop. PCG has the same stopping criteria as in the first inner loop.
5.2.1 Minimising the inner loop cost function
In Figure 2, we compare the performance of the randomised LMPs with the deterministic LMP. We also consider the effect of varying k, the number of vectors used to construct the preconditioner. We set the oversampling parameter to $l=5$. Because results from the randomized methods depend on the random matrix used, we perform 50 experiments with different realizations for the random matrix. We find that the different realizations lead to very similar results (see Figure 2a).
Independently of the k value, there is an advantage in using second-level preconditioning. The reduction in the value of the quadratic cost function is faster using randomised LMPs compared with deterministic LMPs, with REVD_ritzit performing the best after the first few iterations. The more information we use in the preconditioner (i.e., the higher the k value), the faster REVD_ritzit shows better results than the other methods. The performance of the REVD and Nyström methods is similar. Note that, as k increases, the storage (see Table 1) and work per PCG iteration increase. Examination of the Ritz values given by the randomised methods shows that REVD_ritzit gives the worse estimate of the largest eigenvalues, as was the case when using the advection model. We calculated the smallest eigenvalue of the preconditioned matrix ${({\mathbf{C}}_{5}^{\text{sp}})}^{\mathrm{T}}\mathbf{A}{\mathbf{C}}_{5}^{\text{sp}}$ using eigs. When ${\mathbf{C}}_{5}^{\text{sp}}$ is constructed using REVD_ritzit or Nyström, the smallest eigenvalue of ${({\mathbf{C}}_{5}^{\text{sp}})}^{\mathrm{T}}\mathbf{A}{\mathbf{C}}_{5}^{\text{sp}}$ is equal to one, whereas using REVD it is approximately 0.94. This may explain why the preconditioner constructed using REVD does not perform as well as other randomised preconditioners, but it is not entirely clear why the preconditioner that uses REVD_ritzit shows the best performance.
The PCG convergence when using the deterministic LMP and the randomised LMP with information from REVD_ritzit with different k values is compared in Figure 3 for both setups of the model-error covariance matrix. We also show an additional case where the model-error covariance matrix is constructed setting ${\sigma}_{q}={\sigma}_{\mathrm{b}}/100=0.002$ and ${L}_{q}=0.25\mathrm{\Delta}X$. In this case, the performance of the REVD and Nyström methods is very similar, outperforming no preconditioning after the first 10–15 iterations, with better performance for higher k values (results not shown). Moreover, REVD_ritzit again outperforms the deterministic LMP from the first PCG iterations. For the deterministic LMP in Figure 3, varying k has little effect, especially in the initial iterations. However, for REVD_ritzit, in general increasing k results in a greater decrease of the cost function. Setting $k=5$ gives better initial results compared with $k=10$ in the ${\sigma}_{q}=0.002$ case, but the larger k value performs better after that. Also, at any iteration of PCG we obtain a lower value of the quadratic cost function using the randomised LMP with $k=5$ compared with the deterministic LMP with $k=15$, which uses exact eigenpair information from the Hessian of the previous loop.
5.2.2 Effect of the observation network
- every fifth model variable at every fifth time step (480 observations in total);
- every second variable at every second time step (3,000 observations in total).
The oversampling parameter is again set to $l=5$ and we set $k=5$ and $k=15$ for both observation networks. Since the number of observations is equal to the number of eigenvalues that are larger than one and there are more observations than in the previous section, there are more eigenvalues that are larger than one after first-level preconditioning. Because all 50 experiments with different Gaussian matrices in the previous section were close to the mean, we perform 10 experiments for each randomised method, solve the systems, and report the means of the quadratic cost function.
The results are presented in Figure 4. Again, the randomised LMPs perform better than the deterministic LMP. However, if the preconditioner is constructed with a small amount of information about the system ($k=5$ for both systems and $k=15$ for the system with 3,000 observations), then there is little difference in the performance of different randomised LMPs. Also, when the number of observations is increased, more iterations of PCG are needed to get any improvement in the minimisation of the quadratic cost function when using the deterministic LMP over using no second-level preconditioning.
When comparing the randomised and deterministic LMPs with different values of k for these systems, we obtain similar results to those in Figure 3a, that is, it is more advantageous to use the randomised LMP constructed with $k=5$ than the deterministic LMP constructed with $k=15$.
5.2.3 Effect of oversampling
We next consider the effect of increasing the value of the oversampling parameter l. The observation network is as in Section 5.2.1 (120 observations in total). We set $k=15$ and perform the second inner loop 50 times for every value of $l\in \{5,10,15\}$ with all three randomised methods. The standard deviation of the value of the quadratic cost function at every iteration is presented in Figure 5.
For all the methods, the standard deviation is greatest in the first iterations of PCG. It is reduced when the value of l is increased and the largest reduction happens in the first iterations. However, REVD_ritzit is the least sensitive to the increase in oversampling. With all values of l, REVD_ritzit has the largest standard deviation in the first few iterations, but it stills gives the largest reduction of the quadratic cost function. Hence, large oversampling is not necessary if REVD_ritzit is used.
6 CONCLUSIONS AND FUTURE WORK
We have proposed a new randomised approach to second-level preconditioning of the incremental weak-constraint 4D-Var forcing formulation. It can be preconditioned with an LMP that is constructed using approximations of eigenpairs of the Hessian. Previously, by using the Lanzcos and CG connection these approximations were obtained at a very low cost in one inner loop and then used to construct the LMP in the following inner loop. We have considered three methods (REVD, Nyström, and REVD_ritzit) that employ randomisation to compute the approximations. These methods can be used to construct the preconditioner cheaply in the current inner loop, with no dependence on the previous inner loop, and are parallelisable.
Numerical experiments with the linear advection and Lorenz-96 models have shown that the randomised LMPs constructed with approximate eigenpairs improve the convergence of PCG more than deterministic LMPs with information from the previous loop, especially after the initial PCG iterations. The quadratic cost function reduces more rapidly when using a randomised LMP rather than a deterministic LMP, even if the randomised LMP is constructed with fewer vectors than the deterministic LMP. Also, for randomised LMPs, the more information about the system we use (i.e., the more approximations of eigenpairs are used to construct the preconditioner), the greater the reduction in the quadratic cost function, with a possible exception in the first PCG iterations for low k values and very small model error. Using more information to construct a deterministic LMP may not result in larger reduction of the quadratic cost function, especially in the first iterations of PCG, which is in line with the results in Tshimanga et al. (2008). However, if not enough information is included in the randomised LMP, then preconditioning may have no effect on the first few iterations of PCG.
Of the randomised methods considered, the best overall performance was for REVD_ritzit. However, if we run a small number of PCG iterations, the preconditioners obtained with different randomised methods give similar results. In the case of a very small model error, using REVD and Nyström is useful after the initial iterations of PCG, whereas REVD_ritzit improves the reduction of the quadratic cost function from the start. The performance is independent of the choice of random Gaussian start matrix and it may be improved with oversampling.
In this work we apply randomised methods to generate a preconditioner, which is then used to accelerate the solution of the exact inner loop problem (Equation 11) with the PCG method (as discussed in Section 4). A different approach has been explored by Bousserez and Henze (2018) and Bousserez et al. (2020), who presented and tested a randomised solution algorithm called the Randomized Incremental Optimal Technique (RIOT) in data assimilation. RIOT is designed to be used instead of PCG and employs a randomised eigenvalue decomposition of the Hessian (using a different method from the ones presented in this article) to construct directly the solution $\mathbf{x}$ in Equation (11), which approximates the solution given by PCG.
The randomised preconditioning approach can also be employed to minimise other quadratic cost functions, including the strong-constraint 4D-Var formulation. Further exploration of other single-pass versions of randomised methods for eigenvalue decomposition, which are discussed in Halko et al. (2011), may be useful. In particular, the single-pass version of the Nyström method is potentially attractive. If a large number of Ritz vectors are used to construct the preconditioner, more attention can be paid to choosing the value of the oversampling parameter l in the randomised methods. In some cases, a better approximation may be obtained if l depends linearly on the target rank of the approximation (Nakatsukasa, 2020).
AUTHOR CONTRIBUTIONS
Ieva Daužickaitė: conceptualization, formal analysis, investigation, methodology, software, validation, visualization, writing-original draft, writing-review and editing. Amos S. Lawless: investigation; methodology; supervision; validation; writing – review and editing. Jennifer A. Scott: funding acquisition; investigation; methodology; project administration; resources; supervision; validation; writing – review and editing. Peter Jan van Leeuwen: investigation; methodology; supervision; validation; writing – review and editing.
ACKNOWLEDGEMENTS
We are grateful to Dr Adam El-Said for his code for the weak-constraint 4D-Var assimilation system. We also thank two anonymous reviewers, whose comments helped us to improve the article.
CONFLICT OF INTEREST
The authors declare no conflict of interest.