Multigrid preconditioners for the mixed finite element dynamical core of the LFRic atmospheric model

Due to the wide separation of time‐scales in geophysical fluid dynamics, semi‐implicit time integrators are commonly used in operational atmospheric forecast models. They guarantee the stable treatment of fast (acoustic and gravity) waves, while not suffering from severe restrictions on the time‐step size. To propagate the state of the atmosphere forward in time, a nonlinear equation for the prognostic variables has to be solved at every time step. Since the nonlinearity is typically weak, this is done with a small number of Newton or Picard iterations, which in turn require the efficient solution of a large system of linear equations with 𝒪(106−109) unknowns. This linear solve is often the computationally most costly part of the model. In this article an efficient linear solver for the LFRic next‐generation model currently being developed by the Met Office is described. The model uses an advanced mimetic finite element discretisation which makes the construction of efficient solvers challenging as compared to models using standard finite‐difference and finite‐volume methods. The linear solver hinges on a bespoke multigrid preconditioner of the Schur‐complement system for the pressure correction. By comparing it to Krylov subspace methods, the superior performance and robustness of the multigrid algorithm is demonstrated for standard test cases and realistic model set‐ups. In production mode, the model will have to run in parallel on hundreds of thousands of processing elements. As confirmed by numerical experiments, one particular advantage of the multigrid solver is its excellent parallel scalability due to its avoidance of expensive global reduction operations.


Introduction
Operational models for numerical climate-and weatherprediction have to solve the equations of fluid dynamics in a very short time.State-of-the art implementations rely on accurate spatial discretisations and efficient timestepping algorithms.To make efficient use of modern supercomputers they have to exploit many levels of parallelism (such as SIMD/SIMT, threading on a single node and message passing on distributed memory systems) and scale to 100,000s of processing elements.Semi-implicit time integrators are commonly employed since they allow the stable treatment of fast acoustic waves, which carry very little energy but have to be included in a fully compressible formulation.The implicit treatment of the acoustic modes allows the model to run with a relatively long timestep.The size of the timestep is only restricted by advection which horizontally is generally around one order of magnitude slower than acoustic oscillations and vertically two orders of magnitude slower.The main computational cost of semi-implicit models is the repeated solution of a very large sparse system of linear equations.While standard iterative solution algorithms exist, the linear system is ill-conditioned, which leads to the very slow convergence of Krylov-subspace methods.This is a particularly serious problem for massively parallel implementations due to their very large number of global reduction operations (arising from vector dot-products and norms in each Krylov-iteration).Preconditioners, which solve an approximate version of the linear system, overcome this issue and dramatically reduce the number of Krylov iterations and global communications.The construction of an efficient preconditioner is non-trivial and requires careful exploitation of the specific properties of the system which is to be solved.For global atmospheric models, two key features which have to be taken into account are (1) the high aspect ratio arising from the shallow domain and (2) the finite speed of sound in compressible formulations, which limits the effective distance over which different points in the domain are correlated during one timestep.Typically the preconditioner is based on a Schur-complement approach.This reduces the problem to an elliptic equation for the pressure correction, which can then be solved with standard methods.
Multigrid.Hierarchical methods such as multigrid algorithms [1] are often employed for the solution of elliptic systems since they have a computational complexity which grows linearly with the number of unknowns.[2] contains a recent review of linear solvers in atmospheric modelling (see also [3]).There the performance of a multigrid solver based on the tensor-product algorithm described in [4] was applied to a simplified model system which is representative for the linear system for the pressure correction.The key idea is to use a vertical line-relaxation smoother together with semi-coarsening in the horizontal direction only.Furthermore, due to the finite speed of sound in compressible models, it is sufficient to use a relatively small number of multigrid levels of L ≈ log 2 CFL h where CFL h = c s ∆t/∆x is the horizontal acoustic Courant number.The much higher vertical acoustic Courant number CFL v = c s ∆t/∆z does not cause any problems as vertical sound propagation is treated exactly by the line-relaxation smoother.Since advective transport is about an order of magnitude slower than acoustic pressure oscillations, CFL h ≈ 10 and L ≈ 4 irrespective of the model resolution.As was demonstrated in [2,5], this "shallow" multigrid works well, and avoids expensive global communications.It also significantly simplifies the parallel decomposition, since it is only necessary to (horizontally) partition the coarsest grid, which still has a large number of cells and allows a relative fine-grained domain decomposition.For example, one coarse grid cell could be assigned to a node on a supercomputer, exploiting additional shared-memory parallelism on the cells of the 4 L−1 ≈ 4 3 = 64 fine grid cells.
The tensor-product multigrid algorithm was applied to more realistic model equations in [6] and its performance on a cluster with 16,384 GPUs was demonstrated in [7].
Solvers for finite element discretisations.One challenge of standard latitude-longitude models, which is becoming more severe with increasing model resolution, is the convergence of grid-lines at the poles.Due to the resulting small grid cells at high latitudes this leads to severe timestep constrictions, slow solver convergence and poor parallel scalability due to global coupling at the poles.To overcome this problem, there has been a push towards using different meshes which avoid this issue (see review in [8]).However, to ensure the accurate discretisation of the continuous equations and the exact conservation of certain physical quantities on those nonorthogonal grid requires advanced discretisations.While low-order finite-volume methods [9,10,11] and highorder collocated spectral element methods [12,13] exist, the mimetic finite-element approach developed in [14,15,16] for the shallow water equations is particularly attractive since it generalises to arbitrary discretisation order, has good wave dispersion properties, avoids spurious computational modes and allows the approximate conservation of certain physical quantities in the discrete equations.At lowest order on orthogonal meshes it reduces to the extensively studied C-grid staggering in the horizontal direction.
This paper builds on the work in [17] which describes the recently developed Gungho dynamical core employed in the LFRic Model.The mimetic finite-element discretisation used there is combined with a vertical discretisation [18,19], which is similar to Charney-Phillips staggering, and a mass-conserving finite-volume advection scheme.
A particular challenge of mimetic finite-element discretisations is the significantly more complex structure of the discretised linear equation system, which has to be solved repeatedly at every timestep.For traditional finite-difference and finite-volume discretisations on structured grids the Schur-complement can be formed, and -provided the resulting pressure equation is solved to high-enough accuracy -the preconditioner is exact.However, this is not possible for the finite element discretisations considered here since the velocity mass-matrix is not (block-) diagonal.Instead, the linear system is preconditioned by constructing an approximate Schur-complement using velocity mass-lumping.As demonstrated for a gravity-wave system in [20], this method is efficient if one V-cycle of the same bespoke tensor-product multigrid algorithm is used to solve the pressure system.As shown there, the method also works for next-to-lowest-order discretisations if a p-refinement is used on the finest level of the multigrid hierarchy.
In this paper it is shown how the method can be extended to solve the full equations of motion, the Euler equations for a perfect gas in a rotating frame.The efficiency of the multigrid algorithm is demonstrated by alternatively solving the pressure correction equation with a Krylov-subspace method.As will be shown by running on 100,000s of processing cores and solving problems with more than 1 billion (10 9 ) unknowns, multigrid also improves the parallel scalability since -in contrast to the Krylov-method -the multigrid V-cycle does not require any global reductions.Implementation.To achieve optimal performance, an efficient implementation is required.In a continuously diversifying hardware landscape the code has to be performance portable.In general, the LFRic model uses an implementation which is based on the separation-ofconcerns approach described in [21].The composability of iterative methods and preconditioners is exploited to easily swap components of the complex hierarchical solver in an object oriented Fortran 2003 framework (see Section 6 in [21]).
Structure.This paper is organised as follows.After putting the research in context by reviewing related work in Section 2, the mixed finite element discretisation is described and the construction of a Schur-complement preconditioner for the linear system is discussed in Section 3. The properties of the elliptic pressure operator are used to construct a bespoke multigrid preconditioner.After outlining the parallel implementation in the LFRic framework in Section 4, numerical results for performance and parallel scalability are presented in Section 5. Conclusions are drawn and future work discussed in Section 6.

Context and related work
Semi-implicit timestepping methods.One of the perceived drawbacks of semi-implicit models is the additional complexity required for solving a large nonlinear problem.Iterative solvers introduce global communications, which potentially limits scalability and performance; this can become a serious issue for operational forecast systems which run on large supercomputers and have to deliver results on very tight timescales.Nevertheless, the comprehensive review of linear solvers techniques for atmospheric applications in [2] shows that semi-implicit models deserve serious consideration.Looking at actively developed dynamical cores which target massively parallel supercomputers, of the 11 nonhydrostatic implementations compared in the recent Dynamical Core Model Development Project (DCMIP16) [22], three are semi-implicit: the current spectral model used by ECMWF in their Integrated Forecasting Suite (IFS) [23,24], FVM, the next-generation finite volume version of IFS [25] and the Canadian GEM finite difference code [26].To solve the linear equation system, those models use different approaches.IFS employs a global spectral transform, which diagonalises the operator in Fourier spaces.Although this approach inherently involves expensive all-to-all communications, scalability can be improved by exploiting properties of the spherical harmonics [27].The solution of the pressure system in IFS-FVM with a Generalised Conjugate Residual (GCR) method is described in [28]; the preconditioner exactly inverts the vertical part of the operator, similar to the line relaxation strategy which is used in the smoother for the multigrid algorithm in this work.To solve the three dimensional elliptic boundary value problem in the GEM model, a vertical transform is used to reduce it to a set of decoupled two-dimensional problems, which are solved iteratively [29].
All of the above semi-implicit models use lowest order finite difference/volume discretisations or are fully spectral.In contrast, actively developed massively parallel high order spectral element codes include NUMA [30], which uses an implicit-explicit (IMEX) time integrator, the CAM-SE/HOMME dynamical core [31] used in the ACME climate model and Tempest [32,33].Collocating the quadrature points with nodal points in the Continuous Galerkin (CG) formulation of NUMA results in a diagonal velocity mass matrix, which allows the construction of a Schur-complement pressure system.This system is then solved with an iterative method.This is in contrast to mixed finite element approach employed here, for which the velocity mass matrix is non-diagonal.To address this issue, the outer system is solved iteratively and preconditioned with an approximate Schurcomplement based on a lumped mass matrix.It should be noted, however, that the construction of an efficient linear solver for the Discontinuous Galerkin (DG) version of NUMA is significantly more challenging since the numerical flux augments the velocity mass matrix by artificial diffusion terms; overcoming this problem is a topic of current research and hybridisable DG methods appear to be particularly suitable [34,35].As discussed below, applying a similar hybridised approach to the mixed finite element formulation is a promising direction for future work.While the fully implicit version of NUMA has been optimised on modern chip architectures [36], the massively parallel scaling tests in [37] are reported for the horizontally explicit vertically implicit (HEVI) variant of the model, in which only the vertical couplings are treated implicitly.The same HEVI time integrator can also be used by the Tempest dynamical core.Again this vertically implicit solver is equivalent to the block-Jacobi smoother in the fully implicit multigrid algorithm and the preconditioner in [25].
The discretisation used by the semi-implicit GUSTO code developed at Imperial College is based on [18,38,39] and is very similar to the one used in this work.In contrast to LFRic, which is developed for operational use, GUSTO is a research model implemented in the Firedrake Python code generation framework [40].It uses the iterative solvers and preconditioners from the PETSc library [41] to solve the linear system.By default, the elliptic pressure operator is inverted with a black-box algebraic multigrid (AMG) algorithm.While in [20] AMG has been shown to give comparable performance to the bespoke geometric multigrid preconditioners developed here, using off-the-shelf AMG libraries in the LFRic code is not feasible due to their incompatible parallelisation strategy.It would also introduce undesirable software dependencies for a key component of the model.

Parallel multigrid and atmospheric models.
Multigrid algorithms allow the solution of ill-conditioned elliptic PDEs in a time which is proportional to the number of unknowns in the system.Due to this algorithmically optimal performance, they are often the method of choice for large scale applications in geophysical modelling.The hypre library [42] contains massively parallel multigrid implementations, including BoomerAMG, and has been shown to scale to 100,000s of cores [43].Similarly, the scalability of the AMG solver in the DUNE library [44] has been demonstrated in [45] and [46] describes another highly parallel AMG implementation.In [47] massively parallel multigrid methods based on hybrid hierarchical grids are used to solve problems with 10 12 unknowns on more than 200,000 compute cores.
While those results clearly show the significant potential of parallel multigrid algorithms, it is evident from the review in [2] that they are rarely used in semi-implicit atmospheric models.An exception is the recent implementation of the MPAS model.In [5] it is shown that a semiimplicit methods with a multigrid solver can be competitive with fully explicit time integrators.A conditional semi-coarsening multigrid for the ENDGame dynamical core [48] used by the Met Office is described in [49] and currently implemented in the Unified Model code.An-other recent application of multigrid in a non-hydrostatic model is given in [50].Two-dimensional multigrid solvers for the Poisson-equation on different spherical grids relevant for atmospheric modelling are compared in [51].More importantly, the work in [52] showed that domaindecomposition based multigrid algorithms can be used to solve the Euler equations with 0.77 • 10 12 unknowns.This paper received the 2016 Gordon Prize for it showed that the code scales to 10 million cores and achieves 7.95 PetaFLOP performance on the TaihuLight supercomputer.Note, however, that none of the models described in this section is based on advanced finite element discretisations which are used in this work.

Methods
In the following, the mimetic finite element discretisation of the Euler equations in the LFRic dynamical core is reviewed.By exploiting the structure of the pressure correction equation in the approximate Schur-complement solver, an efficient tensor-product multigrid algorithm is constructed.

Continuous equations
The dynamical core of the model solves the Euler equations for a perfect gas in a rotating frame At every point in time the state of the atmosphere x = (u, ρ, θ, Π) is described by the three dimensional fields for velocity u, density ρ, potential temperature θ and (Exner-) pressure Π.In Eq. (1) Φ is the geopotential such that ∇Φ = −g where the g denotes the gravitational acceleration and the Coriolis vector is denoted by Ω. R is the gas constant per unit mass and κ = (c p − R)/R, where c p is the specific heat at constant pressure; p 0 is a reference pressure.The equations are solved in a domain D which either describes the global atmosphere or a local area model; for further details on the relevant boundary conditions see [17].

Finite element discretisation
To discretise Eq. ( 1) in space, the mimetic finite element discretisation from [14,18] is used.For this, four principal function spaces W i , i = 0, 1, 2, 3, of varying degrees of continuity are constructed.Those function spaces are related by the de Rham complex [53] W 0 Pressure and density are naturally discretised in the entirely discontinuous space W 3 , while the space W 2 , which describes vector-valued fields with a continuous normal component, is used for velocity.At order p on hexahedral elements the space W 2 is the Raviart-Thomas space RT p and W 3 is the scalar Discontinuous Galerkin space Q DG p .As will be important later on, note that the space 2 can be written as the direct sum of a component W z 2 which only contains vectors pointing in the vertical direction and the space W h 2 such that the elements of W h 2 are purely horizontal vector fields.In the absence of orography those two spaces are orthogonal in the sense that Note that W z 2 is continuous in the vertical direction and discontinuous in the tangential direction, whereas W h 2 is continuous in the horizontal direction only.To discretise the potential temperature field, an additional space W θ is introduced.W θ is the scalar-valued equivalent of W z 2 and has the same continuity.The lowest order (p = 0) function spaces are shown in Figure 1.Choosing suitable basis functions v j (χ) ∈ W 2 , σ j (χ) ∈ W 3 and w j (χ) ∈ W θ which depend on the spatial coordinate χ, the discretised fields at the n-th model timestep t can be written as follows: For each quantity a the corresponding vector of unknowns is written as a = [a 1 , a 2 , . . .].

Linear system
The time discretisation described in [17] is semi-implicit and uses a finite-volume transport scheme.This requires the solution of a non-linear equation to obtain the state-vector x n+1 = ( u n+1 , ρ n+1 , θ n+1 , Π n+1 ) at the next timestep.The non-linear system can be written compactly as Eq. ( 3) is solved iteratively with a quasi-Newton method.
For this a sequence of states (4) The linear operator L( x * ), which needs to be inverted in every Newton step, is an approximation to the Jacobian of R. Following [48], it is obtained by linearisation around a reference state x * = (0, ρ * , θ * , Π * ), which is updated at every timestep.Introducing x = ( u , ρ , θ , Π ) and following [17], the linear system in Eq. ( 4) can be written down in matrix form as (5) The exact form of the individual operators in Eq. ( 5) is given in [17] and the matrix D ρ * is defined as where the mass flux f * is defined as the product of the reference density ρ * sampled at velocity nodal points pointwise multiplied by the velocity field u.Note, that the expression in [17] for R Π , their (81), is incorrect and it should be such that both the linearised left hand side and nonlinear right hand side are non-dimensionalised.To interpret the different operators, it is instructive to also write down the continuum equivalent of the equations for the state (u , ρ , θ , Π ) (7) where τ u,ρ,θ = 1 2 are off-centering parameters.ρ * , θ * and Π * are the continuous reference profiles around which the equation is linearised.In contrast to [48], the horizontal couplings are not neglected in P θ * θ2 , and will only be dropped in the approximate Schur-complement constructed in Section 3.4.The block-diagonal entries in the 4 × 4 matrix in Eq. ( 5) are modified mass matrices of the W 2 , W 3 and W θ spaces, possibly weighted by reference profiles (M Π * 3 , similar to the off-diagonal M ρ * 3 ).The term in the upper left corner of the matrix in Eq. ( 5) is the velocity mass matrix augmented by contributions from Rayleigh damping and the implicit treatment of the Coriolis term, where and M ρ * 3 do not contain couplings to unknowns in neighbouring cells, and M θ only couples between unknowns in the same vertical column, M µ,C 2 contains couplings in all directions.This prevents the exact solution of Eq. ( 5) with a Schur-complement approach as in [48] since the inverse of the M µ,C 2 is dense.Instead, the system in Eq. ( 5) is solved with an iterative Krylov subspace solver, which only requires application of the sparse operator L( x * ) itself.The solver is preconditioned with the approximate Schur complement described in the following section.

Schur complement preconditioner
To obtain an approximate solution of the linear system in Eq. ( 5), first all instances of the mass matrix M θ are replaced by a lumped, diagonal version Mθ , such that the diagonal entries of of Mθ are the row-sums of M θ .As in [48] only the part of P θ * θ2 which acts on the vertical part of the velocity field is kept.The resulting operator P θ * ,z θ2 maps from the subspace W z 2 ⊂ W 2 to W θ .Algebraically, the following steps correspond to multiplication by the upper block-triangular matrix (Step 1), solution of the block-diagonal matrix (Step 2) and backsubstitution through multiplication by the lower blocktriangular matrix (Step 3) in the Schur-complement approach [54].
Step 1a: Use to eliminate θ.In the resulting system 3 × 3 system, replace the matrix M µ,C 2 2 .This leads to a system for u , ρ and Π only: Note that -in contrast to Eq. ( 5) -the (block-) diagonal entries of the 3 × 3 system in Eq. ( 10) have sparse inverses, and it is possible to form the exact Schur complement.
Step 1b: Similarly, eliminate density from Eq. ( 10) using to obtain a 2 × 2 system for u , Π .Finally, eliminate velocity with to get an equation for the pressure increment only, The Helmholtz operator H : W 3 → W 3 is defined as Step 2: Approximately solve the Helmholtz equation Eq. ( 14) for Π .For this one multigrid V-cycle as described in Section 3.6 is used.

Structure of the Helmholtz operator
Understanding the structure of the Helmholtz operator H is crucial for the construction of a robust multigrid algorithm for the approximate solution of Eq. ( 14).The tensor-product multigrid method which will be used here was first described for simpler equations and discretisations in [4] and applied to mixed finite element problems in atmospheric modelling in [20].First consider the sparsity pattern of the Helmholtz operator H.In each cell of the grid, it contains couplings to its four direct horizontal neighbours.In addition, it couples to the two cells immediately above and the two cells immediately below.Including the self-coupling, this results in a 9-cell stencil, independent of the order of discretisation p (see Figure 2).Secondly, it is important to take into account how the components of H depend on the timestep size ∆t, the horizontal grid spacing ∆x and the vertical grid spacing ∆z.For this, first note that the weighted weak derivative D ρ * can be decomposed into a vertical and horizontal part Using this decomposition, the Helmholtz operator can be written as the sum of four terms In order to interpret the different parts of H it is constructive to derive the corresponding Schur-complement operator for the continuous linear system which is given in Eq. (7).To be consistent with the substitution above, the third equation of Eq. ( 7) is replaced by θ + τ θ ∆tu z ∂ z θ * = r θ .The resulting pressure operator is where ∇ h is the horizontal derivative and µ = 1 + τ u τ θ ∆t 2 N 2 with the Brunt-Väisälä frequency N 2 = g(∂ z θ * )/θ * .Up to scaling by the local volume of the grid cell it is therefore possible to identify With the definitions of the linear operators in [17], it is easy to see that the different parts of H depend on the grid spacing and timestep size as Further observe that with T * = Π * θ * and the ratio of the specific heat capacities κ/(1 − κ) = c p /c v the (squared) speed of sound is given by c Assuming that the reference profiles are slowly varying, this implies that the ratios d z 1 : h 0 , d z 2 : h 0 and d h 2 : h 0 scale as Combining this with Eq. ( 18) results in the estimates where CFL v and CFL h are the vertical and horizontal acoustic Courant numbers.Note also that the relative size of D z 2 and D h 2 is given by the squared aspect ratio (∆x/∆z) 2 and the relative size of D z 1 and D z 2 decreases ∝ ∆z as the vertical grid spacing goes to zero.
To proceed further, the Helmholtz operator is split into two parts, H = H z + δH such that H z contains the couplings to neighbouring cells in the vertical direction only.If the degrees of freedom are ordered consecutively in the vertical direction, H z is a block-diagonal matrix.Each block describes the couplings in one vertical column; furthermore, solution of the system H z Π = r Π for some right hand side r Π requires the independent solution of a block-pentadiagonal system in each column.Following the scaling arguments above and observing that the operators D z 1 and D z 2 only contribute to H z , it can be seen that for high aspect ratios ∆z ∆x the dominant part of the operator H is given by H z .This observation is crucial for the following construction of a robust tensor-product multigrid algorithm.

Multigrid
Starting from some initial guess, an approximate solution of Eq. ( 14) can be obtained with a block-Jacobiiteration.To avoid the expensive block-pentdiagonal solve, the next-to-nearest neighbour couplings in H z are dropped to obtain a block-tridiagonal H z , see Figure 2.With this matrix, one iteration of the block-Jacobi method is where ω is an over-relaxation factor.The shorthand Π → BlockJacobi(H, B, Π , ω, n Jac ) is used for n Jac applications of the block-Jacobi iteration in Eq. (20).Multiplication by H −1 z in Eq. ( 20) corresponds to the solution of block-tridiagonal linear system, which can be carried out independently in each vertical column.The tridiagonal solve can be done, for example, with the Thomas algorithm [55].When applying H to Π to calculate the residual B − H Π in Eq. ( 20), the vertical terms D z 2 and D z 1 in Eq. ( 16) are treated exactly.
It is well known that stationary methods such as the Jacobi iteration converge extremely slowly since they only reduce the high-frequency error components.This issue is overcome by multigrid methods [1], which construct a hierarchy of grids with associated (nested) finite element function spaces, in particular 3 .Following the tensor-product approach in [4], the grid is only coarsened in the horizontal direction.By applying a small number of smoother iterations on each level, the error is reduced at all length scales.In the following the index ∈ {1, . . ., L} is used to label the multigrid level, with L = 1 corresponding to the fine grid level on which the solution is to be found.Let { Π ( ) } and {B ( ) } be the set of solution vectors and right hand sides on all levels, with Π (1) = Π and B (1) = B. Since the function spaces are nested, the obvious prolongation P : from a coarse space to the next-finer multigrid level is the natural injection: for all points χ ∈ D.
Eq. ( 21) naturally induces a restriction R : W on the corresponding dual spaces (denoted by an asterisk * ): R : for all functions Π ( +1) ∈ W ( +1) 3 . The corresponding linear operator acting on the vector R ( ) representing the dual 1-form r ( ) is written as Restrict R ( ) .The Helmholtz-operators on the coarse levels are constructed by representing the reference profiles on those levels and re-discretising the operator.This is more efficient than assembling it via the expensive Galerkin triple matrixproduct.
Based on those ingredients, it is now possible to write down the recursive multigrid V-cycle in Algorithm 1. Starting from some initial guess Π = Π (1) and right hand side B (1) = B on the finest level, this reduces the error by recursively solving the residual equation on the next-coarsest level.The natural way of (approximately) solving the equation on the coarsest level = L is by inverting the matrix directly or applying some Krylovsubspace method.In a parallel implementation this requires expensive global communications.As explained in [5,2] this is not necessary in our case.To see this, observe that the relative size of the zero order term H 0 and the second derivative in the horizontal direction D h 2 is proportional to the squared, inverse grid spacing ∆x.Since the grid spacing doubles on each subsequent level, the relative size of the two terms in the Helmholtz operator reduces to ν 2 C,h 4 −( −1) .As the vertical terms are treated exactly in the block-Jacobi smoother, the condition number of the Helmholtz operator will be O(1) on levels log 2 ν C,h + 1.Hence, it is sufficient to pick L = log 2 ν C,h + 1 and simply apply a few iterations of the block-Jacobi smoother.For typical atmospheric applications, ν C,h ≈ 10, and hence it is sufficient to work with L ≈ 4 levels.As demonstrated in [2] this shallow multigrid approach also greatly reduces global communications.

Computational complexity
Although the multigrid method requires additional calculations on the coarse levels, its computational complexity is proportional to the number of unknowns.The time spent in one multigrid V-cycle in Alg. 1 is dominated by two contributions: the multiplication with the matrix H ( ) and the vertical solve, i.e. the application of ( H ( ) z ) −1 .Those operations are required in the residual calculation and the block-Jacobi algorithm, which is used for both for pre-/post-smoothing and for the approximate solution of the coarse level system with n coarse block-Jacobi iterations.Let C H and C Hz be the cost per unknown for those two operations.For a pressure system with N unknown the computational cost per multigrid V-cycle is where the approximation in the last line is valid for 4 −L+1 1.In contrast, solving the Helmholtz pressure system with n iter iterations of a BiCGStab method, preconditioned by While the multigrid V-cycle contains more nearestneighbour parallel communication in the form of halo exchanges, those can easily be overlapped with computations via asynchronous MPI calls.In contrast, the BiCGStab solver requires 4 global sums per iteration (including one for monitoring convergence in the residual norm).While communication avoiding variants of Krylov solvers exist (see the overview in [56] and recent work on BiCGStab [57]), this will not overcome the fundamental issue.Several BiCGStab iterations with global communications are required to achieve the same reduction in the pressure residual as in a multigrid V-cycle.As the numerical results in Section 5.2 demonstrate, this reduces the scalability of Krylov-subspace solvers for the pressure correction equation.

Implementation
As described in [21], the LFRic code is designed around a separation-of-concerns philosophy originally introduced in this context in [58].It provides well-defined abstractions for isolating high-level science code from computational issues related to low-level optimisation and parallelisation with the aim of achieving performanceportability on different parallel hardware platforms.The model developer (atmospheric scientist or numerical algorithm specialist) writes two kinds of code: • Local kernels, which describe the operations which are executed in one vertical column of the mesh.
• High level algorithms which orchestrate the kernel calls.
The PSYClone code generation system [59] automatically generates optimised wrapper subroutines for the parallel execution of the kernels over the grid.PSYClone can generate code for execution on distributed memory machines via MPI and shared memory parallelisation with OpenMP, as well as mixed-mode parallelisation; it also supports threaded implementations on GPUs.Depending on the data dependencies, which are specified by attaching access descriptors to the kernels, appropriate MPI calls (e.g. for halo-exchanges) are automatically inserted into the generated code.
As is common in atmospheric modelling codes and consistent with the tensor-product multigrid algorithm described in the paper, the grid is only partitioned in the horizontal direction, and the elementary kernels operate on contiguous data stored in individual vertical columns.By using a structured memory layout in the vertical direction, any costs of indirect addressing in the horizontal direction from logically unstructured grids can be hidden; this was already observed in [60] and confirmed for tensor-multigrid solvers in [6].
To implement the solvers described in this paper, the user will have to write the iterative solver algorithm and kernels for applying the appropriate finite element matrices or carrying out the block-tridiagonal solves in each column.

Solver API
For the complex solvers and preconditioners described in this paper the parameter space is very large: different Krylov solvers for the outer mixed system in Eq. ( 4) and for the Helmholtz pressure equation in Eq. ( 14) will lead to varying overall runtimes.The performance of the multigrid preconditioner depends on the number of levels, value over-relaxation parameter, the number of pre-and post-smoothing steps and the choice of the coarse level solver.To explore different configurations and allow the easy switching of components, an object oriented framework for iterative solvers has been implemented in LFRic [21].Similar to the DUNE Iterative Solver Template Library [44] and PETSc [41], this avoids re-implementation of the iterative solver and aids reproducibility.Based on this library of iterative solvers, the user has to provide problem specific linear operators and preconditioner objects.
For this, three abstract data types are defined in the code: • A vector type which supports linear algebra operations such as axpy updates (y → y + αx) and dotproducts (β = x, y = j x j y j ) • A linear operator type which acts on vectors x and implements the operation y → Ax • A preconditioner type which implements the operation x → P y, where x approximately solves the equation Ax = y This allows the implementation of different Krylov subspace solvers, which are parametrised over the linear operator A and corresponding preconditioner P , both of which are derived from their respective abstract base types.So far the following solvers of the general form K(A,P ,[ ]) (where is a tolerance on the target residual reduction) are available in LFRic: • Conjugate Gradient CG(A,P , ) • Generalised minimal residual GMRES(A,P , ) • Stabilised Biconjugate Gradient BiCGStab(A,P , ) • Generalised Conjugate Residual GCR(A,P , ) • A null-solver (preconditioner only) PreOnly(A,P ) All solvers operate on instances of a concrete field vector type, which is derived from the abstract base vector type and contains a collection of dof-vectors.More specifically, to implement the linear solver with Schur-complement preconditioner for the linear system described in Section 3.3, an operator A mixed which represents the matrix in Eq. ( 5) and acts on a field vector for the state x = ( u , ρ , θ , Π ) was created.The corresponding Schur complement preconditioner P mixed acts on field vectors of the same form and, as discussed in Section 3.4, contains a call to a Krylov-subspace method K H (A H , P H ) for solving the Helmholtz problem in Eq. ( 14).Here the operator A H represents the Helmholtz operator in Eq. ( 15) and P H is a preconditioner; both A H and P H act on a single-component field vector objects of the form x H = ( Π ).Both the multigrid preconditioner P (MG) H (L, ω, n pre , n post ) described in Section 3.6 and a single-level method P (Jac) H (ω, n Jac ), which corresponds to n Jac applications of the block-Jacobi iteration in Eq. ( 20), were implemented.
Hence the general nested solver can be written as

Results
To identify the most promising preconditioner, first the performance of different solvers for the pressure correction in Eq. ( 14) is explored on a relatively small number of compute cores in Section 5.

Algorithmic performance and pressure solver comparison
In all cases a GCR solver with a tolerance of = 10 −6 is used to solve the mixed system in Eq. ( 4); sometimes this will also be referred to as the "outer solve" below.Although other methods are available for this option, an investigation of the mixed solver is not the focus of this paper and so only this method, as used in [17], is considered.To test the algorithmic performance of the solver the model is run on the baroclinic wave test case of [61], which models the development of mid-latitude atmospheric wave dynamics.Apart from the semi-implicit solver, the model setup is the same as described in [17] with the following exceptions: 1. To improve long timestep stability the continuity equation is handled in an (iterated) implicit manner, instead of explicit as in [17], that is their (22) becomes with 2. To improve the accuracy of the advection operator over non-uniform meshes a two-dimensional horizontal polynomial reconstruction of the potential temperature field is used instead of the one-dimensional reconstruction of [17].This reconstruction follows the method of [10], except here the polynomial is always evaluated at fixed points in space instead of at Gauss points of the swept area as in [10].
The model is run on a C192 mesh (6 × 192 × 192 cells) with ≈ 50km horizontal resolution and 30 levels in the vertical following a quadratic stretching such that the smallest vertical grid spacing is ≈ 200m.This results in 39.3 • 10 6 total degrees of freedom, with 6.6 • 10 6 pressure unknowns.The timestep is ∆t = 1200s, which results in a horizontal wave CFL number of CFL h = c s ∆t/∆x ≈ 7.9 with c s = 340 m/s.The vertical CFL number is CFL v = c s ∆t/∆x ≈ 1800 (near the surface).
Different methods are used to solve the pressure correction equation in Eq. ( 14): Following the notation in Eq. ( 22), the solver configurations are summarised in Table 1.The Krylov-solver with H = 10 −6 corresponds to the solver setup used in [17].Two pre-and post-smoothing steps with an overrelaxation parameter ω = 0.8 are used in the multigrid algorithm; the number of smoothing steps on the coarsest level is n coarse = 4.The test is run for 8 days of simulation time (768 timesteps) on 64 nodes of the CrayXC40 with 6 MPI ranks per node and 6 OpenMP threads per rank.
Table 2 lists the average number of outer, mixed solver iterations per semi-implicit solve and the average number of iterations per pressure solve for each of the setups described above and summarised in Table 1.Note that in each timestep the mixed solver is called four times to solve a non-linear problem and Table 2 shows the average times for a single linear solve; those times are visualised in Figure 3.For completeness, results for the 1-level multigrid method are also reported.Although in essence this simply corresponds to four applications of the line smoother, since n pre + n post = 4 this guarantees that the same number of fine-level smoother iterations is used for all multigrid results.The multigrid methods (as long as more than one level is used) result in a significant reduction in the time taken for each linear solve and, compared to the Krylov methods, require roughly the same number of outer iterations.In particular, solving the pressure correction equation to a relatively tight tolerance of H = 10 −6 does not reduce the number of outer GCR iterations.This implies that the main error in the approximate Schur complement solve is due to mass lumping, and not due an inexact solution of the pressure equation.Overall, a single multigrid V-cycle with L = 3 levels gives the best performance.Increasing the number of multigrid levels further does not provide any advantage since the problem is already well-conditioned Krylov( H = 10 after two coarsening steps, and adding further coarse levels will make the method slightly more expensive.Although standalone line relaxation provides a cheap pressure solver (without any global sums) this is offset by a significant increase in the number of outer iterations such that the overall cost is not competitive with the multigrid method.In fact (looking at the final column of Table 2), 10 iterations of the block-Jacobi solver are slightly more expensive than the four-level multigrid V-cycle with two pre-/post-smoothing steps; this is not too far off the theoretical cost estimate in Eq. (3.7).As the results for the Krylov-MG show, there is no advantage in wrapping the multigrid V-cycle in a Krylov solver.Results (not shown) using the alternative iterative methods listed in Section 4.1 for the pressure solver with H = 10 −2 show similar performance to the BiCGstab solver presented in the pressure solver.Since the linear problem is not symmetric, the Conjugate Gradient method does not converge.

Massively parallel scalability
To examine parallel scalability, two representative pressure solvers are chosen, namely BiCGStab with a prescribed tolerance of H = 10 −2 (denoted "Kr2" hereafter) and a standalone multigrid V-cycle with 3-levels (denoted "MG").For comparison, BiCGstab with a prescribed tolerance of H = 10  [62], an analysis of solver algorithms is combined with a performance model for the computation and communication costs.According to this analysis, strong scaling is expected to break down when the local number of degrees is smaller than O(10 4 −10 5 ).It is therefore expected that the largest node count considered in the present paper is well in the strong scaling limit.Table 3 summarises the resulting local problem sizes for all node counts.The average time spent in the mixed-and pressuresolver are shown in Table 4.Note that those times are now reported for an entire timestep, i.e. aggregated over four linear solves.This will allow a quantitative comparison to the overall communication cost per timestep reported below.
It should be noted that the Aries network deployed on the Cray XC40 uses an adaptive routing algorithm for the messages [63].This ensures the best utilisation of all the network links across the machine, but may not be optimal for a given task.by messages is not the same each time resulting in variation in the time taken.Ideally, a statistical measure, such as the minimum or mean would be taken over many measurements to discount such variation.However, jobs running on such large numbers of nodes are computationally expensive and the large number of potential paths through the network would require a large (and expensive) statistical sample, which was not feasible for this study.Consequently, only a single result for each setup is reported.
Figure 4 quantifies the strong scaling of the time spent in the inner pressure solver.Rather than plotting the absolute times (which can be read off from Table 4), the parallel efficiency and the relative cost of the Krylovsubspace solvers (Kr2, Kr6) compared to multigrid (MG) are shown.Let T As the medium blue, unhatched bars show, the multigrid solver scales extremely well to 3456 nodes.Both Krylov solvers (with the dark blue hatching for Kr2 and light hatching for Kr6) show significantly worse scaling; that the trend is not smooth can be attributed to network variability.The upper panel of Figure 4 shows the relative performance of the multigrid solver and the two Krylov methods for increasing node counts.More specifically, the ratio R = Kr/MG is obtained by dividing the time spent in one of the Krylov solvers by the time for the multigrid solver.The MG solver is much faster than either Kr2 and Kr6, especially for large node counts where the superior scalability of the multigrid algorithm pays off: for 3456 nodes, the MG solver is more than 11× faster than Kr2 and 25× faster than Kr6.As will be discussed below, this can be at least partially explained by the fact that multigrid does not rely on costly global reductions which are required in each iteration of the Krylov-subspace solvers.However, even for 384 nodes the relative advantage of MG is more than 4× compared to the Kr2 solver and 11× compared to Kr6.
Figure 5 shows the same quantities as in Figure 4, but now for the outer, mixed solve.Again the parallel efficiency is given in the lower panel and the upper panel quantifies the relative advantage of the MG pressure solver, compared to Kr2 and Kr6.Compared to Figure 4, the strong parallel efficiency of the outer solve drops to less than 40% for all pressure solvers.This can be explained by the additional parallel communications in the outer solver, which have a particularly strong impact for MG.As can be seen from Table 4 this is to be expected since the multigrid solver accounts for a relatively smaller fraction of the outer solve time.However, as can be read off from the top panel of Figure 5, the MG pressure solver still improves performance relative to Kr2 and Kr6: on the largest node count multigrid leads to a 4× reduction in runtime compared to Kr2 (9× for Kr6) and even on 384 nodes the relative advantage of MG is 3× for Kr2 (7× for Kr6).
To understand the differences in parallel scalability for the different pressure solvers, Figure 6 shows the communication costs for increasing numbers of nodes.This includes both local communications in halo exchanges (top panel) and all-to-all communications in global reductions (bottom panel).The numbers were collected with the CrayPAT profiler in sampling mode.For technical reasons it was not possible to limit the measurements to the solver routine.Instead, the measured data is aggregated across all the calls to the relevant MPI library functions for the whole model run.However, from the profiling data, the time spent in the semi-implicit solver varies from 58 − 72% of the time per timestep for MG, 79 − 87% for Kr2 and 89 − 92% for Kr6.Moreover almost all the calls to the global sum take place in the solver routines.Thus it is reasonable to conclude that the communication costs are dominated by the solver.
The lower panel in Figure 6 shows τ , the time spent in the global sums per timestep.On the largest node count, the approximate times are τ ≈ 2s for MG (blue, unhatched), τ ≈ 7s for Kr2 (dark blue with hatching) and τ ≈ 12s for Kr6 (light blue with hatching).Evidently Kr2 and Kr6 spend much more time in global communications.This is readily explained by the fact that the number of global sums in the pressure solve is proportional to the number of Krylov solver iterations.In contrast, the MG V-cycle does not require any global sums and the superior scaling of the MG pressure solver itself (Figure 4) can be largely accounted for by this ab-sence of the global communication.Moreover, from Figure 6 it can be seen that especially for Kr6 the variation in the cost of the global sum is large.To quantify this, define the variation ν t as where Although the trend in the data is not entirely clear due to the network variability, the Kr2 solver (yellow with hatching) spends the least amount of time in nearest-neighbour communications.Since Kr6 (brown with hatching) requires more iterations to converge, it will also require more halo exchanges.The nearestneighbour communication cost of the MG V-cycle (red) lies somewhere between those two extremes, which can be attributed to additional halo exchanges on the coarser multigrid levels.Note, however, that limiting the number of levels to L = 3 in the shallow multigrid approach avoids the inclusion of very coarse levels with a poor computation-to-communication ratio.The cost of halo exchange decreases with the number of nodes.This is again plausible since (as long as the message size is not too small) local communication is bandwidth bound and so scales with the amount of data that is sent.The third column of Table 3 shows that the size of the halo reduces by a factor of three as the number of nodes increases from 384 to 3456.
The Multigrid solver scales better than might be expected from [62] which predicts that for the largest node count the local problems are so small that strong scaling should break down.This can be at least partly attributed the fact that the analysis in [62] is for a three-dimensional decomposition, whereas in the work the decomposition is only two-dimensional.Surface-to-volume scaling implies that for small local volumes, there is less communication in 2-d than 3-d.

Robustness with respect to timestep size
Finally, the impact of variations in the horizontal acoustic Courant number CFL h on the performance of the model is studied.As can be inferred from the analysis in Section 3.5, the condition number of the Helmholtz operator H in Eq. ( 15) depends strongly on CFL h .In other words, for a fixed grid spacing the condition number will increase with the timestep size, which can potentially make the pressure solver more costly.timestep sizes of varying length.Apart from this, the setup is the same as in Section 5.2.Table 5 shows the number of outer, mixed and pressure solver iterations for CFL h between 4 and 8. Results for CFL h = 4, 6 are averaged over the for the first 100 timesteps.The final column of Table 5 shows the average time spent in the linear solver in each timestep.Unfortunately if was not possible to chose larger timestep sizes due to instabilities due to the CFL limit imposed by the explicit vertical advection scheme.For the Kr2 and Kr6 solvers the number of inner, pressure iterations is also given.It can be observed that the number of outer iterations is largely insensitive to increases in the timestep size, in particular using one multigrid V-cycle for the pressure solve gives good results for the range of CFL h considered here.Any increase in the wall-clock time for the configurations which use Krylov-subspace pressure solvers can be clearly attributed to the growing number of inner iterations.Relative to the MG setup, which only requires exactly one multigrid V-cycle in the pressure solve, the cost of the Kr2 and Kr6 solvers increases for larger values of CFL h .This is readily explained by the fact that the number of pressure solver iterations grows with the condition number.Empirically it can be observed that this number of iterations approximately doubles when CFL h is increased from 4 to 8.While values of CFL h ≈ 10 are typical atmospheric simulations, even for smaller timestep sizes the multigrid pressure solver shows a clear advantage compared to Krylov-subspace methods.

Conclusion
This paper describes the construction of a Schurcomplement preconditioner with a multigrid pressure solver for the mixed-finite element LFRic numerical forecast model.Due to the presence of a velocity mass matrix-matrix an additional outer solver iteration is necessary, which makes the solver significantly more complex than in simpler finite-difference models on struc-tured latitude-longitude grids.By exploiting the structure of the Helmholtz-operator on the highly anisotropic global grid, it is possible to build a highly-efficient bespoke multigrid method using the tensor-product approach in [4].Using only a relatively small number of multigrid levels further improves parallel scalability.The numerical results presented here confirm the conclusions from earlier studies in idealised contexts, e.g.[20]: compared to Krylov-subspace methods, solving the pressure correction equation with a single multigrid Vcycle leads to significantly better overall performance.Running a baroclinic test case with more than 1 billion (10 9 ) unknowns on 124416, multigrid reduces the time spent in the outer, mixed solve by a factor of around 4×. Since it requires significantly less global reductions, multigrid also scales better in parallel.In contrast to Krylov-subspace based pressure solvers, multigrid is robust with respect to changes in the timestep size.
There are several avenues for future work.While multigrid has been demonstrated to be consistently better than any other considered solver, there are likely further small, problem specific optimisations which can be achieved for example by tuning parameters or including additional terms in the approximate Helmholtz operator by using a modified mass-lumping strategy.While the focus in this paper was on the lowest-order discretisation, [20] have demonstrated that the approach can be easily extended to higher orders by including an additional p-refinement step in the multigrid hierarchy.LFRic already provides the necessary code infrastructure.Following the promising results shown here, extending the approach from global simulations to local area models will require careful treatment of boundary conditions but is likely to lead to similar performance gains.Finally, an obvious downside of the approximate Schur-complement approach pursued here is the necessity of an additional outer solve of the mixed problem.As has been recently shown in [64], this can be avoided by using a hybridised mixed method.In fact, for a gravity-wave testcase has already been implemented in LFRic.It currently lacks an efficient preconditioner for the pressure correction equation, which is actively developed.
[36] Daniel S Abdi, Francis X Giraldo, Emil M Constantinescu, Lester E Carr III, Lucas C Wilcox, and Timothy C Warburton.Acceleration of the implicitexplicit non-hydrostatic unified model of the atmosphere (NUMA) on manycore processors.arXiv preprint arXiv:1702.04316,2017.

Figure 1 :
Figure 1: Function spaces used in the finite element discretisation

Figure 2 :
Figure 2: Stencil of the Helmholtz operator H (all gray cells) and of the operator H z (dark gray cells)

− 6 )Figure 3 :
Figure 3: Breakdown of solver times for the baroclinic wave test case on the C192 mesh described in Section 5.1.Both the total time per solve and the time spent in the pressure solver (blue) is shown.
average total time spent in the mixed, outer solve for each timestep; T (p) solve is the average time spent in the inner pressure solve.

Figure 4 :
Figure 4: Strong scaling of the time spent in the inner pressure solver for the C1152 mesh test case in Section 5.2.The lower panel shows the parallel efficiency PE compared to 384 nodes as defined in Eq. (24).The horizontal dashed line corresponds to perfect scaling where PE = 1.The relative cost of the Krylov solvers (Kr2 and Kr6) compared to the multigrid V-cycle (MG) is shown at the top.
(p) solve (N ) be the time spent in the pressure solve on N nodes.The parallel efficiency PE relative

Figure 5 :
Figure 5: Strong scaling of the time spent in the outer, mixed solver for the C1152 mesh test case in Section 5.2.The lower panel shows the parallel efficiency PE compared to 384 nodes as defined in Eq. (24).The horizontal hashed line shows perfect scaling where PE = 1.The relative cost of the Krylov pressure solvers (Kr2 and Kr6) compared to the multigrid V-cycle (MG) is shown at the top.

Figure 6 :
Figure 6: Strong scaling of the average communication costs per timestep for the C1152 test case in Section 5.2.The upper panel shows the cost of MPI send/receive operations during halo-exchange; lower panel shows the time spent in global sums.All times are measured with the CrayPAT profiling tool.

Table 1 :
Pressure solver configurations used in Section 5.1 5.3.All tests were run on the Met Office CrayXC40 supercomputer using the Aries interconnect.Each node is comprised of dual socket, 18-core Broadwell Intel Xeon processors, i.e. 36 CPU cores per node.The model was compiled with the Intel 17 Fortran compiler (version 17.0.0.098).
1, before presenting massively parallel scaling tests for a smaller subset of solver configurations in Section 5.2.Finally, robustness with respect to the timestep size is quantified in Section

Table 2 :
Number of iterations and time per linear solve for both the outer mixed solve (t ).All numbers are averaged over the total run of the baroclinic wave test case on the C192 mesh described in Section 5.1; times are given in seconds. solve

Table 2 .
The only exceptions are the the Jacobi & precondition-only methods which take approximately twice as long to run as using BiCGstab for

Table 3 :
Node configurations and local problem sizes for parallel strong scaling runs in Section 5.2.
−6(denoted "Kr6") is also included in the following study.The scaling tests are run on a C1152 cubed sphere mesh (6 × 1152 × 1152 cells) with ≈ 9 km horizontal resolution and 30 vertical levels with the same stretched grid and vertical CFL v as in Section 5.1.The total number of degrees of freedom is 1.4 • 10 9 with 2.4 • 10 8 pressure unknowns.The timestep is set to 205 seconds, such that the horizontal wave CFL number is again CFL h = c s ∆t/∆x ≈ 8.0, as in Section 5.1.In a strong scaling experiment the model is run for 400 timesteps 1 on up to 3456 nodes of the Cray XC40, keeping the global problem size fixed.Each node is configured to run with 6 MPI ranks and 6 OpenMP threads per rank; for the largest node count this corresponds to 3456 × 36 = 124416 threads.In this case the local state vector has around 10000 degrees of freedom and there are only around 2000 pressure unknowns per core.In

Table 4 :
Strong scaling of the time spent in the linear solver on different numbers of nodes for the C1152 mesh test case in Section 5.2.All results are measured in seconds and are averaged over 400 timesteps.T t min and t max are the smallest and largest measured global communication times across all four considered node counts.For the different methods the numerical values are ν t = 0.18 for MG, ν t = 0.17 for Kr2 and ν t = 0.36 for Kr6.Apart from the longer time to solution, the large variation in global communication costs due to network variability is another disadvantage of the Kr6 method.Again, this can be explained by the significant fraction of time spent in global sums during the pressure solve.For MG, global sums are only required in the outer solve and consequently the run-time variation is much smaller.

Table 5 :
To quantify the impact of this, the model was run on 384 nodes with Average number of solver iterations for the outer mixed and inner pressure solves for increasing horizontal acoustic Courant number.Results are reported for the C1152 test case in Section 5.2.The final column lists the average time for the spent in the mixed, outer solve for each timestep.