Comparison of dimensionally-split and multi-dimensional atmospheric transport schemes for long time-steps

Dimensionally split advection schemes are attractive for atmospheric modelling due to their efﬁciency and accuracy in each spatial dimension. Accurate long time steps can be achieved without signiﬁcant cost using the ﬂux-form semi-Lagrangian technique. The dimensionally split scheme used in this paper is constructed from the one-dimensional Piecewise Parabolic Method and extended to two dimensions using COSMIC splitting. The dimensionally split scheme is compared with a genuinely multi-dimensional, method-of-lines scheme which, with implicit time-stepping, is stable for Courant numbers signiﬁcantly larger than 1. Two-dimensional


Introduction
Many traditional weather and climate models use latitude-longitude meshes, but new models are being developed on quasiuniform meshes in order to better exploit modern computers (e.g. Skamarock and Gassmann, 2011;Staniforth and Thuburn, 2012;Weller et al., 2012;Lauritzen et al., 2014;Katta et al., 2015). Latitude-longitude meshes need measures to enable long time steps near the poles such as polar filtering, semi-implicit and semi-Lagrangian methods (Davies et al., 2005). These methods necessarily have large domains of dependence near the poles which lead to poor parallel scaling at high resolution. Quasiuniform meshes avoid parallel scaling bottlenecks near the poles but lead to non-orthogonal meshes (e.g. the cubed-sphere) or different structures (e.g. icosahedra). Therefore accurate and efficient transport (or advection) schemes on non-orthogonal meshes are required. There is an abundance of desirable properties of advection schemes, including: 1. Inherent local conservation of the advected quantity. 2. Stability in the presence of large Courant numbers. 3. Accuracy in the presence of large Courant numbers. 4. High-order accuracy. 5. Low computational cost, good parallel scaling and multitracer efficiency. 6. Low phase and dispersion errors (advection of all wavenumbers of the advected quantity at close to the correct speed). 7. Low diffusion errors (maintaining amplitude of all wavenumbers of the advected quantity). 8. Boundedness, monotonicity, positivity and maintaining correlations between multiple advected tracers.
Four (important) properties are listed together in the final item because they will not be addressed here. In this paper, we address the issue of conservative, accurate, efficient advection schemes on logically rectangular, non-orthogonal meshes which are stable in the presence of large Courant numbers. These schemes would be particularly relevant for cubed-sphere meshes and for terrain-following meshes. Another novel aspect of this paper is that, for simplicity, we test advection schemes entirely on planar meshes rather than on the sphere, proposing test cases to challenge advection schemes on non-orthogonal meshes without the need to implement meshes in spherical geometry. We assume that a large fraction of the numerical errors associated with terrain-following co-ordinates and the cubed-sphere grid are due to the mesh non-orthogonality, distortion and discontinuity. Other sources of numerical errors, such as those associated with the representation of spherical geometry, are not addressed here. Dimensionally split schemes (operating separately in each spatial dimension) are attractive for atmospheric modelling due to their efficiency and high accuracy in each spatial dimension (e.g. Leonard et al., 1996;Lin and Rood, 1996;Brassington and Sanderson, 1999;Putman and Lin, 2007;Katta et al., 2015). Inherent conservation is guaranteed by using the flux-form semi-Lagrangian (FFSL or forward in time) technique (e.g. Colella and Woodward, 1984) which integrates the dependent variable over a swept distance upstream of every face in order to calculate fluxes in and out of cells. Accurate long time steps can be achieved without significant cost by calculating cumulative mass fluxes along lines of cells (Leonard et al., 1995). One-dimensional schemes can be used with operator splitting to create dimensionally split, temporally second-order-accurate schemes (e.g. Leonard et al., 1996) on logically rectangular, multidimensional meshes. Dimensionally split schemes have been found to give good accuracy on nonorthogonal meshes such as the cubed-sphere if special treatment is applied over cube edges to improve accuracy: for example, Putman and Lin (2007) use the average of two one-sided schemes at cube edges. Katta et al. (2015) use a multi-dimensional transport scheme with dimensionally split reconstructions. They create ghost cells outside each cube panel boundary and achieve accuracy between second and fourth order. Guo et al. (2014) report error growth around cube edges using Strang splitting.
Meshes also become non-orthogonal when orography is represented with terrain-following layers. The problem is commonly alleviated by smoothing terrain-following layers to reduce non-orthogonality away from the ground (e.g. Schär et al., 2002), or by using floating Lagrangian vertical coordinates (Lin, 2004). The special treatment applied to dimensionally split schemes for cubed-spheres cannot be applied for terrainfollowing layers because sloping orography appears throughout the domain. Dimension splitting may account for some of the errors over orography reported by Kent et al. (2014), although they do not cite this as a reason for errors.
Dimension-splitting errors on distorted meshes can be eliminated by using genuinely multi-dimensional advection schemes. These can be either FFSL (i.e. swept area; e.g. Lashley, 2002;Lipscomb and Ringler, 2005;Miura, 2007;Thuburn et al., 2014), method of lines (MOL, discretising space and time separetely; e.g. Weller et al., 2009;Skamarock and Gassmann, 2011;Katta et al., 2015;Shaw et al., 2017) or conservative semi-Lagrangian (with conservative re-mapping; e.g. Iske and Kaser, 2004;Zerroukat et al., 2004;Lauritzen et al., 2010). Few FFSL and MOL multi-dimensional schemes have been extended to work with Courant numbers significantly larger than 1, a notable exception being Ullrich and Norman (2014), whose FFSL method is stable for Courant numbers up to around 2.5. FFSL multi-dimensional schemes could be extended to handle larger time steps by integrating the upstream swept volume over a large upstream volume, interacting with a large number of upstream cells. However, the integration cost would be proportional to the time step since an upstream swept volume would overlap with more cells. MOL multi-dimensional schemes can be extended to work with Courant numbers larger than 1 by using implicit time-stepping. This will increase the computational cost per tracer advected since the solution of a matrix equation would be needed for every advected tracer. Other disadvantages of implicit time-stepping are the large phase errors when long time steps are used (e.g. Durran and Blossey, 2012;Lock et al., 2014), the difficulty of achieving monotonicity, and the parallelization cost of linear solvers. This is in contrast to semi-Lagrangian or FFSL schemes which maintain accuracy with long time steps (Purnell, 1976;Pudykiewicz and Staniforth, 1984;Leonard et al., 1995) although monotonocity with long time steps is still challenging (Bott, 2010). Conservative semi-Lagrangian (e.g. Zerroukat et al., 2004) naturally extends to long time steps but the conservative remapping is complicated and expensive, particularly on nonrectangular meshes, and will not be investigated here. Lauritzen et al. (2014) described how the FFSL technique with a long time step can be made equivalent to the conservative semi-Lagrangian.
It is therefore not clear what approach should be taken for achieving long time steps when advecting multiple tracers on distorted meshes. In this paper, we show the effect of dimensionsplitting errors using a FFSL dimensionally split scheme on a number of test cases which use distorted and undistorted meshes, and compare with a genuinely multi-dimensional implicit MOL scheme using large and small Courant numbers.
The theoretical properties of dimensionally split advection schemes are often tested on uniform, orthogonal meshes (e.g. Leonard et al., 1996) which is inadequate for a scheme that will eventually be used on a cubed-sphere. Developing a transport scheme to the extent that it can be used on a multi-panel cubedsphere with special treatment of cube edges is a considerable undertaking. Hence there is a need for more challenging advection test cases which are simpler to implement, without the need for spherical meshes. We therefore propose some modifications of existing test cases to use distorted meshes, or distorted co-ordinate systems, on a logically rectangular, two-dimensional plane. These test cases will mimic the problems encountered at cubed-sphere edges.
The long-time-step-permitting, dimensionally split scheme and the long-time-step-permitting multi-dimensional scheme are defined in section 2. In section 3 we present results of three advection test cases on distorted meshes in two-dimensional planes using Courant numbers above and below 1. These are: the solid-body rotation test case of Leonard et al. (1996), modified to use a mesh (or co-ordinate system) with distortions similar to a cubed-sphere (section 3.1); the horizontal advection test case over orography (Schär et al., 2002), examining sensitivity to time step, resolution and mountain height, all on the basic terrain following mesh without smoothing of terrain following layers (section 3.2); and a modification of the deformational flow test case of Lauritzen et al. (2012) for a periodic rectangular plane (section 3.3). Some estimates are made of computational cost in section 3.4 and final conclusions are drawn in section 4.

Transport schemes
We present two conservative advection schemes suitable for long time steps (stable for Courant numbers significantly larger than 1) for solving the linear advection equation: where the dependent variable φ is advected by a divergence free velocity field u(x, t). The dimensionally split scheme is the piecewise parabolic method (PPM; Colella and Woodward, 1984) which uses the FFSL approach extended to long time steps following Leonard et al. (1995) with COSMIC splitting Leonard et al. (1996) to extend PPM to two dimensions. The multidimensional scheme uses the method-of-lines approach (treating space and time independently). The second-order accurate spatial discretistaion of Shaw et al. (2017) is combined with an explicit, second-order Runge-Kutta time-stepping scheme and with an implicit, Crank-Nicholson time-stepping scheme to allow long time steps. Neither scheme has monotonicity or positivity preservation. The dimensionally split scheme is third-order accurate in one dimension but the COSMIC splitting reduces the temporal order to two. The multi-dimensional scheme uses cubic interpolations but uses cell centroids as Gauss points so the order of accuracy is limited to two. When explicit time-stepping is used, the cost of both schemes is similar (section 3.4). These schemes are therefore considered suitable for comparison. The code for dimensionally splitting scheme is available at https://github.com/yumengch/COSMIC-splitting. The multidimensional scheme is implemented using OpenFOAM (2016) and is available at https://github.com/AtmosFOAM/. The set-up of the test cases is available at https://github.com/hilaryweller0/ multiDadvectCases (all accessed 26 July 2017).

One-dimensional PPM with long time steps
We describe a long-time-step version of PPM for solving the one-dimensional advection equation: Colella and Woodward (1984) defined PPM with monotonicity constraints and for variable resolution but for simplicity (and for comparison with the multi-dimensional scheme) we will define PPM without monotonicity constraints and for a fixed resolution x, and time step t. With these restrictions, PPM should be fourth-order accurate in one dimension for vanishing time step. We define the dependent variable, φ (n) i , to be the mean value of φ in cell i at time-level n where x i = i x and t = n t. Since PPM is a flux-form finite-volume method, φ (n) i is updated using: where X C (φ) is the conservative advection operator for φ in the x direction. The fluxes, u i±1/2 φ i±1/2 , are found by integrating a piecewise polynomial p, along the distance travelled in each time step upwind of cell boundary x i±1/2 . The polynomial is defined in each cell i, such that: by where ξ = (x − x i−1/2 )/ x and In order to cope with long time steps, we follow Leonard et al. (1995) and divide the Courant number into a signed integer part, c N , and a remainder, c r . The departure point of location x i− 1 2 is thus computed as x d = x i− 1 2 − u i− 1 2 t, and the departure cell, > 0, the flux through x i−1/2 between times n t and (n + 1) t is: where M i− 1 2 is the cumulative mass from the start point to position This departure point calculation assumes that the velocity is uniform on the computational mesh which has a first-order error which could be particularly damaging for long Courant numbers, when the wrong departure cell could be found. The departure point calculation could be improved by using a second-order predictor-corrector departure point calculation (e.g. Melvin et al., 2010). The velocity is derived from a stream function and the Jacobian of the co-ordinate transform: For stability, the time step is restricted by the deformational Courant number: (Pudykiewicz and Staniforth, 1984) such that c d ≤ 1.

COSMIC splitting
COSMIC operator splitting (Leonard et al., 1996) allows singlestage, one-dimensional schemes such as PPM to be used stably in two or more dimensions whilst retaining conservation, constancy preservation and second-order accuracy (on orthogonal meshes). As we are now considering two spatial dimensions, we define φ ij , u ij and v ij , the values of φ and the velocity components, u and v in cell (i, j) where x = i x and y = j y. COSMIC splitting uses both advective (A) and conservative (C) advection operators in the x and y directions: , u e = u i+1/2,j and u w = u i−1/2,j are the values of φ, u and v at the (n, s, e and w) cell boundaries. If COSMIC is being used to extend PPM to two spatial dimensions, then φ n,s,e,w are calculated from Eq. (8). Assuming C-grid staggering, v n , v s , u e and u w are dependent variables. Instead of using cell-centred velocity (Lin and Rood, 1996) or upwind velocity (Leonard et al., 1996), the advective operators are calculated in a similar manner to Lin (2004). Mesh distortions can be included in the advection equation with a co-ordinate transform with Jacobian J: The operators X C , Y C , X A and Y A are then: which are combined to update φ ij in each cell by: where |J| ij is the determinant of Jacobian at the centre of cell (i, j), and x and y give the cell size in uniform computational domain.

Multi-dimensional method-of-lines (MOL) scheme
The MOL scheme uses the finite volume method for arbitrary meshes and is implemented in OpenFOAM (2016). This uses a cubic upwind spatial discretization (Weller and Shahrokhi, 2014;Shaw et al., 2017) combined with either implicit Crank-Nicholson in time or explicit RK2 (see below). Although the interpolation uses a cubic polynomial, cell centre values are approximated as cell average values and face centre values are approximated as face average values so the method is limited to second-order accuracy in space. The advection scheme uses Gauss's divergence theorem to approximate the divergence term of the advection equation: where V is the cell volume, the summation is over all faces, f , of cell c, φ f is the value of the dependent variable, φ interpolated onto face f , u f is the velocity at face f and S f is the vector normal to face f with magnitude equal to the area of face f (i.e. the face area vector). The advection scheme uses a fit to a 2D (or 3D) polynomial using an upwind-biased stencil of cells ( Figure 1) in order to interpolate from known cell values onto a face. In 2D, the cubic polynomial is: Stencil of upwind-biased cells for interpolating onto face f using a 2D cubic polynomial for a rectangular mesh structure for the multi-dimensional scheme. The tracer, φ, is stored at cell centres (grey dots). 'up' and 'down' refer to upwind and downwind cells relative to the central face, f , and x -y is the co-ordinate system relative to f . it cannot be set with a stencil that is narrow in the direction of the flow.) Coefficients a i are set from a least-squares fit to the cell data in the stencil. The least-squares problem involves a 9 × m matrix singular value decomposition (where m is the size of the stencil) for every face and for both orientations of each face. However this is purely a geometric calculation and is therefore a pre-processing activity since the mesh is fixed. This generates a set of weights for calculating φ f from the cell values in the stencil, leaving m multiplies for each face for each call of the advection operator. The stencils are found for three-dimensional, arbitrarily structured meshes by finding the face(s) closest to upwind of the face we are interpolating onto, taking the two cells either side of the upwind face(s) and then taking the vertex neighbours of those central cells ( Figure 1). For each face there are two possible stencils depending on the upwind direction. Both of the stencils are stored and the interpolation weights for both stencils are calculated.
In order to ensure that the fit is accurate in the cells either side of face f and to ensure that the values in these adjacent cells have the strongest control over φ f , rows associated with these values in the least-squares fit matrix are weighted a factor of 1000 relative to the other rows (following Lashley, 2002). This does not affect the order of accuracy. Mathematically, an arbitrarily large value of the weight can be used to ensure that the fit goes exactly through the upwind and downwind cell. However if a value too large is used, the singular value problem becomes ill-conditioned. The stabilization procedures described by Shaw et al. (2017) are in the code but are not activated for these test cases because the meshes are sufficiently regular. The value φ f is then calculated as a higher-order correction to first-order upwind: where w c are the weights for each cell of the stencil calculated from the least-squares fit, with w up reduced by 1 to make the fit a correction on upwind.

Implicit and explicit time-stepping
Assuming that the velocity field and mesh are constant in time, the explicit Runge-Kutta 2 (Heun) time-stepping is defined as The trapezoidal implicit or Crank-Nicolson time-stepping leads to a matrix equation which needs to be solved to find all the φs at the next time step. In order to ensure that the matrix is diagonally dominant for arbitrary time steps, the cubic interpolation applied is a deferred correction on firstorder upwind so that only the coefficient corresponding to the upwind cells are included in the matrix. This means that more than one implicit solves are needed per time step so that the higher-order terms are solved to be second-order accurate in time. If the Courant number is less than or close to 1, we use two implicit solves per time step. Consequently, assuming that the velocity field and mesh are constant in time, the time-stepping scheme is defined as: For Courant numbers larger than 1, we use four implicit solves per time step, although sensitivity to this choice has not been investigated.

Matrix solvers and tolerances
If this implicit scheme is applied on a logically rectangular, twodimensional mesh with horizontal and vertical Courant numbers c x and c z , then the diagonal coefficients of the matrix would be 1 + c x /2 + c z /2 and, assuming two upwind directions, there would be exactly two off-diagonal elements, −c x /2 and −c z /2. Consequently the matrix is very sparse, asymmetric and diagonally dominant for all time steps. It is solved using the OpenFOAM bi-conjugate gradient solver using DILU pre-conditioning to a tolerance of 10 −8 every iteration. Sensitivity to the solver or solver tolerance have not been investigated. Information about the number of solver iterations for different test cases in given in section 3.4.

Results of test cases in planar geometry
In order to make test cases as simple as possible without the need for incorporating spherical geometry, multi-panel meshes or nonrectangular cells, our computational domain consists of a periodic two-dimensional plane with deformations and discontinuities in the co-ordinate system (or mesh) mimicking the kind of distortions which are produced near cubed-sphere edges. We also use a two-dimensional vertical slice test case over orography using terrain-following co-ordinates (or terrain-following meshes). A range of test cases are undertaken on uniform and distorted meshes using the dimensionally split scheme and the multidimensional scheme in order to evaluate the influences of mesh distortions, the validity of using a dimensionally split scheme on a distorted mesh, and the schemes' accuracy and stability for long time steps.

Solid-body rotation
The solid-body rotation test case of Leonard et al. (1996) is used to compare the accuracy of the dimensionally split and multi-dimensional schemes on orthogonal and non-orthogonal meshes. We define this test case on a domain that is 10 4 × 10 4 m 2 . The velocity is defined by numerically differentiating the streamfunction which can be defined at mesh vertices, x, by: where x c is the centre of the domain, and A = 5π/3000 s −1 so that the angular velocity is 2A. The initial tracer takes a Gaussian distribution in order to ensure that all advection schemes achieve their theoretical order of accuracy: where x φ = x c + r cφ j is the initial centre of the tracer distribution, r cφ = 2500 m, r φ = 500 m and j is the unit vector in the y direction. The analytic solution has the same tracer distribution, but with the centre of the tracer at and the Gaussian rotates anti-clockwise exactly one revolution in 600 s. The solid-body rotation test case is performed on a uniform, orthogonal mesh and on a non-orthogonal mesh on a plane with non-orthogonality similar to that of a cubed-sphere mesh. For the dimensionally split scheme, non-orthogonality is achieved using the co-ordinate transform: where y m = 5000 m and f is the equation for a y position of uniform Y half way up the domain. In order to create angles of 120 • in the mesh, similar to a cubed-sphere, f is given by: where x m = 5000 m. For a 50 × 50 mesh this gives the x and y coordinate locations as shown in Figure 2. The multi-dimensional scheme model uses Cartesian co-ordinates and a distorted mesh rather than a non-orthogonal co-ordinate system on a Cartesian mesh. However, this does not affect the numerical results assuming that the co-ordinate transforms are implemented in a consistent way to the distorted mesh in Cartesian co-ordinates. For the dimensionally split scheme, bi-periodic boundary conditions are applied. For the multi-dimensional scheme, it was more straightforward to impose fixed-value boundary conditions of φ = 0 where the velocity is into the domain and zero normal gradient where the velocity is out of the domain. However, φ remains almost zero near the boundaries so these boundary conditions do not affect the accuracy.
Results of this test case on the orthogonal and non-orthogonal meshes of 100 × 100 cells with t = 1 s are shown in Figure 3 for both advection schemes (which gives a maximum Courant number close to 1). The contours show the tracer value every 100 s and the colours show the errors summed every 100 s. The dimensionally split scheme outperforms the multi-dimensional scheme on both meshes due to the higher-order accuracy of the split scheme. The dimensionally split scheme introduces a small error at 300 s where the tracer goes through the change in direction of the mesh which would be ameliorated if we were using monotonicity constraints. The second-order, multi-dimensional scheme shows phase lag but errors are almost entirely insensitive to the mesh distortions. The results in Figure 3 use implicit timestepping but the results using explicit time-stepping are similar (not shown).
The multi-dimensional and dimensionally split schemes take very different approaches to handling large Courant numbers. The multi-dimensional scheme uses implicit time-stepping whereas the dimensionally split scheme uses a FFSL approach, integrating over a line of cells in order to calculate the flux across a face. Implicit schemes are known to suffer from phase errors (e.g. Durran and Blossey, 2012;Lock et al., 2014) for long time steps, whereas the accuracy of semi-Lagrangian is less sensitive to timestep (Pudykiewicz and Staniforth, 1984). Therefore we present results of both schemes on orthogonal and non-orthogonal meshes for time steps ten times those used in Figure 3 ( t = 10 s) giving maximum Courant numbers of around 10 using 100 × 100 cells in Figure 4. The error of the multi-dimensional scheme is again much larger than the dimensionally split scheme on both meshes. The dimensionally split scheme is accurate at large Courant numbers despite the first-order calculation of departure points. However, the dimensionally split scheme introduces oscillations on the non-orthogonal mesh, particularly where the mesh changes direction, whereas for the multi-dimensional scheme, errors are not strongly affected by the non-orthogonality. Figure 4 clearly shows phase errors of the implicit time-stepping but, despite the large Courant number, the well-resolved part of the profile is propagating at close to the correct speed. Dispersion analysis (Lock et al., 2014) shows that high-frequency oscillations (which are poorly resolved in time and space) will be slowed dramatically, but fast-moving features which are well-resolved in space will propagate at a much more realistic speed, supporting the results shown in Figure 4. The explicit multi-dimensional scheme is not stable for this time step. In order to compare convergence with resolution of the different numerical methods, we use the 2 and ∞ error norms defined in the usual way: where φ T is the analytic solution and the integrations and maxima are over the whole domain, with volume V. Figure 5 shows convergence with resolution of the 2 and ∞ error measures for meshes of 50 × 50, 100 × 100, 200 × 200 and 400 × 400 cells with time steps scaled in order to maintain a maximum Courant number of 1 ( t = 2, 1, 0.5, 0.25 s) or scaled to achieve a maximum Courant number of 10 ( t = 20, 10, 5, 2.5 s). The error norms are calculated at t = 500 s, when the tracer has made 5/6 of one revolution in order to avoid error cancellation.
On both orthogonal and non-orthogonal meshes, the multidimensional scheme has second-order convergence once errors are low enough to avoid error saturation (stable errors are bounded at around 1). With a large Courant number, both schemes are less accurate. For the multi-dimensional scheme, this is due to phase errors of the implicit time-stepping whereas for the multi-dimensional scheme the first-order errors in calculating the departure point and trajectory will be emerging and there could also be significant errors from the second-order COSMIC splitting. On the orthogonal mesh, the dimensionally split scheme has third-order convergence for the Courant number close to 1 and second-order for the larger Courant number (consistent with the results of Colella and Woodward, 1984;Leonard et al., 1996). On the non-orthogonal mesh, the dimensionally split scheme has second-order convergence for both Courant numbers. Results of the explicit multi-dimensional scheme are also shown in Figure 5 for the shorter time step on the non-orthogonal mesh. The explicit RK2 scheme is more accurate than the implicit (Crank-Nicolson) scheme with the same order of convergence, but does not come close to the accuracy of the dimensionally split scheme. The different time-stepping schemes of the two models also affect convergence with time step. The 2 and ∞ error measures as a function of time step for meshes of 100 × 100 cells are shown in Figure 6. The dimensionally split scheme, which uses FFSL time-stepping, has errors reducing as time step increases, up to a Courant number of 2 ( t = 2 s), whereas the multi-dimensional scheme using implicit method-of-lines to treat space and time separately always has error reducing as time steps reduce. The FFSL technique discretizes space and time together and the error is not very sensitive to time step. However, the shorter the time step, more time steps need to be taken, and so errors can actually accumulate more from taking more time steps. (This is consistent with the order of accuracy of semi-Lagrangian being x p / t for interpolation using polynomials of degree p, as described by Durran, 2010). The explicit version of the multi-dimensional scheme has errors that decrease as time step increases due to the spatial discretization errors that are being introduced at every time step.
In summary, the dimensionally split scheme has excellent behaviour at large and small Courant numbers on the orthogonal meshes with up to third-order convergence for small Courant numbers, and the errors increase and order of convergence decreases on non-orthogonal meshes. In contrast, the multidimensional scheme is insensitive to the orthogonality, converges with second-order as expected, and suffers from phase errors at large Courant numbers due to the use of Crank-Nicolson implicit time-stepping.

Horizontal advection over orography
Non-orthogonal meshes (or co-ordinate systems) are usually necessary for representing orography with terrain-following coordinates. This is a challenge for dimensionally split schemes. Horizontal-vertical split schemes are commonly used in this context (e.g. Dennis et al., 2012;Wan et al., 2013) since most weather/climate models use higher resolution in the vertical than in the horizontal. We present results of the Schär et al. (2002) horizontal advection over orography test case for a range of resolutions and Courant numbers for the dimensionally split and multi-dimensional schemes.
All simulations use basic terrain-following co-ordinates (BTF; Gal-Chen and Somerville, 1975) in order to present a challenging test case that maximizes the non-orthogonality that can be introduced by terrain-following coordinates. The transformation is given by: where H is the domain height and h is the terrain height. The test case uses a domain of width 300 km, height, H = 25 km and a with the maximum mountain height, h 0 = 3 km, half-width a = 25 km and wavelength λ = 8 km. These values give a maximum terrain gradient of close to 45 • . The wind is given by a streamfunction which is defined at vertices so that the wind field is discretely divergence-free. The streamfunction at vertices is calculated analytically from the wind profile: with u 0 = 10 m s −1 , z 1 = 4 km and z 2 = 5 km. The initial tracer position is given by: initial tracer centre (x 0 , z 0 ) = (−50 km, 9 km) and halfwidths A x = 25 km, A z = 9 km. At time t = 5000 s the tracer is above the mountain and the simulation finishes at t = 10 000 s by which time the analytic solution is centred at (50, 8 km). The tracer advection over orography is shown in Figure 7 for the split and multi-dimensional schemes at a resolution of x = 1 km, z = 500 m and for a range of Courant numbers. The multi-dimensional scheme uses implicit time-stepping. The horizontal Courant number is defined as u 0 t/ x and ranges from 0.25 to 10. The maximum Courant number is the maximum of the multi-dimensional Courant number which is defined for cell with faces f as (section 2.3 defines the variables), and ranges from 0.74 to 29.6. The time-step restriction for the split scheme is based on the deformational Courant number, c d ≤ 1, (Eq. (11)). The maximum deformational Courant number is also given in Figure 7. The contours in Figure 7 show the tracer values at 0, 5 000 s and 10 000 s after initialization and the colours show the errors from the analytic solution. For horizontal Courant numbers less than 1 (maximum Courant number up to 3), both schemes give accurate results with the dimensionally split scheme tending to give oscillations and the multi-dimensional scheme producing more diffusion. For larger Courant numbers, when the deformational Courant number is greater than 1, the split scheme is unstable, while the multi-dimensional scheme produces large phase errors due to the errors associated with implicit time-stepping. The term responsible for the large deformational Courant number is ∂u/∂z where the velocity shears from u 0 = 10 m s −1 at z = z 2 to zero at z = z 1 . We examine the convergence with resolution in Figure 8 which shows the 2 and ∞ error norms as a function of x. These simulations all use a maximum Courant number less than 1, a horizontal Courant number of 0.25, a maximum deformational Courant number of about 0.2 and fixed ratios of x, z and t. The width and height of the orography is kept constant. Both schemes give similar accuracy with the dimensionally split scheme having faster convergence with resolution.
In summary, the dimensionally split scheme has good accuracy over orography for modest Courant numbers but larger errors for larger Courant numbers, and is unstable when the deformational Courant number is greater than 1, whereas the multi-dimensional scheme with implicit time-stepping is stable for all Courant numbers. The multi-dimensional scheme is second-order convergent and the dimensionally split scheme converges faster.

Deformational flow
For deformational flow, there is no analytical solution and therefore we follow the approach of Nair and Lauritzen (2010) and Lauritzen et al. (2012) and define an evolving velocity field that reverses direction half-way through the simulation, taking the tracer back to the initial conditions. Error norms can then be calculated by comparing the final and initial tracer fields. The deformational velocity field is added to a fixed solid-body rotation that does not reverse so as to avoid error cancellation between the forward and backward periods (solid body here meaning horizontal wind on a periodic domain).
We define a Cartesian version of the deformational, nondivergent test case of Lauritzen et al. (2012) with a domain between −π and π in the x direction (L x = 2π ) and between −π/2 and π/2 in the y direction (L y = π ) with periodic boundary conditions in the x direction and zero gradient, and zero flow boundary conditions in the y direction. The streamfunction adapted to Cartesian co-ordinate is: where ψ = 10 and T = 5 is the time for one complete revolution of the periodic domain for the solid-body rotation part of the flow. In order to test the order of convergence of the advection schemes, we use the infinitely smooth Gaussian distribution for the initial tracer concentration: where x T = (x, y), x T 0 = (5L x /12, 0), x T 1 = (7L x /12, 0) and A = 1/5.
A mesh with distortion similar to a cubed-sphere is defined by the co-ordinate transform: where f is:  The initial conditions and a 120 × 60 mesh with distortion defined by this co-ordinate transform is shown in Figure 9. The tracer concentrations after 1, 2, 3, 4 and 5 time units are shown in Figure 10 using the dimensionally split and multidimensional schemes (with explicit time-stepping, although the results using implicit time-stepping look identical) on a nonorthogonal mesh of 480 × 240 cells, a time step of 0.0025 units (ie 2000 time steps to reach 5 time units) which gives a maximum Courant number of 1.03. The tracer is stretched out, wound up, advected around and then wound back into its original position with some numerical errors. Both advection schemes preserve fine filaments but suffer from some dispersion errors which generate small oscillations around zero behind sharp gradients in the direction of the flow since neither scheme is monotonic or positive preserving. The dimensionally split scheme returns the tracer to a more accurate final solution.
Sensitivity to orthogonality, resolution, time step and time scheme are explored using a range of simulations with maximum Courant numbers shown in Table 1. The Courant numbers on the uniform, orthogonal meshes (with no mesh distortions) are 70% of those on the non-orthogonal mesh since the non-orthogonal mesh has clustering of mesh points. The deformational Courant number is less than 1 for all simulations. The convergence with resolution of the 2 and ∞ error norms at the final time are shown in Figure 11 for both advection schemes using a maximum Courant number close to 1 and for a maximum Courant number . The non-orthogonal mesh/co-ordinate system for the deformational flow with 120 × 60 cells and the initial tracer conditions. close to 10. We will first consider the behaviour of the schemes at modest Courant number (maximum close to 1). The explicit multi-dimensional scheme is stable at this Courant number and errors are almost identical to those of the implicit scheme. All of the schemes give first-order convergence with resolution at coarse resolutions due to error saturation (if the scheme is stable, errors are bounded above by about 1). At higher resolution, the split scheme converges with nearly third-order for both meshes whereas the multi-dimensional scheme approaches second-order. For large Courant numbers (maximum close to 10), the split scheme converges with first-order for both mesh types due to the crude estimation of the departure point (section 2.1), whereas the multi-dimensional scheme is much less accurate but approaches second-order at high resolution. The dimensionally split scheme is sensitive to mesh distortions only for large Courant numbers.
We inspect sensitivity to time step of all schemes on orthogonal and non-orthogonal meshes of 120 × 60 cells in Figure 12 using the time steps shown in Table 1 giving maximum Courant numbers ranging from 0.25 to 10 (deformational Courant numbers always less than 0.32). Again this shows that the explicit time-stepping gives almost identical errors to the implicit timestepping for time steps at which it is stable. Figure 12 demonstrates potentially useful properties of the FFSL time-stepping used by the split scheme. Ignoring errors in calculating the departure point, semi-Lagrangian schemes have errors proportional to t −1 (Durran, 2010) which explains the reduction in error as time step increases for the split scheme. Once the maximum Courant number reaches 1 or 2 ( t = 0.01 or 0.02), errors of the split scheme do grow with the time step, due to the deformational nature of the flow and first-order errors in calculating the departure points. In contrast, using the multidimensional scheme which uses method-of-lines time-stepping, errors increase as the time step is increased. Comparisons between schemes in Figures 10 and 11 used maximum Courant numbers of 1 and 10 which showed that the split scheme on the orthogonal Figure 10. Results of the deformational flow test case after 1, 2, 3, 4 and 5 time units on the non-orthogonal meshes of 480 × 240 cells using the explicit dimensionally split (left) and multi-dimensional (right) schemes. mesh gave better accuracy than the multi-dimensional scheme. However, Figure 12 shows that this advantage disappears at lower Courant numbers since the multi-dimensional scheme (methodof-lines) gets more accurate with lower Courant numbers, whereas the split scheme (semi-Lagrangian) gets less accurate since more time steps have to be taken. Both schemes are stable for all time steps considered.
In summary, the dimensionally split scheme has good accuracy for deformational flow independent of mesh orthogonality.

Computational cost
We cannot compare CPU time, wall-clock time or parallel efficiency of the two advection schemes because the multidimensional scheme is written in C++ using OpenFOAM and the split scheme is written in Python; the two codes have been run on different hardware and the split advection scheme code has not been parallelised. Instead we consider the number of multiply operations performed per cell per time step for calculating the fluxes of each tracer. We appreciate that this is not a good predictor of efficiency as it does not consider memory read and write requirements or cache coherency. However all data that are multiplied have to be fetched from memory and so, assuming that all data can be arranged optimally in memory to enable fewest cache misses, the number of multiplies should be related to the wall-clock time. Neither advection scheme does significantly more work per memory fetch than the other. For both schemes we consider the number of multiplies needed to calculate the flux of φ at each face, i.e. Eq. (8) for the dimensionally scheme and Eq. (24) for the multi-dimensional scheme. We do not consider the computational cost of updating the cell averages from the fluxes using Gauss's theorem (Eqs (3) and (22)) as these are the same for each scheme. For the multidimensional scheme, we do include an estimate of the amount of work done by the linear equation solver, but we do not consider the scalability of this solver.

Dimensionally split scheme
The computational cost of PPM with COSMIC splitting is not strongly time-step-dependent due to the swept area approach of FFSL. One-dimensional PPM uses four cells for interpolation to find face values used by the reconstruction. Assuming as much as possible is pre-computed, this interpolation uses three multiplies on a non-uniform mesh. The reconstruction then uses six multiplies to calculate the flux on each face. Only one additional memory access and one additional multiply are needed per cell for Courant numbers greater than 1 (Eq. 9) making ten multiplies per cell for applying PPM in one direction for a Courant number greater than 1. The COSMIC splitting in two dimensions involves four applications of the one-dimensional PPM. This makes 40 multiplies per cell in total for applying PPM with COSMIC splitting in two dimensions. In three dimensions, COSMIC splitting requires 12 applications of PPM (Leonard et al., 1996) leading to 120 multiplies.

Multi-dimensional scheme
The number of multiplies involved in the multi-dimensional scheme includes the number of multiplies to update the higherorder advection and the number of multiplications for each iteration of the linear equation solver. We will also consider the cost of an explicit version of the multi-dimensional scheme using an RK2 time-stepping scheme (e.g. as used by Shaw et al., 2017) which is stable and accurate up to a Courant number of 1 or 2 for this spatial discretization and gives very similar results to the implicit scheme.
On a logically rectangular two-dimensional mesh, there are 12 cells in each stencil for each face (Figure 1). Each cell has four faces and the interpolation onto each face is used to calculate the flux between two cells. This leads to 24 multiplies per cell for each RK2 stage and hence 48 multiplies per cell per time step. When using implicit time-stepping, there will be 24 multiplies per cell for every evaluation of the right-hand side of the matrix for the higher-order correction on first-order upwind.
We have not explored the sensitivity of the accuracy and stability to the stencil size and shape in three dimensions. The three-dimensional equivalent of the stencil of quadrilaterals in Figure 1 is likely to contain 36 (rather than 12) cells, although it may be possible to omit some corner cells and use a stencil of 20 cells. For a stencil of 36 cells, the number of multiplies per cell per time step would be 36 × 3 × 2 = 216.
The implicit version of the multi-dimensional scheme (section 2.3.1) requires the solution of an asymmetric, diagonally dominant matrix with three non-zero elements per row for a logically rectangular two-dimensional mesh. The pre-conditioner is implemented in file DILUPreconditioner.C in OpenFOAM 3.0.1 and the solver in file PBiCG.C. From these files, we estimate that, for a mesh of quadrilaterals, the solver will use 24 multiplies (or divides) per cell, per solver iteration, including pre-conditioning. The average number of iterations of the preconditioned bi-CG solver per time step for each of the simulations is shown in Table 2 (including the number of solver iterations in each outer iteration). These simulations use two outer iterations per time step when the maximum Courant number is ≤ 1.1 (as in Eqs (27) and (28)) but, for stability, use four outer iterations per time step for the larger Courant numbers. This partly explains the greater number of iterations for larger Courant numbers. A solver tolerance of 10 −8 is used for each of the outer solves. Table 2 shows that the total number of iterations per time step reduces slightly as resolution increases. The total number of iterations for a complete simulation is reduced by using larger Courant numbers because the number of iterations per time step increases less than linearly with increasing Courant number. In fact, simulations with larger Courant numbers are considerably cheaper because there are fewer evaluations of the right-hand side of the matrix equation.

Cost comparisons
Combining the number of solver iterations, the number of multiply operations per solve and the number of multiply operations in calculating the explicit higher-order part of the advection scheme, the total number of multiply operations per cell per time step for the multi-dimensional scheme is shown in Table 3 for the non-orthogonal mesh of 120 × 60 cells for the deformational flow test case. Table 3 also shows the number of multiplies for using the explicit, RK2 version of the multidimensional scheme and the dimensionally split scheme. Table 3 shows that the implicit scheme always uses more multiply operations but particularly uses more multiplies for large Courant numbers. The explicit version of the multi-dimensional scheme always uses 48 multiply operations but is not stable for all time steps whereas the dimensionally split scheme (using FFSL time-stepping) is stable for all Courant numbers (at this spatial resolution) and always uses the fewest number of multiply operations per cell per time step.
There is considerable flexibility in the solver configuration: the number of outer iterations per time step determines how frequently the high-order correction on the right-hand side of the matrix equation is updated, and the solver tolerance per outer iteration could be modified by using a weaker tolerance on all but the final matrix solve per time step. These options have not been explored. It may also be beneficial to create more non-zero matrix entries rather than having the higher-order correction entirely a deferred correction on first-order upwind, but such a   change would need to ensure that the matrix remains diagonally dominant. We also measured the wall-clock computation cost, comparing the implicit and explicit versions of the multi-dimensional scheme running the deformational test case in serial on a non-orthogonal mesh with a maximum Courant number of 1. At the resolution of 120 × 60, the model using implicit time-stepping is 5% more expensive that the explicit model. This drops to 2.5% at the resolution of 960 × 480. Despite the implicit scheme having four times as many multiplies as calculated for Table 3, the difference in wall-clock time is negligible. This demonstrates both the danger of only looking at the number of multiplies and the danger of using results of timing code. The linear equation solvers on OpenFOAM have undergone years of optimization whereas the stencils for calculating the cubic fit have not been optimized and entail unpredictable memory access patterns and likely frequent cache misses.

Summary and conclusions
We examine the errors associated with using a dimensionally split advection scheme and a multi-dimensional advection scheme on distorted meshes. The dimensionally split scheme is the piecewise polynomial method (PPM; Colella and Woodward, 1984) with COSMIC splitting (Leonard et al., 1996) which extends it to two spatial dimensions. PPM converges with third-order in one dimension and COSMIC splitting enables second-order convergence in two dimensions. PPM is a FFSL scheme and so can handle large Courant numbers accurately without significant additional computational cost, with a timestep restriction based on the deformational Courant number. The second-order accurate multi-dimensional scheme is split in space and time (method-of-lines) and uses a cubic polynomial fit over a stencil of cells for spatial discretisation and explicit RK2 or Crank--Nicolson in time to retain stability for large Courant numbers. The multi-dimensional scheme uses cell centroids as Gauss points so the order of accuracy is limited to two. We use versions of both schemes without any monotonicity constraints in order to compare the handling of multi-dimensionality of the two schemes and order of convergence rather than comparing the limiters of the two schemes. When run explicitly, the cost of the two schemes is similar. These schemes are therefore considered suitable for comparison.
Three two-dimensional advection test cases on Cartesian planes are proposed without the complexities of a spherical domain or multi-panel meshes but with distorted meshes to mimic the distortions near cubed-sphere edges or the distortions of terrainfollowing co-ordinates. We therefore propose that these test cases could be used in the initial testing of advection schemes, before the generation of meshes on the sphere. The first test case is an extension of the Leonard et al. (1996) solid-body rotation using a distorted mesh. The second test case is the established horizontal advection over orography (Schär et al., 2002) using a basic terrain-following mesh in order to maximize the distortions due to terrain-following coordinates. The third test case is the deformational flow test case of Lauritzen et al. (2012) adapted to a planar, Cartesian domain and a distorted mesh. We use the version of this test case with smooth initial conditions (the sum of two Gaussians) in order to examine order of convergence.
The dimensionally split scheme is most accurate on orthogonal meshes but only loses a little accuracy on highly distorted meshes, despite a first-order departure point calculation. The multi-dimensional scheme is almost entirely insensitive to mesh distortion and asymptotes to second-order convergence at high resolution. As is expected for implicit time-stepping, phase errors occur when using long time steps but the spatially well-resolved features are advected at the correct speed and the multidimensional scheme is always stable. On orthogonal meshes, the dimensionally split scheme remains highly accurate when large Courant numbers are used (as long as the deformational Courant number is low). However, on distorted meshes, the split scheme loses accuracy at large Courant numbers.
Computational cost is estimated by the number of multiplies. The matrix solver associated with the implicit time-stepping of the multi-dimensional scheme means that it is always more expensive than the split scheme with cost increasing with Courant number. We have not investigated sensitivity to solver and pre-conditioner choices and it may be possible to do better. However in practice we found that the implicit solution contributed only 2.5-5% of the total wall-clock computation time. The FFSL method enables long time steps without any matrix solutions and this will always be difficult to beat. We also use the multi-dimensional scheme with explicit time-stepping, making the cost similar to the dimensionally split scheme.
The dimensionally split scheme is hard to beat, providing close to third-order convergence, even in the presence of mesh distortions, and can be very cheaply extended for large Courant numbers.