From reliable weather forecasts to skilful climate response: A dynamical systems approach

While weather forecasting models can be tested by performing and evaluating many hindcasts, the limited observational record restricts the degree to which climate projections can be evaluated. Therefore a question of interest is: to what degree can we evaluate the potential skill of a climate model's response to forcing by assessing the reliability of short‐range weather and seasonal forecasts produced by the same model? We address this question using a dynamical systems framework. We use linear response theory to provide the mean climate response of a general dynamical system to a small external forcing. We relate this response to the reliability of initial value forecasts. We find that, in order to capture the mean climate response, the forecast model must correctly represent the slowest evolving modes of variability in the system. The reliability of forecasts on seasonal and longer time‐scales, which is sensitive to the representation of these slow modes, could therefore indicate if the forecast model has the correct climate sensitivity and so will respond correctly to an applied external forcing. In this way, the skill of initialized forecasts could act as an ‘emergent constraint’ on climate sensitivity. However, we also highlight that unreliable seasonal forecasts do not necessarily indicate an incorrect climate projection. This is because correctly representing rapidly evolving modes is also necessary for reliable seasonal forecasts.


INTRODUCTION
The "seamless prediction" paradigm has gained much momentum in recent years. Since accurate weather and climate prediction both depend on the representation of fast physical processes in the climate system, it has been proposed that the weather and climate communities should work in collaboration to produce and evaluate Earth-system models (Brunet et al., 2010). The nonlinear nature of the atmospheric equations of motion results in upscale transfer of errors from smaller to larger scales (Lorenz, 1969;Palmer, 2001;Tribbia and Baumhefner, 2004). However slower evolving modes can also provide predictability on shorter time-scales (Hoskins, 2013;Vannitsem and Lucarini, 2016). A unified approach to evaluating weather and climate models can therefore provide insight for both communities by evaluating models across a wide range of time-scales (Phillips et al., 2004;Hurrell et al., 2009). This development has been made possible by advances in computational resources, which has allowed the resolution of global models to approach that of the numerical weather prediction community (Haarsma et al., 2016;Roberts et al., 2018), and which has enabled weather forecast models to be integrated out towards the long lead times of interest to the climate community.
Following the seamless prediction paradigm, several authors have initialized Earth-system models for production of short-range forecasts, where verification is possible, before using these models for climate applications (Phillips et al., 2004;Hurrell et al., 2009). This approach has proven useful in diagnosing fast physics biases (e.g. Hannay et al., 2009;Medeiros et al., 2012;Williams et al., 2013), as models are observed to develop biases over days to weeks that are similar to their climatic biases. This indicates that the subgrid-scale physics schemes are the source of the biases in the climate model, and allows attribution of the error to a particular parametrization scheme. Palmer et al. (2008) and Weisheimer and Palmer (2014) go a step further and argue that the reliability of a model on shorter time-scales is a necessary criterion for skilful climate projections. The authors argue that, due to the nonlinearity of the climate system, certain processes (such as radiative feedbacks) influence both seasonal and climate time-scales. The fidelity of these processes should be evaluated by considering short-range seasonal forecasts, before using these forecasts to calibrate climate change projections Weisheimer and Palmer, 2014;Palmer and Weisheimer, 2018). This is an attractive proposal: instead of waiting decades to verify climate projections, models can be evaluated today using initialized forecasts. This information can be used to constrain Earth's climate sensitivity by excluding models which perform poorly with respect to this metric, providing information about likely anthropogenic climate change. In other words, it is proposed that the skill of initialized forecasts could be used as an 'emergent constraint' (Klein and Hall, 2015), that is, an empirical relationship between characteristics of a model's behaviour in today's climate and that model's climate sensitivity. However, no rigorous theory has been presented to justify linking the reliability of short-range forecasts with the accuracy of climate change projections. In this paper, we seek to address this problem.
A natural framework for considering the impact of a small forcing perturbation on the state of a system is linear response theory. For example, the fluctuation-dissipation theorem (FDT) recognises the intimate relationship between natural fluctuations that excite the system of interest, and the damping effect of these fluctuations on the system's response to an external forcing (Kubo, 1966). Through this relationship, the properties of the fluctuations in the system can be used to predict the system's response to a forcing. Leith (1975) was the first to apply these ideas to the climate system, expressing the climatic response of a system in equilibrium to a constant external forcing. Response theory and FDT have been applied to the climate system with some success (Cionni et al., 2004;Gritsun and Branstator, 2007;Cooper and Haynes, 2011;Cooper et al., 2013;Achatz et al., 2013). Further work has considered a general linear response theory for systems far from equilibrium (Ruelle, 1998;Lucarini and Sarno, 2011;Lucarini et al., 2014;Gritsun and Lucarini, 2017), demonstrating that many of the classical results at equilibrium still hold.
On the other hand, a natural framework for considering predictability of initial value forecasts on short time-scales is dynamical systems theory (e.g. Smith et al., 1999 give a summary). There is a wealth of literature describing the use of dynamical systems theory to understand the atmospheric predictability problem (e.g. Lorenz, 1996;Palmer, 1999;Franzke et al., 2015), with simple dynamical systems such as those of Lorenz (1963) and Lorenz (1996) becoming widely used (e.g. Arnold et al., 2013;Trevisan and Legnani, 1995;Smith et al., 1999;James, 2002;Wilks, 2005;Vissio and Lucarini, 2018). The framework is particularly useful for understanding how small perturbations grow in time, and project most strongly in preferred directions. In this manuscript, we use the phrase mode to refer to a preferred pattern of growth (or variability) in the system in question. Such a mode will have an associated time-scale of growth. Examples could include baroclinic instability or the El Nino-Southern Oscillation. On the other hand, processes are physical phenomena, such as cloud microphysics or percolation of water through soil. While processes typically occur on a particular time-scale and in a particular location, they may impact the evolution of modes with a wide range of space-and time-scales.
In this paper, we seek to draw together the existing body of literature on response theory and dynamical systems to explore whether there is a link between the reliability of weather forecasts and the response to external forcing. This will enable us to explore the possibility of using initial value forecasts as an emergent constraint for assessing the climate characteristics of a model in the way described by Palmer et al. (2008). In Section 2, we review the existing literature concerning error growth in initialized forecasts, and thereby define the properties of a forecast model that will enable it to produce reliable weather forecasts. In Section 3, the response of a dynamical system to an applied forcing is derived as an analogy for the climate change problem, and the resultant response equations validated in the Lorenz (1963) dynamical system. In Section 4 we draw on results from the previous sections to discuss the relationship between forecast reliability and climate sensitivity: in Section 5 the more complex two-scale Lorenz (1996) dynamical system is introduced as a testbed to evaluate this proposed relationship. Finally, the results are discussed and conclusions are drawn in Section 6.

THE DYNAMICAL SYSTEMS FRAMEWORK
Our discussion is largely framed using dynamical systems theory. A dynamical system is described by a set of nonlinear ordinary differential equations: where the system's m-dimensional state vector is denoted in bold x. The state vector characterises the state of the dynamical system: given an initial condition, x 0 , and the equations of motion, F, the future trajectory of the system is completely determined.
Error growth in dynamical systems, with a view to the atmospheric forecasting problem, is expounded in Smith et al. (1999). Here we provide only a brief summary. First we define the Jacobian, J, of the dynamical system in Equation 1: which will be written The Jacobian gives the linearized equations of motion for the system. It summarizes the instantaneous local growth of infinitessimal uncertainties in the system, and thereby governs how small perturbations to the dynamical system, such as errors in the initial conditions, = x − x 0 , will evolve in time: Correctly representing the Jacobian at all locations in phase space will ensure correct initial error growth in the forecast model.
Note that, since J is non-normal in general, it has non-orthogonal eigenvectors. Non-normality means that, even if J has only negative eigenvalues, the system can still display instantaneous error growth (Smith et al., 1999). Furthermore, the fastest growing mode will not necessarily be excited by a perturbation along the leading eigenvector. The instantaneous growth rate will evolve in time, and asymptote towards the leading eigenvalue of J. For this reason, it is also helpful to introduce the tangent propagator, which maps a perturbation at t 0 to that observed at a later time, t (Molteni and Palmer, 1993;Smith et al., 1999): Note that we have made no linearity assumptions in defining M and that, following this definition, M is defined for all t, without requiring t to remain small. However, while t does remain small, M can be approximated by: .
The magnitude of the perturbation, t , is given by While M is not necessarily symmetric, M T M is symmetric by construction, such that the eigenvalues of M T M form an orthogonal basis set. The singular value decomposition of M is M = U V T . The matrix V contains the right singular vectors of M, which describe the perturbation directions which will maximize the rate of growth of || t ||. The leading left singular vectors contained in U define the direction of this optimal growth. The diagonal matrix, , contains the singular values, i . The rate of growth of || t || is governed by the eigenvalues of M T M, given by 2 i , also known as the local finite-time Lyapunov exponents (Smith et al., 1999). In the limit t → ∞, the finite time Lyapunov exponents converge to the global Lyapunov exponents.

Reliability in dynamical systems
Let us now consider reliability in the dynamical systems framework, as this was highlighted by Palmer et al. (2008) as a potential predictor of the climate prediction skill of Earth-system models. A probabilistic forecast is reliable if the forecast probability distribution function (pdf) is a good indicator of the uncertainty in the forecast. In other words, given a particular forecast pdf, the verification (i.e. observation) will behave as if it is drawn from that predicted distribution (Wilks, 2006) at a range of lead times. To assess reliability, statistics are aggregated from many initial start dates to test whether the verification follows the distribution predicted by the ensemble. However, this sampling also samples over the system's climate distribution. Stratification of forecasts by predicted ensemble spread (Leutbecher and Palmer, 2008;Christensen et al., 2015a) is a first step towards removing this averaging over the attractor, as a partial test of state-dependent reliability. A probabilistic forecast is generated by creating an ensemble (or set) of forecasts, where individual members of this ensemble explore the sources of uncertainty present in the forecast (Slingo and Palmer, 2011). Dynamical systems theory can be used to consider the rate of growth of small perturbations between different ensemble members, and is therefore a natural framework for discussing reliability. For a reliable forecast, the expected divergence of two randomly selected ensemble members must match the expected rate at which the observed truth diverges from a randomly selected ensemble member. Figure 1 shows a schematic of the scenario of interest, indicating the evolution of the ensemble forecast and observed truth, given a degree of uncertainty in the initial conditions. Note that, since the truth and model forecast live in different spaces, one must first perform an operation of projection to compare the two in this way. We assume for simplicity that the initial condition uncertainty is known, and is correctly accounted for in the ensemble forecast. While this is a major assumption, it allows us to focus on the dynamical properties of the forecast system. The reliability of the estimated initial condition uncertainty can be assessed separately (Rodwell et al., 2016).
Using the superscript t for the "true" system, and f for the forecast model, consider the difference between the truth and one ensemble member: which includes the impact of a potential state-dependent model error, b(x, t) (i.e. the difference between the dark red and pale blue dashed lines in Figure 1), as well as the random component of error growth due to the uncertainty in the FIGURE 1 Schematic of reliability in a dynamical system framework. The blue oval indicates initial condition uncertainty, the centre of which is marked by the reference point x 0 . Pale blue lines represent ensemble forecast members, while the dark red line represents the observed truth, with time increasing towards the right. The dashed pale blue and dashed dark red lines represent the time evolution of the reference point x 0 using the forecast model and "true" equations respectively: the difference between the dashed lines is due to model error [Colour figure can be viewed at wileyonlinelibrary.com] initial conditions. In contrast, two randomly selected ensemble members will diverge following where f (0) was their initial separation. All methods for evaluating reliability assess whether the verification behaves similarly to the members of the ensemble forecast. From Equations 7 and 8, it is evident that assessing reliability, that is, by comparing the rate of divergence of ensemble members with the rate of divergence of the verification from the ensemble, will indicate whether the forecast tangent propagator, M f (t, 0, x 0 ) is adequately representing the "true" tangent propagator, M t (t, 0, x 0 ), as long as the model error term b(x, t) is small.

PREDICTING THE FORCED RESPONSE OF A DYNAMICAL SYSTEM
We next consider the climate change problem: the goal is to predict how the climate of a dynamical system will respond to an imposed forcing. Consider applying a small forcing perturbation F to the dynamical system in Equation 1.
The unperturbed system could represent the Earth's climate system before 1850, while the perturbed system would include the additional forcing from anthropogenic emissions of greenhouse gases. In general the forcing perturbation, F, is allowed to be a function of the state of the system, F(x), but this need not be the case.
The FDT seeks a linear relationship between a small applied forcing and the mean climate response (Gritsun and Branstator, 2007). The theory predicts that such a response of a system to a forcing can be written, following the notation in Cooper and Haynes (2011): where the system's pdf is written (x), and we assume that the pdf is differentiable. The angle brackets indicate the expectation taken over the attractor, for example, through taking an ensemble mean over a large number of initial conditions. The response at time t, x(t), is determined by the integrated effect of forcing perturbations at all earlier times, t − . Note that the "linearity" of the response theory stems from the assumption that the forcing perturbation is sufficiently small for the response to be linearly proportional to the forcing. It makes no assumptions about the dynamics of the system. Leith (1975) made the assumption that (x) can be approximated by a Gaussian, and so expressed the climatic response of a system in equilibrium to a constant external forcing as: where C and C 0 are the covariance matrices at lag and 0 respectively.

Ruelle's response theory
For some systems the unforced climatological pdf is undefined. For example, a climatological pdf does not exist for any chaotic system which explores a strange attractor. For these systems, Ruelle (1998) derived a generalized response theory which showed that many of the classical results still hold. Here we will re-derive the linear response theory of Ruelle, making some simplifications, and using the notation already defined in this paper to highlight the links to dynamical systems theory. As before, let a dynamical system be representeḋx = F(x). Let a small forcing perturbation be turned on at time t 0 . The system can now be written: where the small forcing perturbation has explicitly been split into a time-dependent and space-dependent component. Following Ruelle (1998), we can write the change in the system's trajectory, x, over the time window (t 0 , t) due to the forcing as which includes the familiar tangent propagator (equivalently, linear resolvent operator), acts as a Green's function, mapping the influence of a small perturbation at time t ′ onto the state of the system at some later time, t. Next, defining the expectation value of an observable of interest, , as 1 and, following Ruelle (1998), we can write the change in the expected observable due to the forcing perturbation as a function of time, t, as where we have also extended the integral in time to consider the response due to all past perturbations, and the expectation is calculated over the climate distribution of the unperturbed system, (x). Instead of proceeding for a general observable, consider only the response of the mean, = x. While this is a simplification, even constraining the mean response of the Earth system (the climate sensitivity) is of substantial interest. The equation simplifies to If we make the assumption that the function F(x) in Equation 1 has no time dependence, we can replace the pair Changing the variable of integration, we find Increasing the new index of integration, , represents looking further into the past, mapping the impact of past perturbations on to the present state of the system at a time lag = 0.
We will ultimately consider the case of a time-independent forcing. However it is instructive to first consider the more general case of time-varying forcing. We proceed by taking the Fourier transform of Equation 17 where we have written the Fourier transform of (t) as̃( ).
We have arrived at a key result of response theory, namely that the response of a dynamical system to a small forcing perturbation is local in frequency space, such that where the linear susceptibility,̃( ), is As a thought experiment, if the system were linear, we note that the tangent propagator could be written M( , x) = exp [L ], where L is independent of x. Only the forcing, f(x), depends on x. Writing the forcing averaged over the attractor as ⟨f⟩ and integrating, where we have evaluated the upper limit using the fact that the unperturbed system is in a steady state: perturbations must be damped over long time lags.
Returning to the nonlinear case, it is difficult to proceed more generally from Equation 20. However, we can first consider the case of a state independent forcing perturbation, f(x) = f. Defining M( ) as the integral of M( , x) over the attractor, the linear susceptibility reduces tõ The response to forcing of a system is given by the integral under the M( ) curve, where M( ) expresses how small perturbations to the system affect the average evolution of the system. For a forcing that is constant in time, the linear susceptibility can be writteñ For a forcing that is constant in time, the response is also constant in time, as predicted by Equation 19.

Response to finite forcing in a nonlinear system: the Lorenz '63 System
The extent to which the response Equation 24 is true for a non-infinitesimal external forcing is evaluated in an example dynamical system, the Lorenz (1963; L63) system: The system is a simplified model of Rayleigh-Bénard convection, where X describes the intensity of the convection, Y is proportional to the difference in temperature between ascending and descending plumes, and Z describes the deviation of the vertical temperature profile from its unperturbed state. The parameters , r and b are the Prandtl number, the Rayleigh number and a parameter characterizing the size of the convective cell. The standard values of = 10, r = 28 and b = 8∕3 produce the familiar Lorenz "butterfly" attractor, shown in Figure 2a. The L63 system is commonly used in atmospheric science as a testbed both for atmospheric predictability studies, and for testing new mathematical techniques, including data assimilation and sensitivity analysis (e.g. Miller et al., 1994;Trevisan and Legnani, 1995; The response of the Lorenz system to a forcing perturbation. A sample of the trajectory of the (a) unforced and (b) forced system (grey lines) and their respective fixed points (black crosses) in the X-Y plane. The corresponding marginal probability density functions of (c) the unforced system and (d) the forced system in the X direction. The Lorenz system is forced following Equation 26 using f = 10 and = 3 ∕4 Smith et al., 1999;Lucarini, 2009;Van Leeuwen, 2010;Marzban, 2013). Consider the impact of a small forcing perturbation to the evolution of the L63 system. Following Palmer (1999), a possible small forcing perturbation to that system can be written: The L63 system is integrated for one million model time units (MTU) using a fourth-order Runge-Kutta scheme with a timestep of 0.01. Figure 2a shows a sample of the trajectory from an unforced simulation with f = 0. The response of the Lorenz system to a forcing of f = 10 and = 3 ∕4 is shown in Figure 2b. The time series are smoothed using a 0.7 model time unit running average for both the forced and unforced systems, and the pdfs of the smoothed time series are shown in Figure 2c,d. While (b) indicates that the overall "shape" of the attractor is unchanged, (d) shows that the system now spends more time in one wing of the attractor than the other.
We test the linearity of the true response to increasing forcing by producing a set of forced simulations, each with a different value of f . Figure 3 shows the observed response in each of the X, Y and Z directions as a function of f . For X and Y, the climate response is remarkably linear with magnitude of the imposed forcing, even up to large values of f = 10. For Z, the climate response is not linear: for small f , Z is zero, while for larger f , the response is small but positive. Note that, interestingly, the forcing on X is negative, but the response is positive (Palmer and Weisheimer, 2011).
To test whether the response Equation 24 can predict the mean climate response of the Lorenz system, we must estimate the tangent operator averaged over the attractor for the unforced system, M( ), as a function of lag, . A long unforced simulation is produced spanning 10 8 MTU. Every 2 MTU, a set of three simulations are branched from the first, each of duration 10 MTU. The starting conditions of these three simulations are generated from the parent simulation, with an additional initial condition perturbation of ( , 0, 0), (0, , 0), and (0, 0, ) respectively, where is set to 0.01. The growth of the initial condition perturbation is used to estimate M( , x) following Equation 4. The mean is calculated over the samples as an estimate of the average over the attractor, M( ). We also calculate the mean for ten sub-samples, to indicate the uncertainty in our estimate.
Having estimated M( ), the predicted climate response is calculated following Equation 24. Figure 4a-c shows the projection of the forcing through the tangent propagator, M( )f, as a function of time lag, in each of the X, Y and Z directions. The integration of this quantity over is shown in (d)-(f), which will tend towards the solution of the response equation as the sample size increases, and as the upper limit, , is increased towards ∞.
We compare this predicted response to the observed response in the L63 system shown in Figure 3. In each panel, the circles with error bars at i, j, and k indicate the predicted response for f = 10, estimated at = 4, 6 and 9 MTU respectively. The error bars show uncertainty in the predicted response: the bold capped error bars indicate the ±1 interval, while the thin uncapped bars indicate the maximum and minimum estimates from the subsample. The response equation gives a good prediction of the climatic response to an imposed forcing in the Lorenz system in the X and Y directions, even for forcings as large as f = 10 (note for comparison that the standard deviations of X, Y, and Z are 9.0, 9.0 and 8.6 respectively). In the Z direction, the response equation predicts a climatic response of Z = 0, which is observed for small f . However by f = 10, the response in Z has become nonlinear.
To achieve a good estimate of (0), we must estimate M( ) out to large lags . Unfortunately, as the lag increases, the sample size needed to adequately sample the attractor becomes prohibitively large, even for a system as simple as the Lorenz system. This problem is well known (Eyink et al., 2004). Several approaches have been proposed to estimate M( ) at long lags, and thereby calculate the response, though each approach is limited by necessary assumptions and approximations (Eyink et al., 2004;Thuburn, 2005;Abramov and Majda, 2007;Cooper and Haynes, 2011).

THE PROPOSED LINK BETWEEN WEATHER FORECASTS AND CLIMATE PREDICTION
Equation 24 provides a useful framework for considering the mean response of a dynamical system to a forcing perturbation. It indicates that a forecast model which correctly represents the tangent propagator for each state on the attractor, and therefore also its integral M( ), will respond to forcing in a similar manner to the "true" system. Section 2.1 discussed how error growth statistics in initial value forecasts, assessed by measuring forecast reliability, are also dependent on correct representation of the tangent propagator. Does this indicate that the reliability of existing simulations performed by weather forecasting centres could be used to evaluate if that forecast model would produce the correct forced climate response?
For a short-range initial value forecast on time-scales of hours to days, the divergence of ensemble members will be dominated by the fastest growing modes (the leading singular vectors of M( ) for a of hours to days), so capturing these modes correctly is sufficient for reliable short-range forecasts. Slower growing modes will also result in ensemble divergence, but on longer time-scales. Short-range weather forecasts are not long enough to detect the fidelity of these slower growing modes in a forecast model. However, correctly predicting the response to forcing requires an accurate representation of M( ) for these longer time-scales. Furthermore, for the L63 system, Figure 4 shows that the behaviour of M( ) at the shortest time-scales (< 0.5 MTU) is very different to the behaviour at longer time-scales, so in that system at least, short-range reliability does not necessarily indicate long-range reliability.
Following an alternative line of reasoning, we have seen that for a linear system the response These equations suggest that it is correctly representing the inverse of the (effective) Jacobian that is important for a forecast model to capture the response of a system to an external forcing. While the eigenvectors of J −1 eff are the same as those of J eff , the eigenvalues are the inverse. Therefore the larger magnitude eigenvalues of J eff , which contribute to the most rapid growth of perturbations in initial value forecasts, will have little influence on the response of a system to a forcing, FIGURE 5 A schematic suggesting the required characteristics of a forecast model for reliable forecasts and for skilful climate prediction. While reliable longer-range forecasts could indicate that slow growing modes are correct, as needed for skilful climate prediction, unreliable longer-range forecasts could be due to either incorrect fast or slow growing modes. The precise time-scales corresponding to "shorter"-and "longer"-range forecasts will be system-dependent [Colour figure can be viewed at wileyonlinelibrary.com] as 1∕ will be very small. Instead, the response to forcing will in general be dominated by the eigenvalues corresponding to the slowest evolving modes. Therefore, in order to judge how well a forecast model will predict the response to an external forcing, it is these slow modes that must be verified through initial value forecasts, not the fastest modes. This is a much harder task (Vannitsem and Lucarini, 2016).
We therefore propose a more nuanced link between the reliability of initial value forecasts and the forecast model's response to an imposed forcing. While the reliability of short-range forecasts (on the scale of hours to days) could provide some information on the fidelity of the forecast model's fastest growing modes, it is the fidelity of the slower growing modes that are important for a correct response to focing: the integral over M( ) extends to many decades or centuries, and it is this long tail that will dominate the response. It is possible that the reliability of forecasts on longer time-scales (e.g. seasonal or longer) could provide information about the evolution of these slower modes. This would provide information relevant to the forced response of the system. However, unreliable longer-range forecasts would not necessarily indicate that the slow-growing modes have not been captured, as incorrect representation of fast growing modes would infect the forecast at longer time-scales. It is therefore important to also validate forecasts at shorter time-scales to assess sources of unreliability at longer time-scales. Figure 5 summarizes the necessary characteristics of the forecast model for a particular result.
A specific exception to this argument comes if the forcing projects directly on a growing mode of the system (a singular vector of M( ) for the of interest). In that case, the response of the forecast model to forcing depends solely on correctly capturing the corresponding mode. Since singular vectors evolve in time, it is possible this situation will not arise in the case of the Earth system.

TESTING THE THEORY IN THE LORENZ '96 SYSTEM
To test the proposed link between reliability of shorter and longer time-scale forecasts and the response of a system to forcing, a second dynamical system is considered for which we can also develop a forecasting model. The system of choice is the second system proposed in Lorenz (1996), which we call the L96 system. This system describes a set of coupled equations for two types of variables arranged in a ring: where the variables have cyclic boundary conditions; X k+K = X k and Y j+JK = Y j . The X k variables are large-amplitude, low-frequency variables, each of which is coupled to many small-amplitude, high-frequency Y j variables. The parameter values used in this study are shown in Table 1; for these values, the system undergoes chaotic behaviour. The variables are scaled such that one model time unit is approximately equal to five atmospheric days (Lorenz, 1996). Lucarini and Sarno (2011) have demonstrated that the simplified one-scale L96 system is a suitable testbed for considering linear response theories, providing some justification for the use of the more sophisticated, albeit still simple, two-scale version to validate this theory. The L96 equations are integrated for 10,000 MTU with a time step of 0.001 MTU and F 0 using a fourth-order Runge-Kutta scheme. Note that 1 MTU is equivalent to approximately 5 "atmospheric days," deduced by comparing error-doubling times in the real atmosphere and the L96 system. This control simulation is used to derive the forecast models described below.
Two types of forecast model are developed for the L96 system: a linear inverse model, and a dynamical (nonlinear) stochastically parametrized forecast model. Both forecast  Time-scale ratio c 10 models are constructed assuming only the large-scale X variables are predicted, while the Y variables are unresolved.

A linear inverse model (LIM) is a statistical model that is
constructed to capture the covariance properties of a system. It is empirically derived from an observed time series of a dynamical system, and leads to an accurate linear representation of the system, which cannot necessarily be achieved by linearizing the governing equations of motion (Branstator and Haupt, 1998). This linear approximation to the full nonlinear equations of motion can give significant insight into the system's dynamics. For example, LIMs are used to great effect in a climate system context to model the sea surface temperatures in the Pacific (Penland and Sardeshmukh, 1995;Newman, 2007) and Atlantic Oceans (Hawkins and Sutton, 2009;Zanna, 2012). To construct a LIM, following Penland and Sardeshmukh (1995), we rewrite Equation 1 as:̇x = x + n(x), where F(x) has been decomposed into a linear component, x and a nonlinear component n(x), and x is the state vector of the system, x = [X; Y]. The aim is to construct a forecast model of L96 given observations of the large-scale X variables. The simplest trial model is a linear system driven by Gaussian white noise. We therefore approximate Equation 28 asẊ = LX + , where now only the "resolved" X variables are considered, and the linear feedback matrix, L is the appropriate subset of . The nonlinear terms have been replaced with white noise forcing, . The most likely solution to Equation 29 is given by the propagator, G: The linear feedback matrix, L, can be estimated from the time series of X as where C and C 0 are the covariance matrices at lag and 0 respectively. The properties of the white noise forcing can be calculated through the fluctuation-dissipation relation (Penland and Sardeshmukh, 1995): Penland and Sardeshmukh (1995) present a series of tests to validate whether a LIM is a good model of the chosen system. In particular, if a LIM is a good model of the system, the matrix L should be independent of the lag used to estimate it. This is tested for the L96 model in Figure 6, which shows the 2-norm of L as a function of the lag, , used to estimate it (Penland and Sardeshmukh, 1995). Below a critical value of providing support for the use of a LIM. A further requirement is that the fast processes need to be sufficiently mixed (Penland, 2003), such that the rapidly interacting nonlinear terms can be represented as noise. If a LIM does not provide good forecasts, it is an indication that this assumption might not be valid for a particular system.
Because of the linearity of the LIM, the response equation can be written: where the right-hand side has been simplified by recognising that the forcing perturbation is independent of the state, X. Note that this is also consistent with the result from classical FDT, since within the LIM framework, we can rewrite Equation 12 as since the real part of the eigenvalues of L are negative.
Equation 33 can be used to compute the response of the LIM to forcing. The control simulation of the unperturbed L96 system with F 0 was coarsened by selecting every fifth data point (every 0.005 MTU) to represent the discreteness of observations of the system. Following the results of the sensitivity testing shown in Figure 6, we use = 0.1 MTU to fit L and Q. Table 2 indicates the properties of the fitted L, summarized by its eigenvalues, = ± i . There are two purely damped modes, and three oscillatory modes (each consisting of a complex conjugate pair of eigenvalues), such that the total number of modes is K = 8.

5.2
The stochastically parametrized forecast model A second type of forecast model is constructed, more similar to the type of model used to make modern weather and climate predictions. This second model assumes knowledge of the underlying dynamical equations that describe the L96 system, which can be used to make a forecast. However, due to limited computer resources, a compromise is necessary and the effect of the small scale, Y variables on the X variables must be parametrized. This is equivalent to a weather or climate model where the equations of motion (the Navier-Stokes equations) are represented in the atmospheric simulator, but the sub-grid processes such as convection, gravity waves, and turbulence, must be parametrized due to computational constraints.
The stochastically parametrized forecast model used here is described in Arnold et al. (2013). In short, the impact of the Y variables on the X variables are parametrized using a cubic function of X. The parameters in this cubic function are fitted using observations taken from the "true" L96 system. In the full system, there is considerable variability in the sub-grid Y forcing which is not captured by this simple cubic equation. The uncertainty in the parametrization is represented by using a multiplicative stochastic perturbation. The reduced equations of motion for the forecast X * (t) of X(t) can be written: where the stochastic perturbation e k (t) is independently generated for each X k , and follows a zero-mean first-order autoregressive process: where is the first order autoregressive parameter, 2 e is the variance of the stochastic perturbation, and z k (t) is unit variance Gaussian distributed white noise. The cubic fit parameters, b i , and the magnitude and autocorrelation of the stochastic perturbation are fitted using data from the L96 control integration (Arnold et al., 2013). Two stochastic models are considered. Both models use the magnitude of the stochastic perturbation fitted from the true system. A white noise forecast model ( = 0) is compared to a red noise forecast model, for which is fitted from the "true" tendencies.

Climate projection
The impact of a change in forcing on the mean climate of the L96 system is considered, following Christensen et al. (2015b). Two state-independent perturbations are introduced, both constant in time. The first is constant across the X variables (symmetric: F s ), while the second is applied only to X 1 (asymmetric: F a ), as shown in Table 1. In addition to the control simulation with F 0 , two further 10,000 MTU integrations are produced for each forcing perturbation in turn. The response of the mean of the L96 system to the change in forcing is shown in Figure 7 (black line), while the error bars indicate the variability in response due to the finite sample size 2 . The symmetric forcing produces a symmetric response, while the asymmetric forcing breaks the symmetry of the L96 equations, resulting in an asymmetric response as shown.
Each forecast model is integrated for 10,000 MTU with F 0 to produce estimates of the unperturbed 'climate' of the L96 system. To use the LIM as a forward model, Equation 29 is integrated using the first-order Stratonovich scheme described in Penland and Matrosova (1994) 3 with a time step of 0.005 MTU. For the stochastic dynamical models, a second-order Runge-Kutta time-stepping scheme is used with the same time step of 0.005 MTU. Two further integrations are produced with each of the forecast models for the two forcing perturbations respectively, F s and F a . The forecast models are not changed or re-tuned prior to making this "climate change" prediction: analogously to prectice in the climate modelling community, the models are tuned and verified using the present-day observed climate, before that same model is used to produce climate change projections. The differences in mean between the unperturbed and perturbed climates for each L96 forecast model are shown in Figure 7. Figure 7 shows the response to forcing predicted by the LIM compared to the true response (i.e. the response of the mean of the full L96 system) for both (a) symmetric and (c) asymmetric forcing perturbations. Figure 7 (a) shows that, while the LIM can predict that the response of the true L96 system to a symmetric forcing will be symmetric, the predicted magnitude is too large. The LIM also fails to predict the response to the asymmetric forcing F a . Figure 7 also compares the observed response of the white and red stochastically parametrized forecast models to that of the true system for both (b) symmetric and (d) asymmetric forcing perturbations. For both the symmetric and asymmetric cases, both stochastic dynamical models do a reasonable job at predicting the shape of the response of the mean. For the symmetric forcing, the red noise model (red line with downward triangles) predicts the correct response to within sampling uncertainty, while the white noise model (yellow line with upward triangles) predicts too weak a response compared to that observed (black line). For the asymmetric forcing perturbation shown in (d), both parametrized models significantly outperform the LIM. The error bars indicate no significant difference between the response of the red noise forecast model and the true system, such that the red noise model slightly outperforms the white noise case.

Reliability
Initialized forecasts in the L96 system are now considered. Given access to such "weather" forecasts, would it have been possible to predict whether the forecast models would correctly capture the response of the "true" L96 system to forcing?
We first use the LIM operator L to provide an estimate of the key time-scales and modes in the L96 system. From Table 2, the fastest time-scale modes in the linearized model have a time-scale of order a few "atmospheric days," while the slow modes have time-scales on the order of a few "atmospheric weeks." The symmetric forcing perturbation, F s is aligned with one of the eigenvectors of L (not shown). 4 Since the symmetric forcing projects onto one of the modes of the system, we anticipate that (to leading order) we need capture only that single mode to predict the response to F s . In contrast, the asymmetric forcing perturbation, F a is not aligned with any eigenvectors of L. The response of the forecast model to this perturbation will therefore be dominated by the slowest evolving modes, as argued above.
The three forecast models were used to produce short-range ensemble weather forecasts. Initial conditions were selected every 10 MTU from the full "true" L96 simulation to give a sample of 300 start dates. A 40-member ensemble forecast out to 3 MTU was produced from each start date for each model. The ensemble members for a given start date differ only in their stochastic forcing: it is assumed that the initial state is known perfectly, and so the initial conditions are not perturbed. Figure 8 shows the reliability of the three forecast models as a function of lead time. Figure 8a shows the standard deviation ("spread") of the ensemble forecasts compared to the root mean square error (RMSE) in the ensemble mean as a function of lead time for each forecast model. For a reliable forecast, the ensemble spread should match the RMSE in the ensemble mean (Wilks, 2006;Leutbecher and Palmer, 2008). From this summary diagnostic, both the LIM and the red noise parametrized forecast model are reliable at a range of lead times.
It is clear that variability in the L96 system has a narrow range of time-scales when compared to the Earth system, where processes in the stratosphere, ocean, ice sheets, and land surface introduce a wide spectrum of time-scales into tropospheric motions. Nevertheless, we will consider the reliability of comparatively "fast" and "slow" time-scales within the L96 system, to evaluate the predictions from Section 4. The "fast" ("slow") time-scales in the L96 system will be dominated by the dynamics of the Y (X) variables respectively. We select two lead times of particular interest. The eigenvalue of L that corresponds to F s , which we denote s , has a real part with magnitude 6.69, corresponding to an e-folding time-scale of 0.15 MTU. 5 We observe that error saturation in the parametrized stochastic models occurs as late as 3 MTU (Figure 8a), corresponding to the time-scale of the slowest modes in the system. To consider how well the L96 forecast models will respond to the forcing perturbations F s and F a , we therefore consider the reliability of forecasts at 0.15 MTU and 2.8 MTU (i.e. just before saturation) respectively. Considering the reliability at these time-scales will probe the relevant growing modes in the forecast model, compared to the full L96 system. The vertical lines on Figure 8a indicate the lead times of particular interest: 0.15 and 2.8 MTU. At each of these lead times, a reliability diagram is generated for forecasts of the event "X > X," where X is the climatological mean of the full L96 system, and statistics are aggregated from across all X k : the calibration functions are shown in (b) and (c) while the refinement distributions are shown in (d) and (e) (Wilks, 2006). For a reliable forecast, the calibration function, which shows the observed frequency of an event against predicted probability, should lie on the one-to-one line. The refinement distribution provides additional information about the confidence of the forecast model, indicating how often the forecast diverges from climatology (Wilks, 2006). At a lead time of 0.15 MTU ((b) and (d)), the LIM and red noise parametrized models both produce reliable forecasts: the conditional observed probability closely matches the forecast probability. Both forecast models show high confidence, with forecast probabilities that often deviate from the mean. In contrast, at a lead time of 2.8 MTU ((c) and (e)), while both the LIM and red noise parametrized models still show high reliability, the LIM shows low confidence, rarely producing forecasts that deviate from the climatological forecast: the LIM has saturated to climatology. The white noise parametrized model is unreliable at both short and long lead times, producing systematically under-dispersive forecasts.
These results show that the LIM produces reliable forecasts at a range of lead times. However, the LIM response to forcing, as shown in Figure 7, was poor. This implies that reliability is not an indicator of a good climate response. To interpret these results, we return to Equations 7 and 8. In Section 2.1, we saw that evaluating reliability could only indicate whether the forecast tangent propagator, M f (t, 0, x 0 ) adequately represents the 'true' tangent propagator if the model error term b(x, t) is small. However, Figure 8 shows that error growth rates in the LIM are too high, since the two dynamical models show predictability out to substantially longer lead times than the LIM. In fact, the dominant modes in the LIM have time-scales of 1.0 and 1.5 MTU, after which the LIM ensemble forecast approaches the climatological spread (the slower 2.5 MTU mode rapidly decays). The model error term in the LIM must be significant. The rapid growth of perturbations in the LIM compensates for the large model error, to produce reliable forecasts. Therefore, reliability is not always a good indicator of climatological response. Only if the model error is relatively small, or equivalently, if the error growth rates in the forecast model are similar to those in the true system, can reliability provide useful information about that forecast model's climatic response to forcing, when evaluated at a suitable lead time.
At 0.15 MTU, the red noise dynamical model is more reliable than the white noise dynamical model, indicating a better representation of s (the eigenvalue of L that corresponds to F s ). The red noise model is also more skilful at predicting the climatic response of the corresponding symmetric forcing perturbation, F s . The asymmetric forcing perturbation did not project onto any one eigenmode of the system, so we expect the slowest growing modes to be of most importance for predicting the correct climate response. At longer lead times, the red noise forecast maintained its reliability, indicating a good representation of the slowest growing modes of the system. This forecast model also predicted a good response to the asymmetric forcing. We find that a reliable forecast model has also responded well to an imposed forcing, as predicted by the theory.
At longer lead times, the white noise parametrized model produced unreliable forecasts. However, it is not possible to discern the cause of this unreliability. Forecasts from the white noise model were unreliable at short lead times, indicating a poor representation of the fastest growing modes in the L96 system. Errors in the fast-growing modes would infect the forecast at longer time-scales, making it difficult to assess the representation of the slower growing modes in that forecast system. From Figure 7, the white noise parametrized model was in fact able to capture the response of the L96 system to the asymmetric forcing, indicating that the model captures the slowest growing modes in the L96 system. This is as suggested by the theory, and indicated in Figure 5.
Physically, in the L96 system, the long time-scale predictability comes from the correct representation of the low-frequency X variables. These are correctly represented in both the red and white noise dynamical models. In contrast, the LIM provides a statistical representation of both the largeand small-scale dynamics: it is a linearized representation of the system, with the effect of the nonlinear dynamics incorporated into the noise. While the LIM captures the impact of the nonlinear processes in a statistical sense, it seems that the details of the nonlinear processes are important for setting the slowest evolving modes in the L96 system. The fastest evolving modes come from the unresolved sub-grid Y variables. The increased forecast reliability indicates that the correlated, red noise dynamical model captures the statistical effect of these small-scale variables on the slow scales better than the white noise model.

5.5
Implications for the development of stochastic forecasting models Recent years have seen an increase in the adoption of stochastic parametrization schemes by weather forecasting centres (Palmer et al., 2009;Sanchez et al., 2016). The stochastic approach is designed to represent uncertainty in forecasts due to the limitations of the forecast model. Without stochastic parametrizations, ensemble forecasts are unreliable. The use of stochastic schemes improves the growth of perturbations in the model, substantially improving the reliability of weather forecasts (Berner et al., 2008;Palmer et al., 2009;Christensen et al., 2015c;Leutbecher et al., 2017). The improved reliability of weather forecasts indicates an improved representation of fast-evolving modes in models. However, improving the statistics of fast processes through stochastic parametrization also systematically impacts the evolution of slowly evolving modes in climate models. For example, stochastic parametrization schemes have been shown to transform the representation of the slowly evolving El Niño-Southern Oscillation (ENSO) in climate models . In a follow-on study from Christensen et al. (2017), Berner et al. (2018) use LIM as a tool to study tropical Pacific variability in the Community Climate System Model (CCSM) version 4, and find that a simulation including a stochastic parametrization scheme had a better representation of the time-scales associated with the slowest evolving modes than a deterministic (control) simulation. Given this improvement in low-frequency variability, it is possible that including stochastic schemes in a climate model will improve the response of that model to an applied external forcing. In other words, stochastic parametrizations could lead to improved estimates of climate sensitivity.

DISCUSSION AND CONCLUSIONS
In this paper, linear response theory was reviewed and presented using the notation of dynamical systems theory. Assuming a forcing constant in time and space, and considering only the mean response of a dynamical system to an imposed forcing, we show that where ⟨x⟩ is the mean response of the system averaged over the attractor, M( ) is the tangent propagator, averaged over the attractor, which maps a small forcing perturbation at a given time onto the response at time later, and f is the applied forcing. For a climate model to correctly represent the climate sensitivity of the Earth system, it must have a good representation of M( ).
In an ensemble forecast, the tangent propagator, M, maps small initial condition perturbations forward in time, embodying how the ensemble will evolve. Reliability is the property of an ensemble forecast which evaluates whether the verification behaves as if drawn from the predicted distribution, that is, whether the ensemble members are diverging at the correct rate. It is therefore intuitive to link reliability with response to forcing, as proposed by Palmer et al. (2008).
However, we note that evaluating the reliability of a short-range weather forecast can only evaluate M( ) for ≤ the length of the forecast. In contrast, the response equation requires correct representation of M( ) out to = ∞. Furthermore, we show evidence that that the dominant response of a system to an applied forcing is governed by the system's slowest evolving modes (i.e. M( ) for large ).
Therefore, instead of looking to short-range weather forecasts, we propose that the reliability of forecasts at the time-scale of the slowest modes in the system will be of particular interest. In the atmosphere, slowly evolving modes are associated with large scales, while rapidly evolving modes are associated with smaller scales. It is possible that considering seasonal and longer time-scale ensemble forecasts, averaged over large spatial and temporal scales, could be used to constrain projections of climate change. A reliable seasonal forecast indicates that slow modes in the atmosphere-ocean-land system have been well-captured by the forecast model on the annual time-scale; in order to evaluate lower-frequency modes present in the ocean, even longer time-scales would need to be considered.
This theory therefore provides some justification for using the reliability of initialized forecasts as an emergent constraint on climate sensitivity, which until now had been postulated but not proved . However, we highlight some important caveats, which we demonstrate by producing forecasts for the simple Lorenz (1996) system. We note that an unreliable seasonal (or longer time-scale) forecast is not sufficient to deduce that the slow modes in the system are poorly captured by the forecast model. This is because a reliable seasonal forecast also requires the rapidly evolving modes to be captured. Consideration of weather forecasts on scales of hours to days could provide some insight into the representation of these fast modes. Furthermore, we demonstrate that it is possible for a forecast system to produce reliable forecasts and yet be unable to capture the response of the system to an external forcing. This is the case if the forecast model has a large model error, and yet correctly accounts for forecast uncertainty introduced by this error.
There are several limitations of the approach presented above which are worth further discussion. Firstly, the approach has considered the linear climate response to a small perturbation. It is not clear how this assumption affects the applicability of this theory to the climate system, where the applied external forcing (anthropogenic greenhouse gas emissions) far exceeds natural variation as indicated by the palaeoclimate records. However, Lucarini et al. (2017) show that linear response theory is a useful framework for the climate system, despite the large magnitude of the applied forcing. Secondly, this discussion has restricted itself to the response of the mean of the climate system. It is also of great interest how the variability of the climate system will respond to an external forcing, with implications for changes in climate extremes such as droughts and floods or extremes of temperature. No evidence has been presented to indicate that the reliability of short-range forecasts is a good indicator of the response of higher-order modes of the climate system to greenhouse gas forcing. Our discussion has assumed stationarity of the system of interest, when in reality the non-stationarity of the climate system should be accounted for (e.g. Lucarini and Sarno, 2011). It is also evident that, if the short-range forecast model only includes an atmosphere and ocean, then assessing the reliability of forecasts will only evaluate the response of those aspects of the model to forcing. Longer time-scales present in the cryo-or biosphere will not be evaluated, but are likely very important for predicting the correct climate sensitivity of the Earth-system (Hurrell et al., 2009). Finally, we note that, while model bias appears explicitly in evaluating reliability, it also appears implicitly in the response equation, as the integral is performed over the biased model attractor, which will naturally affect the predicted response.
The results presented here suggest a focus in climate model evaluation should be on the fidelity of the slowest modes in the model. However, changes to parametrization of a particular fast process made during model development will almost certainly impact the model across a range of time-scales (e.g. Neale et al., 2008;Christensen et al., 2017) due to the interaction of different scales of motion in the climate system. In Section 5.5 we discuss evidence that stochastic parametrization of fast, small-scale processes can improve the representation of both fast and slow modes in forecast models. It is precisely this multi-scale interaction which motivates the seamless evaluation of weather and climate models. While other studies have used very short time-scale forecasts to evaluate parametrization schemes and thereby constrain climate sensitivity (Rodwell and Palmer, 2007), we propose that this apparent relationship is primarily because changing a parametrization scheme can adversely impact both the faster and the slower evolving modes in the forecast model (Berner et al., 2018).
In conclusion, linear response theory provides insight into the link between the reliability of weather and seasonal forecasts, and the climatic response of a forecast model. The purpose of this manuscript is to discuss the implications that this theory has for predicting a priori whether a given climate model will be able to predict climate sensitivity, or in other words, the correct response of the climate system to an applied external forcing. We conclude that, by framing the problem in terms of a general dynamical system, there is some justification for the use of initialized forecasts to validate climate models.