On the properties of ensemble forecast sensitivity to observations

Evaluating impacts of observations on the skill of numerical weather prediction (NWP) is important. The Ensemble Forecast Sensitivity to Observation (EFSO) provides an efficient approach to diagnosing observation impacts, quantifying how much each observation improves or degrades a subsequent forecast with a given verification reference. This study investigates the sensitivity of EFSO impact estimates to the choice of the verification reference, using a global NWP system consisting of the Non‐hydrostatic Icosahedral Atmospheric Model (NICAM) and the Local Ensemble Transform Kalman Filter (LETKF). The EFSO evaluates observation impacts with the moist total energy norm and with recently proposed observation‐based verification metrics. The results show that each type of observation mainly contributes to the improvement of forecast departures of the observed variable maybe due to the limitation of localization in the EFSO. The EFSO overestimates the fraction of beneficial observations when verified with subsequent analyses, especially for shorter lead times such as 6 h. We may avoid this overestimation to some extent by verifying with observations, analyses from other data assimilation (DA) systems, or analyses of an independent run with the same DA system. In addition, this study demonstrates two important issues possibly leading to overestimating observation impacts. First, observation impacts would be overestimated if we apply relaxation‐to‐prior methods to the initial conditions of the ensemble forecasts in the EFSO; therefore, the ensemble forecasts in the EFSO should be independent of the ensemble forecasts in the DA cycle. Second, deterministic baseline forecasts of the EFSO, which represent the forecast without DA, should be initialized by the ensemble mean of the first guess at the analysis time, not by the previous analysis.

of observations such as quality control, data thinning, and bias correction. Data denial experiments, with and without some of the observation data (Kelly et al., 2007), are a basic approach to estimate observation impacts. However, data denial experiments have become more complex in modern DA systems because of the presence of many types of observation. In addition, data denial experiments require computationally expensive long-term DA cycles to gain statistical significance (Geer, 2016).
The Forecast Sensitivity to Observation (FSO) method is a computationally inexpensive method to estimate observation impacts. The FSO quantifies how much each observation improves or degrades a subsequent forecast. Langland and Baker (2004) proposed an adjoint FSO formulation for variational DA systems. The adjoint FSO has been implemented with operational systems (Zhu and Gelaro, 2008;Cardinali, 2009;Lorenc and Marriott, 2014). The Langland and Baker (2004) approach is a powerful tool for diagnosing observation impacts, but it requires adjoint operators of the forecast model. Liu and Kalnay (2008) and Li et al. (2010) proposed an Ensemble FSO (EFSO) for the Local Ensemble Transform Kalman Filter (LETKF: Hunt et al., 2007) without using the adjoint operators. Kalnay et al. (2012, hereafter K12) simplified the Liu and Kalnay (2008) formulation to render it applicable to any kinds of ensemble Kalman filter (EnKF: Evensen, 1994). Ota et al. (2013, hereafter O13) implemented the EFSO with the National Centers for Environmental Prediction (NCEP)'s operational global EnKF system. Kotsuki et al. (2019) used the EFSO to estimate impacts of assimilated observations to study predictability of record-breaking heavy rainfall in Japan in July 2018. Hotta et al. (2017, hereafter H17) proposed a proactive quality control method (PQC) that detects detrimental observations with the EFSO. Lien et al. (2018) proposed an efficient approach to assimilating a new observing system using the EFSO. Recently, Buehner et al. (2018) proposed a new hybrid FSO, in which DA uses the adjoint model and FSO employs ensemble forecasts, for a hybrid ensemble-variational data assimilation system.
The FSO methods estimate observation impacts by measuring errors of two forecasts with and without DA. Most FSO studies for NWP systems have measured errors with dry and moist energy norms relative to reference atmospheric fields. Subsequent analysis fields of DA systems (hereafter, subsequent analyses) are usually used as the reference. It is also possible to use analysis fields from other DA systems as the reference (Lien et al., 2018). Recently, Todling (2013) proposed using observation departures (i.e. observations minus forecasts) for the verification metrics. Sommer and Weissmann (2016, hereafter SW16) and Necker et al. (2018, hereafter N18) tested the observation-based metrics for the Deutscher Wetterdienst (DWD) operational regional ensemble DA system. Similarly, Cardinali (2018, hereafter C18) introduced an observation-based metric for the adjoint FSO of the European Centre for Medium-range Weather Forecasting (ECMWF)'s global system. SW16, N18 and C18 reported that commonly used dry and moist energy norms have the following disadvantages compared with the observation-based metrics: (a) the analysis error would be correlated with the short-term forecast error, (b) the analysis is potentially affected by a model bias, and (c) the energy norms, which are mainly dominated by the errors in the troposphere, evaluate improvements in stratospheric fields insufficiently.
This study aims to investigate the following three specific issues of the EFSO using a global atmospheric DA system (Terasaki et al., 2015), which comprises the Nonhydrostatic ICosahedral Atmospheric Model (NICAM: Tomita and Satoh, 2004;Satoh et al., 2008; and the Local Ensemble Transform Kalman Filter (LETKF: Hunt et al., 2007).
1. EFSO impact estimates' sensitivity to the choice of the verification reference. We estimate observation impacts relative to four references: subsequent analyses, analyses of an independent run of the NICAM-LETKF, the ERA-Interim reanalysis (Dee et al., 2011), and observation departures. 2. The role of relaxation-to-prior methods applied to the initial conditions of ensemble forecasts in the EFSO, a previously undiscussed issue. Relaxation-to-prior methods are commonly used as covariance inflation that inflates the analysis ensemble perturbations after DA. It is unclear whether we need to apply relaxation-to-prior methods to the ensemble forecasts in the EFSO (cf. section 2.4). 3. The role of baseline forecasts that represent forecasts without DA. Gasperoni and Wang (2015, hereafter GW15) conducted deterministic forecasts from the previous analysis time to the verification time. However, GW15's approach may overestimate observation impacts as discussed later. This study proposes an alternative baseline forecast from the ensemble mean of the first guess (cf. section 2.5).
This article is organized as follows. Section 2 describes the NICAM-LETKF system and the EFSO formulation. Section 3 describes the experimental design. The results are presented and discussed in section 4. Finally, a summary is presented in section 5.

NICAM-LETKF system
As the NWP component, we use the NICAM with the sixth grid division level, corresponding to the horizontal resolution of 112 km, and 38 vertical layers up to about 40 km. The Arakawa and Schubert scheme (Arakawa and Schubert, 1974) and Berry's parametrization (Berry, 1967) were used for cumulus parametrization and large-scale condensation schemes, respectively. The LETKF is used as the DA component. The LETKF updates the ensemble mean x and perturbations X ≡ {x (1) − x, x (2) − x, … , x ( ) − x} with a transform matrix, where m, x (i) and x are the ensemble size, the ith ensemble state vector and the ensemble mean state vector, respectively. The analysis equations of the LETKF are given by: where H, d, Y, R andP a denote the linear observation operator, observation departure (d ≡ y o − Hx f ), ensemble perturbation matrix in the observation space (Y ≡ HX), observation error covariance matrix, and analysis error covariance matrix in the ensemble space, respectively. The superscripts o, f and a denote the observation, forecast (prior) and analysis (posterior), respectively. Here, the analysis error covariance matrix in the ensemble spaceP a is defined by: We apply the covariance localization to the observation error covariance so that the errors of distant observations have less impact Miyoshi and Yamane, 2007). We can operate the analysis updates not only at model grid points, but also at observation points: Namely, we can obtain the ensemble mean analysis state y a and perturbation Y a in the observation space within the LETKF. In realistic problems, there would be nonlinear observation operators such as satellite simulators, for which the following approximations are needed: where H is the nonlinear observation operator, and 1 denotes a row vector with all m elements being equal to 1. The EnKF needs variance inflation to mitigate the under-dispersive ensemble (Houtekamer and Mitchell, 1998). Two relaxation-to-prior (RTP) methods were implemented with the NICAM-LETKF (Kotsuki et al., 2017c). The first RTP method is the RTP spread (RTPS: Whitaker and Hamill, 2012), which relaxes the reduction of the ensemble spread of the ith variable of the state vector x as follows: where and RTPS denote the ensemble spread, and relaxation parameter of the RTPS, respectively. The other RTP method used is the RTP perturbation (RTPP: Zhang et al., 2004), which mixes the prior and posterior ensemble perturbations as follows: where RTPP denotes the relaxation parameter of the RTPP. We investigate the sensitivity of these RTP methods to estimated observation impacts. Table 1 summarizes the observations assimilated in this study. The NICAM-LETKF assimilates the satellite radiances of the Advanced Microwave Sounding Unit-A (AMSU-A) and Global Satellite Mapping of Precipitation (GSMaP) precipitation in addition to conventional observations (CONV) from the operational system of the NCEP (a.k.a. NCEP PREP-BUFR). Due to the relatively low model top height (approximately 40 km), the present NICAM-LETKF assimilates only three channels (6−8) of the AMSU-A radiances and excludes channels that are sensitive to the stratosphere and mesosphere (Terasaki and Miyoshi, 2017). The GSMaP is global surface precipitation data estimated by passive microwave radiometers (Kubota et al., 2007;Ushio et al., 2009). This study assimilates a near-real-time version of the GSMaP data with the Gaussian transformation (GT) method (Lien et al., 2013;2016a;2016b;Kotsuki et al., 2017b).

EFSO
The EFSO estimates the forecast error differences at forecast time t derived from assimilated observations at analysis time 0. Assume that the NICAM-LETKF computes an analysis every 6 h. The EFSO compares forecasts from two analysis times, −6 and 0 h, to the verification time t (x f |−6 , and x f |0 ) with any references close to the truth (Figure 1a). The baseline forecast x f |−6 represents the forecast without DA, and is discussed further in section 2.5.
The forecast errors are defined by: where C is the positive definite matrix that defines the forecast error norm, and the superscript 'true' denotes the true error difference. The moist total energy (MTE) norm is given by: where S represents the target region, v is the vertical sigma coordinate, and T ′′ , p s ′ , u ′ , v ′ , and q ′ denote temperature, surface pressure, zonal and meridional winds, temperature and specific humidity of the forecast error e, respectively. R d , C p , and L are the gas constant of dry air, specific heat at constant pressure, and latent heat of condensation per unit mass, respectively. p r and T r are the reference surface pressure and temperature, set to 10 5 (Pa) and 280 (K), respectively. PE, KE and ME denote potential, kinetic, and moist energy, respectively (cf. O13). K12 extended the adjoint-based FSO for ensemble DA. The impact of the lth observation on the forecast at the jth model grid point is given by: where l,j is the localization function at the jth model grid point for the lth observation. The superscript EFSO denotes the forecast error difference estimated by the EFSO. See appendix A for a detailed derivation. This study defines an observation-based verification norm as follows: where Np is the number of observations. Hereafter, this metric is referred to as the normalized observation departure (NOD). As with Equations 15 and 16, the impact of lth observation at time 0 on the forecast departure of nth observation at time t can be approximated by: where l,n is the localization function between the lth and nth observations. Here, the observation error covariance R t is assumed to be diagonal for verifying observations. Following SW16, this study employs the observations used in subsequent data assimilation cycles. We use the same localization function for Equations 16 and 19 as defined later in section 3.

EFSO with RTP methods
Here, we discuss the role of RTP methods applied to the initial conditions of ensemble forecasts in the EFSO. In the LETKF, the Kalman gain K is approximated by where ○ and are the Schur product and localization matrix (cf. appendix A). The forecast perturbation matrix X f |0 is given by MX a 0 , where M is the tangent linear forecast model. The fully nonlinear model M is commonly used to approximate M in the EFSO. Following the original formulation, the Kalman gain K used in the EFSO should be consistent with the K in the LETKF (Equation A2). Therefore, we hypothesize that the EFSO needs no relaxation to X a 0 in contrast to DA cycles that need the relaxation (X a 0 → X a 0,RLX ) to avoid under-dispersive ensembles. This study experimentally examines changes in the estimated observation impacts (Δ 2 EFSO ) with and without the relaxation.
This issue is important in terms of computational cost. If two impact estimates with and without the relaxation are significantly different, it is necessary to perform computationally expensive independent ensemble forecasts only for the EFSO. Care should be taken too when the analysis perturbation X a 0 is altered such as by relaxation and additive inflation methods. Multiplicative inflation methods for the prior perturbation/covariance in DA cycles are free from this issue. Multiplicative inflation should not be applied to the forecast perturbation in the ESFO (Equation 16).

Interpretation of forecasts in the EFSO
Finally, we bring to light a previously undiscussed issue: a TWIN-EXP uses an independent initial ensemble from CTRL, whereas the experimental setting of TWIN-EXP is the same as CTRL.
K12 did not define these forecasts explicitly, previous studies interpreted these forecasts differently. For example, H17 used the mean of ensemble forecasts ( ). Ensemble square root filters including the LETKF update the ensemble-mean first guess; therefore, it would be more straightforward to represent x f |0 and x f |−6 by deterministic forecasts from the ensemble-mean first guess and analysis. The deterministic forecast is also superior to the ensemble forecast in terms of the computational cost.
The difference of the two forecasts of GW15 ( |−6 (x a −6 ) and |0 (x a 0 )) are caused by different assimilated observations and by the difference between deterministic and ensemble forecasts from the previous (−6) to target (0) analysis times ( 0|−6 (x a −6 ) or 0|−6 (x a −6 ); see also Figure 8). To exclude the second factor, this study performs deterministic forecasts from the ensemble-mean first guess as the baseline forecast to represent the forecast without DA (x f |−6 = |0 ( 0|−6 (x a −6 ))). Another forecast with DA is initialized by the ensemble-mean analysis (x f |0 = |0 (x a 0 )). These two forecasts x f |−6 = |0 ( 0|−6 (x a −6 )) and x f |0 = |0 (x a 0 ) have the same forecasting period from the initial time to the verification time. Section 4.6 discusses sensitivities of EFSO impact estimates to the baseline forecasts.

EXPERIMENTAL SETTINGS
We first describe the experimental settings of the DA system. The ensemble size is set to be 40. For the covariance localization (Equations 16 and 19), the Gaussian function is used with standard deviations LS h = 400 km in the horizontal and LS v = 0.4 natural log pressure in the vertical. The localization function is replaced by zero beyond 2 √ 10∕3 ⋅ LS ℎ, which corresponds to the influence-radius of Gaussian function (Gaspari and Cohn, 1999;Hamill et al., 2001). The relaxation parameters RTPS and RTPP are set to be 0.95 and 0.80 (Kotsuki et al., 2017c). The LETKF assimilates temporally distributed observations over a 6 h time window centred on the analysis time (a.k.a. 4D-LETKF: Fertig et al., 2007;Miyoshi et al., 2010). Ensemble forecasts are performed for 9 h, and observation data in the previous 6 h period are assimilated (cf. fig. 3a of Terasaki and Miyoshi, 2017). Temporal localization (Miyoshi and Yamane, 2007) is not applied in this study, enabling us to assume the same transform matrix during the 6 h time window. Namely, this assumption enables the LETKF to compute analysis perturbations in the observation space (Y a ) distributed over the 6 h time window. Note that we conducted preliminary experiments with and without the temporal localization, and concluded that analysis accuracy was insensitive to the temporal localization (not shown).
Next, we detail the experimental settings of the EFSO. This study sets 6 and 12 h for the forecast verification times (FTs). H17 compared the EFSO with 6 and 24 h forecasts and pointed out that the EFSO was insensitive to these lead times. The localization scale in the EFSO is the same as that in the LETKF. The advection of the localization centre (O13) was not important in our preliminary experiments (cf. Table B1 of appendix B); therefore, the advection scheme is not used hereafter. We use ADPUPA, SATWND and AMSU-A radiances for the observation-based verification metric. The NOD metric (Equations 17−19) is sensitive to the normalizing matrix R and the number of verifying future observations (Np). Therefore, we perform the EFSOs separately for ADPUPA, SATWND, or AMSU-A radiances as the verification references. For the AMSU-A-based verification, we apply the same bias correction coefficients used in the DA cycle (i.e. for 6 h forecasts) to 12 h forecasts. After applying the bias correction, mean departures (K) for 6 and 12 h forecasts are −0.003 and 0.010 for channel 6, −0.030 and −0.029 for channel 7, and −0.084 and −0.085 for channel 8, respectively. This suggests that the bias correction coefficients for 6 h forecasts can be applicable to 12 h forecasts in the NICAM-LETKF. Table 2 summarizes the experimental settings. The control experiment (CTRL) assimilates CONV, GSMaP and AMSU-A observations. We also conduct a twin experiment (TWIN-EXP) with the same experimental setting as the CTRL, but from different initial ensemble data. The initial ensembles for CTRL and TWIN-EXP were obtained from the 1st−40th and 41st−80th members of a long-term 128-member NICAM-LETKF experiment . The TWIN-EXP is used as an independent verification reference. A data denial experiment is conducted with excluding AMSU-A radiances (DNY-EXP) in DA. Experiments with different RTP methods are compared: RTPS for CTRL and RLX-EXP1, and RTPP for RLX-EXP2 and RLX-EXP3. RLX-EXP1 and RLX-EXP3 apply the RTP methods to the initial conditions of ensemble forecasts in the EFSO, whereas CTRL and RLX-EXP2 do not as described in section 2.4 (cf. Figure 1b).
The data assimilation experiments of CTRL and TWIN-EXP are performed for 2 months from 0000 UTC 1 June to 1800 UTC 31 July 2014. DNY-EXP and RLX-EXP1−3 are started from 0000 UTC 1 July succeeding after a 1-month DA cycle of CTRL. The results obtained over the last 4 weeks (4−31 July) are used for discussion. Figure 2 shows the estimates of the average 6 h forecast error differences (Δ 2 MTE ) EFSO in CTRL verified with subsequent analyses of CTRL and ERA-Interim reanalysis. We show the results for KE, PE and ME separately to physically understand how each type of observation improves the atmospheric fields. For example, AMSU-A mainly reduces KE, suggesting that AMSU-A radiances improve wind fields despite that AMSU-A is sensitive to atmospheric temperature profiles. The GSMaP mainly improves moisture fields. Among the three energy norms of KE, PE and ME, the largest improvement is observed for KE.

Estimated impacts verified with atmospheric states
In terms of the overall impact, the ADPUPA makes the largest contribution to the forecast error reduction, followed by SATWND and then AMSU-A or ASCATW. Estimated impacts of the AIRCFT and AMSU-A are smaller than those by O13 and H17. The impacts of AMSU-A are smaller probably because the present NICAM-LETKF assimilates only three channels of AMSU-A (Terasaki and Miyoshi, 2017), whereas operational centres assimilate 10 channels. The impacts of AIRCFT would also be smaller because public PREPBUFR contains a limited number of aircraft observations (***Dr. Akira Yamazaki, personal communications). The estimated MTE forecast error differences (Δ 2 MTE ) EFSO have good agreements with the true differences (Δ 2 MTE ) true in general (Table B1 in appendix B), suggesting accurate impact estimates by the EFSO.
Between the two verification references (subsequent analyses and ERA-Interim reanalysis), relatively large differences in estimated impacts are found for ADPUPA, SATWND and ASCATW, where verification with their own analysis shows generally smaller amplitudes (Figure 2c). This may be because the quadratic error norm was used to measure the forecast errors. NICAM forecasts would be closer to subsequent analyses than to the ERA-Interim reanalysis; therefore, the total error reduction relative to the ERA-Interim reanalysis was larger than that relative to subsequent analyses. On the other hand, forecast error reductions in ME are smaller when we verify with the ERA-Interim reanalysis (Figure 2b). This may be due to a dry bias of NICAM-LETKF in the lower troposphere relative to ERA-Interim (Kotsuki et al., 2018). Due to the bias in humidity, improved humidity fields relative to subsequent analyses do not guarantee improvements relative to the ERA-Interim reanalysis. Figure 3 shows the estimates of the average 12 h forecast error differences (Δ 2 NOD ) EFSO in CTRL, verified with ADPUPA, SATWND, and AMSU-A radiances. The EFSO diagnoses that each type of observation generally improves the departures of the observed variable. For example, assimilation of the AMSU-A radiances significantly reduces the forecast departure of the AMSU-A radiances, but not necessarily for other observations (Figure 3c). This feature would come from a limitation of EFSO formulation rather than reflecting the true observation impacts.

Estimated impacts verified with observations
When we assimilate an observation, its impact can propagate away. However, the EFSO limits the estimated impact to the observation's immediate vicinity because of localization (Equation 19). For example, assume that we evaluate the impact of an assimilated observation A using an observation B as the verification reference. If the localization function between observations A and B is zero ( = 0 in Equation 19), the EFSO automatically diagnoses that the observation A has no impact onto the observation departure for B because of the localization. This artefactual mechanism would be particularly relevant to the AMSU-A radiances that observe temperature profiles in the upper atmosphere. In fact, observations near the surface (ADPSFC, SFCSHP and ASCATW) have negligible impact onto the EFSO verified with AMSU-A radiances (Figure 3c).
It is important to evaluate if assimilating observations improves or degrades the forecast departure of other observed quantities. The EFSO with observation-based metrics is a useful tool for this purpose although its interpretation has to be done carefully acknowledging the limitations described above. In the NICAM-LETKF system, no observation degraded departures of other observed quantities,

Impact of AMSU-A radiances
We investigate the impacts on the upper troposphere and stratosphere from AMSU-A radiances which have a large coverage across the entire globe. Figure 4 shows the spatial patterns of the estimated 6 h forecast MTE differences (J/kg) above 350 hPa relative to the ERA-Interim reanalysis in CTRL. Figure 4a,  . Estimated impacts of AMSU-A radiances over land are smaller than those over the ocean (Figure 4a) maybe because other observations (e.g. radiosondes) have dominant impacts over land. In general, assimilating AMSU-A radiances has beneficial impacts across the globe, especially in the Southern Hemisphere (SH). However, assimilating AMSU-A radiances would degrade forecasts in the Tropics (TR; Figure 4a). The detrimental impacts in the TR may be related to a temperature bias between the NICAM-LETKF analysis and the ERA-Interim reanalysis. DA systems perform bias corrections for satellite radiances so that the time-mean observation-minus-first-guess innovation becomes statistically zero (e.g. Bormann and Bauer, 2010). The NICAM-LETKF analysis has a warmer temperature bias in the stratosphere over the TR relative to the ERA-Interim reanalysis (Kotsuki et al., 2018). Assimilating the bias-corrected AMSU-A radiances would degrade the analyses and subsequent forecasts relative to the ERA-Interim reanalysis.
We compare the observation impacts estimated by the EFSO with the data denial experiment for AMSU-A radiances (DNY-EXP). The EFSO impact estimates are not exactly equivalent to the impact estimates with the data denial experiments. A data denial experiment estimates observation data impact by comparing two DA cycles with and without the data. This provides a direct estimate of the data impact, while the EFSO provides indirect impact estimates. The EFSO is  Figure 5 compares the spatial patterns of the time-mean MTE differences above 350 hPa in CTRL and DNY-EXP, or ( 2 |0,MTE ) true CTRL minus ( 2 |0,MTE ) true DNY-EXP where subscripts are the names of experiments. Here, we consider the ERA-Interim reanalysis provides better analyses than the NICAM-LETKF owing to assimilation of more satellite data. Figure 5 suggests that the forecast fields of CTRL generally become closer to the ERA-Interim reanalysis in the SH than those of DNY-EXP. However, some regions are degraded by assimilating the AMSU-A radiances, such as the TR and near New Zealand. We find some similarities between Figures 4a and 5, such as beneficial impacts in the north and southeast Pacific Ocean and detrimental impacts in the TR and near New Zealand. Therefore, the detrimental evaluations by the EFSO can be used to detect detrimental observations in general. However, the opposite results are found in some areas such as the North Atlantic and northwest Pacific Ocean.

Fraction of beneficial observations
Here we investigate the fraction of beneficial observations (FBO), or the percentage of positively contributing observations, in CTRL verifying with different references. We define the FBO by N beneficial /(N beneficial + N detrimental ) where N beneficial and N detrimental are the numbers of assimilated beneficial and detrimental observations. Observations having no impact, that is, ∑ (Δ 2 EFSO ) , = 0, are excluded. The estimated MTE impacts of individual observations assimilated at 0000 UTC on 11 July 2014 are shown in Figure 6 ( ). When we verify relative to the ERA-Interim reanalysis (Figure 6a,b), both beneficial and detrimental observations are seen over the globe. However, we mainly see beneficial observations when we verify relative to the subsequent analyses, especially for the shorter 6 h forecast (Figure 6c,d). Verifying relative to subsequent analyses would overestimate the FBO. However, this is not the case when we verify relative to the analyses of TWIN-EXP (Figure 6e,f). Table 3 summarizes the time-mean FBO. The FBO is 51.9−55.4% when we verify relative to the analyses of TWIN-EXP, ERA-Interim reanalysis, and AMSU-A radiances. Previous studies also found that the FBO was just over 50% (Lorenc and Marriott, 2014;SW16;N18). With a toy chaotic model (Lorenz, 1996;Lorenz and Emanuel, 1998), a similar FBO was also reported (Kotsuki et al., 2017a). However, the FBO is remarkably higher when subsequent analyses are used as the reference (56.1−58.9%). Interestingly, such a higher FBO is not observed when we verify relative to the analyses of TWIN-EXP (54.0−54.4%).
The EFSO regards the verification reference as truth. However, the errors in subsequent analyses may be positively correlated with the propagated analysis increments for short-term forecasts. When they are positively correlated, assimilated observations may tend to be evaluated as beneficial. K12 and H17 reported that the FBO increased as the lead time shortened. This would be more significant when we use the subsequent analyses as the reference. The propagated analysis increments may have larger correlation relative to the errors in subsequent analysis than to the errors in TWIN. Unfortunately, it is almost impossible to investigate such correlations since we never know the true state in practice. Therefore, it is difficult to investigate whether the FBO is overestimated in the presence of positive error correlation between the propagated analysis increments and the errors in the subsequent analyses. We need to investigate this issue further with observing system simulation experiments where we know the true state, but these additional experiments are beyond the scope of this study. This study experimentally To 2014073118 F I G U R E 5 Global pattern of difference of time-mean 6 h forecast MTE (J/kg) above 350 hPa relative to the ERA-Interim reanalysis between CTRL and DNYEXP (i.e. with and without AMSU-A radiances; ( 2 |0,MTE ) true CTRL minus ( 2 |0,MTE ) true DNY-EXP ), averaged over 4 weeks from 4 to 31 July 2014. Green and brown colours represent improvement and degradation due to assimilation of AMSU-A radiances, respectively revealed that overestimated FBO can be mitigated by verifying relative to future observations, analyses of different DA systems, or analyses of an independent DA cycle with the same system.
There are significant reductions in the FBO between FT 6 and 12 h when we verify relative to the subsequent analyses. Small reductions are also seen even when we verify relative to the analyses of TWIN-EXP or ERA-Interim reanalysis, perhaps due to the assimilation of similar observations. On the other hand, no such reduction is observed when we verify relative to the AMSU-A radiances. Observation-based metrics may be more appropriate to estimate the FBO for a shorter lead time since it can avoid an overestimation of the FBO.

Estimated impacts with and without RTP methods
Here, we discuss the third issue, that is, whether we need to apply the RTP methods to the initial condition of the ensemble forecasts in the EFSO. Here, the ERA-Interim reanalysis is used as the verification reference. Figure 7 compares the 6 and 12 h forecast MTE error differences: real reduction (Δ 2 MTE ) true and estimated reduction (Δ 2 MTE ) EFSO . First, we discuss the RTPS (Figure 7a,b). The EFSO impact estimates in CTRL (without RTPS) show good agreements with truth at FT = 6 h, while the EFSO slightly underestimate forecast error reductions at FT = 12 h. On the other hand, the EFSO in RLX-EXP1 (with RTPS) overestimates forecast error reductions at both FTs. With the RTPP, the EFSO in RLX-EXP3 significantly overestimates forecast error reductions (Figure 7c,d).
As discussed in section 2.4, we hypothesized that the EFSO would need no relaxation to X a 0 based on the original formulation. As hypothesized, observation impacts can be overestimated if we apply the RTP methods to the initial condition X a 0 in the EFSO. The overestimated impacts with RTPP methods are more significant than those with RTPS.
Error growth characteristics of RTPP and RTPS may cause different degrees of overestimation of the observation impacts. The RTPS provides a less physically balanced analysis than the RTPP. Therefore, the ensemble spread of the RTPS analysis grows slower than that of the RTPP (Whitaker and Hamill, 2012;Kotsuki et al., 2017c). Owing to the slower

Impact of the choice of baseline forecast
Finally, we investigate the sensitivity to the choice of the baseline forecast, as discussed in section 2.5. Here, we consider three deterministic forecasts at the verification time (   (Figure 9c), implying that the mean of the ensemble forecasts ( 0|−6 (x a −6 )) is statistically better than the deterministic forecast ( 0|−6 (x a −6 )). Figure 10 compares the spatial patterns of the root mean square difference (RMSD) of zonal wind (U; m/s) for 0|−6 (x a −6 ) and 0|−6 (x a −6 ) at the target analysis time relative to the ERA-Interim reanalysis. The mean of the ensemble forecasts ( 0|−6 (x a −6 )) is significantly better than the deterministic forecast ( 0|−6 (x a −6 )) in similar regions where |0 (x f 0 ) is better than |−6 (x a −6 ) (Figures 9d and 10c).   Relatively large differences are observed in the TR where convective systems are dominant. We may need to consider this baseline deterministic forecast issue carefully, especially for the EFSO for convective events.
The above discussion suggests that the actual forecast MTE error differences (Δ 2 MTE ) true are overestimated if we use 0|−6 (x a −6 ) to represent the forecast without DA. Therefore, our deterministic forecasts would be more appropriate than GW15's approach to representing forecasts with and without DA. It does not necessarily mean that the estimated forecast MTE error difference (Δ 2 MTE ) EFSO is always overestimated because other uncertainties such as the approximation of (Δ 2 MTE ) true by (Δ 2 MTE ) EFSO can be sources of an underestimation. Schraff et al. (2016) proposed an independent update of a deterministic run, which introduces analysis increments into the deterministic forecasts based on the Kalman gain for the analysis ensemble mean. The independently updated deterministic runs can be compared with |−6 (x a −6 ).

Implications for other DA systems
The present NICAM-LETKF system assimilates fewer observations than typical operational systems.  Discussions on the covariance relaxation and baseline forecasts (sections 4.5 and 4.6) would be insightful to other ensemble-based systems since the experimental results generally agree with theoretical implications. Discussion on the FBO (section 4.4) would be generic properties for both ensemble-and adjoint-based FSOs. With respect to the MTE-based impact estimates (section 4.1), estimated impacts of each observation type would be different from other operational systems. Compared to operational systems (e.g. O13; LM14; H17; C18; Buehner et al., 2018), AIRCFT and AMSU-A radiances generally show smaller contributions in the NICAM-LETKF maybe because fewer observations are assimilated ( Figure 2d).
As for the observation-based impact estimates (section 4.2), the limitation of EFSO should be discussed. Each type of observation mainly improves departures of the observed variable in NICAM-LETKF (Figure 3a-c). For example, AMSU-A (−214.8 × 10 −3 ) occupies about 96.1% of the total forecast error reduction for the AMSU-A departures (−223.6 × 10 −3 ). Other observations would also reduce AMSU-A departures in operational systems in which many more observations sensitive to the upper troposphere and stratosphere (e.g. more AIRCFT, all-sky microwave imagers, Infrared Atmospheric Sounding Interferometer, and Advanced Technology Microwave Sounder) are assimilated. In addition, adjoint-based FSO does not apply localization and would better capture propagated impacts. In contrast, localization of the EFSO limits impacts of observations to their immediate vicinity. Improving the ensemble size or localization method would be needed to capture such propagated impacts better than the present framework.

SUMMARY
The EFSO is a powerful and computationally feasible tool to estimate observation impacts. This study compared EFSO impact estimates using multiple references. We implemented the EFSO using two error norms (MTE and NOD) with the NICAM-LETKF. The main findings are summarized as follows: 1. We visualized the dry and moist energy errors with separated KE, PE and ME errors. This separation may help understand specific contributions from each type of observation. 2. With observation-based impact estimates, each type of observation mainly contributes to the improvement of forecast departures of the observed variables. For example, the assimilation of the AMSU-A radiances significantly reduces the forecast departure of the AMSU-A radiances, but only slightly reduces the forecast departure of other observations. This feature would come from a limitation of the localization in the EFSO. Improving the ensemble size or the localization method would help estimate propagated observation impacts better. However, the EFSO with observation-based metrics can be a useful tool to evaluate if assimilating observations improves or degrades the forecast departure of other observed quantities. 3. The EFSO overestimates the fraction of beneficial observations when we verify relative to subsequent analyses, especially for shorter lead times such as FT = 6 h. We could mitigate this overestimation to some extent by verifying with observations, analyses from other DA systems, or analyses of an independent run with the same DA system. 4. Observation impacts are overestimated if we apply the RTPS or RTPP to the initial conditions of the ensemble forecasts in the EFSO. Therefore, the ensemble forecasts in the EFSO should be independent of the ensemble forecasts in the DA cycle, especially for RTPP. 5. Deterministic baseline forecasts should be initialized by the ensemble mean of the first guess at the analysis time, not by the previous analysis. The fraction of beneficial observations was just over 50% in the NICAM-LETKF. It would be worth investigating designing additional QC methods, such as PQC (H17), for further forecast improvements.

APPENDIX B: VERIFICATION OF THE EFSO
This appendix validates the forecast error differences estimated by the EFSO. Table B1 compares the correlations between estimated and true MTE forecast error differences, that is, (Δ 2 MTE ) EFSO and (Δ 2 MTE ) true . The MTE forecast error differences at 0000 and 1200 UTC tend to be larger than those at 0600 and 1800 UTC mainly because most radiosondes are launched only at 0000 and 1200 UTC. Computing correlation coefficients using forecast error differences at different verification times altogether would lead to overestimated correlations. To avoid this issue, here we compute correlations separately for assimilated observations at 0000 and 1200 UTC and for those at 0600 and 1800 UTC. The EFSO impact estimates show good agreements with the true error differences in general (correlation coefficient > 0.817). Table B1 also suggests that the advection of the localization centre (O13) does not have a significant effect on the correlation coefficients for short lead time as noted by H17. Table B2 compares the estimated and true NOD forecast error differences, that is, (Δ 2 NOD ) EFSO and (Δ 2 NOD ) true . As described in section 3, we perform the EFSOs separately with ADPUPA, SATWND, or AMSU-A radiances T A B L E B1 Comparison of estimated MTE forecast error differences (J/kg) by the EFSO with true differences sampled over 4 weeks from 4 to 31 July 2014