Comparison of the BMA and EMOS statistical methods for probabilistic quantitative precipitation forecasting

The main approach to probabilistic weather forecasting has been the use of ensemble forecasting. In ensemble forecasting, the probability information is generally derived by using several numerical model runs, with perturbation of the initial conditions, physical schemes or dynamic core of the numerical weather prediction (NWP) models. However, ensemble forecasting usually tends to be under‐dispersive. Statistical post‐processing has, therefore, become an essential component of any ensemble prediction system aiming to improve the quality of numerical weather forecasts as they seek to generate calibrated and sharp predictive distributions of future weather quantities. Different versions of the ensemble model output statistics (EMOS) and the Bayesian model averaging (BMA) post‐processing methods are used in the present paper to calibrate 24, 48 and 72 hr forecasts of 24 hr accumulative precipitation. The ensemble employs the weather and research forecasting (WRF) model with eight different configurations which were run over Iran for six months (September 2015–February 2016). The results reveal that the BMA and EMOS‐censored, shifted gamma (CSG) techniques are substantially successful at improving the raw WRF ensemble forecasts; however, each approach improves different aspects of the forecast quality. The BMA method is more accurate, skilful and reliable than the EMOS‐CSG method, but has poorer discrimination. Moreover, it has better resolution in predicting the probability of high‐precipitation events than the EMOS‐CSG method.


| INTRODUCTION
Precipitation forecasting is an essential tool for the optimal management of water resources and flood forecasting. The numerical weather prediction (NWP) models play a major role in weather forecasting. It predicts the future state of the weather by the mathematical modelling of the atmosphere's behaviour based on its current condition. However, the accuracy of the NWP models is still a challenging issue and its improvement is the main goal of the operational prediction centres.
In weather forecasting by using the NWP models, there are several sources of uncertainty, such as the intrinsic chaotic behaviour of the atmosphere dynamic system, errors in the observational data and initial conditions that are almost impossible to remove (Yang et al., 2012). To quantify these uncertainties, ensemble systems are used. Instead of only one deterministic forecast, an ensemble system is created by several forecasts obtained from perturbation of the initial conditions, physical schemes or dynamic core of the NWP models. Ensemble systems are widely used by the meteorological communities, especially for medium-range weather forecasts as well as for short-and even ultra-short-range weather prediction (Toth and Kalnay, 1993;Houtekamer and Derome, 1995;Molteni et al., 1996;Toth and Kalnay, 1997;Buizza et al., 2005;Charron et al., 2010;Iversen et al., 2011;Bouallégue et al., 2013).
A probabilistic forecasting of flood and extreme precipitation can be created by an ensemble system. However, ensemble forecasting is under-dispersive and not calibrated, especially for meteorological parameters near ground level. To overcome this problem, several statistical methods (Wilks, 2006;Wilks and Hamill, 2007;Williams et al., 2014;Vannitsem et al., 2019;Vannitsem et al., 2020) can be used to post-process the ensemble forecasting. By using the ensemble post-processing methods, the biases in both location and dispersion are eliminated by a historical database of ensemble forecast errors, and then a predictive probability density function (PDF) can be estimated. The most popular ensemble post-processing methods are Bayesian model averaging (BMA) Sloughter et al., 2007Sloughter et al., , 2010Soltanzadeh et al., 2011;Baran, 2014) and ensemble model output statistics (EMOS) Wilks and Hamill, 2007;Thorarinsdottir and Gneiting, 2010;Scheuerer, 2014;Baran and Lerch, 2015;Scheuerer and Hamill, 2015).
In the BMA method , a PDF is first fitted to every ensemble member according to the behaviour of the considered weather parameter, and then the predictive PDF is estimated by weighted averaging of the PDFs fitted to the ensemble members. Weights are estimated according to the performance of each ensemble member. In the EMOS method , the distribution parameters are a linear combination of the ensemble members. In both methods, the co-efficients are estimated by forecasts and their corresponding observational data in a training period. Baran et al. (2014) compared the BMA and EMOS post-processing methods for probabilistic forecasting of 2 min temperature and 10 min wind speed of the Hungarian Meteorological Service. They concluded that both post-processing methods improve the calibration of probabilistic forecasts, and the BMA has a slightly better performance than the EMOS method. Schmeits and Kok (2010) compared the raw ensemble, BMA and extended logistic regression (LR) for precipitation probabilistic forecasts in the Netherlands. The results revealed that the difference in the skill of the BMA and LR is not statistically significant. Díaz et al. (2019) used the BMA and EMOS post-processing methods for 2 min temperature probabilistic forecasts. They showed that these postprocessing methods result in a significant decrease in the continuous ranked probability scores (CRPS) of probabilistic forecasts.
In the present paper, the 24, 48 and 72 hr ahead probabilistic forecasting of 24 hr accumulated precipitation is evaluated over Iran. The BMA and EMOS methods are used for the post-processing of the ensemble forecasting and estimating the predictive PDF. In the BMA, a gamma-distribution is fitted as the predictive PDF of the individual ensemble members. In the EMOS, the generalized extreme-value (GEV) distribution and the censored, shifted gamma (CSG) distribution are fitted. To implement the BMA and EMOS models, the ensemble-BMA  and EMOS (Yuen et al., 2018) R packages are used. The ensemble system has eight members created by running the weather and research forecasting (WRF) model over Iran by different physical configurations during the autumn and winter of 2015-16. The results are verified by rank and the probability integral transform (PIT) histograms, reliability diagram, relative operating characteristic (ROC) diagram, ROC skill score (RSS), Brier score (BS), Brier skill score (BSS) and the CRPS (Wilks, 2011).
The paper is organized as follows. Section 2 introduces the configurations and domains of the WRF model. Section 3 describes the BMA and EMOS methods. The results of the different verification tools are discussed in Section 4. Finally, the summary and conclusions are provided in Section 5.

| THE ENSEMBLE SYSTEM AND DATA
Forecasts of the WRF model with eight different configurations of 24 hr accumulated precipitation for the 24, 48 and 72 hr lead times are used. The different physical configurations of the WRF model are summarized in Table 1.
The initial conditions of the WRF run are the global forecast system (GFS) forecasts with a 0.5 horizontal resolution. The WRF model is run with two nested domains. The larger domain has a 15 km horizontal resolution and covers an area in the southwest of the Middle East from 10 to 51 N and from 20 to 80 E. The smaller domain has a horizontal resolution of 5 km and covers an area between 23-41 N and 42-65 E (the whole of Iran). Initialization time is 1800 UTC between September 1, 2015 and February 28, 2016. The WRF forecasts are bilinearly interpolated to synoptic meteorological stations spread all over Iran. The observational data used in the present study consist of the 24 hr cumulative precipitation  Figure 1 shows the two nested domains of the WRF model and the locations of all the synoptic meteorological stations used in the study. In our implementation, as in Gneiting et al. (2005), the training period is a sliding training window of the most recent N days before the forecast for which the ensemble output and the verifying observation were available.
To determine the suitable length of the sliding training period, the CRPS of the probabilistic forecasts is computed for different sliding training periods from 50 to 110 days for the BMA method. The same test data set is used to evaluate all the training periods; that is, from all the available 180 days between September 1, 2015 and February 29, 2016, the first 110 days are considered as the training period and the remaining 70 days as the test period consisting of all the days between December 21, 2015 and February 29, 2016. As shown in Figure 2, the CRPS decreases for the training periods from 50 to 87 days. Since an 87 day window is selected as the training period, the period from November 28, 2015 to February 28, 2016 is considered as the test period, and a sliding window of the 87 previous days is used as the training period in order to fit the PDFs of the postprocessing models.
The observational data used in the present study consist of the 24 hr cumulative precipitation measured at 0600 UTC in 349 irregularly spaced synoptic meteorological stations spread across Iran.

| BMA
The BMA method provides probabilistic forecasts as predictive PDFs of the weather quantity based on an ensemble prediction system Sloughter et al., 2007). The ensemble forecast, f 1 Á Á Áf M , for precipitation x, at a given time and location is considered. Supposing that each ensemble member f k corresponds to a component PDF h k (xjf k ; θ k ), where θ k are the parameters to be estimated, the BMA predictive PDF of x is: F I G U R E 2 Comparison of the lengths of the training periods for accumulated precipitation forecast with continuous ranked probability score (CRPS) where w k is the weights based on the predictive performance of forecast f k , during the training period; and P M k = 1 w k = 1.
To model precipitation given its properties (positive probability at 0 and skewness), Sloughter et al. (2007) proposed a mixture of a discrete component at 0 and a gamma-distribution as a component PDF; therefore, the BMA model is: for the forecast PDF of the cube root of precipitation accumulation, in which I is the indicator function and p (x = 0jf k ) is specified by: where δ k = 1 if f k = 0, and 0 otherwise; and g k (xjf k ) is a gamma PDF with mean The parameters a 0k , a 1k and a 2k are F I G U R E 5 Reliability diagrams including bootstrap confidence intervals (CI) for the probabilistic forecasts of 24 hr cumulative precipitation for a 24 hr lead time using the raw output of the ensemble system (red), Bayesian model averaging (BMA) (blue), ensemble model output statistics-censored, shifted gamma (EMOS-CSG) (green) and ensemble model output statistics-generalized extreme-value (GEV) (brown) methods and δ k as the two predictor variables.
The parameters b 0k and b 1k are determined by linear regression of the cube root of the non-zero observation on the corresponding cube root of ensemble members; while w k , c 0 and c 1 are assessed by the maximum likelihood (Fisher, 1922) technique from the training data.
The cdf function of "ensembleBMA" package in R is used to compute the validation criteria. As mentioned in the package guideline, if the model has been applied to a power transformation of the data, the output is transformed appropriately .

| EMOS
The EMOS predictive PDF is a single parametric distribution in which parameters are functions of the ensemble members. Scheuerer (2014) modelled precipitation by a GEV distribution, which is left-censored at 0, as follows: where G(y) is the cumulative distribution function (CDF) of the GEV distribution with parameters μ, σ and ξ, which characterize the location, scale and shape of the F I G U R E 6 Reliability diagrams including bootstrap confidence intervals (CI) for the probabilistic forecasts of 24 hr cumulative precipitation for a 48 hr lead time using the raw output of the ensemble system (red), Bayesian model averaging (BMA) (blue), ensemble model output statistics-censored, shifted gamma (EMOS-CSG) (green) and ensemble model output statistics-generalized extreme-value (GEV) (brown) methods GEV. The ensemble members are linked to the mean m and variance σ 2 of the underlying GEV distribution via the following equations: where I f = 0 f g and MD(f ) are the ratio of zero precipitation members and Giniʼs mean difference, respectively. The parameters α 0 , α 1 , ÁÁÁ, α M + 1 , β 0 and β 1 are estimated by minimizing the mean CRPS over the specified training period.
As an alternative to the above-mentioned approach, Scheuerer and Hamill (2015) suggested model precipitation by using the CSG distribution with shape k, scale θ and shift δ parameters as follows: where F k is the CDF of a gamma distribution with the unit scale and shape parameter k. Baran and Nemoda (2016) proposed that the ensemble members are linked to the mean μ and variance σ 2 of the underlying gamma distribution via the following equations: F I G U R E 7 Reliability diagrams including bootstrap confidence intervals (CI) for the probabilistic forecasts of 24 hr cumulative precipitation for a 72 hr lead time using the raw output of the ensemble system (red), Bayesian model averaging (BMA) (blue), ensemble model output statistics-censored, shifted gamma (EMOS-CSG) (green) and ensemble model output statistics-generalized extreme-value (GEV) (brown) methods F I G U R E 8 Relative operating characteristic (ROC) diagrams for the probabilistic forecasts of 24 hr cumulative precipitation using the raw ensemble forecast, and the Bayesian model averaging (BMA), ensemble model output statisticscensored, shifted gamma (EMOS-CSG) and ensemble model output statistics-generalized extremevalue (GEV) methods for the 24, 48 and 72 hr lead times The parameters a 0 , a 1 , Á Á Á, a M , b 0 and b 1 are estimated by minimizing the mean CRPS over the specified training period.

| VERIFICATION RESULTS OF ENSEMBLE CALIBRATION USING THE BMA AND EMOS
In the present research, several verification methods such as the PIT and rank histograms, reliability and ROC diagrams, RSS, BS, BSS and CRPS are used to estimate the performance of the ensemble post-processing methods described in Section 3.2. The test period is from November 28, 2015 to February 28, 2016. In this period, the predictive PDFs of 24 hr accumulated precipitation are estimated by the BMA and EMOS methods, and the observational data are available at 349 irregularly spaced synoptic meteorological stations all over Iran (Figure 2). Figure 3 shows the verification rank histograms of the raw ensemble forecasts of precipitation for the 24, 48 and 72 hr lead times. They are not uniformly distributed and the ensemble systems are under-dispersive. In other words, the ensemble members are not spread out enough and resemble each other too much (Wilks, 2011).
The PIT diagram (Wilks, 2011) is a common tool to assess the calibration of PDF forecasts. This diagram is the histogram of the predictive CDF evaluated at verifying observations. The PIT histograms of the BMA and two versions of the EMOS predictive PDFs are shown in Figure 4. In comparison with the rank histogram of the raw ensemble forecast (Figure 3), it is obvious that all the implemented post-processing methods can significantly eliminate the under-dispersion exhibited by the raw ensemble forecast. However, the BMA and EMOS-CSG models are more uniformed than the EMOS-GEV model. The results of Kolmogorov-Smirnov tests for uniformity of the PIT values are presented in Table 2. The BMA models produce the best PIT histograms for all the lead times. However, the under-representation of the last bin in the PIT histograms shows that the BMA model has a positive bias in predicting 0 values. This is because most of the observations are 0, and in order to plot PIT, the T A B L E 3 Areas under the relative operating characteristic curves (AUCs) and confidence intervals (CIs) of AUCs for the probabilistic forecasts of 24 hr cumulative precipitation using the raw ensemble forecast, and the Bayesian model averaging (BMA), ensemble model output statistics-censored, shifted gamma (EMOS-CSG) and ensemble model output statistics-generalized extreme-value (GEV) methods for the 24, 48 and 72 hr lead times value of the CDF at 0 should be chosen randomly from the interval between 0 and the value of the CDF at 0. For the ideal model, the CDF at 0 (when observation is 0) is expected to be near 1 or the highest frequency of CDF values at 0 be in the last bin near 1. In the BMA model, the highest frequency of CDF values at 0 is not for the last bin (between 0.9 and 1). The EMOS-CSG models are also relatively good for the 48 and 72 hr forecasts, while the fit of the EMOS-GEV model is rather poor for the 48 and 72 hr forecasts.
The reliability diagram (Wilks, 2011) is a common diagnostic graph to assess the reliability and resolution of a probabilistic forecast. In the reliability diagram, the frequency of an observed event is plotted against its forecast probability. In this diagram, the curve of a well-calibrated probabilistic forecast is close to the bisector and far from the "no resolution" line. The democratic voting method F I G U R E 9 Relative operating characteristic skill score (RSS) for the probabilistic forecasts of 24 hr accumulative precipitation of the Bayesian model averaging (BMA), ensemble model output statistics-censored, shifted gamma (EMOS-CSG) and ensemble model output statistics-generalized extreme-value (GEV) methods with respect to the raw ensemble forecasts in different thresholds for the 24, 48 and 72 hr lead times F I G U R E 1 0 Brier score (BS) for the probabilistic forecasts of 24 hr cumulative precipitation of the Bayesian model averaging (BMA), ensemble model output statistics-censored, shifted gamma (EMOS-CSG) and ensemble model output statistics-generalized extreme-value (GEV) methods in different thresholds for the 24, 48 and 72 hr lead times (Wilks, 2011) is used to determine the probabilistic forecast obtained from the raw ensemble. Figures 5-7 show the reliability diagram of the probabilistic forecasts of 24 hr accumulated precipitation estimated by the democratic voting (raw ensemble forecast), BMA, EMOS-CSG and EMOS-GEV methods for the 24, 48 and 72 hr lead times in different thresholds of 0.1, 2.5, 5, 10, 15 and 25 mm. As shown, the predictive PDFs estimated by the raw ensemble and the EMOS-GEV method are not well calibrated and are close to the "no resolution" line, especially in high thresholds. The reliability curve gets close to the bisector by using the BMA and EMOS-CSG methods on the thresholds of 0.1, 2.5, 5 and 10 mm. In other words, the reliability of light and medium rainfalls for the 24 hr accumulative rainfall is considerable. In the thresholds of 15 and 25 mm, the reliability curve of the BMA method is better calibrated than that of the EMOS-CSG method. The BMA and EMOS-CSG methods have higher resolution than the raw output of the ensemble system and the EMOS-GEV method in all the lead times. Generally, the BMA method used for the 24, 48 and 72 hr forecasts provides more reliable probabilistic forecasts than the EMOS-CSG and EMOS-GEV methods and the raw ensemble forecast over studied domain.
The ROC diagram (Wilks, 2011) measures the ability of the forecast to discriminate between events and nonevents and demonstrates the hit rate (H) versus the false-alarm rate (F). Corresponding pairs (F, H) are depicted on the graph and connected to each other by several lines. The pairs (0, 0) and (1, 1) mean "never forecasting" and "always forecasting", respectively. In the good discrimination forecasts, the ROC curves approach the top-left corner; however, in poor discrimination ones, these curves get close to the F = H diameter (Wilks, 2006).
The ROC diagrams of the calibrated forecasts using the BMA, EMOS-CSG, EMOS-GEV methods, and the raw ensemble forecast for the 24, 48 and 72 hr lead times in different thresholds are shown in Figure 8. Table 3 shows the 95% confidence intervals for the area under the relative operating characteristic curve (AUC) for BMA, EMOS-CSG and EMOS-GEV methods. The confidence intervals are determined by bootstrapping methods. The ROC diagrams show that the BMA and EMOS methods, for all the lead times in different thresholds, have better discrimination than the raw ensemble forecasts. Hence, the order of the methods from good to poor discrimination is EMOS-CSG, BMA and EMOS-GEV, respectively. Table 3 shows that discrimination of EMOS-CSG is significantly poorer than the BMA and EMOS-CSG.
The RSS compares the area under the ROC curve of a probabilistic forecast with the area under the ROC curve of a reference forecast. RSS is defined as:  Figure 9. The higher RSS shows better discrimination in comparison with the raw ensemble forecasts. The RSS values in all thresholds for the 24, 48 and 72 hr forecasts in the BMA and EMOS-CSG methods are higher than the RSS values of the EMOS-GEV method. The RSS values are not high in the low thresholds because the raw ensemble forecasts have good discrimination in these thresholds, as shown in Figure 8.
It should be noted that the ROC summarized over all stations and the verification period. Therefore, the ROC and RSS results may suffer from the Hamill and Jarus (2007) effect.
The BS (Brier, 1950) is a common verification method for the probabilistic forecasts of binary events (Wilks, 2011). This score is the mean square of the differences between forecast probabilities and their corresponding binary events, and is calculated as: where N is the number of events; y n is the forecast probability; and o n is the binary view of the nth event. If the event is observed, o n = 1, otherwise o n = 0. A lower BS is desirable, so BS = 0 a perfect forecast. The BSS assesses the performance of a probabilistic forecast compared with a reference forecast: where BS(perfect) = 0 and the raw ensemble forecast is considered as the reference forecast in the present study. The BSS of a perfect probabilistic forecast is 1.
The results of BS and BSS values for the probabilistic forecasts of 24 hr accumulative precipitation in different thresholds for the 24, 48 and 72 hr lead times are presented in Figures 10 and 11.  Figure 11, the BSS values of the BMA, EMOS-CSG and EMOS-GEV methods for the 24, 48 and 72 hr forecasts are positive, and thus more calibrated than the raw ensemble forecasts. As indicated, the BMA and EMOS-CSG method are more skilful than the EMOS-GEV methods, especially in the 24 hr lead time forecasts. The CRPS is calculated to assess the overall performance of the probabilistic forecasts. The CRPS formula is as follows: where F(y) is the CDF of the forecast parameter y; and I {y ≥ x} is a heaviside function that equals 1 when y ≥ x, and 0 otherwise. Table 5 gives the CRPS values of the probabilistic forecasts estimated by the BMA and EMOS models and the raw ensemble and 95% bootstrapping confidence intervals for the CRPS values. It shows that all three methods can improve the probabilistic forecasts in comparison with the raw ensemble forecasts. The CRPS values of the BMA and EMOS-CSG methods are significantly less than of the EMOS-GEV in all the lead times.

| CONCLUSIONS
In the present paper different versions of the ensemble model output statistics (EMOS) and Bayesian model averaging (BMA) methods are compared in order to improve the calibration of precipitation forecasts of the weather and research forecasting (WRF) system in Iran. First, the weaknesses of the WRF raw ensemble system were demonstrated as being under-dispersive and thus uncalibrated. Second, the BMA and two versions of the EMOS were applied and some standard measures for comparing them were computed.
Overall, the results showed that the BMA and EMOScensored, shifted gamma (CSG) techniques are significantly successful at improving the raw WRF ensemble forecasts. However, each approach improves different aspects of the forecast quality. The BMA method is more accurate, skilful and reliable than the EMOS-CSG method, but has poorer discrimination. Moreover, the BMA method gave better resolution in predicting the probability of high-precipitation events than the EMOS-CSG method.
Future studies could investigate the manner in which modern post-processing techniques such as machine learning or blending approaches are performed. As discussed by Vannitsem et al. (2020), post-processing methods such as the BMA and EMOS can be easily implemented, but neural networks or analogues are much more flexible. Therefore, another possibility is to assess the trade-off between complexity and flexibility.