Statistical post-processing of heat index ensemble forecasts: is there a royal road?

We investigate the effect of statistical post-processing on the probabilistic skill of discomfort index (DI) and indoor wet-bulb globe temperature (WBGTid) ensemble forecasts, both calculated from the corresponding forecasts of temperature and dew point temperature. Two different methodological approaches to calibration are compared. In the first case, we start with joint post-processing of the temperature and dew point forecasts and then create calibrated samples of DI and WBGTid using samples from the obtained bivariate predictive distributions. This approach is compared with direct post-processing of the heat index ensemble forecasts. For this purpose, a novel ensemble model output statistics model based on a generalized extreme value distribution is proposed. The predictive performance of both methods is tested on the operational temperature and dew point ensemble forecasts of the European Centre for Medium-Range Weather Forecasts and the corresponding forecasts of DI and WBGTid. For short lead times (up to day 6), both approaches significantly improve the forecast skill. Among the competing post-processing methods, direct calibration of heat indices exhibits the best predictive performance, very closely followed by the more general approach based on joint calibration of temperature and dew point temperature. Additionally, a machine learning approach is tested and shows comparable performance for the case when one is interested only in forecasting heat index warning level categories.


Introduction
In this century, extreme temperatures impacted 97 million people causing over 40 billion Euro economic damage (EM-DAT, 2019). In particular, heat stress has detrimental effect on human health and human activities. In order to mitigate the effect of heat stress, dedicated warning systems are developed (Morabito et al., 2019;McGregor et al., 2015). These systems require meteorological forecasts from which relevant indicators expressing the impact of heat stress on humans are determined (Di Napoli et al., 2019). A large number of different indices of various complexity exists and even a comparison of 165 different ones demonstrated that there is no universal acceptance of a single index (de Freitas and Grigorieva, 2017). In this paper, the focus is on two indices which can be easily derived from standard outputs of weather forecast models and are commonly used: the discomfort index (DI) and the indoor version of the wet-bulb globe temperature (WBGTid). Both are function of temperature and dew point temperature, but have been chosen as they differ in their formulation and complexity which may have an impact on the final findings.
In our study, forecasts of DI and WBGTid are derived from the ensemble prediction system run at the European Centre for Medium-Range Weather Forecasts (ECMWF). Ensemble predictions provide different scenarios of the future recognising the uncertainties inherit in weather forecasting (Leutbecher and Palmer, 2008). The derived probabilistic forecasts are an essential tool to support forecast-based decision making (Fundel et al., 2019). It has been already established that heat indices forecasts based on ensembles are skillful with respect to climatology up to 10 days (Pappenberger et al., 2015). However, the value of these forecast can be increased by correcting the forecasts through so called statistical post-processing taking account of inefficiencies in representing the aforementioned uncertainties and systematic biases. This also adjusts for the fact that the forecasts are produced representing an average over a spatial area (a grid cell) whilst being used and applied locally on individual locations.
Statistical post-processing methods include parametric models such as Bayesian model averaging (BMA; Raftery et al., 2005) and ensemble model output statistics (EMOS; Gneiting et al., 2005), providing full predictive distributions of the weather variables at hand. We focus on the EMOS approach, which specifies the predictive distribution by a single parametric law with parameters connected to the ensemble members via appropriate link functions. Postprocessing methods are often applied to individual weather quantities (Wilks, 2018) although the dependence structure between different weather variables is of crucial importance in the calculation of heat stress indicators. This is acknowledged and handled if bivariate EMOS models (Schuhen et al., 2012;Baran and Möller, 2017) or other approaches to joint calibrations such as the ensemble copula coupling (ECC) method of Schefzik et al. (2013) are deployed. More advanced methods (see e.g. Ben Bouallègue et al., 2016;Barnes et al., 2019) exist, but are not investigated here. This paper explores the fundamental question whether it is more efficient to base the calibration on the two input forecasts (temperature and dew point temperature) or rather to Figure 1: An overview of the different post-processing approaches calibrate directly the end-product (the heat index). We apply these two different methodological approaches to the calibration of DI and WBGTid ensemble forecasts. On the one hand, we first calibrate ensemble forecast vectors of temperature and dew point temperature, either with a bivariate EMOS model or with ECC combined with independent univariate EMOS models. The calibrated DI and WBGTid forecasts are derived from these calibrated input forecasts. On the other hand, we develop a new EMOS model for post-processing both DI and WBGTid ensemble forecasts, where the predictive PDFs follow a generalized extreme value (GEV) distribution. The benchmark method consists in a simple adjustment of the weather quantities to account for the scale mismatch between model output and point observations. A general overview of the applied procedures is given in Figure 1. The different approaches to post-processing are ranked in terms of performance; their implementation and limitations are also discussed.
Finally, we envisage the situation where one is interested only in the prediction of the warning level category of a heat index. In this case, statistical calibration can be considered as a classification problem resulting in forecast probabilities for the different categories. Besides obtaining the predictive distributions using the above-mentioned calibration methods, we also test the forecast skill of a general machine learning based approach.
The paper is organized as follows. Section 2 provides the fundamental formulae for the calculation of DI and WBGTid, followed by a description of the temperature and dew point temperature data sets. The applied calibration methods (including the novel GEV EMOS approach) are given in Section 3, and the verification procedure is discussed in Section 4. Section 5 presents the results of our study devoting separate sections to bivariate and univariate post-processing, and to calibration of discrete forecasts. Some figures are placed in the Appendix in order to improve readability. Finally, lessons learned and recommendations can be found in Section 6.

Heat index definitions
We start with the formal definition of the two heat indices of interest, which are both functions of two weather quantities: temperature and dew point (DP) temperature. Discomfort index I d is calculated from temperature T ( • C) and relative humidity (RH) H r (%) as (2.1) RH can easily be expressed using temperature and DP temperature T dp ( • C) using Magnus formula with constants suggested by Sonntag (1990) H r := 100 exp 17.62 T dp 243.12 + T dp − 17.62 T 243.12 + T . (2.2) Indoor wet-bulb globe temperature T W BGid is calculated as where T pwb denotes the psychrometric wet-bulb (PWB) temperature ( • C). One can obtain this quantity using Bernard's formula (Bernard and Pourmoghani, 1999) providing it as a solution of equation where P d = 6.106 exp 17.27 T dp 237.3 + T dp and P w = 6.106 exp 17.27 T pwb 237.3 + T pwb are the saturation vapour pressures (hPa) at DP temperature T dp and PWB temperature T pwb , respectively (Lemke and Kjellstrom, 2012). It results that WBGTid is also a function of temperature and dew point and can be computed using e.g. the HeatStress package of R (Casanueva, 2019 Both DI and WBGTid observations are negatively skewed (skewnesses are −0.415 and −0.286, respectively). This should be taken into account in the choice of the parametric distribution family for EMOS post-processing. The left-skewed character of both indices is visible in Figure 2 showing the corresponding climatological histograms together with the PDFs of generalized extreme value (GEV), skew normal and split normal distributions (with matching means, variances and skewnesses). These three parametric distributions are potential candidates for EMOS modeling (see Section 3.3).

Ensemble forecasts
The operational ensemble run at ECMWF comprises 50 perturbed members. We focus on forecasts initialized at 12 UTC and we consider 1 -15 day lead times. For each observation, we associate the forecast at the nearest grid point. In order to account for the difference between model and station elevations, orographic correction is applied to both raw temperature and dew point forecasts. Practically, we add 0.0065∆ e ( • C) to the forecasts, with ∆ e the altitude difference between station and model representation. Temperature and dew point forecasts are used to derive the corresponding raw ensemble forecasts of DI and WBGTid.
Raw forecasts provide information on a grid (here the model-grid resolution is ∼ 18 km), whereas observations are point measurements. This scale mismatch leads to representativeness error in the forecast. The raw forecast uncertainty, valid at the model-resolution scale, can be increased in order to capture the temperature variability at smaller spatial scale. The method followed here is a simple down-scaling step, inspired by the pioneer work of Saetra et al. (2004): it consists in adding to each member a draw from a centered Gaussian distribution with standard deviation This formula is derived from the analysis of 2 m temperature measurements of a high-density observation network (Ben Bouallègue, 2020). The spread adjustment is applied independently to raw temperature and dew point ensemble forecasts, followed by a reordering of the adjusted forecasts using the rank of the raw forecasts before the calculation of DI and WBGTid forecasts (see Figure 1). These corrected forecasts are referred to as the adjusted ensemble and are used as a reference for the computation of skill scores (see Section 4.2).

Calibration methods
Our main approach to calibration is the computationally efficient EMOS method that has demonstrated excellent predictive performance for a wide range of weather quantities. In what follows, we denote f 1 , f 2 , . . . , f K the ensemble forecast of a given weather quantity for a given location, time and lead time, and f the ensemble mean. S 2 and MD denote the ensemble variance and the ensemble mean absolute difference, respectively, defined as Multivariate ensemble forecasts are distinguished by bold notation f 1 , f 2 , . . . , f K , and, in this case, the ensemble variance is replaced by the ensemble covariance matrix The 50 members of the operational ECMWF ensemble system are considered as statistically indistinguishable and, in this sense, exchangeable. In the following, if we have M ensemble members divided into K exchangeable groups, where the kth group contains M k ≥ 1 ensemble members ( K k=1 M k = M), notation f k (f k ) will be used for the corresponding group mean (vector).

Normal EMOS model
Calibration of temperature and dew point ensemble forecasts is based on the EMOS model suggested by Gneiting et al. (2005). The predictive distribution is a Gaussian distribution N (µ, σ 2 ) with mean µ and variance σ 2 linked to the ensemble members via equations When an ensemble contains groups of statistically indistinguishable ensemble members, members within a given group should share the same parameters (Gneiting, 2014;Wilks, 2018) and the equation for the mean in (3.1) is replaced by

Bivariate normal EMOS model
The separate univariate EMOS calibration of temperature and dew point ensemble forecasts does not take into account the obvious dependence between these two weather quantities. In order to overcome this limitation, one can consider the joint post-processing of these two variables using an EMOS model with a bivariate normal predictive distribution N 2 µ, Σ), where the mean vector µ and covariance matrix Σ are expressed as respectively. Here A ∈ R 2 , whereas B 1 , . . . , B K and C, D are two-by-two real parameter matrices, where C is lower triangular. This way, CC ⊤ is symmetric and non-negative definite. In the case of existence of groups of exchangeable ensemble members, the bivariate generalization of (3.2) is Note that a bivariate EMOS model with a different parametrization of the covariance matrix has already been successfully applied for the calibration of wind vector ensemble forecasts (Schuhen et al., 2012).

Generalized extreme value EMOS model
In contrast to temperature and dew point (where the symmetric Gaussian distribution provides a reasonable fit), both DI and WBGTid observation distributions result in left-skewed histograms, calling for predictive PDFs with negative skewness. Potential candidates are the following: a skew normal, a split normal or a GEV distribution (see Figure 2). For these three competing laws, the GEV seems to best capture the tail behaviour of the observations (at least for the DI and WBGTid observations at hand).
A GEV distribution GEV µ, σ, ξ with location µ, scale σ > 0 and shape ξ can be characterized by its cumulative distribution function (CDF) if 1 + ξ(x−µ)/σ > 0 and zero otherwise, which for ξ < 1 has a finite mean. For calibrating ensemble forecasts of DI and WBGTid, we suggest modeling location and scale parameters by µ = α + β 1 f 1 + · · · + β K f K and σ := γ 2 + δ 2 MD (3.6) with α, β 1 , . . . , β K , γ, δ ∈ R, whereas the shape parameter ξ is considered to be independent of the ensemble members. Similar to (3.2) and (3.4), the exchangeable version of (3.6) considers the location µ as an affine function of the group means: The expression of the mean (or location) as an affine function of the ensemble is a general feature in EMOS post-processing regardless of the weather quantity at hand (see e.g. Gneiting et al., 2005;Thorarinsdottir and Gneiting, 2010;Baran and Nemoda, 2016), whereas the dependence of the scale on the ensemble mean difference is similar to the censored GEV model of Scheuerer (2014) for calibrating precipitation ensemble forecasts. Exploratory tests with the data set at hand show that the proposed model significantly outperforms the GEV EMOS model with scale depending on the ensemble mean, that is when σ = γ 2 + δ 2 f (see e.g. Lerch and Thorarinsdottir, 2013), and results in slightly (and not always significantly) better predictive performance than its natural modifications involving the ensemble variance S 2 , such as

Parameter estimation
Parameters of the normal and GEV EMOS models described in Sections 3.1 -3.3 are estimated according to the optimum score estimation principle of Gneiting and Raftery (2007). The estimates of the unknown parameters are obtained by optimizing the mean value of a proper scoring rule over an appropriate training data-set. The most popular proper scoring rules are the logarithmic (or ignorance) score (LogS; Good, 1952), that is the negative logarithm of the predictive PDF at the verifying observation and the continuous ranked probability score (CRPS; Gneiting and Raftery, 2007;Wilks, 2011). The CRPS corresponding to a (predictive) CDF F (y) and a real value (observation) x is defined as where I H denotes the indicator of a set H, whereas X and X ′ are independent random variables with CDF F and finite first moment. The last representation in (3.8) implies that the CRPS can be expressed in the same unit as the observation. Both CRPS and LogS are negatively oriented scoring rules (the smaller the better), and the optimization with respect to LogS gives back the maximum likelihood (ML) estimation of the parameters. Furthermore, the CRPS for both normal and GEV distributions can be expressed in closed form (for the exact formulae see Thorarinsdottir and Gneiting (2010) and Friedrichs and , respectively), thereby allowing for efficient parameter estimation procedures often being more robust than the ML approach.
The logarithmic score is obviously well defined also in the case of multivariate predictive distributions. Further, the direct multivariate extension of the CRPS is the energy score (ES) introduced by Gneiting and Raftery (2007). Given a CDF F on R d and a d-dimensional vector x, the energy score is defined as where · denotes the Euclidean distance and, similar to the univariate case, X and X ′ are independent random vectors having distribution F . In most cases (including the case of the Gaussian model), ES cannot be given in a closed form, so it should be replaced by a Monte Carlo approximation based on a large sample drawn from F (Gneiting et al., 2008). In our study, we apply ML estimation for bivariate models, whereas parameters of all univariate EMOS models are obtained by optimizing the CRPS. In the latter case, tests using ML estimation have not demonstrated significantly better results.
The appropriate choice of training data is also an important issue in statistical postprocessing. In EMOS modelling, the standard approach is the use of rolling training periods where the parameter estimates are obtained using ensemble forecasts and corresponding validating observations for the preceding n calendar days. Once the appropriate training period length is chosen, there are two general approaches to spatial selection of training data (Thorarinsdottir and Gneiting, 2010). In the local approach, model parameters are estimated using only training data of the given station, resulting in distinct parameter estimates for the different stations. The global (regional) approach uses training data of the whole ensemble domain and all observation stations share the same set of EMOS parameters. Global modelling requires a far shorter training period but is usually unsuitable for large and heterogeneous data-sets like the one at hand. Alternatively, one can combine the advantages of local and global parameter estimation using a semi-local approach (Lerch and Baran, 2017): the training data for a given station is enhanced with data from stations with similar characteristics. Following Baran et al. (2019), clustering based semi-local modelling where regional parameter estimation is also considered. The clusters are derived using k-means clustering of feature vectors depending both on the station climatology and the forecast errors of the raw ensemble mean over the training period.

Ensemble copula coupling
The ensemble copula coupling (ECC) approach of Schefzik et al. (2013) is a general multivariate post-processing method, where after univariate calibration, the rank order information in the raw ensemble is used to restore correlations between the calibrated forecasts. In our case, we first perform univariate EMOS calibration of temperature and dew point forecasts using the normal EMOS model of Section 3.1. Then we draw N random samples of the same size as the raw ensemble from both individual EMOS predictive distributions (ECC-R; see e.g. Schefzik, 2016) and reorder them to have the same rankings as the corresponding raw temperature and dew point forecasts. The obtained bivariate sample is considered as a sample from a calibrated bivariate predictive distribution of temperature and dew point.
In practice, when applying the ECC approach, 20 samples of the size of the raw ensemble (50) are drawn from the EMOS predictive distributions of temperature and dew point. Each sample is then reordered according to the rank structure of the corresponding raw ensemble forecasts resulting in 1000 draws from the ECC predictive distribution, which can be used for forecast evaluation. In a similar way, verification scores of the bivariate EMOS models are also estimated using 1000 samples from the bivariate predictive distributions.

Proper scores
For a given lead time, competing forecasts in terms of probability distribution are compared with the help of the mean CRPS in the univariate case and mean ES value in the bivariate case over all forecast cases in the verification data. The skill of univariate forecasts for dichotomous events (observation x exceeding a given threshold y) is assessed with the mean Brier score (BS). For a predictive CDF F (y), the Brier score is defined as The integral of BS over all possible thresholds results in the CRPS. The goodness of fit of probabilistic forecasts to extreme values of the univariate weather quantity is evaluated using the threshold-weighted continuous ranked probability score (twCRPS) where ω(y) ≥ 0 is a weight function (Gneiting and Ranjan, 2011). The case where ω(y) ≡ 1 corresponds to the CRPS defined in (3.8). In the following, we set ω(y) = I {y≥r} in order to focus exclusively on values above a given threshold r.

Skill scores
The improvement in terms of a given score S F for a forecast F with respect to a reference forecast F ref can be quantified using the corresponding skill score where S F and S F ref denote the mean score values corresponding to forecasts F and F ref , respectively, and S F perf corresponds to the mean score value of a perfect deterministic forecast (with here S F perf = 0). Based on (4.1), we compute the continuous ranked probability skill score (CRPSS), the energy skill score (ESS), the Brier skill score (BSS) and the threshold-weighted continuous ranked probability skill score (twCRPSS). Skill scores are positively oriented (the larger the better), and in Section 5 the adjusted ensemble is used as a reference.

Calibration assessment
A simple graphical tool for assessing calibration of univariate ensemble forecasts is the verification rank histogram, which is the histogram of ranks of verifying observations with respect to the corresponding ensemble forecasts (see e.g. Wilks, 2011, Section 8.7.2). In the case of a perfectly calibrated K-member ensemble, the ranks follow a uniform distribution on {1, 2, . . . , K + 1}, and the deviation from uniformity can be quantified by the reliability index ∆ defined by where ρ r is the relative frequency of rank r (Delle Monache et al., 2006). One can also generalize the verification rank histogram to multivariate ensemble forecasts. Thorarinsdottir et al. (2016) suggested several options in order to define ranks in a multivariate context. We apply the average ranking given by the average of the univariate ranks and the multivariate ordering proposed by Gneiting et al. (2008). In case of a probabilistic forecast, one can plot the verification rank histogram and calculate corresponding reliability index on the basis of a large number of ensembles sampled from the predictive PDF.
Calibration assessment of univariate predictive distributions is examined with the help of probability integral transform (PIT) histograms (the continuous counterparts of the verification rank histograms). By definition, PIT is the value of the predictive CDF at the validating observation, with a possible randomization at points of discontinuity (Gneiting and Ranjan, 2013). In case of perfect calibration, PIT follows a uniform law on the [0, 1] interval.

Forecast value
The forecast value compares mean expenses of users having access to different levels of information; it has the form of a skill score (see Section 4.2). The framework is the one of a simple binary decision setting: facing a potential threatening event (i.e. a heat index exceeding a warning level), a user can take (or not) preventive action. This protective action has a cost C, whereas the user can endure a loss L if the event actually occurs but no protective action was taken anticipatively. The ratio C/L is called cost-loss ratio and is considered as a characteristic of the user in the decision-making process. The user takes action based on the forecast: if the forecast probability is greater than the user's cost-loss ratio then preventive action with cost C is taken, otherwise the risk to face a loss L in case of event is taken. This rational behaviour is sometimes referred to as the Bayes action (see Rodwell et al. (2019) for a discussion on the concept and applications). When only climatological information is available, the user's cost-loss ratio is simply compared to the event base rate. In this framework, the forecast value is defined as the ratio of two mean expense differences involving the following elements: the mean expense of the user taking the Bayes action with respect to the forecast (E F ), the mean expense of a user making decision based on the observation climatology (E F clim ), and the mean expense of an omniscient user who is taking action only when the event actually occurs (E F perf ). Formally, the forecast value is defined as which can be expressed in a simple form using the elements of a contingency table (see e.g. Wilks, 2001). In our implementation, the mean expenses of the different type of users are computed independently at each station. This means that we consider a climatology (estimated as the sample base rate) varying from one station to another.

Results
We consider two different methodological approaches to the calibration of DI and WBGTid.
In the first case, we start with joint post-processing of temperature and dew point ensemble forecasts, either using directly a bivariate normal EMOS model or applying first univariate normal EMOS models to both components and then ECC to restore the dependence structure. Samples from the bivariate predictive distributions are then used to obtain samples from the predictive distributions of DI and WBGTid. The second approach is the direct post-processing of DI and WBGTid ensemble forecasts with the help of the novel GEV model introduced in Section 3.3 (see Figure 1).  Figure 3: CRPSS of semi-local EMOS models with respect to the local EMOS for temperature (left) and dew point (right) together with 95 % confidence intervals.

Bivariate calibration of temperature and dew point forecasts
We start with some considerations about the implementation of the EMOS approaches.
The 50 ensemble members are regarded as exchangeable; their separate univariate and joint bivariate calibration is performed using normal EMOS models with distribution means linked to the ensemble means via (3.2) and (3.4), respectively, with K = 1. Thus, the EMOS models require the estimation of 4 free parameters in the univariate case and 13 in the bivariate case. A large number of free parameters calls either for long training periods in local estimation or for a semi-local approach. Both local and clustering-based semi-local parameter estimations with 5, 10, 15, 25, 50 and 75 clusters are tested. Clustering of stations is based on 40 dimensional features equally representing the properties of both temperature and dew point forecasts and observations. In accordance with the results of Lerch and Baran (2017), the forecast skill does not really depend on the number of features applied, provided that one works with a sufficiently large number of features. Moreover, the longer the lead time of the post-processed forecasts, the more training data are needed to outperform the raw ensemble forecasts (see discussion in Feldmann et al., 2019). As a trade-off, the length of the training period is fixed to 60 days. This choice leaves 79 calendar days (period 14 July -30 September 2017) as a verification period when taking into account the maximal lead time of 15 days.
First, calibration of temperature and dew point forecast is performed separately. Figure  3 shows the CRPSS of different semi-local EMOS models with respect to local EMOS. Local post-processing is optimal up to day 5, then semi-local approaches exhibit significantly better predictive performance. Hence, we propose to apply different types of EMOS configurations for different forecast lead times: local EMOS at the beginning of the forecast range and semi-local configuration(s) for longer lead times. Practically, local EMOS is considered for days 1 -5 both for temperature and dew point, while for days 6 -10 we use semi-local EMOS with 75 clusters. For dew point, this latter model is also kept for days 11 -15,   whereas for temperature we reduce the number of clusters to 5. The different mixtures of EMOS configurations are reported in Table 1. In the following, the method referred to as ECC relies on the reported mixed approach for the univariate post-processing step.
In the same spirit as Figure 3, the optimal training configuration is investigated by inspecting the multivariate performance in terms of ESS for the semi-local bivariate EMOS models with respect to the local one. As reported in Table 1, optimal performance in terms of ES can be obtained by applying local modelling for days 1 -5, semi-local with 75 clusters for days 10 -15, and semi-local with 5 clusters for longer lead times (referred to as 2D EMOS optimal ). However, this mixture of EMOS configurations is sub-optimal when applied to derive predictive distributions of DI and WBGTid (see Section 5.2). Better forecast skill can be obtained using local EMOS for days 1 -4, semi-local with 50 clusters for days 5 -7, and semi-local with 5 clusters until the end of the forecast range (referred to as 2D EMOS ).
In Figure 4, the different multivariate post-processing approaches are compared in terms of a multivariate skill score, ESS. Bivariate EMOS and ECC post-processing approaches significantly outperform both the raw ensemble and the adjusted ensemble up to day 9, while the advantage of the adjusted ensemble with respect to the raw ensemble disappears only after day 10. All performance differences shrink with forecast lead time. Bivariate EMOS generally outperforms ECC but both methods have similar performance between day 5 and day 10. When significant difference in mean ES between these two approaches is tested, it appears that the proportion of stations with significant difference (positive or negative) is less than 7% at day 1 but increases to more than 30% at day 10 (not shown).
Reliability of the bivariate forecast is now investigated. Figure 5 presents reliability indices (corresponding to average ranking and multivariate ordering) as a function of the forecast lead time. The reliability indices are rather consistent with one another: similar patterns are found for the two types of ranking (average and multivariate). One exception is the difference between raw and adjusted ensemble results which is much smaller when average ranks are considered. The raw ensemble results in strongly U-shaped histograms at short lead time, while the ensemble adjustment allows to improve reliability in particular with respect to the multivariate ranking. 2D EMOS exhibits the better reliability (lowest reliability indices) up to day 8 and fairly good results for longer lead times. However, EMOS post-processed forecasts are slightly biased towards the end of the forecast range, producing low multivariate ranks of the verifying observations (see Figure 10 and 11 of Appendix A). ECC forecasts are highly hump shaped at short lead times, rather well calibrated at medium lead times, and U-shaped for long lead times. Post-processing approaches applied here fail to improve reliability of the ensemble beyond day 10, which is in line with their predictive performance in terms of the ES.
In a nutshell, bivariate post-processing can improve multivariate forecasts of temperature and dew point, and in particular reliability, in a multivariate sense, is increased. However, the post-processing methods applied here do not explicitly account for any relationship between temperature and dew point respecting the laws of physics. This limitation results sometimes in situations where a dew point forecast is greater than a temperature forecast. In that case, the former is set equal to the latter in order to avoid physical inconsistencies and allow the computation of the heat indices. This situation occurs less than 3 % of the cases when applying 2D EMOS and can be more frequent when simply adjusting the ensemble forecasts.

Post-processing of discomfort index and indoor wet-bulb globe temperature forecasts
Calibrated DI and WBGTid forecasts can be derived from the jointly calibrated temperature and dew point forecasts, but DI and WBGTid ensemble forecasts can also be directly postprocessed using the GEV EMOS model introduced in Section 3.3. The GEV model has only 5 free parameters in the case of a single group of exchangeable ensemble members. Considering a 60-day rolling training period and the same verification period 14 July -30 September 2017 as before, we inspect the CRPSS of various semi-local EMOS configurations with respect to local EMOS for DI and WBGTid (not shown). Based on this exploratory analysis, we propose to adapt the EMOS configuration as a function of the forecast lead time as follows: for both indices local EMOS is applied between day 1 and 4, semi-local EMOS with 75 and 5 clusters for days 5 -8 and 9 -15, respectively. This model is referred to as GEV EMOS (see Table 1).
In Figure 6, results in terms of CRPSS indicate that post-processing significantly improves both the raw and adjusted ensembles up to day 7 for DI and day 6 for WBGTid. For both indices, 2D EMOS post-processed forecasts have the best predictive skill, followed closely by GEV EMOS forecasts, whereas ECC shows the worst predictive performance among the calibration approaches. However, the differences between models are minor and often nonsignificant. For lead times up to day 7, the differences in CRPS between the competing EMOS forecasts are significant at a 5 % level for less than 6 % of the stations. This later number increases to more than 15% when differences are computed against the adjusted ensemble forecasts (not shown).
Verification rank and PIT histograms help diagnosing reliability issue in the forecast.  Results for DI are shown in Figures 12 of Appendix B, while histograms for WBGTid, which are very similar, are not shown. For both indices, the raw ensemble forecasts are highly underdispersive for day 1; however, the dispersion improves with the forecast lead time. Moreover, raw forecasts exhibit a bimodal character, which is in line with the bivariate histograms of raw temperature and dew point forecasts. Only local GEV EMOS calibration removes this bimodality and in general, similar to bivariate forecasts, the longer the lead time, the smaller the effect of post-processing.
Another perspective is now taken focusing on the forecast skill at the right tail of the observation distributions. Figure 7 (top panel) depicts BSS and twCRPSS considering the event DI exceeding r = 27 • C. This threshold corresponds to the level at which most of the population suffers from heat-induced discomfort (Stathopoulou et al., 2005). In terms of BSS, all EMOS approaches outperform the raw and adjusted ensemble up to day 10 with comparable skill. However, in terms of twCRPS, the GEV EMOS model clearly outperforms the other post-processing methods. For lower thresholds, the differences between the EMOS approaches is less pronounced, while for higher thresholds the results are noisier and the significance bars are larger (not shown). For WBGTid, a threshold r = 27.8 • C is considered. It corresponds to the threshold associated with the first warning level (green flag) (Patel et al., 2013). Figure 7 (bottom panel) shows BSS and twCRPSS: post-processed forecasts significantly outperform the raw and adjusted ensemble forecasts up to at least day 8, whereas ECC exhibits positive skill score up to day 10. Considering larger thresholds, small BS and twCRPS values combined with large uncertainty in the scores leads to quite noisy results which are difficult to interpret (not shown). The change in EMOS configuration (from local to semi-local) is clearly visible in Figure 7: jumps appear at day 5 and 8 for 2D EMOS, at day 5 and 9 for GEV EMOS, while the curves corresponding to ECC remain smooth.
One more step towards refined performance analysis is taken by looking at the forecast value. For the two events considered above, forecast value is plotted as a function of the user cost-loss ratio in Figure 8. As a general feature, not only the forecast value, but also the number of users that can benefit from the forecast, decreases with the forecast lead time. At day 10, the forecast of WBGTid has no value for a user who needs high certainty about the event occurrence (high cost-loss ratios) but it is still valuable for a user who is risk adverse (low cost-loss ratios). The post-processed forecasts have value for some users even   Stathopoulou et al. (2005) and Patel et al. (2013), respectively. beyond day 10 (not shown). As a purely statistical characteristic of the score, the forecast value reaches a maximum for a user with cost-loss ratio equal to the climatological base rate. This is true locally, but the maximum value varies from one location to another and the corresponding pick is therefore smoothed out when aggregated over different locations.
In any case, the event for DI is more probable than the event for WBGTid, which explains that the maximum is reached for different C/L values in the two cases. The raw ensemble has negative value for extreme C/L for both DI and WBGTid forecasts at 4 and 10 days, but the ensemble adjustment allows to make the forecasts valuable for (almost) all types of users. Forecast post-processing increases the forecast value further in most of the cases. There are also slight differences in terms of forecast value between the different calibration approaches.

Predictive distributions of categories of heat indices
Heat indices are often grouped into categories according to the severity of the effect of heat stress on human activity. Table 2 provides a classification of DI and heat categories of WBGTid based on Stathopoulou et al. (2005) and Patel et al. (2013), respectively. Instead of continuous heat indices, we focus now on the corresponding categories taking values {1, 2, 3, 4, 5, 6} for DI and {1, 2, 3, 4, 5} for WBGTid. In this case, a predictive distribution is specified by a probability mass function (PMF) and post-processing reduces to a classification problem, which is a typical application area of machine learning methods. Here, predictive distributions are obtained with the help of multilayer perceptron neural networks (MLP NN; Hastie et al., 2009  Predictive PMFs of the DI and WBGTid categories are obtained applying MLP NNs with a single hidden layer and cross-entropy as loss function. In our case study, the different categories are highly imbalanced: the proportions of observations in the first category in Table 2 is ∼ 67 % for DI and ∼ 99 % for WBGTid, whereas the last category in Table 2 represents less than 1 o / oo in both cases. For this reason, we also consider a two step procedure (referred to as dual net). In the first step a binary classification is performed, which for each feature vector results in the distribution (p 1 , 1 − p 1 ) of the binary outcome of belonging to the first (richest) category or not. Then a second net is trained using only feature vectors coming from the remaining κ classes, where κ = 5 for DI and κ = 4 for WBGTid. For each feature set this step yields a distribution (q 1 , . . . , q κ ) of the corresponding classes and the final distribution of index categories is obtained as p 1 , (1 − p 1 )q 1 , . . . , (1 − p 1 )q κ .
Similar to the EMOS approaches, different training configurations are envisaged for different lead times. We tested the performance of single nets with 15,20,25,30,35,40 and 50 neurons in the hidden layer, both with the nine-dimensional primary features and with extended feature vectors containing forecast errors of either 1 or 5 days. In the case of dual nets, we used 15, 25 and 40 neurons in the hidden layer of both components and considered the primary features plus the 1 day forecast error. In general, a large number of neurons leads to overfitting, and according to our observations, for longer lead times the use of forecast errors does not improve the predictive performance, while the dual net can handle better the larger uncertainty in the forecasts. For DI categories, the following configuration is selected: a single net with 40 neurons making use of the forecast errors for 5 days as additional predictors for days 1 -4, single net with 15 neurons for days 5 -8, and dual net with 15 -15 neurons for days 9 -15 (see Table 3). For WBGTid categories, single  Figure 9: CRPSS with respect to the adjusted ensemble of DI (left) and WBGTid (right) categories together with 95 % confidence intervals.
networks with medium number of neurons are preferred, and moreover, the use of forecast errors has a longer lasting effect. So, the selected configuration is the following: for days 1 -8 we consider 30 neurons and 5-day forecast errors, for days 9 -12 only the primary features and 35 neurons, whereas for the last 3 days we just reduce the number of neurons to 30 (see Table 3). In the following discussion these two blends are referred to as MLP NN .
In Figure 9, CRPS is applied to forecast and observation categories. All calibration approaches significantly outperform both the raw and the adjusted forecasts of DI categories till day 6. The results for the different EMOS models are in line with those of the continuous forecasts ( Figure 6, left panel). The robust and rather generic MLP NN method is outperformed by the parametric models; however, the differences are often non-significant. A different ranking of the calibration methods can be observed in the right panel of Figure 9, which shows a high level of similarities with BSS and twCRPSS plots in the bottom panel of Figure 7. The GEV and 2D EMOS forecast skill jumps at the lead times where the training configurations are changed. After day 7, 2D EMOS loses its skill against the adjusted ensemble while the machine learning approach shows a reasonable predictive performance up to day 10. Among the different post-processing approaches, MLP NN provides the sharpest predictive PMF in terms of the average width of the 90 % central predictive intervals and in terms of variance for longer lead times too (not shown). Due to the extremely imbalanced nature of WBGTid observation categories, during the training of a neural net one can easily face the situation where some categories are completely missing from the training data. This problem is less pronounced for the other investigated post-processing methods as the EMOS predictive PMFs are calculated using continuous forecasts and observations. Thus, the WBGTid results of the different classification approaches should be interpreted cautiously.

Conclusion
This study focuses on improving forecast heat indices such as discomfort index (DI) and indoor wet-bulb globe temperature (WBGTid) driven by temperature and dew point forecasts as inputs through statistical calibration. Post-processing based on ensemble model output statistics (EMOS) approaches enable us to significantly improve continuous forecasts of heat indices, at least up to day 6. In particular, DI and WBGTid forecast reliability is clearly increased in that case. Among different flavours of EMOS approaches, the generalized extreme value (GEV) model has the best overall performance, followed by the bivariate EMOS and ensemble copula coupling. The GEV EMOS is tailored to the end products (DI and WBGTid), whereas calibration based on joint post-processing of temperature and dew point forecasts can be applied to any heat index depending on these two weather quantities. For longer forecast ranges (typically beyond the first week in the forecast), post-processing does not improve the forecast and can even lead to a deterioration of the forecast skill which is explained by the non-stationarity of the forecast errors.
Moving from continuous forecasts to predicted categories of heat indices, we further compare the traditional approach of post-processing to multilayer perceptron neural networks, demonstrating that both methods have comparable predictive power. Furthermore, we evidenced that local EMOS post-processing is outperformed by cluster-based EMOS approaches after day 4 if a limited training period is used. Decreasing the number of clusters with the forecast lead time helps extending the benefit of EMOS post-processing for longer lead times. As a limit, a single cluster approach is equivalent to a global training approach which is not suitable for forecasting over a large domain.
The present work highlights several avenues of potential future research. For example, as an alternative training data set to the one above, one could consider the use of reforecasts. This is not investigated here but should be considered to address efficiently the calibration of medium to extended range forecasts. When accessing large training data sets, appropriate predictors could be defined to account for seasonality and/or weather regime dependent forecast errors. In this situation, machine learning approaches seem more flexible to deal with an increased number of predictors.
The assessment of forecast performance spans from the multivariate ES to the forecast value for a range of users defined by their cost-loss ratio, from the generic univariate CRPS to a focus on extreme events with BS and twCRPS. The cascade of verification results indicate that the ranking of the post-processing methods might differ from one score to another: the best method for one aspect of the forecast performance is not necessary the best solution for all applications. While an improvement with EMOS post-processing is expected to be effective up to day 6, the ensemble demonstrates potential skill beyond day 10 in terms of forecast value for some users.
The comparison of bivariate calibration approaches with direct post-processing leads us to formulate the following recommendations. Direct post-processing is generally more efficient; targeting one specific event/user could be even more appropriate. However, post-processing of the input variables appears to be a competitive approach. In that case, calibration of the dependence structure gives the best results. The simplest approach, which consist in the calibration of each component separately combined using the empirical copula of the ensemble, has reasonable performance as well. For more complex heat indices, calibration of the input variables can be envisaged if no EMOS model exists for the end product, but as the above discussion tends to indicate, there is no royal road to post-processing of heat indices.   Figure 12: PIT histograms of the GEV EMOS post-processed, simulated verification rank histograms of the 2D EMOS post-processed, and verification rank histograms of the adjusted and raw DI ensemble forecasts for days 1, 4, 7 and 10.