Assessment of MME methods for seasonal prediction using WMO LC‐LRFMME hindcast dataset

Different multi‐model ensemble (MME) methods were investigated for their potential to improve the skill of 1‐month lead seasonal forecast products, based on six models from Global Producing Centers (GPCs) for long‐range forecasts (LRFs) designated by the World Meteorological Organization (WMO). We first compared the hindcast performance of seven MME methods (simple composite method, SCM; simple linear regression, SLR; multiple linear regression, MLR; best selection anomaly, BSA; multilayer perceptron, MLP; radial basis function, RBF; genetic algorithm, GA) for the global 2‐m temperature and precipitation for 1983–2009. The reference datasets for 2‐m temperature and precipitation are the ERA‐Interim from European Centre for Medium‐Range Weather Forecasts (ECMWF) and Global Precipitation Climatology Project (GPCP) for hindcast verification. For real‐time verification, the data from the National Centers for Environmental Prediction/National Center for Atmospheric Research (NCEP/NCAR) reanalysis 1 for 2‐m temperature and climate anomaly monitoring system and outgoing longwave radiation precipitation index (CAMS OPI) for precipitation are used. According to our analysis, GA was the most successful MME method in predicting both the global 2‐m temperature and precipitation for all four seasons. GA also showed good performance in predicting the 2‐m temperature and precipitation over the 13 regional climate outlook forum (RCOF) regions in all four seasons, but the range in performance among the RCOF regions varied significantly. In a real‐time forecast period (MAM 2012‐DJF 2015/16), GA outperformed in terms of time‐averaged anomaly pattern correlation coefficient (ACC) and root‐mean‐square error (RMSE) of the 2‐m temperature, although the forecast skill difference (0.02) between GA and SCM was not statistically significant. For the precipitation, both SCM and GA also reveal better performance than other MME methods. During the very strong El Niño event in 2015, individual models show better performance than other years. Nonetheless, these two MME methods outperform all the individual models.


K E Y W O R D S
global producing centres, GPC, Lead Centre, long-range forecasts, multi-model ensemble, seasonal forecasts, WMO, WMO LC-LRFMME

| INTRODUCTION
Dynamical predictions on a seasonal timescale, which are based on modelling the physical processes in the atmosphere, ocean, cryosphere, and land surface, are a viable tool for anticipating the near-time evolution of the climate system. However, the skill of dynamical seasonal predictions has remained limited, partially because of systematic errors in models arising from factors such as model parameterization schemes (Zadra et al., 2018), and the model's inability to produce an atmospheric response to oceanic boundary conditions (Kang et al., 2004;Park et al., 2018).
Over the last two decades, multi-model ensemble (MME) approaches that combine forecasts from different climate models have been introduced to reduce the influence of errors associated with model biases (e.g., Krishnamurti et al., 1999Krishnamurti et al., , 2000Shukla et al., 2000;Peng et al., 2002;Palmer et al., 2004;Hagedorn et al., 2005;Wang et al., 2009;Mishra et al., 2019). The equal-weighed MME method generally shows consistently better prediction skill each individual model does because the errors from individual models tend to offset each other (Kharin and Zwiers, 2002;Peng et al., 2002;Hagedorn et al., 2005;DelSole, 2007). This is particularly important in longterm forecasts because errors in the model are gradually accumulated and increased. Meanwhile, unequalweighted MME forecasts that give more weights to the models with consistently better performance have also been suggested (Krishnamurti et al., 2000;Yun and Krishnamurti, 2002;Ahn and Lee, 2016;Mishra et al., 2019). Many operational centres for seasonal prediction have now adopted the MME approach to provide more reliable forecasts to users (Palmer et al., 2004;Kirtman et al., 2014;Min et al., 2014Min et al., , 2017Kim et al., 2016). Recent MME studies have expanded its scope to study extreme weather conditions such as droughts, typhoons, monsoons and heat waves using seasonal forecast data (Sohn et al., 2011;Kim et al., 2017;Kim, 2018;Kim and Chan, 2018;Park et al., 2018;Shin and Moon, 2018).
The World Meteorological Organization Lead Center for Long-Range Forecast MME (WMO LC-LRFMME), established in 2009 and jointly operated by the Korea Meteorological Administration (KMA) and the National Oceanic and Atmospheric Administration (NOAA)/NCEP, provides a conduit for sharing model forecasts from global producing centres (GPCs) for long-range forecasts (LRFs).
The LC-LRFMME was initiated to provide a wellcalibrated MME system and a "one-stop shop" for GPC information (Graham et al., 2011;Kim et al., 2016). The WMO LC-LRFMME issues global probabilistic and deterministic seasonal forecasts that are openly available via its website (https://www.wmolc.org/).
In this study, we evaluate the methods for creating deterministic forecasts, that is, forecasts of the MME anomalies, by combining forecasts from GPC-LRFs. The study results are important for improving probabilistic forecasts, as well. The WMO LC-LRFMME provides the probabilistic forecasts using the probabilistic MME method developed by Min et al. (2009) and currently implemented at the APEC Climate Center (APCC). In this method, predicted probabilities are approximated with a Gaussian probability distribution function (PDF). The first parameter of the predicted Gaussian PDF, the mean, can be improved based on the results from this study, while the second parameter, the variance, can be assessed by accounting for both ensemble spread and calibration error as has been demonstrated for Korea by Min et al. (2011) and for Northern Eurasia by Kryzhov (2012).
Besides WMO LC-LRFMME, various deterministic MME methods are used operationally in several operational centres, particularly APCC, Copernicus Climate Change Services (C3S), and North American MME (NMME). Since its establishment in 2005, APCC has issued monthly 3-month global predictions of the 2-m temperature and precipitation. APCC uses four deterministic MME methods (simple composite mean, SCM; stepwise pattern projection method, SPM; multiple linear regression, MLR; synthetic superensemble method, SSE) for combining the ensemble means of participating models that are operationally exploited for seasonal forecasts using three Atmospheric General Circulation Models (AGCMs) and 10 Coupled General Circulation Models (CGCMs) from leading climate forecasting centres and institutes in 10 countries (Kryjov et al., 2006;Kirtman et al., 2014). C3S was implemented by the European Center for Medium-Range Weather Forecasts (ECMWF) as part of the delegation agreement with the European Union and publishes seasonal forecast products every month. Current participating centres in C3S are ECMWF, The Met Office, Météo-France, the German Meteorological Service (Deutscher Wetterdienst, DWD), and the Euro-Mediterranean Center on Climate Change (Centro Euro-Mediterraneo sui Cambiamenti Climatici, CMCC). Simple-averaged MME with equal weighting is used with a correction for bias and variance estimated from the full set of hindcasts available for each model, which is called the "standard bias correction technique" (Stockdale, 2012). NMME is a real-time experimental operational forecast system consisting of coupled models from North American modelling centres, including NOAA/NCEP, NOAA/Geophysical Fluid Dynamics Laboratory (GFDL), International Research Institute for Climate and Society (IRI), National Center for Atmospheric Research (NCAR), National Aeronautics and Space Administration (NASA), and Canadian Meteorological Center (CMC), starting in August 2011. Among other methods, NMME also uses simple-averaged MME with anomaly forecasts for each model based on the climatology from the hindcasts (Kirtman et al., 2014).
WMO LC-LRFMME provides four deterministic MME predictions (SCM; regular multiple regression, MLR; singular value decomposition, SVD; genetic algorithm, GA) for a 1-month lead seasonal forecast provided by 13 WMO-designated GPCs (Kim et al., 2016). Several years have passed since WMO LC-LRFMME started providing MME forecasts; nevertheless, the skill of MME methods using GPCs of WMO LC-LRFMME has not been assessed in an operational context. Although many MME techniques for seasonal prediction have been developed for research and/or operational purposes, the MME techniques were evaluated individually, sporadically and in isolation. The purpose of this study is to assess an overall performance of the several MME methods using the world's leading models for seasonal forecasting. We first introduce the various deterministic MME methods. Next, we assess the general performance of different MME predictions with the bias-corrected ensemble mean forecast worldwide and then highlight the MME method with the highest prediction skill in the hindcast period. We also examine the performance of the selected MME over 13 regional climate outlook forum (RCOF) regions for the hindcast period. Finally, we apply the MME method to real-time forecasting and present the prediction skills of the MME.
2 | DATA AND METHOD

| Data and verification metrics
The MME predictions using six models from five GPCs participating in WMO LC-LRFMME are used (Table 1). The other GPC hindcast data of which the periods are not long enough (e.g., ECMWF and Toulouse) or do not match others (e.g., Beijing, CPTEC, Exeter, Pretoria, Seoul, and Offenbach) are not included in the analysis. Moscow is the only uncoupled model, while the others are coupled models (Melbourne, Montreal [CANCM3 and CANCM4], Tokyo, and Washington). The hindcasts cover the 27 years from 1983 to 2009, and their real-time forecasts from 2012 to 2015 (12 independent predictions) are analysed. The 1-month lead retrospective and realtime MME forecasts for the seasons of March-April-May (MAM) initialized on February, June-July-August (JJA) initialized on May, September-October-December (SON) initialized on August, and December-January-February (DJF) initialized on November are evaluated. All prediction skills are focused on the 2-m temperature and precipitation. The datasets are interpolated into a grid with a resolution of 2.5 in both longitude and latitude. The T A B L E 1 WMO GPCs whose predictions are used in this study with summarized information for the forecast system configurations GPC name (model) Organization anomalies are calculated as departures from the climatology of each model over the training period in the leaveone-out cross-validation scheme to provide comparable results (Wilks, 1995). All of the non-equal-weights combination methods are calculated in the leave-one-out crossvalidation mode. Following official recommendations of the WMO Joint CBS-CCl Expert Team on Operational Predictions from Sub-seasonal to Longer-time Scales (ET-OPSLS) adopted at the meeting held on 11-15 April 2016 in Beijing, China, we use different datasets for hindcast and real-time forecast verification. The data used for verification of the retrospective forecast are 2-m temperature from the ERA-Interim reanalysis (Dee et al., 2011) of European Centre for Medium-Range Weather Forecasts (ECMWF) and precipitation from the Global Precipitation Climatology Project (GPCP) merging data from rain gauge stations, satellites, and sounding observation (Adler et al., 2003). Verification of the real-time forecasts was performed on the near real-time 2-m temperature from NCEP/NCAR reanalysis 1 (Kalnay et al., 1996) which is a joint product from NCEP and NCAR and precipitation from Climate Anomaly Monitoring System and Outgoing longwave radiation Precipitation Index (CAMS OPI) merging observation from rain gauges with satellite estimates to obtain the quasi real-time monthly mean global precipitation dataset (Janowiak and Xie, 1999). The anomalies from observation are also leave-one-out cross-validated data (e.g., WMO, 2010;Lee et al., 2011Lee et al., , 2013aMin et al., 2014Min et al., , 2017Mishra et al., 2019). The Niño 3.4 index based on optimum interpolation (OI) version 2 monthly mean sea surface temperature (SST) (Reynolds et al., 2002) obtained from the National Oceanographic and Atmospheric Administration (NOAA)/Climate Diagnostics Center (CPC) (available at https://www.cpc.ncep.noaa.gov/data/indices/) is used to present the relationship between forecast skill and El Niño-Southern Oscillation (ENSO) variability. OISST version 2 was the officially suggested dataset for verifying sea surface temperature at the meeting of CBS for Expert Team on Develop a Verification System for Long-range Forecasts held on 16-20 August 1999 in Reading, United Kingdom (the final report is available at: https://www. wmo.int/pages/prog/www/DPS/Reports/ET-LRF-DevVS_ ecmwf1999.html).
The seasonal forecast skill of the various MME methods is assessed using the SD, the temporal correlation coefficient (TCC), the anomaly pattern correlation coefficient (ACC), and the root-mean-square error (RMSE). The SD, TCC, and RMSE are calculated for each grid point and then globally averaged. We also estimate the time-averaged ACC and RMSE worldwide. The scores are defined as follows: where N is the number of years of hindcast period, n is the number of grids, F and O denote forecast and observation and F 0 and O 0 are the anomaly for forecast and observation, respectively. j denotes each grid point. According to the agreements on the need to have a more coherent approach to verification of long-range forecast, the Standard Verification System (SVS) for LRF (SVS-LRF) has arranged the procedures for the practical details of producing and exchanging appropriate verification scores (WMO, 2010). The mean square skill score (MSSS) is recommended by SVS-LRF for deterministic verification. The MSSS is based on the mean squared error (MSE) between a set of paired hindcast (MSE j ) and observations (MSEc j ).
The overall MSSS is computed as below: MSSS can be expanded (Murphy, 1998) as.
where r FO j is the product moment correlation of the forecasts and observations at point j as Here, S F j and S O j are the SD of forecast and observation, respectively. The first three terms of the decomposition of MSSS j are related to phase errors (through the correlation), amplitude errors (through the ratio of the forecast to observed variances) and overall bias error, respectively, of the forecast. These values can be helped to adjust or weight the forecasts used as inputs to regional and local forecasts Good model performance with strong correlation and small amplitude error can be considered suitable for use in prediction.

| MME prediction methods
We suggest seven experimental deterministic MME methods to merge the six seasonal forecast systems. Each method is described below and summarized in Table 1.

| Simple composite method (SCM)
SCM is the simplest and most commonly used way to combine the bias-corrected seasonal forecast with equal weights (see Equation (10)), where F 0 i is the forecast anomaly of the ith model, and N is the number of individual models involved in MME. SCM is widely used in many projects or operational centres (e.g., APCC, DEME-TER, NMME, WMO LC-LRFMME), as recommended by WMO (Palmer et al., 2004;WMO, 2007WMO, , 2008Kirtman et al., 2014;Kim et al., 2016;Min et al., 2017).

| Linear regression methods
This approach uses empirically weighted MME with coefficients calculated for simple linear regression (SLR) and MLR. SLR is given by Equation (11), where a i is the regression coefficient of the ith model.
MLR is built using covariance matrices to minimize the mean-square error during the training period and is calculated by where b i is the MLR coefficient of the ith model, which is obtained by minimizing the difference between the individual models and observations (Krishnamurti et al., 1999(Krishnamurti et al., , 2000Yun et al., 2003;Ke et al., 2009).

| The best selection anomaly (BSA) method
The best selection anomaly (BSA) method is a linear method to merge models. This method takes the forecasted anomaly of the ith model Fz ij − F ij À Á , which has the highest correlation with observations among the individual models for the jth point in the cross-validation process (KMA, 2009a(KMA, , 2009b.

| Artificial neural network (ANN) method
The artificial neural network (ANN) is a nonlinear MME prediction method that is used for nonlinear regression and its classification problems. Previous studies demonstrate that the seasonal prediction estimated by ANN outperforms that of each participating model, and that it is more advantageous than MLR in its ability to extract useful information in the case of nonlinear cause-effect relationships (Yun and Krishnamurti, 2002;Park et al., 2005;Ahn et al., 2012). ANN consists of three types of layer: input, hidden, and output. The forecast data are entered into the input layer, and the MME results are generated from the output layer. There is more than one hidden layer between the input and output layers (see figure 6 in Yun and Krishnamurti, 2002).
We choose the multilayer perceptron (MLP) algorithm, which is the most widely used algorithm to calculate optimal weighting (Marius-Constantin et al., 2009). The jth neuron in the hidden layer is assigned the value, y j , where N is the number of hidden neurons, w ij and c j are the weighting and bias parameters, respectively, and x i is the forecast data normalized by four times the SD of observation σ in the training period from the input layer, as the input range is from −1.0 to 1.0. This output from the jth neuron (y j ) is entered into the output layer as follows: where H is the number of output neurons, andw j andc are the weighting and bias from the hidden layer to the output layer, respectively. This process is repeated to minimize the cost function, J, where O t is the observation from time step t, and T is the training period. The final forecast data from ANN using the MLP method are given by where σ is the observation SD in the training period (Park et al., 2005). In addition, we use the radial basis function (RBF), which has the same structure as MLP. In the process from the input to the hidden layer of RBF, the jth neuron in the hidden layer is derived by.
where μ j and σ j are the mean and SD of each neuron, respectively. The final output of RBF is also calculated as Equation (8) (Lee et al., 2008).

| Genetic algorithm (GA) method
The concept of GA was initially designed by Holland (1975) as a probabilistic technique that explores the optimum solution to a given problem that maximizes or minimizes a particular function by algorithmizing the process of developing and producing the next generation to preserve and survive their genes in an environment of survival of the fittest. GA is applied to MME to find the optimal weight among the models during the training period. Randomly generated initial weights of single models are calculated in the first step of GA by using the hindcast data of each model except the target year. These weights are evaluated by a fitness function to check whether they are optimal or not. If the weight does not provide the optimal solution, the algorithm continues to the next generation through operators such as crossover, mutation, and replacement until an optimal solution is obtained (Henao, 2011). Here, the minimum RMSE between observation and model anomalies is used as the fitness function You et al., 2012;Ahn and Lee, 2016). Ahn and Lee (2016) showed that the MME method applying GA to both single-model and MME improved the performance for both winter temperature and precipitation, even over higher-latitude land areas, compared to conventional SCM. This improvement over land seems to be attributable to GA's greater efficiency in finding an optimum solution in a complex region where nonlinear physical properties are evident. Ratnam et al. (2019) also revealed an improvement in seasonal forecast of the 2-m temperature by applying GA over several regions of South America, North America, Australia and Russia compared to the unweighted ensemble mean.

| COMPARISON OF MM2E PREDICTION METHODS
We first show the general characteristics and performance of the various deterministic MME methods using the 27 years of hindcasts for 1983-2009. There are four linear (SCM, SLR, MLR, and BSA) and three nonlinear (MLP, RBF, and GA) MME methods.
3.1 | Annual variation of MME prediction methods Figure 1 shows the annual mean of SDs of the seasonal 2-m temperature in observation and seven MME predictions, together with the observational annual climatology pattern. The annual mean SD of each MME prediction is calculated by averaging the SD of 1-month lead predictions for each season. The temporal variability in observation over the land tends to be much larger than that over the ocean due to the larger thermal inertia of the oceans because of the different surface and atmospheric feedbacks that occur over the land compared with the ocean (Stouffer et al., 1994;Bell et al., 2000).
The variability in the model is calculated based on the ensemble mean, and the noise is averaged out; hence, the variability in the ensemble mean is generally smaller than the observed one. Nevertheless, most MME methods represent a contrast of the temporal variability between land and ocean shown in the observation, particularly using the MLP and MLR methods which are highly correlated to the observation (spatial ACCs equal 0.85 and 0.83, correspondingly). There are differences in variability among the MME predictions, most of which are found over land rather than over the ocean because for a 1-month lead, the SST prediction skill for all models is generally high; thus, the differences among the various models are small over the ocean except SLR and RBF. The variability in MME predictions in the equatorial eastern Pacific is relatively higher than that in other regions. Previous studies have shown that most of the interannual variability in the skill of seasonal predictions is affected by oceanic boundary conditions, particularly those associated with the ENSO (Wang et al., 2009;Barnston et al., 2010;Min et al., 2014). The variability associated with oceanic conditions, such as ENSO, is retained in the MME predictions, leading to a higher variability in the equatorial eastern Pacific.
For precipitation, the largest variability in the observed precipitation spans the tropical region. A similar tendency is found in the MME predictions but with a smaller amplitude ( Figure 2) because of ensemble averaging. The results from Figures 1 and 2 reveal that the MME predictions capture the spatial patterns of the observed variability over the ocean quite well, but that the amplitudes of variations from the MME predictions are smaller than those of the observations for both the temperature and precipitation.  During the strong 1997/98 El Niño event, the spread of forecast skill among MMEs for precipitation for the decaying phase (MAM 1998) is narrower than that for the maturing phase (DJF 1997), which means that the forecast skills of MMEs are relatively certain after the ENSO events. Previous studies reveal that the atmospheric response associated with the ENSO is lagged by a few months, not only in observation but also in climate models (Trenberth and Paolino, 1981;Trenberth et al., 1998;Kumar and Hoerling, 2003;Buli c and Kucharski, 2012). This may be due to the delayed atmospheric response to the ENSO, in that the oceanic boundary conditions initialized from late January to February (mature phase of El Niño) and used for seasonal prediction for spring, lead to a larger atmospheric response in spring (Kumar and Hoerling, 2003) and a higher forecast skill, regardless of the MME method. The averaged MME forecast performances of both variables during the ENSO events are always better than those during non-ENSO events for all four seasons. Especially, the ACC differences between the ENSO and non-ENSO precipitation events for JJA and DJF are statistically significant at the 5% significance level. The variability in performance among the MME predictions is consistent regardless of the ENSO events (as shown in Figure S13). Although the ENSO is an important of predictability, the forecast skills of the MME predictions do still vary for both variables. For example, for all seasons GA tends to outperform the other MME methods for both the 2-m temperature and precipitation in terms of not only ACC but also MSSS (Figures 3, 4, and S1-S4). Other oceanic forcing such as El Niño Modoki (Ashok et al., 2007) and Indian Ocean Dipole (Saji et al., 1999) also have an improving effect on MME prediction performance (Figures S14 and S15), but their impacts are lower than the ENSO.

| Interannual and seasonal variation of MME prediction skill
The global averaged TCCs of the MME predictions for the 2-m temperature and precipitation are shown in Figure 5a,b, together with the mean of single model skills (MSMS), which is estimated as the average of the skills of individual models. Generally, the skills of MME predictions for the 2-m temperature are much higher than those for precipitation, and the seasonal skill for the 2-m temperature does not vary greatly. Also, the differences of forecast skills of MME among the seasons are not large and do not show large seasonal dependency. The MSMS for the 2-m temperature is also largely better than that of precipitation. Most MME predictions show higher correlation than the corresponding (based on TCC as a skill metric) MSMS for both the 2-m temperature and precipitation during all four seasons.
F I G U R E 2 Same as Figure 1 but for precipitation. The unit of climatology for precipitation is mmÁday -1 However, the MLP method reveals lower correlations than MSMS for both variables during all four seasons, indicating that not all MME methods improve the skill compared with MSMS. Figure 6a,b show the global averaged RMSE for the 2-m temperature and precipitation, respectively. Generally, the forecast errors of MME are larger in boreal winter than in boreal summer, because of higher variability in winter in extratropical Northern Hemisphere land regions, particularly for the 2-m temperature. The forecast errors averaged over separate single model forecasts for the 2-m temperature are 0.81, 0.65, 0.79, and 1.01 during the MAM, JJA, SON, and DJF seasons, respectively. Most MME predictions show lower RMSEs than the corresponding (based on RMSE as a skill metric) MSMSs do, except for MLP during MAM, SON, and DJF. Similar to the 2-m temperature, the RMSEs of all MME predictions for precipitation are smaller than the corresponding MSMS, that is, the MME prediction tends to reduce the forecast errors.

| Overall performance of the MME prediction in the retrospective period
As shown in the Taylor diagram (Taylor, 2005) in Figure S16, the MME predictions mostly underestimate the interannual variation of both the 2-m temperature and precipitation for all four seasons worldwide. The TCC and normalized SD for each MME prediction show very low seasonal variation. Figure 7 is presented to comprehensively determine the 4-season-averaged TCC and RMSE of each MME and individual model for the 2-m temperature and precipitation. First, for the 2-m temperature, the skill and RMSE of the MME predictions generally have a linear relationship, as opposed to that of single models. Second, the performance of MMEs is superior to that of the individual models (e.g., Min et al., 2014;Kim et al., 2016). GA has good forecast performance, similar to SCM, among all MMEs and individual models with the highest TCC (0.51), which is significant at the 99% confidence level, and the lowest RMSE (0.74). The worst is MLP, which has a TCC of 0.41 and RMSE of 0.80.

F I G U R E 4 Same as Figure 3 but for precipitation
For precipitation, the TCC and RMSE do not show a distinct linear relationship, but there is a clear difference in performance among the MMEs and single models. GA also exhibits good performance for precipitation with TCC of 0.20 and RMSE of 0.52. MLP is also ranked as the worst MME prediction for precipitation with poor forecast skill (0.12) and large forecast error (0.55). The forecast performance of GA is consistently comparable to that of SCM, which is widely used in many operational centres for each variable and each season.
The mean square skill score (MSSS) which is recommended by the WMO Standardize Verification System (SVS) for LRFs can be decomposed as phase errors (through the correlation between forecast and observation), amplitude errors (through the ratio of the forecast to observed SDs) and overall bias error, respectively. If the model or MME performance is good with a strong correlation and small amplitude error, the forecast can be considered suitable for use in forecasts. A good forecast is achieved when the ratio of the forecasted to observed SDs is close to one, overall bias through the difference between the forecast and observation is near zero, and the MSSS is close to one. Negative or zero MSSS values indicate that deterministic forecasts are worse than or the same as climatology forecast (WMO, 2010(WMO, , 2018Setiawan et al., 2017). In terms of MSSS, GA outperforms the other MMEs for the 2-m temperature in both boreal summer and winter (as shown in Figures S1 and S2). The correlation does not vary significantly among all MMEs ( Figures S5 and S6). However, the ratio of SDs for RBF is close to zero worldwide ( Figures S9 and S10). As a result, the MSSS values of RBF are lower than those of the other methods. For the precipitation, the skilful MSSS mainly shows in the Maritime continent and equatorial Pacific F I G U R E 5 Global averaged temporal correlations between observations and predictions of (a) 2-m temperature and (b) precipitation by each MME method and single-model's mean (SMM) for MAM, JJA, SON, and DJF for the period 1983-2009. Dashed lines indicate that the estimated score is statistically significant at the 5% level using the one-tailed Student's t test F I G U R E 6 Global averaged RMSE of (a) 2-m temperature and (b) precipitation predictions of each MME method and single-model's mean (SMM) for MAM, JJA, SON, and DJF for the same period for both JJA and DJF seasons ( Figures S3 and S4) for all MMEs and it is related to the correlation. The large ratio of the SDs causes MSSS to be negative, as clearly demonstrated by BSA of the precipitation for the JJA and DJF (Figures S11 and S12).

| PERFORMANCE OF THE GA METHOD OVER RCOF REGIONS
As discussed above, the prediction skill of GA is almost equal to that of SCM for the hindcast period worldwide. As the performance of SCM over the RCOF regions was already revealed in a previous study (Kim et al., 2016), we investigate the forecast performance of the GA method over the RCOF regions in this section.
RCOF regions have been designated by the WMO and the corresponding domains are listed in Table 2. RCOF produces consensus-based, user-relevant climate outlook products in real-time to reduce climate-related risks and support sustainable development for the region in question (WMO, 2016). One of the primary sources of information for the climate outlook of RCOF is the real-time seasonal forecast produced by WMO LC-LRFMME. Figure 8 shows the scatter diagram between TCC and normalized RMSE (NRMSE) of the most reliable MME prediction for the hindcast period, GA, for the 2-m temperature and precipitation for each season, averaged over the domain of each RCOF region. Here, we calculate the NRMSE, which is obtained from the RMSE divided by the SD of the observation, to compare the forecast errors of the 2-m temperature and precipitation together. The seasonality of the forecast performance over most RCOF regions is not large for both 2-m temperature and precipitation. The most consistent result is that the forecast skills and RMSE for the 2-m temperature are better than those for precipitation, showing a larger TCC. In addition, the forecast performance still varies based on region or variable, even though the most reasonable MME prediction method (i.e., GA) is used. In terms of MSSS as well, GA performs better than the other methods in most RCOF regions, especially for boreal winter 2 m temperature and precipitation ( Figure S21). The forecast skills of both the variables decrease toward the extratropical RCOF regions. Previous studies have revealed that the prediction skill for individual models is greater over the tropics than the extratropics and greater over ocean than land (Wang et al., 2009;Peng et al., 2011;Kim et al., 2012;Lee et al., 2014;Ham et al., 2019). GA also outperforms in tropical RCOF regions, such as ASEANCOF, PICOF, and PRESAO, which are more affected by the interannual variations under oceanic boundary conditions, for example, ENSO (e.g., Ropelewski and Halpert, 1989;Mason and Goddard, 2001;Jin et al., 2008). However, the skills over extratropical RCOF regions, not affected or weakly affected by the ENSO, remain relatively low (Alexander et al., 2002;Palmer et al., 2004;Kryjov, 2012;Kumar et al., 2013;Min et al., 2014). This conforms to results from many previous studies that the skillful forecasts are F I G U R E 7 Diagram of temporal correlation-RMSE of each MME method of (a) 2-m temperature and (b) precipitation averaged over four seasons (MAM, JJA, SON, and DJF). The red crosses indicate each individual model concentrated in the seasons and regions affected by the ENSO (e.g., Ropelewski and Halpert, 1989;Mason and Goddard, 2001;Jin et al., 2008) as well as in the extratropical regions strongly related to the ENSO via teleconnections during its peak phases (Wang et al., 2009;Barnston et al., 2010).

| REAL-TIME FORECAST
Next, we investigate the seasonal predictability of different MME systems for the 1-month lead seasonal 2-m temperature and precipitation, using real-time forecasts for the period during MAM 2012-DJF 2015/16. Figure 9 shows the real-time forecast skills of the two MME predictions (GA and SCM) for the real-time forecast period and the spread of the skill of all MMEs for the 2-m temperature and precipitation, in terms of ACCs. We focus solely on these two MME methods because we previously assessed GA as the most skilful MME method for the hindcast period, and SCM is widely used in many operational centres. Overall, the real-time forecast skill for both the 2-m temperature and precipitation increased in recent years, as it was affected by the very strong El Niño in 2015. As mentioned above for the hindcast period, the real-time forecast skills of all MME methods are also highly related to ENSO variability. The forecast skill among all seven MMEs ranges from 0.35 to 0.63 for the 2-m temperature and from 0.77 to 0.82 for the precipitation in DJF 2015. The forecast skills for GA and SCM are not always better than those of the other MME predictions for the period of real-time forecast; however, they consistently show relatively higher performance, regardless of the season. The time-averaged ACCs of GA and SCM are 0.43 and 0.42 (0.31 and 0.32) for the 2-m temperature (precipitation). There are few skill differences between GA and SCM in the real-time forecast period, although GA consistently slightly outperforms other MME methods for the hindcast period.
The forecast errors, in terms of RMSEs, of the 2-m temperature for the same period are illustrated in Figure 10a. The most notable feature is that the skill differences among the MME methods during DJF are larger than those during the other seasons. The RMSE of the MLP shows the greatest seasonal variation, with a mean and SD of 1.44 and 0.56, respectively. The increased forecast error during DJF may be due to the wider error spread of individual models compared with that in the other seasons (figure not shown). In general, the forecast error of both GA and SCM is lower (with time-averaged RMSE of 0.94 for GA and 0.95 for SCM) than that of other MMEs, even in winter. Figure 10b shows the forecast error for precipitation for the real-time forecast period. Among the MME methods, the time-averaged forecast error ranges from 0.80 to 1.09. The forecast errors of GA and SCM with time-averaged RMSE of 0.88 and 0.85, respectively, are relatively low among those of the MME methods. However, SLR shows the lowest error with a time-averaged RMSE of 0.80.
These two MME systems present lower errors than those of the other MMEs in most cases, when the error spreads of MMEs are quite large (e.g., precipitation for SON 2013). The errors of the two MME predictions are not largely different in the real-time forecast; the time averaged RMSEs of GA and SCM are both 0.94 for the 2-m temperature and 0.88 and 0.85 for precipitation, respectively. Individual models show better performance during the very strong El Niño event in 2015 than in other years.

| DISCUSSION AND SUMMARY
The MME methods have been developed and evaluated individually, sporadically and in isolation, so it is necessary to assess their overall performance comprehensively and coherently. We evaluated the global skills of MMEs for the hindcast period and suggested the most reasonable MME method. For the assessed variables, most MME predictions present reasonable skills in terms of spatial patterns of annual variation and temporal evolution for all four seasons for the 27-year hindcast periods, particularly during the El Niño events. The ENSO variability was one of the main sources of predictability, not only in the individual models but also in the MME prediction (Lee et al., 2011(Lee et al., , 2013a(Lee et al., , 2013bMin et al., 2014Min et al., , 2017 between the observations and predictions for 1-month lead seasonal mean (a) 2-m temperature and (b) precipitation for GA (red) and SCM (blue) for the same real-time period over the globe. The grey shaded area shows the spread of scores for all MME methods for the same period forecast period, the two MME systems (GA and SCM) reveal consistently higher skills and lower errors compared to the other methods for the global 2-m temperature and precipitation.
The following study caveats should be noted. First, the forecast skills vary depending on the region or variable, even in the most skilful MME system. Also, the forecast skill for precipitation remains poorer than that for the 2-m temperature, even using various MME methods. This is due to the complex physics of local-scale precipitation, resulting in more noise and therefore unpredictable characteristics (Gong et al., 2003;Min et al., 2017). In addition, the forecast errors are quite large in the extratropics for both the 2-m temperature and precipitation. Second, the MME predictions outperform those by single models in both hindcast and real-time forecast. The differences of forecast skills between GA and SCM were not statistically significant, but GA remains the most skilful MME prediction for the 2-m temperature in the real-time forecast.
The limited seasonal prediction skills of the MME methods can be partially explained using the Empirical Orthogonal Function (EOF) analysis. The first mode of EOF of observation for the global 2-m temperature is close to the global warming pattern, and the first principal component (PC1) is highly correlated (0.66) with the global warming signal. This mode explains about 15.8% of the total variance for 1983-2009 (Figures S17a and S18a). The second EOF mode of observation for the global 2-m temperature, which explains 13.8% of the variance, mainly shows the Arctic Oscillation (AO)-related impact over the land in the Northern Hemisphere (Kryzhov and Gorelits, 2015;He et al., 2017). The correlation coefficient between PC2 and the AO index obtained from CPC (available at http://www.cpc.ncep.noaa.gov/ products/precip/CWlink/daily_ao_index/ao_index.html) is 0.52 (Figures S19a and S20a). The first EOF modes of most MME methods except for RBF are related to global warming. The second EOF mode of the MMEs (except for RBF) is associated with the ENSO rather than with the AO (Figures S19b-h and S20b-h).
Several studies have attributed the difficulty in assessing MME methods to the small samples available to calculate optimal weights (Kharin and Zwiers, 2002;Kryjov et al., 2006;DelSole and Shukla, 2009;Rodrigues et al., 2014;Min et al., 2017) and to overfitting during the training period (Davis, 1978;Michaelsen, 1987;Min et al., 2014). Nevertheless, it is relevant that we explore the possibility of finding MME techniques other than SCM for use in operational seasonal prediction. For RCOF regions, it is also important to utilize an objective MME approach instead of a consensus-based subjective approach for developing seasonal forecasts. GA shows good performance in the RCOF regions, especially in the tropical RCOF regions, such as ASEANCOF, PICOF, and PRESAO, which are directly affected by oceanic boundary conditions (e.g., ENSO). As noted in the introduction, the study results will be helpful to improve the seasonal outlook for RCOF regions, in terms of both deterministic and probabilistic predictions. Improved deterministic forecast can be used as the mean of forecast PDF, and the variance of forecast PDF can be assessed by accounting for both ensemble spread and calibration error as suggested by Min et al. (2011) andKryzhov (2012).