A metrics‐based analysis of seasonal daily precipitation and near‐surface temperature within seven Coordinated Regional Climate Downscaling Experiment domains

We compare ensemble mean daily precipitation and near‐surface temperatures from regional climate model simulations over seven Coordinated Regional Climate Downscaling Experiment domains for the winter and summer seasons. We use Taylor diagrams to show the domain‐wide pattern similarity between the model ensemble and the observational data sets. We use the Climatic Research Unit (CRU) and the University of Delaware gridded observations and ERA‐Interim reanalysis data as an additional observationally based estimate of historical climatology. Taylor diagrams determine the relative skill of the seven sets of simulations and quantify these results in terms of center pattern root‐mean square error and correlation coefficient. Results suggest that there is good agreement between the models and the CRU, in terms of their respective seasonal cycles, as shown in Taylor diagrams and bias plots. There is also good agreement between both gridded observation sets. In addition, downscaled ERA‐Interim precipitation is closer to observations than raw ERA‐Interim precipitation. Domains located in the low latitudes and those having high topography appear to have larger biases, especially precipitation.


| INTRODUCTION
The Coordinated Regional Climate Downscaling Experiment (CORDEX) is a program sponsored by the World Climate Research Programme to coordinate the generation and assessment of regional climate simulations for multiple domains around the globe (Giorgi and Gutowski, 2015). The main goal of CORDEX is "to advance and coordinate the science and application of regional climate downscaling through global partnerships." (Gutowski, et al., 2016). Advancing the science of climate downscaling requires metrics to assess progress in that direction. Here, we apply a set of metrics to multiple CORDEX simulations in multiple CORDEX regions and provide an initial evaluation of COR-DEX ensemble simulations for these regions.
While there have been multiple studies involving metrics analysis for individual CORDEX regions, cross-domain metrics analysis of CORDEX RCMs is lacking. Evans et al. (2012) used metrics for the analysis of Weather Research and Forecasting (WRF) model physics ensembles over Australia. Lucas-Picher et al. (2013) analyzed the Aire Limitée Adaptation dynamique Déveoppment International simulations over North America within the CORDEX framework. Nikulin et al. (2012) and Solman et al. (2013) used CORDEX model ensembles over Africa and South America (respectively); both studies show that ERA-Interim precipitation is deficient in the low-latitudes. In addition, a series of regional studies over Africa using CORDEX RCM data demonstrated that the models adequately capture annual and seasonal rainfall characteristics and that an ensemble approach is desirable as individual models have biases (Endris et al., 2013;Kalognomou et al., 2013;Gbobaniyi et al., 2014;Shongwe et al., 2014). Glisan and Gutowski Jr. (2014) and Lindsay et al. (2014) highlight temperature and precipitation biases in reanalyses, including the ERA-Interim, across the Arctic. In this study, we quantify the performance of CORDEX RCMs over multiple domains through a series of metrics.
There are two general categories for metrics: (a) statistical climatology and (b) process-or phenomenological-based metrics. The phenomenological metric has a regional focus, which deals with features that are unique to a given area, such as the monsoons, low-level jets, and extra-topical systems. However, we want to evaluate model performance over multiple regions with very different climate processes thus, we focus on statistical climatology metrics that apply to all of our analysis regions.
Data available for this study were ensemble averages for each domain, so this study analyzes the net behavior of groups of models to show their collective behavior. In addition, some models have performed worse in specific regions. Thus, a benefit of using multiple models in a given region is the calculation of an ensemble mean, which puts less emphasis on outlier model(s) through averaging. We specifically consider monthly average two-meter air temperature and precipitation. We compare model ensemble average results to two gridded observational data sets and a reanalysis product. We use Taylor diagrams and bias portrait plots to present the results. The paper is organized as follows: Section 2 describes the data sets, models, and domains, as well as analysis methodology. Section 3 describes our results, and section 4 summarizes our findings and gives our conclusions.

| Gridded observations and reanalysis
We use two gridded observational data sets for seasonalprecipitation and two-meter air temperature. The Climatic Research Unit Time-series, version 3.21 (CRU TS3.21) is an historical, global gridded monthly averaged data set on a 0.5 × 0.5 grid, available from 1901 to 2012 (Mitchell and Jones, 2005). The data set uses observations from over 4,000 global weather stations. Temperature and precipitation at each station is converted to an anomaly from the 1961-1990 station average using the Climate Anomaly Method (CAM; Peterson et al., 1998). The value of each grid box is the average of all anomalies for stations contained within. The CAM uses additional measures to assess and correct errors that arise, for example, from poor measurements and urbanization biases.
We also use the University of Delaware (UDEL) Terrestrial Temperature and Precipitation data set, version 4.01. The UDEL data set uses the Global Historical Climatology Network and annual and monthly mean station observations of air temperature and total precipitation from Matsuura and Willmott (2009); averaged seasonal temperature and precipitation were bilinearly interpolated to a 0.5 × 0.5 grid.
In addition to CRU and UDEL, we use daily fields from European Centre for Medium-Range Weather Forecasts ERA-Interim (EI) reanalysis (Dee et al., 2011). The ERA-Interim reanalysis is a global product spanning 1979 though the present. Atmospheric and surface variables are provided on a T255 spectral resolution grid. While the temperature field is constrained by station observations, the precipitation is not. Rather, EI precipitation is a product of the underlying forecast model, accumulated over a 12-hr forecast segment. For the analyses here, CRU is the reference observational data set. Specifically, we treat EI as a "model" constrained by observations and UDEL as another observation set to compare against the reference CRU in our analysis. As with UDEL, EI is also bilinearly interpolated onto the CRU grid.

| Models
The regional climate models used in this study are forced using ERA-Interim lateral boundary conditions, sea surface temperature, and sea ice extent. For each region, we use the ensemble average of RCMs that provided a simulation; Table 1 lists the individual models used to calculate the ensemble mean for each CORDEX domain. Some models provided results for multiple regions, though they differ in detail from domain to domain. As with the observations, we extract the December-January-February (DJF) and June-July-August (JJA) averaged two-meter temperature and precipitation fields.

| Study design
Our analysis spans 1991-2007 and focuses on DJF and JJA seasons for seven CORDEX domains ( Figure 1). The DJF and JJA ensemble averages for temperature and precipitation were computed for each domain. We also find the seasonal averages for the observation-based sets used. From these averages, we compute the seasonal, domain-averaged difference of the model ensemble mean versus the CRU reference observational data set. We present these biases on a portrait plot, which summarizes the relative magnitudes of bias.
We are also interested in the domain-wide pattern similarity between the models and the observational data sets. The biases calculated cannot properly capture this behavior, so we use Taylor diagrams, which provide a succinct graphical representation of how well a set of patterns match observations (Taylor, 2001). For our study, we use Taylor diagrams to determine the relative skill of seven CORDEX domains against baseline observations; hence seven points on the diagram for each season and variable. The Taylor diagrams quantify these results in terms of the spatial root-mean square error (RMSE) and spatial correlation coefficient (CC) with respect to the reference data set. The RMSE is calculated by normalizing the spatial standard deviation for each domain's model ensemble average with the reference (CRU) spatial standard deviation.

| Two-meter temperature
In DJF (Figure 2a), we have a large grouping of data points close to the CRU reference curve. We find RMSE (CC) ranging from 0.84-1.26 (0.88-0.99) for the dowscaled models. Both EI and UDEL are similar to the CRU observations; The EI RMSE (CC) spreads are between 0.88 and 1.34 (0.93-1.00). This suggests that the ensemble of models across the seven domains are doing a relatively good job of simulating the amplitude of spatial variability and the  (Figure 3a). A similar range is found when comparing the EI with CRU a range from −2.01 to +2.64 C. In terms of the biases calculated from the model ensemble mean and CRU, we find the largest negative bias in the South Asia domain (−10.99 C) and the largest positive bias is located in the Arctic (1.47 C). All model ensemble mean biases are negative, except for the Arctic. For three of the seven regions (Africa, North America, South Asia), the bias in the model ensemble mean fall outside of the UDEL-CRU range of differences. Viewing the UDEL-CRU differences as a rough measure of observational uncertainty, the models' ensemble means show mixed results in their agreement with observations. Specifically, the model-CRU biases tend to be larger and Africa and South America biases are opposite that of the UDEL-CRU reference bias.
Compared with DJF, JJA is somewhat more concentrated around the CRU reference line where the normalized RMSE is equal to 1 (Figure 2b). In fact, data points are difficult to differentiate, as they tend to overlap and fall along the 0.99 CC line. EI points are tightly clustered around the reference line, suggesting it is producing the same behavior as the CRU observations. Noticeably, models and UDEL points that fall outside this clustering about the reference line are those for Europe, South America, and South Asia. In general, the degree to which the spreads of both RMSE and CC are small and fairly close to the reference field (save a few outliers) is indicative of the seasonal pattern being in good agreement with CRU near-surface temperatures. In terms of biases (Figure 3b), we find a similar spread between UDEL and CRU (−3.39 to +1.73 C) and EI and CRU (−2.77 to −0.96 C). When the model-CRU biases are calculated, we find the largest negative bias across the Africa (−6.05 C) and South Asia (−5.19 C). Overall, the model ensemble is producing cooler temperatures across all seven CORDEX domains. Also, similar to DJF, three of the seven regions during JJA have biases outside the UDEL-CRU range, and they are the same three regions: Africa, North America, and South Asia. For these regions, in both DJF and JJA, the ensembles have substantial cool biases compared with observations.

| Precipitation
In DJF, most of the UDEL points are tightly clustered to the left of the reference line (Figure 4a). We find the smallest spread for both the RMSE and CC, suggesting UDEL is well matched with CRU's seasonal precipitation pattern. The CORDEX simulation data points across the seven domains are distributed to the left and right of the reference curve with the Arctic and South Asia domains showing the largest spread; Africa shows similar, though not as pronounced behavior. The spread of RMSE and CC for the CORDEX simulations is somewhat larger than UDEL, but smaller than the EI. In JJA, we see similar behavior as in DJF, with both UDEL and models grouped about the reference line ( Figure 4b).
The most obvious difference between the precipitation and two-meter temperature Taylor plots is the expanded normalized standardized deviation axis. This extended axis is a function of how poorly the ERA-Interim precipitation compares to the CRU observations, as indicated by the large RMSE values. Moreover, we found that three regions had the largest RMSE values in both DJF and JJA; Africa, South America, and South Asia.
Domain-averaged DJF daily precipitation bias indicates UDEL has a similar precipitation rate to CRU (Figure 3a); biases range from −0.07 (Africa) to 1.56 mm-day −1 (South Asia). When we compare EI with CRU, we find that EI has less domain-wide precipitation than CRU, with as much as 21.45 mm-day −1 (East Asia) less than observations. The model versus CRU biases are more diverse. The largest negative model bias is found in South Asia (−27.27 mm-day −1 ) and largest positive in found in Europe (9.72 mm-day −1 ).
In JJA, we see similar behavior between UDEL and CRU as seen in DJF; biases are between −1.24 and 0.51 mmday −1 (Figure 3b). In comparison, EI is underestimating precipitation with the largest negative bias being −13.35 mmday −1 in South Asia. However, EI overestimates precipitation in East Asia with a bias of 21.8 mm-day −1 (East Asia). Across five of the domains, the model ensemble is overestimating precipitation, ranging from 3.04 (South Asia) to 9.06 mm-day-1 (Europe). The largest negative model bias is found in South America, at −4.84 mm-day −1 . ERA-Interim errors tend to be more in the domains that include the deep tropics such as Africa, South America and South Asia.

| SUMMARY AND DISCUSSION
In this study, we use metrics to quantify the RCM ensemble performance across seven CORDEX domains. Specifically, we use bias portrait plots and Taylor diagrams to compare seasonal mean two-meter temperature and daily precipitation model performance against the CRU observational set. We F I G U R E 3 Seasonal two-meter temperature ( o C) and precipitation (mmday-1) biases for the seven CORDEX domains in DJF (a) and JJA (b). Blue (red) shading represents negative (positive) biases vs. the reference observation set, CRU. Darker shading indicates larger bias also use UDEL observations and ERA-Interim reanalysis in our comparison against CRU.
In terms of near-surface temperature, the model ensemble means across most of the seven domains compared relatively well with CRU. We found a small spread in both the RMSE and CCs suggesting that the models are in phase and simulating well to seasonal patterns for both DJF and JJA. This behavior is reflected as a grouping of points about the reference line in the Taylor diagram. The largest domain-wide seasonal biases generally are found when comparing the model ensemble means to the CRU observations; the simulations tend to be colder for both seasons. ERA-Interim and UDEL have similar magnitudes of bias, with both having warm and cold biases for specific regions.
Seasonal daily mean precipitation behavior across the seven CORDEX domains are more mixed than the temperature comparisons. We find that both UDEL and model ensemble means are more in-phase with the seasonal behavior found in the CRU, with UDEL having a smaller spread for both the RMSE and CC; this is true for both DJF and JJA. Where we find a departure from the behavior present in the daily precipitation comparisons is that of the EI. ERA-Interim precipitation underperforms in three of the COR-DEX domains specifically, Africa (Kalognomou et al, 2013), South America, and South Asia. We believe this behavior is a function of the EI being a model product and not constrained with precipitation observations; Weedon et al. (2014) confirms the EI precipitation bias and we suggest ERA-Interim reanalysis rainfall should not be used as a proxy for rainfall.
Precipitation biases show more variability than the twometer temperature analysis. UDEL versus CRU biases are relatively small, showing us that both observation sets are in good agreement. The same cannot be said for the EI. We found large negative biases, especially in the Asian domains of up to 21 mm-day −1 . The EI also generally under produces precipitation in both seasons. In relative terms, the model-CRU biases are closer to those biases found with the EI, with the exception that 65% of the domain have a positive bias; the models are over-simulating precipitation. This may be a function of the relative coarseness of the model domains compared with observations. Future work that would add further insight into the behavior of the downscaled model ensemble mean would be that of comparing the biases against the UDEL-CRU bias difference for each region.
Overall, our results suggest that the performance of the models versus the CRU is in-phase with the seasonal cycle as shown in the Taylor diagrams and bias plots. We do find less robust agreement in some CORDEX domains that encompass large swaths of the lower latitude regions, such as South and East Asia; there is less variability in the F I G U R E 4 DJF (a) and JJA (b) seasonal average daily precipitation Taylor diagram for the seven CORDEX analysis regions. The UDEL observations (red dots), ERA-interim reanalysis (blue dots), and model simulations (black dots) are compared to the CRU observations. Number next to each dot denote the specific CORDEX domain. Numbers next to domain names represent the amount of members in the model ensemble average tropical climate system (i.e., not much difference between warm and cold seasons) and model convective parameterization schemes are tested more in the tropics than in nontropical domains. Domains that also include very high topography appear to have larger temperature and precipitation biases. South Asia seems to be an outlier domain for these specific regions, as it covers a large latitudinal and longitudinal area with wide-ranging, extreme topography. Given such a spread among the domains, the Taylor diagrams provide a succinct way of comparing multiple data points, giving a visual representation of important statistical information that is easy to analyze.