Evaluation of CMIP5 retrospective simulations of temperature and precipitation in northeastern Argentina

It is generally agreed that models that better simulate historical and current features of climate should also be the ones that more reliably simulate future climate. This article describes the ability of a selection of global climate models (GCMs) of the Coupled Model Intercomparison Project Phase 5 (CMIP5) to represent the historical and current mean climate and its variability over northeastern Argentina, a region that exhibits frequent extreme events. Two types of simulations are considered: Long‐term simulations for 1901–2005 in which the models respond to climate forcing (e.g. changes in atmospheric composition and land use) and decadal simulations for 1961–2010 that are initialized from observed climate states. Monthly simulations of precipitation and temperature are statistically evaluated for individual models and their ensembles.


Introduction
Southeastern South America presents large departures from the mean climate at different time scales leading to frequent extreme events, including floods, droughts, and heat waves that affect natural and human systems (Magrin et al., 2014). Several observational studies have found that extreme events have increased in their frequency and severity. This trend may become more pronounced in the coming decades as discussed by Seneviratne et al. (2012), Cavalcanti et al. (2015) and Carril et al. (2016). If so, these changes are likely to disrupt hydrological systems affecting food production in one of the most fertile plains in the world. Research on regional climate variability and change then becomes a necessity in order to transfer scientific knowledge to decision-making processes.
Projections of future climate are produced with numerical simulations based on global climate models (GCMs), whose reliability depends on their ability to reproduce historical and current features of climate. To this end, the World Climate Research Programme's Working Group on Coupled Modelling promoted a set of experiments known as the fifth phase of the Coupled Model Intercomparison Project (CMIP5; Taylor et al., 2012a), following a previous version known as CMIP3 (Meehl et al., 2007). In general, models have less skill to simulate precipitation than they do for temperature because the temperature is obtained from a thermodynamical balance, while precipitation results from simplified parameterizations approximating actual processes ; see also references therein). These authors showed that the CMIP5 models reproduce global scale patterns of surface temperature and that at regional scales, the CMIP5 simulations of temperature and large-scale precipitation have improved over those of CMIP3.
Studies focused on southeastern South America have reported that CMIP5 models properly simulate regional mean temperature Kumar et al., 2014). de Barros Soares et al. (2017) have shown that most CMIP5 models capture the positive trends in temperature. CMIP5 models simulate precipitation with lower relative errors and smaller inter-model dispersion than CMIP3 in annual and summer precipitation patterns (Gulizia Figure 1. Topography map of the southern portion of South America together with the main rivers that drain into the La Plata River. The red box highlights the study region in northeastern Argentina. and Camilloni, 2015; Díaz and Vera, 2017). Yet in the eastern portion of southeastern South America, a CMIP5 multi-model ensemble underestimates the annual cycle of precipitation as well as drought events Rivera, 2013, 2016). Consistent with these findings, Maenza et al. (2017) found that while 15 CMIP5 models and their ensemble capture the low-frequency variations of wet season precipitation, they underestimate precipitation values over the western Pampas of Argentina. Trends in precipitation for the period 1902-2005 have been discussed by Vera and Díaz (2015) who reported that an ensemble of CMIP5 models identified the positive change of precipitation, but with a weaker magnitude than that in observations. Furthermore, ensembles of regional climate models coordinated under the project CLARIS-LPB (see Solman et al., 2013;Sánchez et al., 2015;Boulanger et al., 2016 and references therein) show that the spatial and temporal patterns provide more detail although they still reflect the biases of the driving GCMs. Many studies that evaluate the performance of CMIP5 models have focused on long-term simulations. However, previous observational studies in southeastern South America have shown that decadal variability plays a key role in the region's climate (Baethgen and Goddard, 2013;Grimm and Saboia, 2015). The CMIP5 decadal simulations are expected to reproduce the decadal variability since the simulations do not only respond to external forcing but also consider internal interactions within the climate system (Taylor et al., 2012a). Thus, decadal simulations potentially track the actual trajectory of climate change and explore climate predictability on decadal to multi-decadal time scales.
Argentina is a leading cereal and meat exporter, considered one of the breadbaskets of the world that plays a strategic role in the global food security (Fischer et al., 2014). This study will focus on an extensive portion of the Pampas in northeastern Argentina, that presents three remarkable features: (1) it produces more than 80% of the agricultural yield (mostly soybeans, maize, sunflower, and wheat) and cattle raising of the country (Agrofy News, 2017; Ministry of Agroindustry, 2017); (2) it experiences hydro-climate variability on different time scales (i.e. interannual and decadal) and experiences extreme events that may increase in frequency and severity in the coming years and (3) it concentrates more than 90% of Argentina's population, which together with the consequent concentration of economic activities make it particularly vulnerable to hydro-climate variability and extremes. The study region is mostly flat with almost no slopes and is delimited by 65-58 ∘ W and 36-26 ∘ S (see Figure 1). The region is relatively homogeneous from a climatic perspective, identified as a subtropical humid climate according to the Köppen−Trewartha climate classification (Jacob et al., 2012;Gallardo et al., 2016). The regional precipitation is mostly uniformly distributed throughout the year and the thermal gradient is latitudinal (Berbery and Barros, 2002;Caffera and Berbery, 2006).
The main objective of this article is to evaluate the ability of CMIP5 models to represent the historical and present climate over northeastern Argentina. In doing so, measures of statistical and spatio-temporal properties of historical long-term and decadal simulations are assessed and subsets of the best performing models are identified. Section 2 describes the CMIP5 and its models, the used data sets and the methodology of evaluation. Sections 3 and 4 assess e1160 M. A. LOVINO et al. to what extent the GCMs reproduce the observed precipitation and temperature in long-term and decadal simulations. Finally, Section 5 presents the conclusions.

Data and methodology
2.1. CMIP5 models CMIP5 provides an unprecedented collection of data sets produced with climate models. The CMIP5 experiments were performed in two groups: (1) long-term simulations with century time scales and (2) near-term simulations (also known as decadal simulations) with decadal time scales of about 10-30 years (Taylor et al., 2012a). Long-term experiments simulate the response to climate forcing like the atmospheric composition (including CO 2 ) due to volcanic and anthropogenic influence, solar radiation, emissions or concentrations of short life species, natural and anthropogenic aerosols, and land use. The decadal experiments not only respond to climate forcing as the long-term experiments do but also they are initialized from observed states of the climate system. Thus, the decadal variability can enhance or moderate anthropogenic trends of climate change at regional scales mainly in short-term periods (Baethgen and Goddard, 2013). In this way, decadal simulations potentially track the actual trajectory of climate change (Taylor et al., 2012a). This paper evaluates a set of 27 GCMs of the CMIP5 from different modelling centres. Model types include ocean-atmosphere coupled models (AOGCMs), earth system models (ESM), and models with chemical atmospheric processes (ChemAO and ChemESM). Table 1 presents the 27 selected GCMs and their features: responsible institution, model type, horizontal and vertical resolution together with a main citation. Of the 27 GCMs, we evaluate 25 with historical long-term simulations and 7 with decadal simulations. Two GCMs (CanCM4 and CFSv2-2011) only perform decadal simulations and were included to strengthen the multi-model ensembles of decadal simulations. Following the coding established by the WCRP working group on coupled modelling, we use the simulations identified as r1i1p1, where r is the first realization number, i is initialization method indicator, and p the perturbed physics number. More details are found in Taylor et al. (2012b).

CMIP5 data and observations
The models evaluation is performed on monthly precipitation (pr) and monthly air surface temperature (tas) derived from historical long-term and decadal simulations. and spatial distribution of annual means) of the present climate for both long-term and decadal simulations. This period is chosen as it is the last normal (30-year) period available on both simulations.
The observational data set of temperature and precipitation used in this study is the CRU TS 3.20 from the Climate Research Unit -University of East Anglia (Harris et al., 2014). CRU TS 3.20 consists of monthly gridded data with a 0.5 ∘ × 0.5 ∘ spacing extending from January 1901 to December 2011. Other gridded data sets are available, such as the Global Precipitation Climatology Centre data set (GPCC, Schneider et al., 2011) and the University of Delaware Air Temperature and Precipitation database (UDEL; Matsuura and Willmott, 2009). All gridded products for southeastern South America face the problem of sparse gauge coverage at the beginning of the 20th century (e.g. Barreiro et al., 2014). According to Lovino (2015), CRU TS 3.20 fits well observed monthly mean precipitation and temperature over the study region, although the most significant biases occur in the first decades of the 20th century.

Evaluation methodology
The models performance is assessed, first, contrasting mean annual cycles and spatial patterns against observations. Second, the GCMs' skill to simulate precipitation and temperature is evaluated with statistical metrics. The metrics used are the mean bias error (MBE), the mean absolute error (MAE), the root mean square error (RMSE), the centred root mean square error (CRMS), the Nash-Sutcliffe efficiency (NSE) and the Pearson correlation coefficient (r). A description of these metrics can be found in Déqué (2012). The NSE is presented in Nash and Sutcliffe (1970) and further discussed in Moriasi et al. (2007).
MBE indicates whether a GCM simulation overestimates or underestimates the observed data. MAE and RMSE represent the magnitude of the error. NSE varies between -∞ and 1, 1 being the perfect score. Moriasi et al. (2007) suggested that model performance can be evaluated as 'satisfactory' if NSE > 0.5 and 'very good' if NSE > 0.75.
For the spatial evaluation, we compute the MBE, the RMSE and the spatial correlation between the mean annual fields of simulated and observed variables. To compute the spatial metrics, simulated data are scaled to a 1 ∘ longitude × 1 ∘ latitude grid through the inverse distance weighting (IDW; Shepard, 1968). The CRU data set is also used at 1 ∘ grid spacing. We choose 1 ∘ grid spacing as an intermediate resolution among the different resolutions of the GCMs and the observed data set (see Table 1).
The results are shown for the individual models and multi-model ensemble means. For long-term simulations, we compute the multi-model ensemble means for the 25 studied GCMs and for subsets of models selected according to the best performance to simulate the regional climate. For decadal simulations, we perform the multi-model ensemble means for the seven GCMs examined here.

Temporal analysis of temperature
The evaluation metrics for the CMIP5 models that performed historical long-term simulations (Table 2) indicate that the regional mean temperature is noticeably well simulated. All models have correlations above 0.92 and low errors (the averages of the individual models' errors are MAE = 2.3 ∘ C and RMSE = 2.7 ∘ C). The error values are in the range of those reported by Flato et al. (2013) for southeastern South America. GCMs that achieve the best performances were chosen for further analysis. The selection criterion was that they must satisfy a minimum NSE of 0.8. The selected threshold is in the range of 'very good values' as recommended by Moriasi et al. (2007) for general performance ratings. Nine models, shaded in Table 2, meet the requirement. The selected models have high correlations with observations and very low statistical errors (1.40 ∘ C < MAE < 1.72 ∘ C and 1.78 ∘ C < RMSE <2.10 ∘ C). The 9-model ensemble achieved the best evaluation metrics, exceeding those of the individual members. Of the individual models, CESM1-BGC and CCSM4 models (rows 5 and 21) show the best skill scores, with an efficiency of 0.86, the lower errors, and correlations of 0.94. Figure 2 presents the time series of areal-averaged annual and spring mean temperature for the ensemble of the 25 models and the ensemble of the subset of 9 models with the better performance. In both time series, the 25-model ensemble presents a systematic error, which is significantly reduced with the 9-model ensemble. These results are consistent with the reduction of the MBE from 0.68 to 0.02 ∘ C (Table 2). Figure 2(a) shows that the ensembles depict lower variability than observations, but still reproduce the trend in regional annual mean temperature in the latter part of the period. This warming trend is captured by all the selected models, although with different magnitudes (not shown). It has been reported that northeastern Argentina registered the most significant positive trends in spring mean temperature (Lovino, 2015). Figure 2(b) shows that the 9-model ensemble correctly recognizes this trend. Similar results were reported by de Barros Soares et al. (2017) who showed that most CMIP5 historical simulations reproduce the observed annual and spring temperature trends in southeastern South America.
Scatterplots of the simulated and observed temperature of the two multi-model ensembles are presented in Figure 3. For reference, similar pictures are shown for the two single models that exhibit the best evaluation metrics, the CCSM4 and CESM1-BGC. Figures 3(a) and (b) show that the ensembles achieve a low dispersion around the linear regression line (R 2 = 0.93). The dispersion of the data is slightly higher in cold months than in warm months. The regression line in Figure 3(a) shows that the 25-model ensemble overestimates temperature on warm months.  Table 2. Statistical evaluation metrics between the areal-averaged time series of mean temperature observed and simulated by historical long-term simulations . The shaded rows correspond to the models that achieve the best evaluation metrics, performing the 9-model ensemble mean. In the 9-model ensemble (Figure 3(b)), the regression line fits the 1 : 1 line better than the 25-model ensemble reaching a slope value m = 1.03. Although the ensembles fit well the data, their variability is reduced given that the ensembles diminish the dispersion of simulated temperatures keeping their monthly values in a narrow band when compared with observations. It produces a month-to-month discontinuity in the scatterplots mainly on transitions months between cold and warm seasons (for instance, see green dots representing April and October temperatures). The 9-model ensemble improves the representation of the monthly observed variability and reduces the discontinuity displayed by the scatterplot of the 25-model ensemble. Regarding the best-performance individual models, Figures 3(c) and (d) show that the two models are close to the 1 : 1 line with observations, with a very good dispersion given by the determination coefficient R 2 ∼ 0.88. As the multi-model ensembles, the reference models fit better mean temperatures than cold or hot extremes. Figure 4 depicts the Taylor diagram of the regional mean temperature. Taylor diagrams summarize the degree of correspondence between the observed and simulated fields in terms of their correlation, their CMRS and the ratio of their variances (Taylor, 2001). The reference point 'a' in Figure 4 shows that observations have a standard deviation of 4.8 ∘ C. The diagram shows graphically that the 9-model ensemble mean (point 2) presents the best scores and correctly represent the reference standard deviation. The 25-model ensemble mean (point 1) exhibits similar performance than the individual models with better skill scores. All analysed GCMs are in phase with observations, with correlation coefficients varying in the range 0.92-0.94. Also, most of the models form a cluster of points as they exhibit standard deviations between 5 and 7 ∘ C and CRMS close to 2 ∘ C. On this cluster, CCSM4 and CESM1-BGC (blue points) stand out as the closer models to the reference point. In general terms, the evaluation metrics computed in northeastern Argentina displayed by the Taylor diagrams are consistent with those found in most regions of the world (see Kumar et al., 2014), with slightly lower skills than in North America or southeastern Asia.
So far we have seen that models tend to reproduce statistical features of the observations. Next, we explore basic aspects of the observed and simulated climatology, including the mean annual cycle and spatial patterns. Figure 5(a) shows that the 25-model ensemble mean correctly fits the annual cycle, slightly overestimating temperature from November to April, in agreement with Flato et al. (2013) for all southern South America. The inter-model range has an average amplitude of 8 ∘ C increasing to 10 ∘ C on  austral winter (June to September). Figure 5(c) indicates that the inter-model range of the nine selected GCMs is remarkably reduced to almost 2-3 ∘ C. The 9-model ensemble mean improves the representation of the annual cycle with respect to the 25-model ensemble, mainly on warm months. Also, note that all the GCMs identify the observed annual cycle (Figures 5(b) and (d)) with different systematic errors that are quantified in Table 2 by the MBE scores. Table 2 shows that 19 of the 25 models have positive MBEs, indicating their tendency to overestimate the average regional temperature. As Figure 5(b) shows, most of these models mainly overestimate the temperature in warm months. Figure 5(d) shows that most of the selected models reduce this error, accurately fitting observed mean values between March and December and only overestimate them in January and February. These  results suggest that the nine models that achieve the best evaluation metrics and its multi-model-mean ensemble are able to properly simulate the mean annual cycle of regional temperature.

Spatial analysis of temperature
The mean observed and simulated temperature fields are presented in Figure 6. Observations (Figure 6(a)) reveal a mostly south-north gradient with a temperature range of about 6 ∘ C. The 9-model ensemble mean (Figure 6(b)) exhibits a smoothed gradient resulting from a bias of about −2 ∘ C towards the south and a bias of about +1 ∘ C towards the north. Two of the models in the ensemble are shown in Figures 6(c) and (d). They are CCSM4 and CESM1-BGC, which were already shown to have the best temporal performance. The two models are able to simulate the observed field with only a slight overestimation towards the north. The spatial evaluation metrics of the mean annual temperature fields for the nine GCMs and their ensemble are presented in Table 3. Spatial correlations vary between 0.62 and 0.96, with six models reaching values above 0.9. In general, the GCMs with higher horizontal resolution obtain the best skill scores. For example, the CCSM4 and CESM1-BGC models exhibit spatial correlations of about r = 0.96, with a slightly warm MBE between 0.2 and 0.4 ∘ C, consistent with the overestimation of the spatial fields discussed above. In this case, the 9-model ensemble mean achieves similar spatial scores than the two models with the best performance. These results suggest that the 9-model ensemble does not improve the ability of the best performance individual models to reproduce the annual average temperature field.

Temporal analysis of precipitation
Metrics for the areal-averaged precipitation evaluation are presented in Table 4. The values are lower than those for temperature (for reference see Flato et al., 2013). The correlation coefficients range from 0.31 to 0.64. The RMSE and MAE metrics tend to be rather uniform among models, with RMSE varying between 42 and 53 mm and MAE between 32 and 42 mm. These errors are high but similar to those found in southeastern South America (Gulizia and Camilloni, 2015) and in other regions of the world (Kumar et al., 2014). For instance, the RMSE values are lower than in Africa but greater than in North America (see also Sheffield et al., 2013). The MBE metric suggests that most GCMs (18 out of 25) tend to underestimate precipitation, consistently with the results reported by Díaz and Vera (2017) for the La Plata Basin and Maenza et al. (2017) for the western Pampas of Argentina (south of the study region).
Statistical errors of precipitation are high, resulting in low Nash-Sutcliffe efficiencies. The correlation coefficient is more useful as a measure of the phase between time series, and the errors can be corrected using statistical methods (e.g. downscaling). Seven models with correlations higher than 0.6 are chosen for further analysis (see shaded rows in Table 4). Their statistical scores are remarkably better than for the rest of the models. This improvement is also reflected on the 7-model ensemble, which has a temporal correlation above 0.7, while the mean errors are reduced and the efficiency increases considerably.
The annual and summer precipitation time series of the 25-model ensemble and the 7-model ensemble along with observations are presented in Figure 7. The figure shows that the multi-model ensembles have lower temporal variability than observed precipitation. In the case of annual precipitation (Figure 7(a)), the 25-model ensemble underestimates the observed precipitation by 12% on average. The 7-model ensemble reduces the underestimation to just 1%. In the case of summer precipitation (Figure 7(b)), the 25-model ensemble underestimates the observed precipitation by 7% whereas the 7-model ensemble overestimates it by 13% on averages. The percentages are estimated as series, respectively. A common feature of both ensembles is that they do not reproduce accurately the positive trend in annual and summer precipitation, but weakly recognize the positive pattern of change. The seven individual models do present a trend, but it is smaller than that in the observations. As mention above, most of the historical simulations of the CMIP5 models can recognize the right sign of the summer precipitation changes in southeastern South America, although weaker than observed (Vera and Díaz, 2015). Our results suggest that the annual and summer mean trend in precipitation are weakly simulated also in the northeast region of Argentina. Scatterplots of simulated precipitation versus the observed precipitation are presented in Figure 8. The scatterplot of the multi-model ensembles (Figures 8(a) and (b)) reflect the difficulties that some models have in representing observed precipitation. As in temperature, the ensembles reduce the temporal variability flattening the dispersion of the data around the simulated monthly mean values. Thus, the scatterplot of the 25-model ensemble (Figure 8(a)) shows two quasi-horizontal bands and a severe dry bias for months with more than 100 mm of rain. Also, low precipitation monthly values (lesser than 20 mm of rain) are not simulated by the multi-model ensemble. These biases are reduced when considering the 7-model ensemble mean (Figure 8(b)) with a closer fit to the 1 : 1 reference line, mainly on rainy months. In general, the individual models in the 7-model ensemble underestimate the mean precipitation in rainy months and overestimate it in months with low rainfall, a feature also reproduced in most of the 25 studied models (not shown). This behaviour can be seen in Figures 8(c) and (d) with the scatterplots of two of the best performing models, CCSM4 and CanESM2. As the linear regression lines show, the two models underestimate precipitation higher than 150 mm and overestimate precipitation lower than 50 mm. The data dispersion is lower for low precipitation values and increases for high precipitation values,   resulting in determination coefficients between 0.36 and 0.39. These results suggest that while individual GCMs are able to recognize the temporal variability of the observed series, they present difficulties in simulating both the rainy months and those months of low precipitation. Figure 9 summarizes the performance of the models in a Taylor diagram. The 25-model ensemble mean (point 1) improves the CRMS and the correlation values of each GCM but underestimates the standard deviation of the reference value. The temporal variability is also reduced as discussed earlier. The 7-model ensemble mean (point 2) notably reduces the difference in the standard deviation while maintaining the performance in correlation and CRMS. The individual standard deviation values indicate that six out of seven selected models exceed the observed variability. The CCSM4 model (point f) fits best the observations (point a) with a slightly higher standard deviation and a correlation above 0.6.
The annual cycle of observed precipitation and all model simulations are plotted in Figure 10. Figure 10 (2016) found a similar result in the eastern portion of southern South America, reporting that CMIP5 models reproduce the shape of the annual cycle of precipitation but underestimate monthly totals during all year. A similar finding was reported by Maenza et al. (2017) in the western Pampas of Argentina. It is also evident in Figure 10 Figure 10(c) indicates that the 7-model ensemble mean properly reproduces the mean annual cycle of precipitation in the study region. The relative precipitation maximum in March, which is a particular feature of the region, is not simulated by the 7-model ensemble but by some individuals models (see Figure 10(b)).

Spatial analysis of precipitation
The observed and 7-model ensemble mean precipitation fields are presented in Figures 11(a) and (b) along with the two best performing models (CanESM2 and CCSM4) in Figures 11(c) and (d). Observations show a W-E gradient and slight increase towards the northeast (Figure 11(a)). The 7-model ensemble, as well as two of the best performing GCMs, tend to show an SW-NE gradient, mostly due to a negative bias towards the SW. The 7-model ensemble mean (Figure 11(b)) underestimates the regional precipitation field and reduces the spatial variability. The CanESM2 model (Figure 11(c)) achieves the best individual representation of the observed field as it reproduces the observed spatial variability, while the CCSM4 model smooths the gradient with positive bias in the whole region (Figure 11(d)).
The spatial skill scores of the annual simulated precipitation fields for the seven best performing models are Table 4. Statistical evaluation metrics between the areal-averaged time series of monthly precipitation observed and simulated by historical long-term simulations . The shaded rows correspond to the models that achieve the best evaluation metrics, performing the 7-model ensemble mean.   and models with the best evaluation metrics (CCSM4 in c, CanESM2 in d). The linear regression line is plotted and its slope m is presented together with the determination coefficient R 2 in the bottom-right corner of each scatterplot. Figure 9. Taylor diagram of the regional precipitation simulated by historical long-term simulations (period 1901-2005). The blue dots correspond to the models that achieve the best evaluation metrics and perform the 7-model ensemble. CRMS: centred root mean square error.
presented in Table 5. Most of the GCMs present correlations in the range 0.6-0.7. The exception is the MIROC4h that exhibits the best correlation (r = 0.8) probably due to its high resolution (0.56 ∘ × 0.56 ∘ ) despite underestimating the annual precipitation by about 30%. The 7-model ensemble reduces the statistical errors of each individual GCM but it does not improve their spatial correlations. The CanESM2 model exhibits high correlation and an MBE of about −50 mm year −1 , which represents a 5% of the areal mean precipitation. This result is consistent with its ability to simulate the annual mean field of precipitation discussed above. The CCSM4 model overestimates the areal-averaged mean precipitation in about 150 mm year −1 , which represents the 15% of the observed field. These results suggest that the multi-model ensemble does not improve the ability of individual models to simulate spatial precipitation fields. The best-performance individual models can recognize spatial features that do not simulate the multi-model ensemble, mainly the high spatial variability of the precipitation field. Table 6 presents the GCMs with decadal simulations assessed in this study, including their evaluation metrics. The evaluation focuses on those GCMs that show greater ability to simulate the regional climate in the historical long-term experiment. Only five of the subset of best models present decadal simulations (indicated with 'a' in Table 1). Table 6 shows that the evaluation metrics of the decadal simulations do not show great differences from those found in the evaluation of long-term simulations (Tables 2 and 4).

Decadal simulations
In the case of the precipitation, the NSE shows a slight improvement with mostly positive values between 0.1 and 0.25; although the statistical errors and the correlations remain in the range of the historical long-term simulations. Other studies have shown that the decadal simulations have been an advance in the modelling and predictability of the large-scale SST patterns that force the region's climate (e.g. Gonzalez and Goddard, 2016;Meehl et al., 2016). While it is well known that the study region is strongly forced by large-scale SST patterns (Seager et al., 2010;Barreiro et al., 2014), our results suggest that the performance of decadal simulations is similar to that of the historical long-term simulations.
In the case of temperature, the evaluation metrics present values very similar to those of the historical long-term simulations as in precipitation. However, noted that the maximum errors of decadal simulations (MAE = 2.99 ∘ C and RMSE = 3.43 ∘ C; Table 6) are lower than the maximum errors of the historical long-term simulations (MAE = 4.08 ∘ C and RMSE = 4.67 ∘ C; Table 2). Thus, the magnitude of the errors is slightly reduced in decadal simulations.   As in the historical long-term simulations, the multi-model ensemble improves the evaluation metrics of precipitation and temperature, over those of the individual GCMs (see Table 6). According to the Table 6, the CCSM4 model again shows the greater ability to simulate the regional mean temperature and precipitation. The CFSv2-2011 and the CMCC-CM models also exhibit good performance for temperature, while the CanCM4 model is among the best for precipitation.
The Taylor diagram of mean temperature (Figure 12(a)) is similar to that obtained for the long-term simulations ( Figure 4): all GCMs simulate temperature with standard deviations greater than the reference value, correlation coefficients in the range 0.92-0.95 and CRMS close to 2 ∘ C. The multi-model ensemble (point I) approaches the reference point as it improves both the correlation and CRMS values. Different standard deviations are found in the Taylor diagram of the precipitation (Figure 12(b)); certain models considerably underestimate the reference value of 49 mm (point A) while others overestimate it. The models with the highest standard deviation, also exhibit the highest correlation coefficients (points B and C, r ∼ 0.65). The multi-model ensemble (point I) improves the correlation and the CRMS of each GCM, although it widely underestimates the standard deviation of the reference value. The mean annual cycle of temperature and precipitation from observations and simulations are shown in Figure 13. Figure 13(a) shows that temperature of the multi-model ensemble fits very well the observed annual cycle and is consistent with a low MBE of only −0.1 ∘ C (see Table 6). Most of the GCMs (Figure 13(b)) reproduce the annual cycle but with systematic errors that are quantified by the MBE values in Table 6. Five out of the seven analysed models present positive MBEs less than 1 ∘ C. Only two models have negative errors, but greater than 1.6 ∘ C in magnitude. The multi-model ensemble precipitation (Figure 13(c)) underestimates observed precipitation during the rainy season (between September and April) while slightly overestimates the precipitation of the dry season. The annual cycle of precipitation is simulated by each GCM with dissimilar success; while some GCMs adequately reproduce the annual cycle (e.g. CCSM4 and CanCM4) others are far removed from the observed distribution (e.g. FGOALS-g2, CMCC-CM).

Conclusions
This paper evaluated the ability of CMIP5 GCMs to simulate the observed spatio-temporal behaviour of the climate in northeastern Argentina. The GCMs were selected from the two main core sets of CMIP5 simulations: long-term  and decadal . Monthly simulations of precipitation and temperature were evaluated with the observed CRU TS 3.20 data set. The simulations of each GCM and the multi-model ensembles were evaluated through the MBE, the MAE, the RMSE, the coefficient of NSE, and the Pearson correlation. These metrics were computed for both areal-averaged time series and spatial fields. We also determined the ability of GCMs to simulate the annual mean fields and the mean annual cycle of temperature and precipitation for the normal 30-year period 1971-2000. Inspection of the results allowed selection and evaluation of model subsets that represent better the regional climate. The temperature observations show that the region: (1) experienced a warming trend in annual mean temperature, with a strong signal of positive change during austral spring; (2) presents a marked annual cycle of temperature with mean values of 25 ∘ C in summer and 12 ∘ C in winter; and (3) has a spatial south-north gradient that ranges from 15 to 21 ∘ C. These three features are correctly simulated by all GCMs that performed historical long-term simulations. All models reach correlations in the range 0.92-0.94 for the 1901-2005 time series. Among them, the CCSM4 and CESM1-BGC models have the highest ability to simulate regional temperature as they achieve the best evaluation metrics and reproduce properly the features of the climatology. Furthermore, the ensemble of the nine most skilful models improves the performance of the individual GCMs. The 9-model ensemble reaches a correlation of 0.96, reduces more than 20% the positive systematic errors shown by the all-model ensemble in the 1901-2005 time series, and fits well the climatological annual cycle.
The precipitation observations show that the region: (1) experienced a positive trend in annual and summer precipitation after the 1960s (Barros et al., 2008;Lovino et al., 2014); (2) has a marked intra-annual variability, with the warm season (October to April) accounting for more than 80% of the annual precipitation; and (3) presents a spatial west-east gradient. The set of CMIP5 historical long-term simulations reproduce the regional precipitation features with diverse degrees of success. The ensemble performed with the seven more skilful (in terms of precipitation) models reveals better skill than the individual GCMs, reducing all-model biases in more than 25%. On the other hand, the 7-model ensemble accurately reproduces the annual precipitation cycle improving the performance of all the rest of the GCMs that underestimate rainfall in the rainy season. However, the 7-model ensemble does not clearly identify the positive trend in annual and summer precipitation. Spatially, the best individual models and the 7-model ensemble mean correctly reproduce the large scale features of the observed annual precipitation pattern but fail to replicate the smaller scales spatial variability.
The seven models with better precipitation skill had already been identified as those that best resemble the South American monsoon with the most accurate quantitative estimations of the summer rainfall in the monsoon area (Hsu et al., 2013). Furthermore, Jones and Carvalho (2013) showed that most of the selected GCMs ably reproduce the large-scale features like seasonal amplitude, onset or duration of the South America Monsoon System. Thus,  the ability of the selected GCMs to reproduce the monsoon features may explain their high performance to simulate the precipitation in northeastern Argentina. Four of these seven selected models for precipitation also achieve the best evaluation metrics for temperature: CCSM4, CESM1-BGC, CESM1-FASTCHEM, and NorESM1-M. It is widely known that the decadal variability, strongly forced by large-scale SST patterns, plays a key role in modulating the climate of northeastern Argentina (e.g. Barreiro et al., 2014 and references therein). Therefore, it would be expected that the inclusion of this external forcing in the decadal simulations would lead to improvements in skill. Nevertheless, our results suggest that decadal simulations do not improve the evaluation metrics achieved by the long-term simulations, suggesting that the initialization of decadal simulations with climate observations does not lead necessarily to a better representation of regional climate. Still, some individual models do have success. The CCSM4 model achieves the best evaluation metrics for both temperature and precipitation, demonstrating the greater ability to simulate the regional climate on decadal time scales. This model also presents one of the best performances in the evaluation of long-term simulations.
Reliable projections of future climate require models that adequately represent the characteristics of the regional climate system. The evaluation of the long-term simulations allowed us to select a subset of 9 GCMs that best reproduce the mean temperature fields and a subset of 7 GCMs that most skilfully represents regional precipitation. The multi-model ensemble means computed using the selected GCMs achieve the largest skill to represent historical and present climate as they reached the best evaluation metrics, exceeding the performance of each GCM and the multi-model ensemble performed with all the GCMs. It is expected that these models with a greater capacity of representing the regional historical climate, will be more appropriate for simulating the future climate.