Evaluation of precipitation datasets against local observations in southwestern Iran

This study provides a comprehensive evaluation of a great variety of state‐of‐the‐art precipitation datasets against gauge observations over the Karun basin in southwestern Iran. In particular, we consider (a) gauge‐interpolated datasets (GPCCv8, CRU TS4.01, PREC/L, and CPC‐Unified), (b) multi‐source products (PERSIANN‐CDR, CHIRPS2.0, MSWEP V2, HydroGFD2.0, and SM2RAIN‐CCI), and (c) reanalyses (ERA‐Interim, ERA5, CFSR, and JRA‐55). The spatiotemporal performance of each product is evaluated against monthly precipitation observations from 155 gauges distributed across the basin during the period 2000–2015. This way, we find that overall the GPCCv8 dataset agrees best with the measurements. Most datasets show significant underestimations, which are largest for the interpolated datasets. These underestimations are usually smallest at low altitudes and increase towards more mountainous areas, although there is large spread across the products. Interestingly, no overall performance difference can be found between precipitation datasets for which gauge observations from Karun basin were used, versus products that were derived without these measurements, except in the case of GPCCv8. In general, our findings highlight remarkable differences between state‐of‐the‐art precipitation products over regions with comparatively sparse gauge density, such as Iran. Revealing the best‐performing datasets and their remaining weaknesses, we provide guidance for monitoring and modelling applications which rely on high‐quality precipitation input.


| INTRODUCTION
In many regions of the world, changes in water availability have severe impacts on society and economy. These changes will be intensified by climate change in some regions (Kirtman et al., 2013). This jeopardizes water and food security, especially in developing countries with highly agricultural-oriented economies (Vaghefi et al., 2019;Hameed et al., 2020). In this context, a lack of reliable precipitation information, which is key to monitor these water dynamics, has been a serious barrier for supporting decision-makers. To compensate for this deficit, different sources of data, including gauge or satellite-based and reanalyses, have been utilized by researchers to monitor and predict extreme events across the world (Naumann et al., 2014;AghaKouchak et al., 2015;Zhan et al., 2016;Balsamo et al., 2018). However, without comprehensive and comparative validation against ground-based observations the usefulness of these data products is unclear.
In spite of this necessity for accurate ground-based precipitation dataset, reliable gauge station datasets are not widely available. Further, existing gauge observations often suffer from artificial disturbances because the stations' reliability varies with time due to, for example, instrument deterioration or relocations. As a result, there may be substantial questionable or missing data records.
To address these problems, some studies have focused on reconstruction, quality control and homogeneity of the time series (e.g., González-Rouco et al., 2001 in southwestern Europe; Beaulieu et al., 2008 in Québec, Canada;Vicente-Serrano et al., 2010 in north-eastern Spain).
As a result of sparse gauge measurements, datasets of near-global coverage have been generated with various approaches. Some make direct use of gauge measurements together with statistical techniques for interpolating the observations (e.g., Chen et al., 2002;Xie et al., 2007;Becker et al., 2013;Harris et al., 2014). Others use remote sensing from satellites with high spatiotemporal resolution and near real-time availability, making them suitable especially for data-sparse or ungauged basins. Because of the indirect nature of the precipitation estimates from satellites, the products are subject to a variety of potential errors (Brocca et al., 2014;Moazami et al., 2014;Koster et al., 2016;Sun et al., 2018;Salmani-Dehaghi and Samani, 2019). The satellite estimates are therefore often blended with gauge data (Ashouri et al., 2015;Funk et al., 2015), which also enhances their usefulness in areas with insufficient gauge coverage.
Besides these direct and indirect measurements of precipitation, there are also modelling approaches such as through reanalyses which assimilate meteorological observations from various sources, for example, groundbased stations, ships, airplanes, and satellites (Parker, 2016) with forecasts from numerical weather prediction models to infer precipitation estimates (e.g., Saha et al., 2010Saha et al., , 2014Dee et al., 2011;Kobayashi et al., 2015;Copernicus Climate Change Service, 2017). To alleviate the major bias in reanalysis models, because of not making any direct use of gauge information, they are sometimes combined with bias adjustment techniques (Weedon et al., 2011;Berg et al., 2018) and often merged with other data products such as satellite remote sensing and interpolated gauge datasets (Beck et al., 2017a(Beck et al., , 2017b. In regions with poor station coverage, satellite rainfall estimates and reanalysis products may compensate for the lack of gauge stations provided a meaningful calibration of the underlying models. This is particularly challenging in mountainous regions (Dinku et al., 2008;Hu et al., 2016;Alijanian et al., 2017;Beck et al., 2017a).
This growing number of precipitation datasets derived through various approaches implies a need for a comparative performance assessment. In this study, we compare the performance of various precipitation datasets derived from multiple sources, against gauge measurements over Karun basin, southwestern Iran. The outcome can inform dataset developers about respective strengths and weaknesses, and also provide guidance to users for their choice from the variety of state-of-the-art products. This is especially important in relatively data-sparse regions such as Iran. Although there have been some studies on the evaluation of precipitation products over Iran's climatic zones (Moazami et al., 2014;Katiraie-Boroujerdy et al., 2013, 2019Ghajarnia et al., 2015;Sharifi et al., 2016;Khodadoust Siuki et al., 2017;Alijanian et al., 2017Alijanian et al., , 2019Hosseini-Moghari et al., 2018;Dezfooli et al., 2018;Saeidizand et al., 2018), comprehensive evaluations of reanalysis, satellite-based, and interpolated precipitation data are lacking.
Sections 2 and 3 introduce the study area and the considered precipitation datasets. Section 4 illustrates the methodology, and Section 5 presents results and discussion. Finally, in Section 6 the conclusions of this study are presented.

| STUDY AREA
Karun basin, with an area of 65,230 km 2 , is one of the largest basins in Iran (Figure 1), hosting the Karun River with an average annual discharge rate of 575 m 3 Ás −1 . Variation in topography is significant over the basin; surface elevation varies from zero at Persian Gulf coast to 4,400 m over the Zagros mountain chains. The basin encompasses various climate zones; this climate variability is controlled by geographical latitude, proximity to the Persian Gulf, and elevation. Average annual precipitation over the basin is about 632 mm, however, with a large spatial variability illustrated by values ranging from 153 mm in southern plain regions to >2000 mm in mountainous regions. Daily temperature varies over the basin from a minimum of −30.6 C at Koohrang station to a maximum of 52.2 C at Ahvaz station.

| DATA
In this section, we introduce the ground-truth reference data (Section 3.1) which we use to validate a multitude of established gridded state-of-the-art precipitation datasets (Section 3.2).

| Reference data
In this study, in situ data from rain gauges operated by IRIMO (Islamic Republic of Iran Meteorological Organization) are utilized as a ground reference for evaluating the selected gridded precipitation products ( Figure 1). In essence, this dataset consists of two different sources of 3-hr synoptic and daily rain gauges, represented by triangle and diamond, respectively. It should be noted that the synoptic stations are more reliable because of occurring less human error in the process of observation. As shown in Figure S1 the distribution of gauges across altitudes matches that of the grid cells which cover Karun basin. No statistical post-processing has been applied to the gauge measurements. We focus on the time period 2000-2015 in this study since, according to plotted stations' time series, the overall availability of the station measurements is highest in these years. As most gridded datasets come with a monthly resolution, we derive monthly estimates of the gauge data by accumulating daily values.

| Gridded data products for evaluation
The gridded products are grouped into three categories: entirely gauge-based datasets, merged datasets using gauge data among other sources, and reanalysis datasets not making any direct use of gauge information. A summary of all individual datasets and their respective characteristics is shown in Table 1.

| Interpolated gauge data
The four interpolated gauge datasets evaluated in this study are GPCCv8 (Global Precipitation Climatology Centre; Schneider et al., 2014Schneider et al., , 2018, CPC-Unified (Climate Prediction Center Unified; Xie et al., 2007;, PREC/L (PRECipitation REConstruction over Land; Chen et al., 2002), and CRU TS4.01 (Climatic Research Unit; Harris et al., 2014). They share a spatial resolution of 0.5 , except for GPCCv8, which provides 0.25 . These products provide gridded gauge analysis products derived from quality-controlled station data at a daily (CPC-Unified) or monthly (GPCCv8, PREC/L and CRU TS4.01) temporal resolution. PREC/L is based on an advanced method of optimal interpolation (OI) and is derived from gauge observations from over 17,000 stations collected in the Global Historical Climatology Network version 2, and the Climate Anomaly Monitoring System datasets (Chen et al., 2002). CPC-Unified is derived by combining all information sources available at CPC, 16,000 quality-controlled daily stations, and by taking advantage of the OI objective analysis technique . GPCC employs an extraordinarily large number of gauges around 85,000 stations (Schneider et al., 2014(Schneider et al., , 2018. Further, it provides the number of gauges used for each grid cell, and an uncertainty estimate deduced from ordinary kriging (Yamamoto, 2000). CRU TS4.01 provides a precipitation dataset and other metrological variables from 1901 to near-present, including over 4,000 individual weather station records (Harris et al., 2014).

| Merged multi-source data
We use three high-resolution merged products, namely CHIRPS2.0 (Climate Hazards group Infrared Precipitation with Stations), PERSIANN-CDR (Precipitation Estimation from Remotely Sensed Information using Artificial Neural Networks-Climate Data Record), and SM2RAIN-CCI (SM2RAIN-Climate Change Initiative, published in July 2015). CHIRPS is a quasi-global rainfall dataset, spanning across 50 S-50 N. It extends from 1981 to near-present and blends 0.05 resolution satellite imagery with in situ station data to create gridded rainfall time series. PERSIANN-CDR provides daily rainfall estimates at a spatial resolution of 0.25 over the latitude band of F I G U R E 1 Topographical map of the Karun basin in Iran.
The location of precipitation gauge stations utilized in this study is marked with purple diamonds (daily rain gauges) and triangles (synoptic stations) 60 S-60 N from 1983 to the near-present. PERSIANN-CDR is produced from the PERSIANN algorithm using GridSat-B1 infrared satellite data, and the training of the artificial neural network is done using NCEP stage IV radar data. The biases of PERSIANN-CDR are adjusted using 2.5 monthly GPCP (Global Precipitation Climatology Project) precipitation data (Adler et al., 2003;Ashouri et al., 2015). Brocca et al. (2014) developed a global scale rainfall product, SM2RAIN, by translating soil moisture obtained from satellite soil moisture data into precipitation estimates. Recently, the SM2RAIN method has been applied to the satellite-derived ESA CCI soil moisture product with a spatial resolution of 0.25 for the period of 1998-2015 (Ciabatta et al., 2018). The algorithm was calibrated against the Global Precipitation Climatology Centre Full-Data daily dataset (Ciabatta et al., 2018). Note that even though these datasets rely primarily on satellite-derived information, they also indirectly employ gauge measurements for bias adjustment (PERSIANN-CDR) or algorithm calibration (SM2RAIN-CCI).
Further, we evaluate two datasets: HydroGFD2.0 (Hydrological Global Forcing Data) and MSWEP V2 (Multi-Source Weighted-Ensemble Precipitation), which are produced by merging gauge, satellite and reanalysis data. HydroGFD2.0 is methodologically based on Weedon et al. (2011), and is produced in near real-time (Berg et al., 2018;Arheimer et al., 2019). The HydroGFD2.0 dataset covers the period 1979 to present at a daily time scale and with a spatial resolution of 0.5 . Additionally, we use the MSWEP V2 dataset from Beck et al., 2019. It provides data with high spatial (0.1 ) and temporal (3-hourly) resolution, and is computed by merging precipitation estimates based on gauges, satellites, and reanalyses data. In addition, they corrected frequency and systematic biases in the precipitation data (Beck et al., 2019).  (Saha et al., 2010(Saha et al., , 2012. JRA-55 covers a period of more than 55 years from 1959 with a spatial resolution of 0.5625 , and with 3-hourly time steps. It is based on four-dimensional variational data assimilation (4D-Var) with Variational Bias Correction (VarBC) for satellite radiances (Kobayashi et al., 2015). ERA-Interim starts in 1979 and is continuously updated, providing 3-hourly data with a spatial resolution of 0.75 . The data assimilation system used to produce ERA-Interim is based on a 2006 release of the integrated forecasting system (IFS), including 4D-Var with a 12-hr analysis window (Dee et al., 2011). ERA5 is the sequel of ERA-Interim. Using the latest IFS version, it provides hourly data on many atmospheric, land-surface and seastate parameters together with estimates of uncertainty on a 30-km grid from 1979 to near-present (Copernicus Climate Change Service, 2017).

| METHODOLOGY
Before comparing the various precipitation products, they are re-gridded to a common 0.5 spatial resolution, if necessary. This was done using climate data operators (Schulzweida, 2019), namely through conservative remapping which preserves the water mass (Jones, 1999). This method is widely used for remapping precipitation datasets (Chen and Knutson, 2008;Nikulin et al., 2012). To examine the effect of this re-gridding, we also compute our analyses with the native resolutions of the precipitation datasets.
Gauge measurements within each grid box of the respective datasets were averaged arithmetically to produce representative time series, and to form a reference for the evaluation of the considered precipitation datasets. In this context, to assess the performance, monthly time series were computed for for gauges and datasets; G and M are defined later. Based on these time series, six measures were calculated for each grid cell of the models: mean and maximum values, spatial and temporal correlation coefficient (CC), relative error (RE), and absolute error (AE): where i, n, and top bar indicate time, the number of months, and an average over time, respectively. Further, P denotes monthly precipitation data from gauge measurements (G) or the considered datasets (M). After calculating these measures, the average of all grid cell values is taken to obtain a single representative value for each metric and each dataset over the basin. In addition to temporal correlation, we also infer spatial correlations of observed versus modelled grid cell averages.
We also examine the accuracy of products in different climate regimes. These are characterized by long-term average aridity and temperature, and computed from ERA-Interim data for each grid cell. Thereby, aridity is computed as the ratio of mean annual net radiation to mean annual precipitation, converted to the same units by normalization with the latent heat of vaporization (Budyko, 1974;Orth and Destouni, 2018). Figure S2 indicates the basin climate determined by aridity values in each grid cell along with the respective stations' temperature. Figure 2 illustrates mean monthly precipitation averages across the basin. It shows that the main differences between the datasets occur during the regional wet season (November to April) while all datasets capture the summer dry season. Most datasets, except for the reanalyses and HydroGFD2.0, underestimate precipitation during the wet season. GPCCv8, reanalyses, and merged products generally demonstrate seasonal variability of precipitation better than the other interpolated gauge products. Underestimation in the datasets incorporating satellite estimates during winter might be due to a systematic bias related to snow-covered surfaces (Gebregiorgis et al., 2017); this is despite the fact that they are calibrated with gauge observations. Moreover, such underestimation in SM2RAIN-CCI might be further due to surface soil moisture saturation (Brocca et al., 2013). We repeat this analysis for low, medium, and high elevation grid cells ( Figure S3). The overall similar results indicate no or little importance of elevation for the seasonal performance patterns.

| All-basin summary evaluation of precipitation products
Results for the statistical analysis are presented as basin averages in Table 2. In terms of mean values, ERA-Interim, CFSR, and GPCCv8 agree best with the observed gauge values, while reanalyses perform generally well. The spread is large especially between the interpolated datasets, which is likely due to the different selection of gauges included in each individual dataset. Even though HydroGFD2.0 uses monthly anomalies from CPC-Unified, we find different mean values here because the climatology is derived from CHPclim, which includes more stations (Funk et al., 2015).
As seen in Figure 2, most products underestimate precipitation over the basin. The underestimation of precipitation in CHIRPS2.0 could stem from the low density of employed stations. The underestimation found in PERSIANN-CDR might be related to the bias-adjustment, which is based on the GPCP dataset with a rather coarse 2.5 resolution. Such underestimation over mountainous regions in Iran has also been reported for other precipitation datasets which include satellite estimates (Moazami et al., 2016;Alijanian et al., 2017;Katiraie-Boroujerdy et al., 2017). It might be associated with infrared sensors F I G U R E 2 Comparison of mean monthly precipitation (mm per month) for different datasets (symbols) across all grids of Karun basin (2000)(2001)(2002)(2003)(2004)(2005)(2006)(2007)(2008)(2009)(2010)(2011)(2012)(2013)(2014)(2015). Red lines denote datasets which did not employ gauge data, while blue lines indicate datasets which use some gauge records, and turquoise lines refer to datasets which extensively used gauge observations T A B L E 2 Statistical measures of the evaluated monthly datasets over Karun basin (2000)(2001)(2002)(2003)(2004)(2005)(2006)(2007)(2008)(2009)(2010)(2011)(2012)(2013)(2014)(2015) Group Dataset  having difficulties to detect warm cloud process especially over mountains (Dinku et al., 2018;O and Kirstetter, 2018). The underestimation in SM2RAIN-CCI could be related to the fact that once the surface soil is saturated after intense or long-lasting precipitation events, no additional increase is possible such that no additional precipitation can be deduced from the soil moisture data (Brocca et al., 2013). According to Alijanian et al., 2017, MSWEP V1 highly overestimated precipitation values over different climatic zones of Iran while this study interestingly indicates contrary results for MSWEP V2. This might have to do with a reduction of Iranian bias correction factor due to suspected issues with observed runoff data (Beck et al., 2019). Furthermore, for maximum precipitation, HydroGFD2.0, GPCCv8, and ERA5 agree best with in situ measurements whereas the other interpolated datasets, as well as MSWEP V2 and SM2RAIN-CCI, show large underestimations. Considering the mean relative error over Karun basin, the CFSR reanalysis shows the best agreement with gauge data, followed by GPCCv8. Merged datasets outperform some reanalyses which may be attributed to a F I G U R E 3 Mean precipitation (mm per month) over the basin for the 2000-2015 period. Background colours indicate precipitation of the respective gridded product while the circles denote observations large overestimation of these reanalyses at low-altitude regions ( Figure S3).
The spatial correlation results indicate the best agreement of GPCCv8 and CFSR with gauge data, but also reanalysis datasets (except JRA-55) show similarly high correlations whereas MSWEP V2 and PREC/L, show particularly poor performance. We also explore the performance of the products in their original spatial resolution (shown in Table 2 with parenthesis). Overall, we find a minor role of the spatial resolution, except for spatial correlation where results seem to be more sensitive.

| Spatial evaluation of precipitation datasets
In addition to the all-basin evaluation of the considered datasets in the previous section, in this section, we F I G U R E 4 Temporal correlation between different global precipitation datasets and local gauge data analyse the spatial variability of datasets versus gauge measurements. Figure 3 presents dataset versus gaugebased mean precipitation across the Karun basin. The above-mentioned underestimation mostly occurs in mountainous regions. In general, reanalysis datasets along with GPCCv8 and HydroGFD2.0 agree best with gauge-derived rainfall patterns over the basin, confirming results from Table 2. Aside from the overall biases, we find that MSWEP V2 and PREC/L have most difficulties in capturing the spatial precipitation patterns, including the contrast between low areas and the mountains (relative error of products in each station are presented in Figure S4).
In Figure 4, we analyse the spatial pattern of the temporal correlations evaluated in Table 2. We find overall highest correlations for GPCCv8 and MSWEP V2. For the latter, this was reported earlier by Alijanian et al. (2017), who attributed this fact to either applying high weights to gauge observations in MSWEP, or the inclusion of reanalysis precipitation data. Most datasets show high F I G U R E 5 Bias values (mm per month) of datasets and the number of gauges within each grid cell (size of the circles) in the different climatic zone of the basin (circles with no colour means missing data) spatial variability in the correlation results across the basin, except for GPCCv8 and MSWEP V2 where correlation is generally high, and for PREC/L where correlation is low everywhere. Figure 5 illustrates the absolute error of the considered precipitation datasets with respect to climate in the considered grid cells. Further, the number of gauges located within each grid cell is reflected by the size of the points; smallest circles indicate one gauge per grid cell, while largest circles refer to nine gauges. Performance of the datasets in each grid cell is shown through colour coding. We find no systematic difference in the results indicated by small versus large circles. This suggests that the different number of gauges in each grid cell has only a negligible impact on our conclusions.

| Performance in different climates and elevations
The previously mentioned underestimation in CHIRPS2.0, PERSIANN-CDR and interpolated gauge F I G U R E 6 (a) Absolute error, (b) relative error and (c) temporal correlation as a function of elevation. Symbols indicate the elevations of the grid cells. Lines are smoothed with a LOWESS filter (CPC-Unified and CRU TS4.01) datasets, as well as MSWEP V2, is mostly found in regions with comparatively cold and wet climates; regions with elevations above 2,500 m as opposed to the low altitude surfaces. GPCCv8 and ERA5 generally show more similar performance across the different climates. Most datasets agree better with gauge observations in dry and warm climate. This underlines the importance of considering the orographic and geographic effects on the quality of precipitation data (Hu et al., 2016). Similar results are found for relative errors of the datasets ( Figure S5); while the climate sensitivity of correlation performance interestingly is overall lower ( Figure S6).
The elevation effect on the agreement between gauge data and precipitation products is analysed in Figure 6, using each grid cell's elevation according to a 0.5 digital elevation map (Amante and Eakins, 2009). Overall, the performance of the considered products varies similarly to altitude. Reanalysis datasets tend to overestimate precipitation at low altitudes while they show no or little bias at higher altitudes, despite the challenging topographic complexity associated with high elevation regions (Isotta et al., 2015). All other datasets show largest biases at medium altitudes and slightly improved results for the lowest and highest elevation regions. Interestingly, the temporal correlation analysis (Figure 6c) reveals opposite results; highest correlations are found at medium altitudes and performances are degraded in the lowest and highest grid cells. These performance differences are controlled by altitude-dependent (a) topographic complexity and (b) climate ( Figure S7). Figure 7 illustrates which dataset agrees best with gauge observations in each grid cell and climate. For this purpose, a ranking of datasets is computed in each grid cell according to (a) temporal correlation and (b) absolute error. Then, the dataset with the lowest sum of the two ranks is selected as the best dataset for a particular grid cell. GPCCv8 performs best in 15 grid cells, thereby clearly outperforming all other datasets. Among the remaining datasets, CFSR and ERA5 stand out with the best agreement against gauge data in three grid cells each. The outstanding performance of GPCCv8 in this context is probably due to the fact that this dataset employs some of the gauge data used as reference in this study. While this is true for other (interpolated) datasets, they probably could use the gauge information more efficiently in their derivation procedure.

| CONCLUSIONS
Known discrepancies between state-of-the-art precipitation datasets have motivated us to assess the accuracy of a great variety of state-of-the-art precipitation datasets against gauge observations in a comparatively observation-sparse region, Karun basin in Iran. Thereby, we analysed the spatiotemporal variability and find characteristic strengths and weaknesses of each dataset, while no single dataset is superior in all respects.
The overall best agreement with observations was found for the GPCCv8 dataset. However, GPCCv8 is likely biased to better results as it might include a large part of the gauge data used for the evaluation. Therefore, the result rather points out that a comprehensive gauge selection is most important for the quality of any largescale precipitation dataset.
While merged products include gauge data for calibration, reanalyses are independent of the gauge data since they do not assimilate surface gauge precipitation. Among the latter, ERA5, ERA-Interim, and CFSR outperform JRA-55 over Karun basin. Given this rather independent nature of the reanalyses, the results do show value in such data, particularly in regions where no gauge data are available.
Comparing older, established datasets with more recent products we find mixed results. While ERA5 shows overall improved agreement with gauge measurements over ERA-Interim, MSWEP V2 shows significant biases even though with the opposite sign as the previous version (Alijanian et al., 2017). Further, MSWEP V2 and HydroGFD2.0 as the most recent of the considered products are not outperforming previously released datasets.
F I G U R E 7 Datasets showing best agreement with gauge data in each grid with respect to correlation and absolute error. The number of gauges within each grid cell is denoted by the size of the points A caveat for our analysis is potential errors in gauge measurements. Especially at higher altitudes, precipitation under-catch in such measurements is a known issue (Mekonnen et al., 2015). Therefore, the underestimations we find in most considered datasets are likely even more considerable than shown here. In addition, it should be noted that the gauge measurements used in this study have (partly) been employed in the derivation of (some of) the considered datasets. Despite this, we find no consistently improved agreement between the gauge measurements and datasets that use more of them versus datasets that do not use them.
The identification of such important shortcomings of state-of-the-art datasets highlights potential avenues for future development. This way, the present study contributes to more reliable, high-quality precipitation datasets which are key to hydro-climatological monitoring and modelling, especially given the potential increase of related extremes in the context of climate change.