A cross‐checked global monthly weather station database for precipitation covering the period 1901–2010

Comprehensive monthly weather station databases are the foundation for many gridded climate data products, and they are widely used to characterize regional climate conditions, track climate change and research the impact of climate on natural and managed ecosystems. However, weather station databases are often regional in coverage, and they can have extensive gaps in station coverage over time. They may also contain errors in climate records, station coordinates or elevation. Here, we assemble a comprehensive monthly weather station database for precipitation from multiple reputable data sources. We use digital elevation models and nearby stations to search for inconsistencies in reported station locations and recorded precipitation values. We also estimated missing values in weather station time series using a linear model approach based on interpolated anomaly surfaces. The resulting station records were ranked into ten classes, according to the completeness of records, the reliability of missing value estimations and other criteria. We corrected incomplete or erroneous location and elevation information for 12% of all available station records. A total of 23% of monthly records that had missing values could be estimated with high or moderate confidence. We sub‐sampled our global database of more than 80,000 stations with various spatial filters, so that only the highest quality station for a given area was retained. Our contribution significantly enhances global data coverage compared to individual databases currently available. Even when accepting only the stations within the top two quality ranks in our combined database, and applying the coarsest spatial filter of one station per approximately 1,600 km2, the remaining station count of more than 20,000 stations exceeds the largest alternative database (without a spatial filter applied) by more than 50%.


| INTRODUCTION
Comprehensive monthly weather station databases are the foundation for characterizing regional climate conditions, and for tracking climate change over time. For the purpose of climatic characterizations, monthly summaries represent a good compromise between capturing seasonal climate variation without having to manage large amounts of daily weather data. Once these monthly weather summaries have been recorded for 30 years, also referred to as a climate normal period, calculating an average allows inference of long-term expectations of climate conditions that is not usually biased by cyclical or random anomalies (Guttman, 1989;Arguez and Vose, 2011). With a sufficient density of weather station data for a region, interpolation methods can be used to derive grids of baseline climate data for complex landscapes, modelling various climate phenomena, such as changes in temperature along elevation gradients, orographic precipitation and rain shadows (Hutchinson, 1995;Daly et al., 2002). Once the baseline climatology of a region has been established, additional questions can be addressed with monthly time series records, such as how the climate has changed in the past, or how the climatology of a region may change in the future (Sáenz-Romero et al., 2010;Ramirez-Villegas et al., 2013).
Many gridded climate data products that are widely used to research the impact of climate variability and climate change on natural and managed ecosystems rely on monthly weather station databases. For the United States, the Parameter-elevation Relationships on Independent Slopes Model (PRISM) is a well-regarded database of gridded climate that benefits from the extensive network of weather stations available for this region. Gridded climate products with global coverage also include the Climate Research Unit Time Series (CRU-TS) database from the University of East Anglia (Harris et al., 2014), a gridded database from the University of Delaware (Willmott and Matsuura, 1995), the Global Precipitation Climatology Centre (GPCC) product (Becker et al., 2013) or the Precipitation REConstruction Land (PRECL) database from NOAA (Chen et al., 2002). These databases with monthly historical resolution are limited to low spatial resolutions (0.5 degree or coarser). Alternative products, with high spatial resolutions (30 arc-seconds or approximately 1 km), are usually restricted to 30-year normal summaries and provide no interannual historical data, for example WorldClim (Hijmans et al., 2005).
For researchers that develop climate grids, there are a number of important challenges in assembling the required regional or global weather station databases. First, the placement of the weather stations is usually biased towards population centres or agricultural lands, whereas climate conditions of mountainous or desert areas are usually not well documented (New et al., 1999;Menne et al., 2012). Another important limitation of weather station data is temporal coverage. Before the 1950s, the density of weather stations tends to be low, reaching its highest global density around the 1970s before declining again (Menne et al., 2012). Additionally, many of these stations were operational only for a few years, with extensive gaps in the records or only operated seasonally, especially in mountainous regions. Finally, it is not uncommon to encounter errors in recorded climate values, errors of unit conversions in countries using the Imperial system and mistakes associated with locations, such as inaccuracies in the reported coordinates or elevation. Before the widespread use of global positioning systems, coordinates were typically recorded to the nearest minute, implying a location error of hundreds of metres, which can be problematic on mountainous terrain where the elevation and topographic gradients are an important determinant of the weather patterns.
In order to support researchers that rely on monthly weather station databases to develop interpolated grids or other climate data products, we assemble and cross-check a monthly weather station database for precipitation that combines several regional and global databases that are publicly available, including the Global Historic Climate Network (GHCN) v2 database (Lawrimore et al., 2011;Menne et al., 2012), the station database corresponding to the Climate Research Unit (CRU) Time Series v3.21 interpolated data set (Harris et al., 2014), the WMO-CLINO database of the World Meteorological Organization (WMO, 1996), the FAOCLIM 2.0 database of the Food and Agriculture Organization of the United Nations (Bogaert et al., 1995), the R-HydroNET database of the Regional Hydrometeorological Data Network for Latin America (Vorosmarty et al., 1998), the European Climate Assessment (ECA) database (Tank et al., 2002;Besselaar et al., 2015) and the United States Forest Service (USFS) database (Rehfeldt, 2006). We use duplicate entries, digital elevation models and nearby stations to search for inconsistencies in recorded climate values in weather station records that may be due to unit conversion errors, location and elevation inaccuracies.
This study provides a consolidated database of more than 80,000 weather stations (without duplicates) ranked into ten quality classes, according to the completeness of records, the reliability of missing value estimates and other criteria. Specifically, we contribute the following corrections and enhancements for users of monthly precipitation databases: (1) when errors could be corrected without ambiguity, corrections were made and indicated by flags in the database. Alternatively, the records were flagged with the lowest quality score for removal; (2) we estimate missing values in monthly station time series records with a linear model approach based on correlations with global interpolated anomaly surfaces, where deviations of monthly weather station values from 1961 to 1990 normal climate were interpolated. The reliability of estimated values was documented with model fit statistics and quality scores. Missing value estimates are primarily provided to estimate adjusted long-term averages (e.g. 30-year climate normals); (3) lastly, we provide subsamples of this database that select the best station records for pre-determined three-dimensional filters (with different intervals for latitude, longitude, and elevation). The sub-sampled data sets retain only the most reliable and complete station records for a given area, with global coverage and with as little spatial sample bias as possible, which are meant for generating interpolated data products.

| Databases used
The public weather station databases that were used in this study (Table 1) have already been subjected to rigorous quality control methodology. The CRU and GHCN databases have been screened for duplicate records, outliers, tests for violation of logical or physical relations between variables (Tmax < Tmin), unrealistic peaks or dips in time series, spatial consistency tests by comparing with surrounding stations, etc. (New et al., 1999;Durre et al., 2010). We use both the daily and monthly versions of the GHCN database v2, with the daily database containing additional records of stations with shorter time series and more gaps in the record. The FAOCLIM 2 database from FAO was established in the 1980s to evaluate the global agricultural production potential in developing countries and provides additional regional coverage in Central America, agricultural areas of South America and the Sahel zone of Africa. The ECA monthly database provides good additional coverage for mountainous regions in Europe, such as the Alps, the Carpathians, the Balkans, the Caucasus and the Scandinavian mountains. The R-HydroNET database for South America provides useful additions for Amazonian precipitation data, and the USFS database has excellent additional coverage for mountainous regions in North America, including the United States, Canada and Mexico.
After removal of duplicates from the combined weather station database (as described below), temporal coverage used in this study extends from the beginning of the last century to 2010, reaching their highest spatial density from the 1960s to the 1990s for most regions of the world (Figure 1).
The drop of station coverage in recent years is partially due to several databases not including recent records (Figure 1). Excellent temporal and spatial coverage for the 1961-1991 period is one reason why baseline grids are often developed for this 1961-1990 normal period (New et al., 1999;Menne et al., 2012). Another reason why 1961-1990 normal period is a useful reference period is that it largely precedes anthropogenic climate change (Tett et al., 1999;Lawrimore et al., 2011) and can therefore be used as a reference period when future climate projections are expressed as an anomaly (e.g. +2°C warming relative to a reference period). In the database that we develop in this study, we rank weather stations higher that have complete records for this 1961-1990 normal period.

| Elevation match with DEM
As a first check of station records, we compared the reported elevation for each weather station against a digital elevation model (DEM) of 1 km resolution (Gesch et al., 1999).
Missing elevation values in weather station records, usually indicated by flags in the elevation field, such as −9999, −999, −99 or 9,999 were replaced with the DEM value. Further, we recorded the difference between the reported elevation value and the DEM, and performed a more detailed inspection of any station that had a difference exceeding ±250 m. We checked those stations for potential errors of unit conversions, potential errors of location that may have led to an elevation discrepancy or implausible elevation values given the topography in the vicinity of the recorded station. In case of discrepancies between the DEM and reported elevation values, we normally accept the reported elevation of the weather station in mountainous terrain. Here, uncertainties in the location of weather station, for example if reported to the nearest minute, can usually explain a discrepancy with the DEM. In flat terrain, elevation differences exceeding ±250 m could usually be explained by unit conversions or other errors, and the reported elevation was replaced with the DEM value. Determining the correct value for weather stations is important, because the elevation value is normally used as a covariate in any climate modelling or interpolation effort.

| Outlier detection and missing value estimation
A useful check of weather station records is to determine consistency of records with other nearby stations to detect recording errors or unit conversion issues. In this study, we used two approaches. First, all stations were checked against one or several nearby stations within a 10 km distance (increasing the radius in 10 km increments if no station was found). Differences in monthly records were calculated, and then, the stations were sorted based on the maximum difference per station pair. A histogram of the differences revealed a bimodal distribution, with far outliers exceeding a difference value of 1,000 mm. This represented a problem primarily confined to a small percentage of stations from the CRU database and was recorded as a station quality flag.
Second, we compare station values to interpolated grids of monthly anomalies from 1901 to 2010 of the CRU-TS data product (Mitchell and Jones, 2005;Harris et al., 2014). Our criterion for potentially problematic station records was low correlations of station data with the CRU-TS monthly anomalies for the location of the weather station. For this purpose, we calculated correlation coefficients for each month of the year between weather station records and the corresponding CRU grid cell. Stations with low correlation coefficients were flagged as potentially problematic.
If the correlation between CRU-TS data and weather station records were high, we used a simple linear model with a fixed intercept at zero to predict missing values in the temporal record of station data. The weather station record needed to have at least 20 years for the  period and the linear model an R 2 > 0.7 for to be considered for estimating missing values. A less reliable missing value estimation, resulting in a lower quality rank, was based on linear models with R 2 values between 0.5 and 0.7 and at least 27 years of data for the 1901-2010 period. A special case for missing value estimation was stations located in desert areas, where a linear model could not be established due to the majority of monthly precipitation values being zero. For stations with at least 10 years of monthly data for the 1901-2010 period, but located in desert areas, we filled any missing values in the observed weather station data with the corresponding CRU-TS values directly (i.e. not using a linear model). These estimates were flagged as filled and assigned a lower quality score (see next section), allowing users of our database to select various quality criteria to filter the database according to their needs. T A B L E 1 Databases included in this study with statistics describing their spatial and temporal coverage. The databases are ordered by preference, based on documented quality control efforts, accuracy of location information, temporal coverage and overlap with other databases. The latest data used were 2010 as most databases were incomplete beyond this date ( Figure 1) 1961-1990 1961-1990

| Quality criteria
For each station, we assigned a quality score based on the completeness of the station record, the quality of the linear model to estimate missing values and a number of other criteria ( Table 2). The best station quality score (1) was assigned to stations that had at least 90% complete data for the 1961-1990 period, either as monthly time series, or reported as average for the 1961-1990 normal period (i.e. from WMO or R-HydroNET databases). The next best score (2) was assigned to the stations that had at least 66% of the data for the 1961-1990 period complete and where missing values could be estimated with a linear model that had an R 2 of at least 0.7. The following score (3) was assigned to stations with a similar criteria, but between 33% and 66% of the data of the 1961-1990 was complete (i.e. 10 years), plus a total of 25% of the data complete for the 1901-2010 period (i.e. 28 years of data in total), and an R 2 of at least 0.7 for estimation of missing values. The fourth score (4) was given to records that did not report monthly time series, but only 1961-1990 normal period averages (i.e. from WMO and R-HydroNET) and with completeness of annual records between 66% and 90%. The next score (5) was given to station records that did not cover the 1961-1990 period well, but that still contained a substantial time series with at least 25% of the data complete for 1901-2010 time series (i.e. 27 years), and with a total of 90% of data either observed or estimated for 61-90 time series with R 2 > 0.7. Score (6) was assigned to stations with at least 25% of the data complete for 1901-2010 time series (i.e. 27 years and missing values estimable with R 2 > 0.5. Score (7) includes all seasonal stations that covered three to ten months of the year and that otherwise covered at least quality score 6. The score of (8) was applied to entries that did not have monthly time series but only a 1961-1990 normal period average with between 33% and 66% completeness of the data (i.e. applicable to some entries of the WMO database) or data completeness was not reported (applicable to the USFS database). A score of (9) was given to stations that exhibited far outliers in monthly data that were inconsistent with nearby stations. Finally, the score of (10) was given to all remaining stations that did not fulfil any of the above-stated requirements, usually stations with very short time series. The monthly consolidated database contains all entries from our source databases (Table 1), without duplicates removed at this stage. In addition to the original entries, we report the DEM value for the station location, a linear model estimate for any missing value for the 1901-2010 period, the R 2 value for the linear model estimate and a flag that indicates whether the precipitation value was recorded, estimated, filled with CRU estimates for desert stations or not estimable. Additional columns specify per cent of complete records for the 1901-2010 period, the per cent of complete records for the 1961-1990 period, a database quality score according to Table 1, a station quality score according to Table 2 and a combined quality score that ranks database scores within station quality scores (e.g. 23 would indicate a station quality score of 2 for an FAO database entry). Smaller numbers indicate overall higher quality records.

| Duplicate removal and database subsets
Duplicate stations among the databases were common and were removed based on reported weather station IDs and or based on identical latitude and longitude values. Most databases used station IDs derived from those of the World Meteorological Organization. Where possible, we parsed the ID field to generate station IDs that conformed to the WMO format. This occasionally resulted in duplicate station IDs among different databases that were located in different states, countries or continents that used similar but independent ID schemes. To avoid these false-positive duplicate detections, we assigned to each station a global code related to the country, state or province where they were located. The final ID-based duplicate removal retained the station with the highest overall quality score for a given station ID in the same jurisdiction. This step did not remove all duplicates, as some databases did not use the WMO station IDs for some or all of their records.
The second duplicate removal step was location-based. We generated grids of 2.5 arcminutes (~5 km), 5 arcminutes (~10 km), 10 arcminutes (~20 km) and 20 arcminutes (~40 km). In addition, we want to retain stations in the same general area that are located at different elevations. For this purpose, we created elevation intervals of 100 m for each of the above grids. To sub-sample the original database at different spatial densities and to remove any additional duplicates that were missed in the previous step, we retained a single station with the highest overall quality score in each of the three-dimensional grid cells.

| Recorded station elevation versus DEM
From the total consolidated record of 123,000 stations (no duplicates removed), 9% had missing values for elevation indicated by a flag of −9999 or similar. In addition, several databases contained a sizable number of records that had zero recorded for elevation. (0.11% of CRU, 3.5% of GHCN, 0.4% of FAO, 0.05% of ECA, and 1.5 of dGHCN). Not all of these zero values were plausible measurements, for example indicated in blue in Figure 2, typically located across India, Australia and Brazil. Here, zero values were presumably used when the record should indicate missing values. For simplicity, we replaced all zero values with records from the digital elevation model, even when zero values were plausible measurements, that is for the Netherlands and other coastal locations. The rationale for this replacement was that many databases had at least some records, where an elevation value of zero actually indicated a missing value as well. Recorded elevation values of zero that represented a correct measurement were usually located near the coast, where the DEM replacement resulted in a very similar elevation estimate.
For all other records, we screened for substantial discrepancies between the digital elevation model and the recorded station elevation. This yielded a number of stations where conversions from the imperial system to the metric system were either omitted, or applied twice (Figure 3. rows of green points). We found that most databases were affected by this type of error, but to varying degrees (0.7% of CRU, 4.4% of GHCN, 0.6% in FAO, 0.1% of ECA, 0.4% of dGHCN and 0.3% of USFS). Even within local regions, only a subset of stations had these conversion errors. The errors were corrected by either multiplying by 3.281

Score
Data requirements for station quality score Stations with a quality score of at least 6, but that contained individual monthly observations that were identified as a far outlier and as inconsistent with nearby stations.

10
All remaining stations that did not meet at least quality criteria 8, usually containing short time series or high numbers of missing values. or dividing by 0.305. We only carried out the corrections for countries that in some point in their history used the Imperial system, and where these errors were almost exclusively located ( Figure 2). For stations with elevation conversion issues, we also checked if precipitation conversions may have been incorrect, but we could not find instances where precipitation may have plausibly been incorrectly converted from inches to millimetres. Lastly, other stations with large elevation discrepancies were flagged, but retained unchanged. These stations are usually located in mountainous regions (Figure 2. red circles), and the recorded elevation is likely a more reliable indicator of the true elevation of the weather station than the DEM value for these locations. A large elevation difference was therefore not incorporated into the quality score, but is reported as a separate elevation difference statistic for each station.

| Missing value estimation
For the purpose of calculating long-term climate averages, we provide missing value estimations that may be used in lieu of accepting a certain number of missing values in estimation of climate normals. The missing value estimation relies on a linear model with interpolated CRU-TS anomaly grids that have a coarse resolution (30 arcminutes), but nevertheless often yield strong correlations with recorded weather station data. The interpolated grids allow missing value estimation because of spatial interpolation from nearby stations that have records for the missing target value. While the correlations of station values with the monthly interpolated grids are often quite high and suitable for prediction (Figure 4a), the relationship is also often biased with a slope considerably deviating from the diagonal (e.g. Figure 4c). A moderate proportion of missing values could be estimated based on a linear model with R 2 values between 0.5 and 0.7 (Figure 4b, and Table 3).

| Final spatially sub-sampled databases
Spatial and elevational sub-sampling by elevation intervals and various grid sizes (2.5, 5, 10, and 20 arcminutes) is meant to provide users with databases where local duplications (nearby stations) are removed, retaining only a single station with the  highest overall quality for a given grid cell. As the grid size for subsamples increases, we therefore retain slightly higher proportions of high-quality stations while the overall database size decreases (Table 4). The WMO and CRU databases have the highest proportion of high-quality station records, but their database size is relatively small compared to our combined and sub-sampled databases. Even when sampling one station at the coarsest 20 arcminute grid resolution, we retain more than 20,000 stations with a high-quality score of 1 or 2 ( Table 4). The spatial distribution of stations using the coarsest subsample at 20 arcminute resolution is shown in Figure 5, where the quality is indicated by a colour legend. Low-quality stations are typically restricted to mountainous areas and specific countries. For example, most of India's weather station coverage has gaps after 1970 in all databases, leading to intermediate quality scores. For North America, there are equal number of high and low-quality stations, but we should note that the low-quality stations primarily represent Quality 8 stations from the USFS database. The USFS database contains 1961-1990 normal averages, without reporting the number of observations that went into these estimates. The low-quality score in this case was given for lack of information, that is where duplicate records are available, other databases with more information would be preferred.
Virtually, all databases included in this study contributed a significant amount of stations to the final combined database ( Table 5). The individual contributions to some degree reflect their size of each database (cf . Table 1), with the daily GHCN database contributing the largest number of stations. However, with spatial filters applied that subsample the best stations for a three-dimensional grid cell (area and elevation band), the contributions of individual databases become more equalized. Also, the best quality stations (right section of Table 5) are sourced from all databases except USFS, which lacked documentation to achieve a high-quality score in this compilation.

| Applications and limitations
In this data management and data cleaning effort, we made a number of subjective choices that are guided by particular applications that this database may be useful for, namely for the development of long-term climate normal surfaces that can serve as reference periods for ecological research on adaptation of organisms with climate, biological response of organisms to interannual climate variability and response of organisms to historical and future climate trends. As a useful normal reference period, we advocate the use of the 1961-1990 climate normal, which strikes a good balance between excellent global weather station coverage, and largely preceding a strong anthropogenic warming signal (Tett et al., 1999;Lawrimore et al., 2011;Estrada et al., 2013). Therefore, our station quality ranks specifically take this period into account. Nevertheless, users that are interested in other periods can easily modify the ranking system. All records of the combined database were retained, and all decision criteria for quality ranks for each station are included to specify other preferences.
For spatial interpolation of weather station data, we recommend the use of station subsets, where only one station with the best quality score is selected for various grid sizes. For spatial interpolation, multiple records in close proximity are generally not needed, and the disadvantage of potentially poor quality records may outweigh the advantages of dense spatial coverage. For generating interpolated surfaces of climate normal periods other than the 1961-1990 period, or for generating surfaces with a monthly resolution, for example to study response of organisms to climate variability or climate trends at a monthly time step, we recommend using an anomaly or delta approach, described for example by Mitchell and Jones (2005) and Wang et al. (2006). Relying on our missing value estimation for stations up to a quality score of 8, an adjusted 1961-1990 normal average can be obtained for a large majority of the stations contained in this database (75% of stations). Deviations from this 1961-1990 normal estimate F I G U R E 5 Map of stations coloured by quality score (blue = high quality, red = poor records, for details see Table 2), for the subsample where one station is selected for each 20 arcminute grid cell (approximately 1,600 km 2 ) and 100 m elevation interval  can then be calculated for all observed values in station time series. Interpolated monthly anomalies or interpolated normal anomalies can then provide robust climate estimates for years outside the 1961-1990 period, even if weather station coverage is not as dense.
The complete global database of monthly precipitation records from 1901 to 2010 is available from http://doi. org/10.5281/zenodo.3520885. This repository also contains regional files, climate normal summaries for the  period and subsamples based on spatial filters, where the highest quality stations are selected for a grid cell. Our general recommendations for developing global interpolated climate products are to use the 20 arcminute sub-sample and station records with a quality score of at least 8. For the  period, this subset would include stations with the following summary statistics (derived from data underlying Tables 3 and 4): 69% of records complete, 17% of monthly records being estimated with high confidence (linear model estimates with an R 2 > 0.7), 6% of monthly records being estimated with moderate confidence (linear model estimates with an R 2 between 0.5 and 0.7) and 2% of records being filled with CRU estimates for desert stations. For the development of local climatology products, a higher density station coverage may be used (i.e. one station per 2.5, 5, or 10 arcminute grid cell), especially if local station coverage is generally poor.

| CONCLUSION
The databases that we generated in this study should be of value to a variety of users who create gridded precipitation data or other climate data products derived from weather station data. We corrected a sizable number of errors in reported station elevations due to unit conversions, or due to missing values being reported as zero. Elevation errors can be detrimental for the quality of interpolated surfaces, as elevation is almost always used as a predictor variable or covariate in interpolation techniques for climate data. In addition, the estimation of missing values with linear models should render some stations useful that did not have appropriate coverage to calculate a specific 30-year climate normal due to missing values, but that had enough records from other years available to infer long-term climate conditions. Finally, the sub-sampling procedure guarantees that poorer records from various source databases are replaced by other, better quality records for nearby locations within the same elevation band. Our contribution significantly enhances the global data coverage compared to individual databases currently available (Table 5). This applies in particular for high-quality stations. The combined database contains almost 40,000 stations within the quality ranks 1 and 2, that is 41% of 97,112 stations (Table 4). Even when applying the strictest spatial filter, selecting one station per 20 arcminutes (or approximately 1,600 km 2 ), the resulting station count of more than 20,000 stations in the combined database exceeds the count of the top two quality stations in any individual database (without a spatial filter applied) by at least 58%.