Sensitivity of trends to estimation methods and quantification of subsampling effects in global radiosounding temperature and humidity time series

Climate trends estimated using historical radiosounding time series may be significantly affected by the choice of the regression method to use, as well as by a subsampling of the dataset often adopted in specific applications. These are contributions to the uncertainty of trend estimations, which have been quantified in literature, although on specific pairs of regression methods, and in the not very recent past characterized by smaller trends in temperature than those observed over the last two decades. This paper investigates the sensitivity of trend estimations to four linear regression methods (parametric and nonparametric) and to the artificial subsampling of the same dataset using historical radiosounding time series from 1978 onwards, available in the version 2 of the Integrated Global Radiosonde Archive (IGRA). Results show that long‐term decadal trends may have not negligible uncertainties related to the choice of the regression method, the percentage of data available, the amount of missing data and the number of stations selected in the dataset. The choice of the regression methods increases uncertainties in the decadal trends ranging from −0.10 to −0.01 K⋅da–1 for temperature in the lower stratosphere at 100 hPa and from 0.2 to 0.8% da–1 for relative humidity (RH) in the middle troposphere at 300 hPa. Differences can also increase up to 0.4 K⋅da–1 at 300 hPa when the amount of missing data exceeds 50% of the original dataset for temperature, while for RH, significant differences are observed in the lower troposphere at 925 hPa for almost all datasets. Finally, subsampling effects on trend estimation are quantified by artificially reducing the size of the IGRA dataset: Results show that subsampling effects on trend estimations when at least 60 stations, up to 76% of data available, are considered for temperature and at least 40 stations for RH.


| INTRODUCTION
Historical radiosounding observations have proven to be extremely successful for climate monitoring and for trend detection by providing high vertical resolution profiles of Essential Climate Variables (ECVs). Radiosounding profiles of temperature, humidity and wind have been used both to characterize the structure of weather systems and to detect changes in climate evolution. Radio soundings are not the only source of upper air data but offer a few potential advantages compared to satellite observing platforms. The satellite microwave sounding units (MSUs) have provided global data since 1979 with spatial coverage far exceeding that of radiosounding networks, especially in the Tropics, Southern Hemisphere (SH) and oceanic regions but with a coarse vertical resolution, especially in the upper troposphere (UT) and lower stratosphere (LT) regions (Christy and Goodridge, 1995;Gaffen, 1998;Christy et al., 2003;Ladstädter et al., 2011). In addition, MSU datasets are relatively short in sampling time and potentially influenced by several issues such as changes in cloud amount and tropospheric humidity (Simmons et al., 2014).
At the global scale, more than 1,000 radiosoundings are approximately launched at day and night time each day to characterize the state of the atmosphere (Ladstädter et al., 2011), although with large sparse coverage in the SH and occasional launches performed on ships to fill the gap over the oceans. The use of radiosounding datasets for climate change studies implies the handling of inhomogeneities caused by frequent changes in instruments, sensor types, observation practices or change in data processing systems (Thorne et al., 2005;Karl et al., 2006;Miloshevich et al., 2006;Simmons et al., 2014;Madonna et al., 2020aMadonna et al., , 2020b. Radiosondes from different manufacturers are affected by systematic errors, which differ from one type to another (e.g., Dirksen et al., 2014;Ingleby, 2017).
Even though random errors in the trend estimations can be reduced by increasing the number of measurements (Weatherhead et al., 1998;Whiteman et al., 2011), the systematic errors persist regardless of the number of observations introducing biases in climate applications (Ingleby, 2017). In addition, changes in the utilized sensor type at different locations over time were almost not documented before 2000 and not always adequately after, leading to artificial trends or jumps in the radiosounding data records (Dai et al., 2010). Finally, radiosounding observation data routinely deal with data continuity problems. In this respect, a recent study of Ferreira et al., (2019) provides a detailed analysis of the completeness of radiosounding humidity observations available in IGRA version 2 (Durre et al., 2018), with several statistics which reveal that the length and completeness of humidity data vary widely among stations, and the vertical resolution, vertical extent and completeness of soundings have improved considerably over time since 1947 to the present.
Regular temporal sampling and continuity in historical radiosounding data profiles are an important aspect of trend estimation and interpretation (Schlegel and Smit, 2016). Gaffen et al. (2000) quantified the impact of geographical distribution and data quality on upper-air temperature trends that were estimated using National Oceanic and Atmospheric Administration (NOAA) radiosounding data for the period 1959-1995. Trend estimations were based on a comparison between the robust and nonparametric median of the pairwise slopes method and the classic parametric least-squares regression method: Results show similar trend estimations, with differences generally less than ±0.03 KÁdecade -1 for the period 1959-1995, although differences in stratospheric trends may rarely exceed 0.10 KÁdecade -1 for the period 1970-1995. Referring to different datasets obtained from MSU, radiosonde measurements and reanalysis from European Centre for Medium-Range Weather Forecasts (ECMWF) and National Centers for Environmental Prediction (NCEP), Santer et al. (2000) compared two linear trend estimation methods, one based on minimization of absolute deviations (LA) and another based on minimization of squared deviations (LS). Results indicate that trend estimates with LA are less sensitive to the noise than trend estimates of the LS method. They also found that differences between the two linear methods are generally less than 0.05 KÁdecade -1 over 1959-1996 and exceed 0.10 KÁdecade -1 for lower troposphere and 0.15 KÁdecade -1 for the lower stratosphere over 1973-1993. The sensitivity of estimated trends to regression methods is an uncertainty contribution, which adds to the sampling uncertainty in time, due to gaps (e.g., missing data) in the data records, and in space, due to the need to select the most reliable subset of stations (Rosen et al., 2003;Agudelo and Curry, 2004;Free and Seidel, 2005;McCarthy, 2008;Ladstädter et al., 2011). It is worth pointing out that uncertainties due to temporal and spatial gaps in historical radiosounding data records have been often neglected in literature or reduced by using temporal averaging, on monthly or yearly scales, or data gridding using the interpolation or extrapolation method, which can reduce sampling bias but not remove it completely (McCarthy, 2008).
From NCEP reanalysis and MSU satellite data and from seven radiosonde networks available for the period 1960-1997, Free and Seidel (2005) found that sampling errors affecting trend estimation depend upon the atmospheric pressure level and variability, the network size and the climate region. Results found that sampling errors range from less than 0.001 KÁdecade -1 to more than 0.10 KÁdecade -1 and differ noticeably from one dataset to another. Comparing datasets from GPS radio occultation and radiosounding observations, Ladstädter et al. (2011) also found that sampling errors from radiosounding datasets are generally small (less than 0.3 K) between 50 S and 50 N but are larger at higher latitudes due to greater variability of the atmosphere and to the small number of stations in the SH.
Most of above-mentioned papers used radiosounding data until 2000 and, therefore, are characterized by trend rates slower than those observed over the last two decades. Moreover, beyond the paper cited above, many of the previous studies have estimated trends in temperature or humidity from historical radiosounding by using the parametric linear regression method, which is not a robust method and may be adversely affected by outliers, by non-Gaussian behaviour of the underlying data distribution and non-stationarity (Lanzante, 1996;Zaman et al., 2001;Mudelsee, 2019).
The outlined scenario shows that further studies and reviews are needed to properly quantify trend estimation uncertainties. Using historical radiosounding time series available from the IGRA (version 2) for the period 1978-2018, this paper provides a quantitative analysis of the uncertainties in the estimation of decadal trends of temperature and humidity due to the use of linear regression methods. It also provides a quantitative estimation of the uncertainty introduced by the spatial and temporal subsampling effects on decadal trends using a novel approach, which may be considered useful for the design of a measurements network, as well as for climate applications.
The paper is structured as follows: Section 2 discusses the radiosounding datasets and the related available metadata provided by IGRA. Section 3 outlines data analysis, subsampling strategies and the linear regression methods applied to estimate trends. Section 4 discusses the uncertainty contributions that may derive from the use of different linear estimation methods by comparing parametric simple linear regression method and nonparametric regression methods. Section 5 investigates the sensitivity of trends due to missing data. Section 6 describes the effect of spatial subsampling on trend estimations. Finally, a discussion and recommendations for future works are provided in Section 7.

| RADIOSONDE DATASETS
The radiosounding time series used for the analysis presented in this paper is a part of IGRA (version 2) (Durre et al., 2018). To date, IGRA is the largest and most comprehensive dataset of quality-assured radiosonde observations freely available. It consists of a collection of historical and near real-time radiosounding and pilot-balloon observations taken from 2,700 globally distributed stations (including ships, buoys and remote islands over the oceans) with varying periods of record, aiming to enhance spatial and temporal coverage, particularly prior to the 1970s (see also https://www.ncdc.noaa.gov/ data-access/weather-balloon/integrated-global-radiosondearchive). Approximately 1,000 of the IGRA stations are currently reporting data. Compared to the former release (Durre et al., 2006), IGRA version 2 increased its data volume by 30%, extending the data back in time to as early as 1905 and improving the spatial coverage, for example, in South America and Africa. Among the list of stations available from the IGRA data repository, a substantial subset of globally distributed radiosounding stations (656 stations) was used in this paper (Figure 1), namely, those with metadata available since 2000 to present. Ferreira et al. (2019) found that the total number, spatial distribution and temporal completeness of IGRA stations vary considerably over the years due to several factors, and the number of vertical levels and the vertical extent in radiosonde reports depend on the standard and mandatory pressure levels provided according to the World Meteorological Organization (WMO) recommendations (WMO, 1996) to the Met services, as well as on the number of reported significant levels. Finally, due to various reasons, missing data are frequently found. In this work, monthly radiosounding time series from 1978 onwards on 16 mandatory pressure levels (1,000,925,850,700,500,400,300,250,200,150,100,70,50,30,20 and 10 hPa) are selected for temperature, while for RH, only data at pressures higher than 300 hPa were considered as RH sensors below 300 hPa are prone to systematic uncertainties.
The data availability of measurements in different zonal regions at each pressure level from 1,000 to 10 hPa was investigated to quantify the amount of missing data and to identify the periods where they are more frequent. Missing data are naturally present in the selected dataset due to a lack of continuity in the operations that contingently occurs at each station. This implies that, to provide a reliable assessment of the uncertainties affecting the estimation of trends using radiosounding data, the dataset selected must be investigated without restrictions on where the missing data are located, although it is known that these may affect trend estimates. Figure 1 shows the spatial distribution of radiosonde stations with an amount of missing monthly data equivalent to 0, 5, 10, 20 and 30 years, reported as the percentage of available data with respect to the expected full sampling (i.e., 100, 90, 76, 51 and 25%). Hereinafter, the percentage of available data at each pressure level will be indicated as P x , that is, if we are referring to the number of stations with at least 90% of the data available at a certain pressure level for a specific test, this will be reported as P 90 . In the following, "da -1 ," instead of "decade -1 ," is used to ease the reading of the text and results provided.
In Figure 1, the total number of stations recording large amounts of data (e.g., at least 90%) becomes extremely limited at the top of the atmosphere. At 925 hPa (Figure 1, top-left panel and Figure 2), no station among the total number of stations at the global scale reaches the full expected sampling (i.e., have full time series without gaps). This is particularly because the 925 hPa pressure level was only adopted at the global scale by the end of 1991, although it has been planned since 1977 (WMO, 1977;Oakley, 1993).
No stations among the total number of stations (n = 144) at the Tropics have at least P 63 of monthly data for all levels from 150 to 10 hPa owing to the paucity or complete lack of observations and to sensor limitations to produce suitable records (Figure 2, middle panel). In the SH (n = 85), the number of stations with P 63 decrease substantially from 30 stations to less than 10 stations between 200 and 150 hPa (Figure 2,

hPa
F I G U R E 1 Spatial distribution of radiosounding stations with 0, 5, 10, 20 and 30 years of missing data. Missing data are reported as the percentage of months where data are available to the total number of months, that is, X = 100, 90, 76, 51 or 25%. Each dot indicates a station, and each colour indicates the percentage P x of data available at the station. Map are shown for pressure levels at 925, 850, 500, 300, 100 and 30 hPa. In brackets, the total number of stations with a certain percentage of data is reported to allow a reliable climate trend detection (Weatherhead et al., 1998;Schlegel and Smit, 2016). Over the Northern Hemisphere (NH) (n = 427), where most of the radiosonde stations operate (70% of radiosonde launches are located between 30 and 60 N), the number of available observations for all P x are more abundant (with a minimum of 10 stations reporting data at all pressure levels) ( Figure 2). To evaluate the sensitivity of decadal trends to regression methods and to the effects of subsampling, the samples P 100 , P 90 , P 76 and P 51 are considered. Note that observations corresponding to P 51 were considered to quantify the uncertainties for those applications based on a subsample of the IGRA dataset, including stations with a significant missing data. This extends several previous studies investigating trend differences solely considering one single percentage of data availability .

| Statistical methods
Regression is a statistical method used to estimate the trend component of a climate data record (Mudelsee, 2019). A multitude of methods has been provided in the literature to estimate trends. Regression model building is broadly used in literature to study the relationship between a dependent variable and a set of independent variables. Outliers from observation measurements in linear regression models are often encountered, and this is also true for radiosounding observations. Outliers can affect the variability associated with the heavy-tailed and skewed nature of data distributions but lead to a dramatic change of the magnitude of linear regression coefficients estimated and even the direction of coefficient signs (Choi, 2009;Wilks, 2011). At present, a reference dataset (Thorne et al., 2017), other than GRUAN (The Global Climate Observing System [GCOS]-Reference Upper-Air Network; https://www.gruan.org), which comprises a small number of high-quality stations (15-20) with considerable data records since 2010, does not exist. Therefore, the analysis of trend estimations has been typically carried out on different datasets, selecting different station networks and periods of record using different regression methods. However, it appears still challenging to provide a "best" trend estimation for the different ECVs from radiosounding measurements due to the several reasons discussed earlier.
The calculation of trends takes into account time series autocorrelation (Weatherhead et al., 1998;Weatherhead et al., 2002), which is often present in climate data records (e.g., temperature). In this paper, the autocorrelation was taken into account by considering the first-order autoregressive model (Weatherhead where y t are the monthly anomaly time series, t the time variable assigned to y t , α a constant term, x t the linear trend function, β the linear trend and u t the residual term that is assumed to be autoregressive of the order of 1 [AR (1)] when monthly data are considered (Weatherhead et al., 2002;Hartmann et al., 2013) and may be related as follow: where ρ determines the sign of this autocorrelation effect and is thus designated as the coefficient of autocorrelation. ε t (the white noise series) are the independent random variables with mean zero and common variance σ 2 ε : This approach is similar to that one used in Intergovernmental Panel on Climate Change (IPCC) (Hartmann et al., 2013). We also assumed that −1 < ρ < 1, so the noise process u t is stationary. This model allows the noise to be autocorrelated among successive observations, such that ρ = Corr(u t , u t − 1 ) (Weatherhead et al., 1998;Mudelsee, 2019).

| Methods of estimating trends
Trend estimation methods applied to estimate linear trends are broad and fall generally into two main categories: parametric and nonparametric. The aim of this section is to extend several previous studies investigating trend estimation uncertainties, focusing on the difference between pairs of regression methods (most often a parametric vs. a nonparametric, e.g., Finger et al., 1995;Gaffen, 1994;Gaffen et al., 2000;Parker, 1985;Santer et al., 2000). The sensitivities of regression methods are then assessed at each mandatory level for each P x for the following methods: 1. Simple linear regression (hereinafter LIN) is probably the most commonly considered analysis method when examining the relationship between a quantitative outcome and a single quantitative explanatory variable. The term « simple » is used to clarify that only a single explanatory variable is considered. As discussed above, LIN is based on statistical significance via a Student's t test and consists of a parametric technique not resistant to outliers (Lanzante, 1996) and has been demonstrated in many investigations to be less accurate and non-robust for skewed and heteroskedastic datasets (Lanzante, 1996;Wilks, 2011;Dervilis et al., 2015). 2. LANZANTE (hereinafter LAN) is a resistant and nonparametric regression based on the median of pairwise slopes regression (Hoaglin et al., 1986(Hoaglin et al., , 1893Lanzante, 1996). It involves the computation of the slopes between every possible pair of points in the time series, taking the median value as the trend estimate. It was first used in the assessment of hydroclimatic trends (Lettenmaier et al., 1994). It is called the Theil-Sen estimator and can be significantly more accurate than simple linear regression for skewed and heteroskedastic data and shows performances similar to non-robust least squares even for normally distributed data in terms of statistical power (Theil, 1950;Siegel and Benson, 1982;Helsel and Hirsch, 1992). The breakdown bound was estimated at about 29% by Theil (1950) and was extended to 50% by Siegel and Benson (1982), making the regression method very robust. In addition, if the errors are normally distributed, and no outliers are present, the estimators were found very similar to classic least squares (Lanzante, 1996). 3. Least absolute deviation (hereinafter LAD) is a resistant and nonparametric regression method fitting the paired data to the linear model using a robust and resistant LAD method (Rice and White, 1964;Barrodale, 1968;Wong and Schneider Jr, 1989;Calitz and Rüther, 1996;Santer et al., 2000). The technique is based on an algorithm by Barrodale and Roberts (1974). Comparisons with the classical leastsquares regression have been made to demonstrate that the proposed algorithm is more resistant to the existence of outliers and gives more intuitive results with less sensitivity to outliers (Wong and Schneider Jr, 1989). In addition, Calitz and Rüther (1996) found that the algorithm gives similar results to the classical least-squares regression in the absence of outliers. Wong and Schneider Jr (1989) reached similar conclusions and showed that the algorithm has no distributional or independence assumptions. 4. LMROB (hereinafter LMR) is a robust and nonparametric regression method based on an estimator for linear regression models (Finger, 2010;Koller and Stahel, 2011;Susanti et al., 2014). The procedure proposed in the algorithm has been implemented in the R-package "robustbase". It provides different robust regression techniques, as well as robust univariate and multivariate methods (Rousseeuw and Hubert, 2011). Apart from robust regression techniques for linear regression analysis, it includes robust methods for nonlinear and generalized linear models. It uses a bi-square redescending score function and returns a highly robust and efficient estimator with 50% breakdown point and 95% asymptotic efficiency for normal errors. The breakdown point is a measure of resistance to misbehaviour of the datasets, while efficiency measures the performance relative to some standard, such as a parametric estimator or a test assuming a Gaussian distribution. This method is strongly recommended for the robust estimation of linear regression due to its high efficiency when the errors have normal distribution (Finger, 2010).
The trend estimation methods used to estimate linear trends were selected from among the most frequently considered in climate studies using in-situ observations. Decadal trends are estimated from monthly anomaly time series at all selected stations and at all pressure levels. Resistant procedures are generally less efficient when the underlying distribution is Gaussian but provide much better results when data contain outliers or when the distribution is non-Gaussian.

| Subsampling strategies
Estimated subsampling effects may differ considerably among investigations and can be dependent on the approach adopted to subsample a global dataset (e.g., Rosen et al., 2003;Free and Seidel, 2005;McCarthy, 2008). Referring to the Geophysical Fluid Dynamics Laboratory model output, Oort (1978) was among the first to quantify the sampling errors from global tropospheric upper-air temperatures due to spatial gaps. Results found that differences range from 0.50 to 1.0 K at individual pressure levels for an 855-station network and conclude that the network was generally sufficient for quantifying the statistics and trends of largescale circulation in the NH but not in the SH. According to the Angell 63-station network (Angell and Korshover, 1975), Trenberth and Olson (1991) estimated subsampling errors using ECMWF operational analyses and found differences between the sampling effects estimated at the global scale in the NH and the SH. From global analyses of MSU and reanalysis datasets, Agudelo and Curry (2004) found that the difference between tropospheric trends in the global dataset and sampling dataset, according to (Lanzante et al., 2003) the 87-station network, is up to 0.08 KÁda -1 . They concluded that the network overestimated temperature decadal trends due to the shortage of observations over the oceans.
Several studies have been focused on determining sampling effects estimated from different particular network configurations and sizes (e.g., Trenberth and Olson, 1991;Santer et al., 2000;Agudelo and Curry, 2004;Free and Seidel, 2005;McCarthy, 2008). In addition, varying sampling density was tested to determine the theoretical size requirements for a network to provide sampling errors below a predefined threshold. McCarthy (2008) found that, to keep subsampling effects below 0.05 KÁda -1 in the troposphere and 0.10 KÁda -1 in the stratosphere, a radiosonde station at approximately every 30 longitude and 15 latitude north of 30 N is required. Results concluded that radiosonde network requirements are inadequate for monitoring humidity in the Tropics and SH due to the large inhomogeneous station distribution. In addition, the geographical coverage of radiosonde stations changed over time, and the periods of records vary among the stations and are characterized by several gaps. The effects of missing data in trend estimations have been investigated by Schlegel and Smit (2016). Results found that the percentage of missing data (% NA) in the time series of datasets and measurement accuracy are the variables that mainly affect accurate trend estimations.
The strategy adopted in this work to quantify spatial subsampling effects is to calculate the differences between decadal trends estimated for different subsets of radiosounding stations artificially selected versus the complete dataset. The different subsets of the stations are randomly selected for each P x and each latitude to reduce the effects of the irregular spatial distribution of the stations ( Figure 1) and their different time coverage at different pressure levels ( Figure 2). Temporal subsampling effects or effects due to missing data are also quantified using the differences between decadal trends estimated for two different values of P x (i.e., P x1 -P x2 ). To analyse the statistical significance of the sensitivity of trend estimation to regression methods and to data subsampling, the rank-based nonparametric Wilcoxon-Mann-Whitney (WMW) test (Siegel and Benson, 1982) is used. The WMW test is used to test whether differences of estimated trends from two regression methods (or from two samples of datasets) are significantly different from zero with a 95% confidence interval. The main reason for using the nonparametric WMW statistical test is that it tends to be more powerful and better suited for non-normally distributed data compared to parametric tests such as Student's t test (Sy and Quesada, 2020). Figure 3 shows the comparison of the temperature decadal trend differences due to the use of different linear regression methods with respect to the LIN method, that is, (LAD − LIN; black curve), (LMR − LIN; red curve) and (LAN − LIN; blue curve) for the datasets corresponding to P 100 , P 90, P 76 and P 51 , respectively. Due to the observation availability, P 100 and P 90 datasets are only considered in the NH (Figures 1 and 2). Figure 4 shows the same as Figure 3 but for RH.

| SENSITIVITY OF TRENDS TO THE CHOICE OF LINEAR REGRESSION METHODS
For both temperature and RH, the variability of the difference along the vertical profile for all the regression methods increases substantively from P 100 to P 51 .
Decadal trend differences (K·10 y -1 ) Decadal trend differences (K·10 y -1 ) Decadal trend differences (K·10 y -1 ) Decadal trend differences (K·10 y -1 ) Decadal trend differences (K·10 y -1 ) Decadal trend differences (K·10 y -1 ) Decadal trend differences (K·10 y -1 ) Decadal trend differences (K·10 y -1 ) However, the comparison between the regression methods for temperature shows that decadal trends estimated with LIN significantly differ from the trends estimated with the three nonparametric methods, particularly between 400 and 200 hPa and at pressure levels below 100 hPa. Beyond the spatial and temporal sampling, the spread in the trends estimates obtained with different regression methods is influenced by several factors, such as: (a) sensitivity of the regression methods due to the specific nature and features of each dataset, that is, variability, outliers and measurements uncertainty (e.g., Lanzante, 1996;Weatherhead et al., 1998;Weatherhead et al., 2002); (b) existence and timing of the largest change points, that is, structural breaks in the time series due to various reasons, such as instrumental effects or station re-locations (e.g., Gaffen et al., 2000;Santer et al., 2000); and (c) length, stability and number of missing data in the data records (e.g., Gaffen, 1994;Weatherhead et al., 1998;Gaffen et al., 2000;Santer et al., 2000;Schlegel and Smit, 2016). Decadal trend differences (%·10 y -1 ) Decadal trend differences (%·10 y -1 ) Decadal trend differences (%·10 y -1 ) Decadal trend differences (%·10 y -1 ) Decadal trend differences (%·10 y -1 ) Decadal trend differences (%·10 y -1 ) Decadal trend differences (%·10 y -1 ) Decadal trend differences (%·10 y -1 ) Decadal trend differences (%·10 y -1 ) Decadal trend differences (%·10 y -1 ) The P 100 dataset shows that trend differences among regression methods vary from −0.10 KÁda -1 (LAN/LIN) to −0.01 KÁda -1 (at LAD/LIN and at LMR/LIN trend differences) below 100 hPa. The spread, compared to P 90, is larger and can be due to several aspects, such as selecting fewer stations, while with more complete data records, the non-normality and skewness of the distribution may increase the uncertainty for LMR and LAD methods (e.g., Wong and Schneider Jr, 1989;Finger, 2010). Conversely, if the datasets are normally distributed and homoscedastic, the LAN method may be affected by a large type I error (arriving at a false positive) and thus may be penalized with respect to other methods. Finally, if outliers and/or large change points are present in the datasets, both LMR and LAD regression methods may be more sensitive (i.e., less resistant) than LAN and thus exhibit small differences with respect to the LIN method. Nevertheless, the values of the trends for the P 100 dataset give a clear idea of the uncertainty outcome from the trend estimation due to the use of one regression method or another for the considered dataset.
Similar results are obtained in the NH for both P 90 and P 76 datasets, with maximum differences around 0.15 KÁda -1 at 300 hPa. Considering P 76 , at the global scale, differences among regression methods may also exceed 0.15 KÁda -1 at 300 hPa, despite the large number of observations available. At the Tropics, regression methods give similar results at all pressure levels with a maximum difference of around 0.05 KÁda -1 at 250 hPa. In SH, the differences between regression methods are around 0.10 KÁda -1 at 925 and 250 hPa, with large discrepancies ranging from −0.20 KÁda -1 (LAD/LIN) to 0.08 KÁda -1 (LAN/LIN) at 70 hPa, mainly due to limited amount of observations and the consequent limited spatial representativeness of the dataset.
Similar results for P 51 are shown in the bottom panel of Figure 3: Differences among the regression methods become significant at all latitudes compared to P 100 , P 90 and P 76 , with trend differences exceeding 0.40 KÁda -1 at 250 hPa in the NH. In the Tropics, at pressure levels smaller than 300 hPa, where observations become limited, the spread among regression methods varies from 0.20 KÁda -1 (LAD/LIN) to 0.10 KÁda -1 (LAN/LIN) at 200 hPa. In SH, large trend differences are also estimated at pressure levels below 300 hPa, with spread among methods varying from −0.15 KÁda -1 (LMR/LIN) to 0.05 KÁda -1 (LAD/LIN).
Comparing trend differences estimated from P 76 and P 51 at the global scale and over the NH (Figure 3 middle and bottom panels), results show that differences among regression methods estimated from P 51 are four times larger than those estimated from P 76 at 250 hPa. Comparing also trend differences for both percentage of datasets P 76 and P 51 (Figure 3) in the Tropics at 200 hPa, differences estimated from P 51 datasets are at least two times larger than differences estimated from P 76 . This means that differences among the regression methods maybe at least two times smaller if the coverage of available observations increased from 51 to 76% of the total sampling. This is consistent with the fact that trends calculated using data records with significant temporal gaps of missing data (>20 years) are typically affected by large sampling uncertainties. In other words, missing data values increase the possibility to detect a trend in a stationary time series (type-2 error, i.e., finding false negatives), also increasing the discrepancies among different regression methods.
On the other hand, adding more stations with large temporal gaps to the stations with smaller gaps (e.g., from P 76 to P 51 ) may substantially increase the spreads among regression methods and, therefore, induce larger uncertainties in trend estimations. In order to reduce uncertainties derived from the choice of an estimation method, the stations with the most complete data records must be selected. This is evident for the trends estimated at the global scale and in the NH at pressure levels lower than 300 hPa in the UTLS regions, where measurement coverage is more limited than at other atmospheric levels. Table 1 summarizes the results shown in Figure 2 at four atmospheric layers: lower troposphere, middle troposphere, upper troposphere and lower stratosphere (hereinafter, LT, MT, UT and LS, respectively). The averaged differences among the regression methods are smaller for P 100 and P 90 (i.e., less than −0.02 KÁda -1 for P 100 and around 0.06 KÁda -1 for P 90 in the UT) but statistically significant. Differences also range around ±0.05 KÁda -1 at all latitudes except in SH, where differences become statistically significant and can reach up to −0.10 KÁda -1 in the LS for P 76 . From P 51, differences are statistically significant as well and can reach up to 0.15 KÁda -1 in the UT, likely related to the large variability of the dynamical processes governing stratospheretroposphere exchange, as well as to the paucity of radiosonde records in the UTLS region.
In Figure 4 and Table 2, a similar comparison as in Figure 3 and Table 1, respectively, is reported but for RH. Comparing trend differences at 300 hPa for all P x , a large spread among regression methods from 0.20%Áda -1 (LMR/LIN) to 0.80%Áda -1 (LAN/LIN) is obtained for P 100 . A large spread at 925 hPa is shown in Figure 4 for all P x . For P 76, the trend differences range within −0.50%Áda -1 (LMR/LIN) and −0.05%Áda -1 (LAN/LIN) in the NH, while in the SH, the differences are larger and range from less than −1.0%Áda -1 (LMR/LIN) to −0.50%Áda -1 (LAD/LIN). The large differences at 300 hPa and 925 hPa  T A B L E 2 Same as Table 1  for RH can be also due to the large water vapour variability in the boundary layer in LT and also to the restricted number of observations (McCarthy, 2008;Ladstädter et al., 2011). Results provided in Table 2 further show that trend differences in the LT and MT are statistically significant, with values ranging within 0.06 and 0.20%Á da -1 in the MT and within 0.01%Áda -1 and 0.10%Áda -1 in the LT for both P 100 and P 90 datasets. For P 51 , the maximum spread occurs in the MT, with significant differences exceeding 0.30%Áda -1 in the SH, while in the Tropics, differences are generally smaller and nonstatistically significant. The large spread among regression methods observed at all latitudes at 925 hPa, shown in Figures 3 and 4, and at 250 hPa for temperature Decadal trend differences (%·10 y -1 ) Decadal trend differences (%·10 y -1 ) Decadal trend differences (%·10 y -1 ) Decadal trend differences (%·10 y -1 ) Decadal trend differences (%·10 y -1 ) F I G U R E 5 Sensitivity of temperature (KÁdecade -1 ) and relative humidity (%Ádecade -1 ) decadal trends to the missing data effects. The effect is quantified using the differences between decadal trends estimated for two different values of P x (i.e., P x1 − P x2 ). Due to the limited observation available, trend differences calculated between P 100 and P 90 datasets (i.e., P 100 − P 90 ) are only shown for the Northern Hemisphere (NH, 20 N < latitude < 90 N) (top panels (a) for temperature and (b) for relative humidity datasets). Differences between decadal trends estimated for P 76 and for P 51 (i.e., P 76 − P 51 ) for temperature ((c), middle panels) and for relative humidity ((d), bottom panels) are shown at the global scale and at all the latitudinal belts: Tropics (jlatitudej < 20 ), Northern Hemisphere (NH, 20 N < latitude < 90 N) and Southern Hemisphere (SH, 90 S < latitude < 20 S). Dots are representative of the median values for each of the different regression methods (different colours), while horizontal bars are representative of the first and third quartiles of the corresponding probability distribution datasets can be mainly explained by the small amount of radiosonde data available, as mentioned above.

| SENSITIVITY OF TRENDS TO TEMPORAL SUBSAMPLING EFFECTS
In this section, we quantify the sensitivity of decadal trends to temporal subsampling effects (i.e., missing data or changes in the sampling frequency). The effects are estimated for temperature and for RH time series in the NH using differences between trends for P 100 and for P 90 at all pressure levels (Figure 5a,b, top panels). The effects are also estimated at all latitudinal belts using the difference between P 76 and P 51 for temperature (Figure 5c, middle panels) and for RH (Figure 5d, bottom panels). In Figure 5, dots are representative of the median values for each regression method (different colours), while horizontal bars are representative of the first and third quartiles of the corresponding probability distribution. Based on trend differences estimated between P 100 and P 90 for temperature (Figure 5a), it can be inferred that the temporal subsampling effects are larger at 300 hPa and at pressure levels below 150 hPa, where there is a substantive increase in the sampling frequency. At 300 hPa, the trend differences due to the missing data and the high local variability can exceed 0.20 KÁda -1 for LIN, while for the nonparametric regression methods, the differences are less than 0.10 KÁda -1 . At pressure levels below 150 hPa, the spread among regression methods becomes larger, with differences ranging within ±0.10 KÁda -1 . From Figure 5b, it is also possible to provide similar results as Figure 5a but for RH. Results show large temporal subsampling effects exceeding 0.50%Áda -1 at 300 hPa, mainly due to the small number of available observations and the large atmospheric variability (McCarthy, 2008;Ladstädter et al., 2011).
The effects of missing data are also obtained using P 76 and P 51 . Figure 5c show statistically significant discrepancies among the regression methods at 925 hPa and at pressure levels below 200 hPa (with values exceeding 0.40 KÁda -1 at 30 hPa), which are mainly due to the substantial increase of the missing data in the time series. These effects are also observed for RH (Figure 5d), with values exceeding 1.0%Áda -1 at 925 hPa in the Tropics. For temperature in the SH, a large spread is observed at 925 hPa and at atmospheric levels below 150 hPa and at 925 and 400 hPa for RH, mainly linked to the sparseness of radiosounding stations and small amount of available data. Table 3 further shows how the temporal subsampling effect at different atmospheric layers is generally small for all regression methods (around −0.01 KÁda -1 in the LT, 0.05 KÁda -1 in the MT and in the UT and within ±0.02 KÁda -1 in the LS), when the amount of missing data is limited (i.e., P 100 − P 90 ). Also for RH, the averaged trend differences due to missing data are generally small (around −0.02%Áda -1 in the LT and 0.10%Áda -1 in the MT). Similar results are obtained using P 76 and P 51 (Table 3, bottom rows), but in the NH and at the global scale. The effects of missing data are statistically significant in the LS, with trend differences exceeding −0.20 KÁda -1 . Table 3 (bottom rows) reports analogous results for RH. Results show that, except for SH where uncertainties are larger (more than 0.30%Áda -1 in MT and 0.40%Áda -1 in LT), temporal subsampling effects are generally small (less than −0.12%Áda -1 in the LT and no more than ±0.11%Áda -1 in the MT). These uncertainties are mainly due to spatial and temporal inhomogeneities in the station distribution.
In summary, the analysis reported in this section allows users to quantify the temporal and spatial subsampling uncertainties for their own climate application based on radiosounding data. To reduce the uncertainty on a trend estimation due to missing data below ±0.15 KÁda -1 for temperature, selecting a dataset with at least 76% of data coverage (i.e., time series observations with at least 30-years of data coverage) is recommended, while for RH, no clear correlation can be found between the estimated uncertainties and the missing data effects. Figure 6 shows the sensitivity of temperature decadal trends to the effects of spatial subsampling, that is, uncertainty on trend estimations due to the selection of different subsets of stations. Due to the limited observation availability and spatial coverage of the dataset in the Tropics and SH, the effects of spatial subsampling are quantified over the NH only. Spatial subsampling effects are estimated using the difference between decadal trends calculated for a subset of radiosounding stations artificially selected (from 20 to 100 stations) versus the decadal trends obtained for the complete dataset. In Figure 6, the median values for each regression method are reported, along with the first and third quartiles (horizontal bars).

| SENSITIVITY OF TRENDS TO SPATIAL SUBSAMPLING EFFECTS
At all pressure levels below 400 hPa, the spatial subsampling effect becomes relevant for all P x , with a large spread among the regression methods. It becomes obvious that subsampling effects are larger when a small number of stations is considered (<40 stations). Therefore, when increasing the number of stations, a  convergence process is observed, with trend differences progressively decreasing to small values. This is true for P 100 , P 90 and P 76 datasets, as shown in Figure 6. However, an increasing the number of stations (e.g., from 20 to more than 100 stations), coupled with a limited amount of missing data, allows us to further reduce the noise coming from an inappropriate spatial and temporal sampling, thus reducing the uncertainties due to the choice Decadal trend differences (K·10 y -1 ) F I G U R E 6 Sensitivity of temperature decadal trends (KÁdecade -1 ) to the spatial subsampling effects. The effects are estimated as the difference between trends estimated from each subset of radiosounding stations artificially selected (ranging from 20 to 100) versus the complete network in the Northern Hemisphere (NH, 20 N < latitude < 90 N) only (due to the limited observation availability) at all mandatory pressure levels for P 100 (top panels), for P 90 (top-middle panels), for P 76 (middle-bottom panels) and for P 51 (bottom panels). The dots are representative of the median values for each of different regression methods (see different colours), while horizontal bars are representative of the first and third quartiles of the corresponding probability distribution of a regression method. Above 400 hPa in the LT, where a dense distribution of observations is available, the subsampling effect becomes smaller, with differences generally less than ±0.01 KÁda -1 , while differences increase when fewer than 20 stations are considered (exceeding 0.10 KÁda -1 at 850 hPa). Results provided in Table 4 also show that the spatial subsampling effects are generally small (lower than ±0.10 KÁda -1 in the LT and no more than 0.20 KÁda -1 in the MT for all P 100 and P 90 datasets), but it is statistically significant, especially when trends are calculated from a moderate number of stations (e.g., less than 80). From P 76 and P 51 datasets, the effects of the spatial subsampling effects are also quite small and therefore negligible, with values ranging around ±0.10 KÁda -1 . If fewer than 20 stations are considered, the differences become statistically significant, with values exceeding 0.2 KÁda -1 for both the MT and the UT levels.
In Figure 7, similar results as Figure 6 are shown for RH. The spatial subsampling relative effects estimated are generally large at 1000 hPa and at 300 hPa, with values ranging within ±0.50%Áda -1 for all P x and the number of stations considered. The calculated differences are mainly due to both the high variability of the humidity field in the LT and UT and to the biases and individual data processing typical of different radiosonde types. The most significant effects are observed below 400 hPa, with significant trend differences exceeding −1.0%Áda -1 for all P x when 20 stations are considered, while they decrease asymptotically from 20 to 100 stations. Table 5 shows the relative effect of spatial subsampling in the LT, which is generally limited with values ranging around ±0.10%Áda -1 for almost all P x , independent of the number of selected stations. Instead, in the MT, the differences become large and statistically significant for almost all datasets, especially for P 51 , when the number of stations considered is fewer than 80: In this case, differences can be up to −1.0%Áda -1 when 20 stations are selected.

| DISCUSSION AND CONCLUSION
An extension of the analysis available in literature for the quantification of uncertainties in the estimation of decadal trends is presented in this paper. The effects of different regression methods, of missing data and of subsampling have been studied using historical radiosounding time series of temperature and RH from 656 stations in the period 1978-2018, available in version 2 of IGRA (Durre et al., 2018). The strongest finding of this paper is that accurate estimations of long-term decadal trends in radiosounding time series may primarily concern the choice of the regression method. When trends are calculated on the entire IGRA dataset (i.e., P 100 ), differences among regression methods may vary from −0.10 KÁda -1 to −0.01 KÁda -1 below 100 hPa for temperature and from 0.2 to 0.8%Áda -1 at 300 hPa for RH. Differences can also increase up to 0.4 KÁda -1 at 300 hPa when the amount of missing data exceeds 50% of the original dataset for temperature, while for RH, significant differences are observed in the LT at 925 hPa for almost all datasets. Such findings are consistent with previous studies available in the literature (e.g., Santer et al., 2000), but they are more complete because they are often limited to considering differences between pairs of regression methods and/or considering one percentage of data availability from datasets characterized by trend rates slower than those observed over the last two decades. Another important aspect of this paper is that this is, to our knowledge, the most recent work based on trends calculated using radiosounding datasets, which are now measuring using improved sensors in the epoch of the "climate change," and therefore, it comes at the right moment as an improvement of past efforts.
In the past, analysis of uncertainties in trend estimations has been typically carried out on different datasets, selecting different station networks, periods of record and regression methods. From the different samples of dataset considered, as mentioned above, the spread among regression methods is influenced by several factors, such as: (a) sensitivity of the regression methods due to the specific nature and features of each dataset (Lanzante, 1996;Weatherhead et al., 1998;Weatherhead et al., 2002); (b) existence and timing of the largest change points Santer et al., 2000); and (c) length, stability and amount of missing data in the data records (Schlegel and Smit, 2016). Further studies based on a comparison of several trend estimation methods using a larger number of stations with long data records are still a valid solution to improve our knowledge of climate trends and to quantify the related uncertainties.
Finally, two additional caveats are worth noting. First, the uncertainty contributions due to the effect of spatial subsampling biases and, second, the sensitivity of decadal trends to temporal subsampling effects (i.e., missing data). For instance, as we mentioned above, missing data are an intrinsic dataset characteristic due to a lack of continuity in operations contingently occurring at each measuring station. Although the completeness of historical radiosounding observations is improving over the years (Ferreira et al., 2019), missing data in the time series are still frequent. This implies that, to provide a reliable assessment of the uncertainties affecting the

Note:
The effects of spatial subsampling are calculated as the difference between decadal trends for a subset of radiosounding stations artificially selected (from 20 to 100 stations) versus the decadal trends for the complete dataset, for each of different regression methods in the Northern Hemisphere (NH, 20 N < latitude < 90 N), only due to limited data availability in the Tropics and SH. Decadal trend differences statistically significant with 95% confidence intervals, that is, differences that passed the WMW test at 0.05 level, are reported in bold.
estimation of trends using radiosounding data, the separate roles of temporal and spatial sampling effects due to the missing data must be considered. The debate related to the approach to consider for reducing biases due to sampling errors has been ongoing for several years, and it is a topic of several papers in literature (e.g., Santer et al., 2000;Rosen et al., 2003;Agudelo and Curry, 2004;Free and Seidel, 2005;McCarthy, 2008). Most of these papers address spatial sampling errors with regard to several factors (i.e., global datasets considered, approach F I G U R E 7 Same as Figure 6 but for relative humidity decadal trends (%Ádecade -1 ) T A B L E 5 Same as Table 4  adopted, network configurations and sizes, etc.). In our study, the existing approaches were extended considering: (a) different sub-networks of stations to quantify the impact on decadal trends and (b) uncertainties due to the missing data and to the choice of regression methods, identifying the minimum amount of IGRA stations, which any user could use without a significant increase of the uncertainty due to the spatial sampling effects.
Our results found that enlarging the number of stations, including those with an acceptable level of missing data, can allow us to reduce the trend estimation uncertainties among different regression methods. In analogy, uncertainties are smaller in regions where data are denser (e.g., NH) and time series with a large amount of missing data are filtered out. This can be because missing data within the time series showing no significant trends increases the probability of committing a type-2 error (Schlegel and Smit, 2016). Our results also suggest that the use of the P 76 dataset is related to a small uncertainty in the estimation of trends due to the choice of the regression method, while the P 51 dataset may increase the uncertainty up to 0.4 KÁda -1 . Uncertainties also critically depend on the coupling of the number of stations selected for an application and the amount of missing data. Selecting a sufficient number of stations (e.g., at least 60) with a limited amount of missing data (e.g., at least P 76 ) can reduce the noise coming from an inappropriate spatial and temporal sampling and, therefore, the uncertainties due to the choice of the trend estimation method. For example, for RH, 40 stations with the coverage of the P 76 dataset should be selected at minimum.
There are some limitations to this work. First, the dataset selected in this paper has been investigated without considering restrictions on the persistence and the position of the missing data within each time series (i.e., at the beginning, in the end, or in the middle), with the aim to quantify the uncertainty affecting the results of any study based on the IGRA radiosounding data as a function of its data completeness in time. Second, the different subsets of the stations are randomly selected for each P x and each latitude with the aim to reduce the effects due to the irregular spatial distribution of the stations and their different time coverage at different pressure levels. Nevertheless, this work contributes to increasing the attention and encourages further studies to improve the quantification of uncertainties in climate trends.