Panel and multivariate methods for tests of trend equivalence in climate data series
Abstract
We explain panel and multivariate regressions for comparing trends in climate data sets. They impose minimal restrictions on the covariance matrix and can embed multiple linear comparisons, which is a convenience in applied work. We present applications comparing post-1979 modeled and observed temperature trends in the tropical lower- and mid-troposphere. Results are sensitive to the sample length. In data spanning 1979–1999, observed trends are not significantly different from zero or from model projections. In data spanning 1979–2009, the observed trends are significant in some cases but tend to differ significantly from modeled trends. Copyright © 2010 Royal Meteorological Society
1. Introduction
Many issues of interest in climate analysis involve comparisons of trends across different data sets. This note explains regression-based methods that yield asymptotically valid parameter variances and covariances while providing a flexible testing framework. Obtaining linear trend coefficients is easy using ordinary least squares (OLS). Obtaining unbiased estimates of the parameter variances and covariances (collectively referred to as the covariance matrix) is more challenging, because the regression residuals may be autocorrelated within each panel, and both heteroskedastic (unequal variance) and correlated across panels. Regressions that use sequenced groups of time series observations are referred to as panel estimators (Davidson and MacKinnon, 2002). They are convenient when panels are unbalanced, i.e. they do not all have the same numbers of observations, but they impose restrictions on the covariance matrix. A nonparametric method introduced by Vogelsang and Franses (2005) handles autocorrelation of unknown dimension; however, it is only applicable to balanced panels.
We explain both methods and the trade-offs between them. In Section 3, we apply them to a comparison of model temperature projections and observations in the tropical troposphere. We test trend significance as well as model-data equivalence. For discussions of the importance of modeling and climatological measurement issues related to the tropical atmosphere, see Karl et al. (2006), Santer et al. (2005, 2008), and Douglass et al. (2007).
2. Methods
2.1. Introduction: two-equation case
(1)
(2)
(3)
(i = 1,2) denotes an autocorrelation-robust variance estimator for b̂i, and cov(b̂1, b̂2) is the estimated covariance between the trend terms.
(4)
(5)2.2. Panel regressions
(6)(1 1)′ denotes two stacked T-length vectors of ones. (0 1)′ denotes a vector of T zeros stacked on T ones. This is called an indicator or a ‘dummy variable,’ since it indicates (value = 1) if the dependent variable is y2. (ττ)′ denotes a 2T-length vector consisting of two T-length time trends and (0 τ)′ is (ττ)′ times (0 1)′. A test of d̂2 = 0 in Equation (6) can be shown to be equivalent to testing b̂1 = b̂2 (Kmenta 1986; Supporting Information). Hence, the t-statistic on d̂2 in Equation (6) yields the test score (3).
(7)
(8)
(9)
(10)When obsij = 0, dyiτ/dτ = b̂1 and when obsit = 1, dyiτ/dτ yields (b̂1 + b̂2). Thus, a t-statistic on b̂1 will test whether the model trend is zero and a test of the linear restriction b̂1 + b̂2 = 0 indicates the significance of the observed slope. The t-statistic on b̂2 tests whether the trend on observations differs significantly from the trend in models.
(11)The estimated model trend is b̂1. The trend in observations from system 1 is b̂1 + b̂2 and from system 2 is b̂1 + b̂2 + b̂4. The t-statistic on b̂4 tests whether the trend in the second observation system differs from that in the first, and so forth.
(12)
(13)
(14)
denotes the covariance between series i and j, Ii denotes an identity matrix with dimension T, and
denotes the variance of series i. There are N(N − 1)/2 covariances
in Equation (14) that need to be estimated, in addition to the variances and AR1 parameters. If some panels j are shorter than others (Tj < T), then the dimensions of the Ai matrices need to be adjusted accordingly. Some commercial statistical packages, such as STATA, can accommodate unbalanced data sets.2.3. Higher order autocorrelations and multivariate trend models
form, which has higher power and is slightly easier to compute. It is obtained as follows. Denote V = U′ and take the columns vj, for j = 1, …, T, each of length N. Define a vector
. Then, VF05 show that
(15)
(16)The VF05 approach improves on the panel method by providing robust trend variances and covariances regardless of the autocorrelation order and the structure of heteroskedasticity. However, it requires balanced panels, which can be a limitation in some cases.
The VF05 statistic, as with all test statistics, has improved size as the sample size increases. Rejection probabilities also increase as ρ→1. Monte Carlo simulations in VF05 show that for T = 100, when q = 1 and ρ> 0.8, just under 10% of
scores exceed the 95th percentile, indicating a tendency to over-reject a true null, although this is an improvement compared to earlier alternatives. Each panel in our full sample has well over 100 observations, but a high ρ value. Hence, VF05 scores that are close to the critical values may overstate significance.
3. Empirical application
3.1. Data
We used the same archive of climate model simulations as used by Santer et al. (2008). The available group now includes 57 runs from 23 models. Each source provides data for both the lower troposphere (LT) and mid-troposphere (MT). Each model uses prescribed forcing inputs up to the end of the twentieth century climate experiment (20C3M; Santer et al., 2005). Projections forward use the A1B emission scenario. Table I lists the models, the number of runs in each ensemble mean, and other details. We used four observational temperature series: two satellite-borne microwave sounding unit (MSU)-derived series and two balloon-borne radiosonde series. We use monthly data starting in 1979, covering the tropics from 20°N to 20°S. The MSU observations come from the University of Alabama-Huntsville (UAH; Spencer and Christy, 1990) and Remote Sensing Systems Inc. (RSS; Mears et al., 2003). The HadAT radiosonde series is an MSU-equivalent published on the Hadley Centre web site (http://hadobs.metoffice.com/hadat/msu_equivalents.html; Thorne et al., 2005). The Radiosonde Innovation Composite Homogenization (RICH) series is published by Haimberger et al. (2008) and is available at ftp://srvx6.img.univie.ac.at/pub/rich_gridded_2009.nc. We used the RICH-gridded data and MSU weights supplied by John Christy (personal communication) to construct MSU-equivalent series (see Supporting Information for details).
| Panel | Model/obs name | Extra forcings | No. of runs | LT trend (SD) | MT trend (SD) | AR coeffs LT/MT |
|---|---|---|---|---|---|---|
| 1 | BCCR BCM2.0 | O | 1 | 0.210****
Significant at 5% level.
|
0.211****
Significant at 5% level.
|
1,2/1 |
| (0.058) | (0.053) | |||||
| 2 | CCCMA3.1-T47 | NA | 5 | 0.363****
Significant at 5% level.
|
0.380****
Significant at 5% level.
|
1,2,3,5/1,3 |
| (0.021) | (0.020) | |||||
| 3 | CCCMA3.1-T63 | NA | 1 | 0.419****
Significant at 5% level.
|
0.444****
Significant at 5% level.
|
1,6/1,6 |
| (0.041) | (0.039) | |||||
| 4 | CNRM3.0 | O | 1 | 0.258****
Significant at 5% level.
|
0.326****
Significant at 5% level.
|
1,6/1,3,6 |
| (0.085) | (0.111) | |||||
| 5 | CSIRO3.0 | 1 | 0.162**
Significant at 10% level.
|
0.30 | 1,3/1,3 | |
| (0.083) | (0.083) | |||||
| 6 | CSIRO3.5 | 1 | 0.305****
Significant at 5% level.
|
0.288****
Significant at 5% level.
|
1,2,6/1,2,6 | |
| (0.103) | (0.109) | |||||
| 7 | GFDL2.0 | O, LU, SO, V | 1 | 0.229****
Significant at 5% level.
|
0.225****
Significant at 5% level.
|
1,6/1,6 |
| (0.099) | (0.104) | |||||
| 8 | GFDL2.1 | O, LU, SO, V | 1 | 0.188 | 0.193 | 1/1,4,5 |
| (0.115) | (0.126) | |||||
| 9 | GISS_AOM | 2 | 0.127 | 0.123 | 1/1 | |
| (0.091) | (0.095) | |||||
| 10 | GISS_EH | O, LU, SO, V | 6 | 0.277****
Significant at 5% level.
|
0.261****
Significant at 5% level.
|
1/1 |
| (0.047) | (0.043) | |||||
| 11 | GISS_ER | O, LU, SO, V | 5 | 0.258****
Significant at 5% level.
|
0.230****
Significant at 5% level.
|
1,3,4,6/1,4 |
| (0.065) | (0.043) | |||||
| 12 | IAP_FGOALS1.0 | 3 | 0.273**
Significant at 10% level.
|
0.259****
Significant at 5% level.
|
1/1 | |
| (0.037) | (0.028) | |||||
| 13 | ECHAM4 | 1 | 0.290****
Significant at 5% level.
|
0.270****
Significant at 5% level.
|
1,4/1 | |
| (0.033) | (0.028) | |||||
| 14 | INMCM3.0 | SO, V | 1 | 0.185****
Significant at 5% level.
|
0.186****
Significant at 5% level.
|
1,4,6/1,6 |
| (0.076) | (0.081) | |||||
| 15 | IPSL_CM4 | 1 | 0.203****
Significant at 5% level.
|
0.202****
Significant at 5% level.
|
1,3,6/1,3,6 | |
| (0.077) | (0.082) | |||||
| 16 | MIROC3.2_T106 | O, LU, SO, V | 1 | 0.100 | 0.102 | 1,6/1,6 |
| (0.078) | (0.084) | |||||
| 17 | MIROC3.2_T42 | O, LU, SO, V | 3 | 0.280****
Significant at 5% level.
|
0.284****
Significant at 5% level.
|
1/1 |
| (0.037) | (0.039) | |||||
| 18 | MPI2.3.2a | SO,V | 5 | 0.277****
Significant at 5% level.
|
0.232****
Significant at 5% level.
|
1,2/1,2,6 |
| (0.060) | (0.057) | |||||
| 19 | ECHAM5 | O | 4 | 0.227****
Significant at 5% level.
|
0.224****
Significant at 5% level.
|
1/1 |
| (0.044) | (0.045) | |||||
| 20 | CCSM3.0 | O,SO,V | 7 | 0.320****
Significant at 5% level.
|
0.285****
Significant at 5% level.
|
1/1,6 |
| (0.050) | (0.044) | |||||
| 21 | PCM_B06.57 | O, SO, V | 4 | 0.178**
Significant at 10% level.
|
0.142****
Significant at 5% level.
|
1,2,3/1,2 |
| (0.043) | (0.023) | |||||
| 22 | HADCM3 | O | 1 | 0.204****
Significant at 5% level.
|
0.186****
Significant at 5% level.
|
1,2,4,6/1,6 |
| (0.060) | (0.063) | |||||
| 23 | HADGEM1 | O, LU, SO, V | 1 | 0.258****
Significant at 5% level.
|
0.270****
Significant at 5% level.
|
1/1 |
| (0.058) | (0.056) | |||||
| 24 | UAH | 0.070 | 0.040 | 1,2/1,2 | ||
| (0.058) | (0.062) | |||||
| 25 | RSS | 0.157****
Significant at 5% level.
|
0.117**
Significant at 10% level.
|
1,2/1,2 | ||
| (0.058) | (0.065) | |||||
| 26 | HadAT | 0.097**
Significant at 10% level.
|
0.020 | 1,2/1,2 | ||
| (0.053) | (0.066) | |||||
| 27 | RICH | 0.114****
Significant at 5% level.
|
0.072 | 1,2/1,2 | ||
| (0.050) | (0.059) |
- Each row refers to model ensemble mean (rows 1–23) or observational series (rows 24–27). All models forced with twentieth century greenhouse gases and direct sulfate effects. Rows 10, 11, 19, 22, and 23 also include indirect sulfate effects. ‘Extra forcings’ column indicates which models included other forcing: ozone depletion (O), solar changes (SO), land use (LU), and volcanic eruptions (V). NA: information not supplied to PCMDI. ‘No. of runs’ indicates the number of individual realizations in the ensemble mean. LT and MT trends based on linear regression allowing six AR terms. Standard errors in brackets. AR coeffs: the AR lags that were significant (p < 0.05) for LT/MT layers, respectively.
- * Significant at 10% level.
- ** Significant at 5% level.
Our data start in January 1979 and end in December 2009. Thus, we have N = 27 panels, each with 372 monthly observations. Figure 1 displays the (smoothed) MSU series and the mean of the PCM model runs for comparison.

UAH (thin dashed) and RSS (thin solid) satellite series 1979:1 to 2008:9. Thick line: Model 21 ensemble mean. Series smoothed using Hodrick–Prescott filter with smoothing parameter λ = 200. Top: LT and bottom: MT
Douglass et al. (2007) and Santer et al. (2008) focused on trends from 1979 to about 1999, with some series extending a few years further. To compare with these results, we first look at data ending in 1999, and then extend the sample to 2009. Since our panels are balanced, we can generate results using both the VF05 and panel regression methods, but since the results are so similar, we report only the VF05 results for the shorter 1979–1999 sample.
Table I summarizes the data. The 1979–2009 trends in °C per decade are shown for the LT and MT levels, with accompanying standard errors, for all ensemble means and observational series. Each series was centered and the trend regression allowed for a six-lag AR process, denoted as AR6. Table I (final column) shows that in 17 of the 23 models and in all 4 observational series, autocorrelation at lags greater than one were observed in at least one atmospheric layer. Hence, an AR1 error specification is likely inadequate. Extended autocorrelation lags were also observed in the individual model runs.
All climate models were forced with twentieth century greenhouse gas and sulfate levels: other assumed forcings are listed in Table I.
3.2. Multivariate trend test results
We weighted each model by the number of runs in its ensemble to adjust for the effect of combining runs into an average, although our conclusions would be unchanged if we weighted each model equally.
Table II presents tests of trend significance for the observational series. On data ending in 1999, the VF05 test shows the four observational series are insignificant at both the LT and MT layers individually and averaged together (column ‘Obs’). By extending the data to 2009, the
score of combined significance at the LT layer rises from 12.50 to 76.66, thus attaining significance at 5%. All observed LT series are individually significant, except UAH which is significant at 10%. At the MT layer, extending the sample raises the combined
score from 5.06 to 23.77, which is significant at 10%. UAH and Hadley series are insignificant, RICH is marginal, and RSS is individually significant at 5%.
| Tests of trend significance | ||||||||
|---|---|---|---|---|---|---|---|---|
| Obs | MSU | UAH | RSS | BAL | HAD | RICH | Models | |
| LT | ||||||||
| VF method | ||||||||
1979–1999 ![]() |
12.50 | 3.98 | 25.47**
Significant at 10% level.
|
7.85 | 15.79 | |||
1979–2009 ![]() |
76.66****
Significant at 5% level.
|
27.92**
Significant at 10% level.
|
118.79****
Significant at 5% level.
|
55.16****
Significant at 5% level.
|
93.12****
Significant at 5% level.
|
|||
| Panel method 1979–2009 | ||||||||
| Trend ( °C per decade) | 0.110****
Significant at 5% level.
|
0.120****
Significant at 5% level.
|
0.079 | 0.159****
Significant at 5% level.
|
0.105****
Significant at 5% level.
|
0.272****
Significant at 5% level.
|
||
| Standard error | 0.050 | 0.059 | 0.060 | 0.058 | 0.047 | 0.013 | ||
| p | 0.027 | 0.042 | 0.186 | 0.006 | 0.026 | 0.000 | ||
| MT | ||||||||
| VF method | ||||||||
1979–1999 ![]() |
5.06 | 1.55 | 19.36 | 0.27 | 10.08 | |||
1979–2009 ![]() |
23.77**
Significant at 10% level.
|
6.21 | 62.96****
Significant at 5% level.
|
0.26 | 41.43**
Significant at 10% level.
|
|||
| Panel method 1979–2009 | ||||||||
| Trend ( °C per decade) | 0.057 | 0.079 | 0.041 | 0.117****
Significant at 5% level.
|
0.043 | 0.253****
Significant at 5% level.
|
||
| Standard error | 0.051 | 0.057 | 0.056 | 0.057 | 0.049 | 0.012 | ||
| p | 0.272 | 0.166 | 0.466 | 0.039 | 0.389 | 0.000 | ||
-
VF method: Shown are Vogelsang and Franses (2005)
test scores. The 90% critical value is 20.14, 95% critical value is 41.53, and 99% critical value is 83.96. Panel method refers to panel regression results. Shown are the trend in °C per decade, the standard error of the trend, and the p value of a test of H0: trend = 0. See text for discussion of column groupings. Headings: Obs, average of all observational series; MSU, combined satellite record; UAH, University of Alabama-Huntsville; RSS, remote sensing systems; BAL, combined balloon (radiosonde) series; HAD, HadAT balloon series; RICH, Haimberger balloon series; Models, average of 23 ensemble means.
- * Significant at 10% level.
- ** Significant at 5% level.
Trend comparison results are listed in Table III. The second column (‘Obs’) shows that at both the LT and MT layers, on data ending in 1999, the difference between models and observations is only marginally significant, echoing the findings of Santer et al. (2008). However, with the addition of another decade of data the results change, such that the differences between models and observations now exceed the 99% critical value. As shown in Table I and Section 3.3, the model trends are about twice as large as observations in the LT layer, and about four times as large in the MT layer.
| Tests of difference from models | |||||||
|---|---|---|---|---|---|---|---|
| Obs | MSU | UAH | RSS | BAL | RSS versus UAH | BAL versus MSU | |
| LT | |||||||
| VF method | |||||||
| 1979–1999 | 24.96**
Significant at 10% level.
|
1990.10****
Significant at 5% level.
|
4.51 | ||||
| 1979–2009 | 188.55****
Significant at 5% level.
|
399.85****
Significant at 5% level.
|
2.06 | ||||
| Panel (p) 1979–2009 | 0.002****
Significant at 5% level.
|
0.012****
Significant at 5% level.
|
0.002****
Significant at 5% level.
|
0.059**
Significant at 10% level.
|
0.001****
Significant at 5% level.
|
0.000****
Significant at 5% level.
|
0.880 |
| MT | |||||||
| VF method | |||||||
| 1979–1999 | 35.48**
Significant at 10% level.
|
1203.37****
Significant at 5% level.
|
10.18 | ||||
| 1979–2009 | 257.67****
Significant at 5% level.
|
229.35****
Significant at 5% level.
|
13.91 | ||||
| Panel (p) 1979–2009 | 0.000****
Significant at 5% level.
|
0.003****
Significant at 5% level.
|
0.000****
Significant at 5% level.
|
0.019****
Significant at 5% level.
|
0.000****
Significant at 5% level.
|
0.000****
Significant at 5% level.
|
0.243 |
- VF group results: Vogelsang and Franses (2005) F2 test scores, 90% critical value is 20.14, 95% critical value is 41.53, and 99% critical value is 83.96. Panel (p) refers to panel regression results. Shown are the p values of a test of whether indicated trend difference = 0. See text for discussion of column groupings. For description of headings, see footnote of Table II
- * Significant at 10% level.
- ** Significant at 5% level.
At both the LT and MT layers, on data ending in either 1999 or 2009, the VF05 tests show that the balloon data are not significantly different from the MSU data, but within the satellite category, the RSS and UAH data are significantly different. Possible reasons for RSS/UAH differences include treatment of intersatellite calibration, orbital decay, and other processing issues (Santer et al., 2005; Karl et al., 2006; Christy and Norris, 2009).
3.3. Panel regressions tests
In cases where one or more series is not of full length, the VF05 test will not work. The panel-corrected standard error estimator in the STATA program (command xtpcse) allows an unbalanced panel in the estimate of Equation (14); however, it imposes an AR1 assumption. For comparison purposes, we report these results on data ending in 2009. We again weighted each observation by the number of runs in the ensemble mean. None of the conclusions depend on this step.
In Table II, the panel estimator at the LT layer shows that the observations as a group (column 2) exhibit a significant trend of 0.110 °C per decade, compared to a model trend (column 9) of 0.272 °C per decade. The balloon and MSU series are each jointly significant (p = 0.026 and 0.042, respectively). In the MT layer, the model trend (0.253 °C per decade) remains significant. The mean observed trend is only 0.057 °C per decade. The panel-estimated standard error implies that it is insignificant (p = 0.272), while the VF05 score implies significance at 10%. Among observational series only RSS is individually significant, echoing the VF05 results. The MSU and balloon series are each jointly insignificant. Figures 2 and 3 show the trend magnitudes.

Modeled and estimated trends (1979–2009, deg C per decade) in the tropics, LT layer. 95% confidence interval shown

Modeled and estimated trends (1979–2009, deg C per decade) in the tropics, MT layer. 95% confidence interval shown
In Table III, the p values of the test scores on a hypothesis of equality between the indicated trends are shown in the bottom row. On data ending in 2009, the trend differences between models and observations (column 2) are significant in both the LT (p = 0.002) and MT (p = 0.000) layers, as was the case with the VF05 tests. The model-observation difference is significant for all data products at both layers, except for the RSS series in the LT layer (p = 0.059).
In the last columns of Table III, we test the differences among the observational series. As was the case with the VF05 tests, the balloons and MSU series are not significantly different from each other (p = 0.880), but within the MSU category, the RSS and UAH series are significantly different (p = 0.000).
4. Discussion and conclusions
Econometric tools are increasingly being used for climate data sets (Fomby and Vogelsang, 2002; Mills, 2010). We present two econometric methods for trend comparisons between data sets. Both add flexibility for multivariate comparisons and provide improved treatment of complex error structures. The multivariate testing method of Vogelsang and Franses (2005) yields more robust estimator of the covariance matrix, but requires balanced data panels. Panel regression methods can accommodate comparisons of series of unequal lengths, but software limitations typically limit treatment of within-panel autocorrelation to the AR1 case. In our example, the two methods yielded similar conclusions, indicating that the AR1 approximation in the panel model was likely not overly restrictive. In general, however, for the purpose of multivariate trend comparisons in climatology, we particularly recommend that the VF05 method enter the empirical toolkit.
In our example on temperatures in the tropical troposphere, on data ending in 1999, we find the trend differences between models and observations are only marginally significant, partially confirming the view of Santer et al. (2008) against Douglass et al. (2007). The observed temperature trends themselves are statistically insignificant. Over the 1979–2009 interval, in the LT layer, observed trends are jointly significant and three of four data sets have individually significant trends. In the MT layer, two of four data sets have individually significant trends and the trends are jointly insignificant or marginal depending on the test used. Over the interval 1979–2009, model-projected temperature trends are two to four times larger than observed trends in both the LT and MT and the differences are statistically significant at the 99% level.
Our methods assume that the trends are linear. We found no evidence for nonlinearity on the observed data, but some on modeled data in the MT. In addition, the fact that the results are sensitive to the end date suggests that they might also be sensitive to the start date. Since the satellite data are unavailable prior to 1979, we cannot extend these series earlier. Interpretation of trend comparisons should, therefore, make reference to the time period analyzed, which, ideally, should have some intrinsic interest. In this case, the 1979–2009 interval is a 31-year span during which the upward trend in surface data strongly suggests a climate-scale warming process. As noted in the studies cited in Section 1, comparing models to observations in the tropical troposphere is an important aspect of testing explanations of the origins of surface warming.








