Direct statistical downscaling of monthly streamflow from atmospheric variables in catchments with differing contributions from snowmelt

The climate–hydrological modelling chain is the most widely used approach for assessing likely future changes in streamflow but is also associated with large uncertainties. We present an alternative approach for modelling monthly streamflow directly from large‐scale atmospheric variables using statistical downscaling to streamflow for 42 catchments with flow generation regimes ranging from largely snowmelt‐driven to entirely rainfall‐driven. It aims to find the most appropriate combination of predictors and methods, which can best include information pertaining to snow storage and provide robust streamflow simulations under changing climate conditions. The effects of selection of probable predictors (five groups of potential variables from reanalysis data, three moving windows, calibration periods of two different lengths) and methods (eight combinations of pre‐processing methods and statistical models) were considered. Results indicate that the following selection of probable predictors gave a good and robust model performance: a combination of 13 independent climate variables and their lagged information, a 3‐month moving window and a calibration period of 31 years. Three combinations of methods performed best: (a) principal component analysis (PCA) and the relevance vector machine (RVM) statistical model using gridded data, (b) PCA and RVM using interpolated data and (c) the Pearson correlation analysis and RVM using interpolated data. Monthly streamflow was well reproduced for most catchments, with a median NSE (Nash and Sutcliffe efficiency) of 0.82 and PBIAS (percentage bias) of 0 in the calibration period and a median NSE over 0.60 and PBIAS less than −6 in the validation period. The best performance of the methods was associated with snow‐dominated catchments in inland regions. We also established that lagged information on atmospheric variables gave better results than snow variables from reanalysis data. The poorest results were associated with rainfall‐dominated catchments, and this was strongly correlated with the inter‐ versus the intra‐annual variability in monthly streamflow.

strongly correlated with the inter-versus the intra-annual variability in monthly streamflow.

K E Y W O R D S
empirical downscaling, machine learning, monthly streamflow, multi-linear regression, Norway, predictor selection 1 | INTRODUCTION Rapid climate change urgently requires credible climate and hydrological projections for developing adaptation strategies. This is especially true for the northern high latitudes, that are expected to warm more rapidly than mid-latitudes under climate change (IPCC, 2014). Consequently, significant changes in the hydrological cycle are likely to occur in northern Europe, including increases in mean and extreme precipitation (Jacob et al., 2014), changes in soil moisture and runoff (Donnelly et al., 2017;Ruosteenoja et al., 2018) and shifts in snow regimes (Raisanen, 2016).
To date, most hydrological impact studies of climate change involve a three-step approach (Olsson et al., 2016); the so-called climate-hydrological modelling chain. The first step is to generate climate projections using global climate models (GCMs). The GCM outputs are either dynamically downscaled by regional climate models (RCMs) or statistically downscaled for use in regional scale analyses. The second step is to simulate streamflow or hydrological processes using hydrological models and time series derived from the downscaled climate projections. The final step is to analyse the projected changes in the hydrological cycle using the simulations, and these analyses are then applied in developing adaptation strategies. This three-step approach has been, for example, applied in the Norwegian government report "Climate in Norway 2100" (Hanssen-Bauer et al., 2017).
Several weaknesses and large uncertainties exist within such a complex climate-hydrological modelling chain. First, the RCMs are computationally expensive, so it is not possible to apply them to all GCM outputs. For example, within the World Climate Research Program Coordinated Regional Downscaling Experiment (CORDEX) initiative for Europe (www.euro-cordex.net), several RCMs could only provide downscaling of one GCM per RCM. Second, post-processing methods, which statistically downscale the coarse-resolution GCM/RCM outputs into finer resolution climate information and correct biases that exist in climate model outputs (Mueller and Seneviratne, 2014), are often criticized for having too simple statistical methods and implicit assumptions (Ehret et al., 2012;Hewitson et al., 2014;Maraun, 2016;Maraun et al., 2017). Last, but not least, the choice of RCMs, statistical post-processing methods, and hydrological models, both their structures and their parameterizations, contributes to substantial uncertainties in impact projections (Hagemann et al., 2011;Lawrence and Haddeland, 2011;Dobler et al., 2012;Chen et al., 2013;Meresa and Romanowicz, 2017).
To avoid these problems, one alternative is to apply a direct statistical downscaling of GCM outputs to hydrological variables, such as streamflow, assuming that the statistical relationship is also valid under changing climatic conditions in the future. The direct downscaling method generally applies synoptic-scale atmospheric variables that are better resolved by GCMs than the local or regionalscale temperature and precipitation series (Cannon and Whitfield, 2002). By directly linking GCM outputs and hydrological variables, uncertainty introduced by different RCMs, post-processing methods and hydrological models can, in principle, be avoided. In addition, such a method is simple and computationally efficient, so it can be easily applied to a full ensemble of GCM outputs to account for the uncertainty contributed by the GCMs (Das and Nanduri, 2018). However, the direct downscaling method has also been criticized for its over-simplification of the hydrological cycle and its lack of water stores and transfers within the soils and groundwater of a catchment (Xu, 1999). Additionally, some early studies indicated a poor performance of the direct downscaling methods (Wilby et al., 1999) and discouraged the further application of this approach. Furthermore, the direct downscaling method generally cannot take into account other important factors affecting streamflow variability, such as variations in land use and soil cover over time.
Nevertheless, recent studies applying direct downscaling methods using advanced statistical models show reasonable results for monthly streamflow (Noori et al., 2011;Sachindra et al., 2013;Tofiq and Guven, 2014;Okkan and Inan, 2015;Sachindra et al., 2015) and seasonal flows (Landman et al., 2001;Ghosh and Mujumdar, 2008;Das and Nanduri, 2018) across the world. Some studies have even succeeded in simulating streamflow with a higher temporal resolution, such as daily or 5-day streamflow (Cannon and Whitfield, 2002;Tisseuil et al., 2010). Joshi et al. (2016) found that the direct downscaling using Sparse Bayesian Learning outperformed the climate-hydrological modelling approaches with respect to lowflow indices and Sidibe et al. (2020) found that this method was more efficient than hydrological models for poor-data regions.
Most of the abovementioned studies focus on one catchment, where streamflow is mainly generated from rainfall. Only Cannon and Whitfield (2002) and Tisseuil et al. (2010) provided a comprehensive analysis of various downscaling methods for 21 and 51 catchments in different flow regimes, respectively. The number of potential variables, which were selected directly from the reanalysis data based on literature review or experiences, varied substantially in these studies, ranging from 1 to 26. The time steps of the variables were generally consistent with the time step of predictands, but some studies (e.g., Cannon and Whitfield (2002)) also applied information on antecedent conditions to describe lagged effects, for example, snow accumulation, soil and groundwater retention. The gridded climate variables can be used either directly as probable predictors or interpolated relative to the catchment considered, and they were usually standardized or normalized prior to the downscaling process. Principal component analysis (PCA) and correlation analysis were the most used pre-processing methods for reducing the number of probable predictors. Various machine learning techniques have been tested in recent studies, and they were often compared with linear regression models. There were no clear guidelines for selecting calibration and validation periods. In general, the training data accounted for 70-80% of the time series and the rest was used as the testing subset (Tisseuil et al., 2010;Joshi et al., 2013;Tofiq and Guven, 2014;Okkan and Inan, 2015). Das and Nanduri (2018), however, used 90% of the data as the training subset in order to maximize the number of data available for training the nonlinear models. Detailed information about these studies is listed in Appendix A.
There has been very little previous work using direct downscaling methods to predict hydrological variables in Norway, and no studies have shown that such methods can be applied to the full range of seasonal flow regimes found in Norway, ranging from snow-dominated inland regions to rainfall-dominated regions along the coast. The only available previous study was reported by the Norwegian Meteorological Institute (Engen-Skaugen et al., 2007) and considered precipitation and temperature as separate univariate predictors of monthly streamflow in several catchments. They found that the best statistical relationships between monthly rainfall and monthly streamflow were during the winter months in catchments with predominantly rainfall-driven flow regimes. Overall, poor results were reported for most catchments, particularly in cases where seasonal flow regimes are strongly controlled by snowmelt processes. This is not surprising, as one expects a one-to-one relationship between monthly streamflow and precipitation or temperature to be most evident when rainfall rather than snowmelt drives streamflow. We, therefore, reconsider here the general downscaling approach using more advanced statistical methods, a wider range of atmospheric variables as potential variables, and a larger selection of catchments than Engen-Skaugen et al. (2007).
Unlike many previous studies that have only applied the direct downscaling method to one catchment, we selected up to 42 unregulated catchments distributed across the country, having flow regimes ranging from rainfall-dominated to snow-dominated (Section 2).We selected the methods from the previous direct downscaling studies with most promising results (Section 3). In addition, we tested different techniques for the selection of predictors, and these were rarely reported in other studies. As a result, we were able to present an extensive statistical downscaling analysis relating large-scale climate information to catchment-scale streamflow in Norway (Section 4). We discuss and compare our results with other direct downscaling studies in Section 5. Finally, we summarize the findings and explain the further application of the method for projecting climate impacts on monthly runoff in the gauged catchments in Norway (Section 6). The objectives of this study can be formulated as (a) to apply different predictors and downscaling methods to 42 Norwegian catchments representing a range of flow regimes, and (b) to find the most appropriate combination of predictors, pre-processing methods and statistical models for further use in the assessment of climate change impact on streamflow in Norway.

| STUDY AREA
Since statistical relationships derived between GCM outputs and streamflow cannot take into account flow regulation, we selected 42 catchments without significant regulation representing different regions in Norway ( Figure 1a). All the daily discharge data for these catchments have been quality checked by the Norwegian Water Resources and Energy Directorate (NVE) for their suitability for analysing monthly streamflow, and daily flow records are continuously available for the period 1960-2010 for the 42 catchments. In addition, 35 of the 42 catchments have longer time series starting in 1950 or earlier. The size of the catchments varies from 7 to 14,156 km 2 , and 83% of them are smaller than 1,000 km 2 .
The distribution of the catchments represents various climate and hydrological regimes, geographic conditions and landscape types in Norway. In general, these catchments are located in three climate regimes (the oceanic climate, continental climate and polar tundra climate) according to the Köppen-Geiger climate classification scheme (Peel et al., 2007). The mean annual temperature during 1971-2000 is highest (up to 7 C) along the coast of southern Norway, while it is as low as −4 C in the high mountains and northern Norway (Hanssen-Bauer et al., 2017). The annual precipitation also shows a distinct spatial distribution, ranging from 300 mm in south-eastern Norway and inland areas in northern Norway to more than 3,500 mm in the west. Consequently, the catchments in the north and in the mountainous regions receive large amounts of snow, with a snow ratio (snowfall/total precipitation) larger than 0.5 (Figure 1b), while the catchments along the coast mainly receive rainfall with a snow ratio less than 0.25.
The spatial patterns of rainfall and snow accumulation exert a strong control on the distribution of annual and seasonal streamflow in Norway. Although the average annual runoff is estimated to be 1,100 mm from 1971 to 2000 for the whole of Norway, this varies from under 400 mm in several catchments in the inland regions of eastern Norway and northernmost Norway (Finnmark) to over 3,500 mm in areas of western Norway (Hanssen-Bauer et al., 2017). The seasonal distribution of streamflow varies considerably from catchment to catchment, depending on the location, the median elevation and distribution of elevation in the catchment. However, it is generally found that catchments in or proximal to coastal regions have the highest streamflow in autumn and winter months during periods of higher rainfall. Catchments in inland regions and throughout Finnmark, on the other hand, have high streamflow due to snowmelt in late spring and summer months and very low streamflow in winter months.
The differences in primary runoff source, that is, rainfall versus snowmelt, leads also to pronounced differences in inter-annual versus. intra-annual variability in monthly streamflow. A streamflow variability index can be calculated as the ratio given by the average inter-annual variance in monthly streamflow (i.e., year-to-year variation F I G U R E 1 (a) The digital elevation model (DEM) of the mainland Norway and the location of the 42 catchments as well as the grid points of the National Center for Environmental Prediction/National Center for Atmospheric Research (NCEP/NCAR) reanalysis data and (b) the snow ratio and streamflow variability index of the catchments for individual months) divided by the average intra-annual variance in monthly streamflow (i.e., month-to-month variation in a given year). Figure 1b shows that catchments with snow-dominated flow regimes tend to have low values of the variability index, whereas catchments with rainfall-dominated flow regimes have high values. The highest values are found along the coast in western and mid-to northern Norway. Detailed information on the 42 catchments is listed in Appendix B.
3 | METHOD AND DATA

| Data
As in previous direct downscaling studies, we selected the National Center for Environmental Prediction/National Center for Atmospheric Research (NCEP/NCAR) reanalysis data (Kalnay et al., 1996) from 1949 to 2010 to test the downscaling models. The NCEP/NCAR reanalysis data were aggregated into a monthly time step, at 2.5 resolution covering the study domain of 2.5-32.5 E longitude and 55-72.5 N latitude ( Figure 1). Monthly time series of streamflow were obtained from NVE. The effect of the calibration period length, that is, the subset used as training data, was investigated in this study. This has not been considered in past studies, but can be a crucial issue when considering use of other reanalysis products with a shorter time series length, for example, ERA-Interim reanalysis (Dee et al., 2011) starting in 1979. The monthly time series of streamflow and the NCEP/NCAR data, which cover the 41-year period of 1950-1990 and the 31-year period of 1960-1990, were used for model calibration. The remaining data, covering the 20-year period from 1991 to 2010 were used for the model validation. In other words, we used 67% and 60% of the data as the training subset and 33% and 40% of data for validation, respectively.
The procedure for selecting appropriate variables for downscaling streamflow involves identifying potential variables, testing for correlations and developing approaches that account for snow storage. This produces five groups of potential variables: group 1, 22 variables; group 2, 13 independent variables; group 3, 13 independent variables, plus lagged information; group 4, 22 variables plus snow; and group 5, 13 independent variables plus snow. We explain each group of potential variables below.
First, up to 22 climate variables were identified as potential variables based on the previous studies (Appendix A). They are 2 m air temperature (T_2m), air temperature (T) at 500, 850, 1,000 hPa, surface air temperature (T_surface), surface skin temperature (T_skin), geopotential height (hgt) at 200, 500, 700, 850, 1,000 hPa, surface pressure, sea level pressure, relative humidity (rhum) at 500, 700, 850, 1,-000 hPa, 2 m specific humidity (shum_2m), specific humidity (shum) at 500, 850, 1,000 hPa as well as the average of specific humidity at all hPa levels (shum_all). Precipitation was not included as a probable variable, because the main objective of this study is to test direct downscaling methods using atmospheric variables that are better resolved by GCMs and not necessarily those used by hydrological models as input.
Since some of the variables can be highly correlated, we computed the Pearson correlation coefficient (Pearson, 1895) for each pair of them ( Figure 2). The correlation matrix indicates high positive correlations between the temperature variables and some of the specific humidity and pressure variables. Based on this matrix, we selected 13 relatively independent variables with correlation coefficients less than 0.7 as group 2 of potential variables. This group includes T at 500, 1,-000 hPa, hgt at 500, 850, 1,000 hPa, surface pressure, rhum at 500, 700, 850, 1,000 hPa and shum at 500, 850, 1,000 hPa.
Because streamflow in the summer half-year is determined not only by the summer climatic conditions but also by the amount of snow storage in the previous winter in many catchments in Norway, it is important to consider the antecedent conditions contributing to snow accumulation. We applied the same lagged information suggested by Cannon and Whitfield (2002) in this study, that is, the average of the variables for the previous two 3-month periods as model inputs. The lagged information was only estimated for the 13 variables because it leads to an excessively large number of variables if used with the 22 variables (i.e., 22 × 3 = 66 variables).
The fourth and fifth groups of variables include snow water equivalent (SWE), which is available from the NCEP/NCAR reanalysis data. We assumed that the difference in the SWE between two adjacent months could also be an indicator of snowmelt in summer and can possibly be used as a probable variable to replace the lagged information. In order to test this assumption, we added the snow information to the first two groups of potential variables.

| Modelling structure
The major novelty of this study is the application of various combinations of probable predictors, pre-processing methods and statistical models. Investigation of these combinations allows us to find the most appropriate downscaling method for Norway, where seasonal flow regimes cover a wide range of combinations of snow versus rainfall as the dominant streamflow generation process. The combinations of probable predictors involve the five groups stated in Section 3.1, sampled using three alternative moving windows (1-, 3-, 5-month moving averages) and two calibration periods. We consider eight combinations of pre-processing and statistical models: two types of predictor data (gridded vs. interpolated data), two techniques for reducing the number of probable predictors (Pearson correlation analysis and PCA) and two statistical models (multi-linear regression and relevance vector machine (RVM)).
In order to explain the methods and data in detail, we use one small catchment (catchment ID 24.9-Tingvatn, shown in red in Figure 3) in the south of Norway as an example. First, we selected six grid points from the reanalysis data adjacent to this catchment (the blue dots in Figure 3). For large catchments with a drainage area larger than 10,000 km 2 , 10 grid points were selected. The number of grid points were determined based on a preanalysis for all catchments, and the six and 10 grid points gave relatively better downscaling results than other numbers.
In previous downscaling studies, the gridded data were used directly, for example, in Sachindra et al. (2013), or were interpolated to the catchment, for example, in Tisseuil et al. (2010). These two choices of data sources lead to substantial differences in the number of probable predictors. For the current example, the number of gridded probable predictors is six times the number of interpolated ones ( Figure 3). If we select group 1 of potential variables, there will be in total 6 × 22 probable predictors from the gridded data, while only 22 from the interpolated data. The advantage of using the interpolated data is the shorter computational time, but the correlation between the predictors and streamflow can be weaker than that between the climate data from the nearest grid point to the catchment. Both the gridded and interpolated climate data as well as the streamflow data were standardized by subtracting the mean and dividing by the standard deviation in the calibration period. The standardization of the probable predictors in the validation period was performed with the mean and standard deviation corresponding to the calibration period of the data set. Sachindra et al. (2013) simulated monthly streamflow using 12 statistical models, each derived separately from the historical climate and streamflow data for the calendar month of interest. The previous studies as well as our preliminary test showed that use of 12 models, one for F I G U R E 2 The correlation matrix of the 22 potential variables, which are 2 m air temperature (T_2m), air temperature (T) at 500, 850, 1,000 hPa, surface air temperature (T_surface), surface skin temperature (T_skin), geopotential height (hgt) at 200, 500, 700, 850, 1,000 hPa, surface pressure, sea level pressure, relative humidity (rhum) at 500, 700, 850, 1,000 hPa, 2 m specific humidity (shum_2m), specific humidity (shum) at 500, 850, 1,000 hPa as well as the average of specific humidity at all hPa levels (shum_all) each month, gave better monthly streamflow than use of one model that predicts streamflow for all months simultaneously. Such monthly statistical models may be valid for the historical period for which they were developed, when changes in climate are not substantial. However, climate projections show that there is a likely shift in seasonal climatic conditions in Norway by the end of this century under the high emissions RCP scenario (RCP 8.5) (Hanssen-Bauer et al., 2017). Thus, the statistical downscaling models derived from historical data should be capable of accommodating such a shift if they are to be used for climate impact assessments in Norway.
In order to improve the model robustness for use with future projections, we therefore applied 1-, 3-and 5-month moving windows to the probable predictors and streamflow to derive the monthly statistical models (Figure 3). The 1-month moving window performs exactly the same as the method presented in Sachindra et al. (2013), i.e., 12 statistical models are derived from the climate variables and streamflow in the individual months separately. The 3-and 5-moving window methods include 3-and 5-month data to derive the models for the centred month. For example, the statistical model for February is derived from the data in January, February and March using the 3-month moving window. Such a moving window method was also applied to correct the bias of RCM outputs for Norway to improve the robustness of transfer functions (Wong et al., 2016).
As discussed above, there is a large number of probable predictors, especially when using data from grid points. To reduce the dimensionality of the data, two techniques for selecting the best explanatory predictors (Pearson correlation analysis and PCA) were applied ( Figure 3). For the entire study period , the Pearson correlation coefficients between each probable predictor and streamflow were calculated. Ideally, the probable predictors that showed the best, statistically significant (95% confidence level, p = .05) correlations with the streamflow should be considered as potential predictors, which are used for calibrating statistical models for each calendar month. However, for some catchments in winter months, dozens of predictors would be selected based on this criterion, especially using the gridded data. In summer months, no potential predictors were found for some catchments, mainly due to lack of snow storage information in the probable predictors. In order to solve these problems, we selected a maximum of 15 probable predictors with the highest correlation coefficients as the potential predictors. This maximum number of predictors was determined based on a preliminary downscaling run for all catchments. For the PCA, we selected the first eight principal components (PCs) as the potential predictors because they can explain more than 95% of the variances in almost all our cases.
The potential predictors were then used in two statistical models (Figure 3): the multi-linear regression model and the machine learning technique RVM (https://cran.rproject.org/web/packages/kernlab/kernlab.pdf). We applied the train() function from the caret package in R (http:// topepo.github.io/caret/index.html) to fit linear regression with backward selection, which starts with all predictors in the model, iteratively removes the least contributive predictors until all predictors are statistically significant (Kassambara, 2018). The RVM method was chosen because it was applied in several previous direct downscaling studies F I G U R E 3 The structure of the downscaling analysis procedure. The NCEP-NCAR grid points for a single catchment in southern Norway (in red) are also shown and always showed good model performance (Ghosh and Mujumdar, 2008;Okkan and Inan, 2015;Das and Nanduri, 2018). Both models used 10-fold cross-validation to estimate the mean squared error and to select the best model based on the smallest error.
The simulated monthly streamflow in the calibration and validation periods were evaluated using the nondimensional Nash and Sutcliffe (1970) efficiency criterion (NSE) and the percentage bias (PBIAS) in streamflow. According to the model performance ratings suggested by Moriasi et al. (2015), NSE ≥ 0.7 and PBIAS ≤ 10 are considered to be "good" simulation results, while NSE ≥ 0.50 and PBIAS ≤ 15 are considered to be "satisfactory" for monthly streamflow simulations.

| Effects of selection of different probable predictors
In this section, we analysed the effects of the potential variables, moving windows and calibration periods, which can influence the selection of probable predictors, on the simulated streamflow in terms of the NSE and the PBIAS values (Figures 4 and 5). Each boxplot in the figures consists of the criteria generated by eight combinations of pre-processing and statistical models for 35 catchments (calibration period 1950-1990, Figure 4) and 42 catchments (calibration period 1960-1990, Figure 5), respectively.
F I G U R E 4 Comparison of Nash and Sutcliffe efficiency (NSE) and percentage bias (PBIAS) in streamflow using five groups of potential variables (22 variables, 13 variables, 13 variables plus lagged information, 22 variables plus snow information and 13 variables plus snow information) and three moving windows (1-, 3-and 5-month moving windows) in the calibration  and validation (1991)(1992)(1993)(1994)(1995)(1996)(1997)(1998)(1999)(2000)(2001)(2002)(2003)(2004)(2005)(2006)(2007)(2008)(2009)(2010) periods. Each boxplot consists of the results generated by eight combinations of pre-processing methods and statistical models for 35 catchments When the 41-year time series were used for calibration (red boxes in Figure 4), the statistical models developed using data from a single month (1-month moving window) gave better results in terms of both the NSE and the PBIAS than did the models based on 3-and 5-month moving windows. The median NSE for all catchments varied from 0.81 to 0.84, from 0.71 to 0.73 and from 0.66 to 0.67 during the calibration period using the 1-, 3-and 5-month moving windows, respectively. The PBIAS also showed a similar pattern during the calibration period, with a median value of 0 for the 1-month moving window and from −2 to −1 for the 3-and 5-month moving windows.
During the validation period (blue boxes), the model performance did not differ as substantially as it did in the calibration period between different moving windows. The median NSE varied from 0.57 to 0.68, from 0.57 to 0.61, from 0.54 to 0.56 using the 1-, 3-and 5-month moving windows, respectively. The absolute value of the median PBIAS was between 4 and 8 using all three moving windows. The loss of model performance between the calibration and the validation periods was largest when the 1-month moving window was used. Such results indicated that although the models derived from the single monthly data gave the best model performance during the calibration period, they were not as stable in the validation period as were the models that used the broader windows and thus sampled a wider range of conditions. Considering model performance versus transferability in different periods as well as the possible seasonal shift under a warming climate in Norway, the 3-month moving window seemed to be an appropriate compromise method for projecting future streamflow using climate change scenarios.
Compared with the effects of moving windows, the five groups of potential variables did not influence the model performance substantially. The largest difference among the medians of NSE from the different groups was F I G U R E 5 As in Figure 4, but for the calibration period 1960-1990 and 42 catchments 0.11 using the 1-month moving window during the validation period. Despite the minor differences, the best model performance was found using group 3, consisting of 13 variables plus lagged information, and there was almost no difference between the model performance using the 22 and 13 variables. The additional snow information only improved the streamflow simulations marginally in the calibration period as compared with the groups of 22 and 13 variables, but it leaded to the poorest model performance in the validation period when the 1-month moving window was used instead of a broader window.
A similar pattern of results for the various combinations of moving windows and potential variables were found when only 31-year time series were used for calibration ( Figure 5). The model performance during the 31-year calibration period was generally better than in the 41-year period. The median NSE varies from 0.83 to 0.87, from 0.73 to 0.75, from 0.67 to 0.69 using the 1-, 3-, and 5-month moving windows, respectively. Although the calibration periods cover different lengths in Figures 4 and 5, the model performances in the validation period are similar, with the median NSE ranging from 0.58 to 0.66, from 0.59 to 0.61, from 0.54 to 0.56 using different moving windows. The PBIAS values are also comparable for the two calibration period lengths considered. This result suggests that a 31-year calibration period may be sufficient for deriving robust statistical downscaling models for Norwegian catchments.

| The best downscaling methods for Norwegian catchments
Based on the best model performance shown in Section 4.1, we analysed the eight combinations of preprocessing and statistical models using the 13 variables plus lagged information, a 3-month moving window and a calibration period from 1960 to 1990 ( Figure 6). In the calibration period, the two combinations of methods using PCA and RVM for both gridded and interpolated data gave the best model performance in terms of both median NSE (0.85) and PBIAS (0). The second best combination consists of the interpolation, Pearson correlation analysis and RVM with NSE of 0.82 and PBIAS of 0. The use of multi-linear regression model showed poorer performance than the ones using RVM and the methods using PCA and multi-linear regression models gave the poorest results in terms of NSE.
In the validation period, the model performance is degraded by 0.07-0.23 in terms of the NSE values and by up to 6 percentage points in terms of the PBIAS. The combination of interpolation, correlation analysis and RVM showed the best results in terms of median NSE (0.68) while the two combinations using PCA and RVM performed best in terms of median PBIAS (0). Taking the performance in both calibration and validation periods into account, we suggest that all three abovementioned combinations are appropriate to statistically downscale monthly streamflow in Norwegian catchments. F I G U R E 6 Comparison of the Nash and Sutcliffe efficiency (NSE) and percent bias (PBIAS) using 13 variables plus lagged information and a 3-month moving window in the calibration (1960-1990) and validation (1991-2010) periods. Each boxplot consists of the results for 42 catchments For the sake of simplicity, we focused on the combination of gridded data, PCA and RVM in the following results. Figure 7 shows the spatial distribution of NSE values. In the calibration period, all the catchments had NSE values larger than 0.7, with the exception of a few catchments along the western coast. In the validation period, the spatial variability in the NSE values was more pronounced than in the calibration period. The downscaling models did not satisfactorily simulate the streamflow in several catchments along the western and southern coastlines, and the poorest NSE values (NSE < 0.5) were mainly found near the coast in western, mid-and northern Norway. In contrast, the statistical downscaling models were robust in most inland areas with good model performance for both calibration and validation periods.
This distinct spatial pattern was also seen in the relationship between the snow ratio of the catchments and the NSE values for the validation period (Figure 8a), having an r 2 value of 0.49. The snow-dominated catchments tended to have higher NSE values than the rainfalldominated catchments. In contrast, the rainfall-dominated catchments tended to have lower NSE values due to higher inter-annual streamflow variability in the monthto-month pattern of flow, as defined in Section 2. Figure 8b shows a high correlation between the calculated variability index and the overall performance of the downscaling approach. A similar spatial pattern of model performance as well as the relationships can be found using other combinations of methods (results not shown).
We also compared the observed and simulated monthly streamflow for one rainfall-dominated catchment (catchment ID 97.1) along the western coast and one snow-dominated catchment (catchment ID 2.614) to consider how the model performance differs in the two catchments, given their very different seasonal flow regimes (Figure 9). The statistical downscaling models can give good simulation results in terms of NSE values Legend NSE < 0.5 0.5 -0.6 0.6 -0.7 0.7 -0.8 > 0.8 Norway (a) Calibration (1960Calibration ( -1990 (b) Validation (1991Validation ( -2010 F I G U R E 7 The spatial distribution of NSE values for the estimates of monthly streamflow developed using 13 variables plus lagged information, a 3-month moving window and a combination of gridded data, principal component analysis and the relative vector machine model. Results are shown for (a) the calibration period ; and (b) the validation period (1991)(1992)(1993)(1994)(1995)(1996)(1997)(1998)(1999)(2000)(2001)(2002)(2003)(2004)(2005)(2006)(2007)(2008)(2009)(2010) (b) (a) F I G U R E 8 Relationship between the snow ratio of the 42 catchments and the NSE (a) and the relationship between the streamflow variability index and NSE (b) during the validation period for the monthly streamflow estimates obtained using 13 variables plus lagged information, a 3-month moving window, a calibration period of 1960-1990 and the combination of gridded data, principal component analysis and the relative vector machine model and the PBIAS for both catchments in the calibration period. It seems, however, more difficult to reproduce the complex variability of monthly streamflow which varies highly both between months in a given year and between years (catchment ID 97.1) than monthly streamflow in a catchment in which a single, snowmeltdriven peak dominates the annual pattern. The poorer simulation results for the rainfall-dominated catchment are associated with an underestimation of winter low flows and an overestimation of the autumn peaks. These results indicate a high sensitivity to the input data and a relatively poor temporal transferability of the downscaling models for all months in this rainfall-dominated catchment.
Finally, it is interesting to know the most important climate variables that influence streamflow generation in Norway. Thus, we analysed how often the 13 potential variables and their lagged information were selected as the real predictors by the method of interpolation, Pearson correlation analysis and linear models. The linear models were chosen due to their simplicity in interpreting the predictors and this combination of methods gave a relatively good model performance for the validation period. On average, there are three real predictors used by each linear model. Figure 10 shows that the most frequently used climate variables are the specific humidity at 1,000 hPa and temperature at 1,000 hPa and its lagged information. The various geopotential heights form the secondary important group of climate variables and the relative humidity variables play the least important role in predicting streamflow in Norway.

| DISCUSSION
This study has investigated the effects of five groups of potential variables for developing downscaling models, which are rarely reported in other studies but are important for simulations in snow-dominated catchments. Our results confirm that lagged information is necessary and useful for obtaining good models in snow-dominated catchments, as suggested by Cannon and Whitfield (2002). The use of lagged information also outperforms the use of snow storage information available from the reanalysis F I G U R E 9 Observed and simulated monthly discharge for two selected catchments in both calibration and validation periods (left panels) and the long-term average monthly discharge in the validation period (right panels). Here the statistical downscaling models were derived from 13 variables plus lagged information, a 3-month moving window, a calibration period of 1960-1990 and the combination of gridded data, principal component analysis and the relative vector machine model data (see Figures 4 and 5 between the variable groups "13 Var + lag" and "13 Var + snow"). This can be partly explained by the fact that the large-scale atmospheric variables are better resolved than are the local land-surface process variables, such as snow storage, by the coarse resolution climate models.
Recent studies published during the last decade have used dozens of potential variables to simulate the streamflow. Our results show, however, that by using only 13 independent variables one obtains practically the same simulation results as those generated by the full set of 22 variables. The advantages of using a small number of variables are the shorter computational time to calibrate the models and the avoidance of overfitting the statistical models; thus, a preliminary selection of potential variables, in which relationships between variables are identified, is recommended prior to the pre-processing of the climate data.
The test of moving windows is a novelty relative to previous studies of the direct downscaling of streamflow. Before we investigated the use of moving windows, we constructed one statistical model using the entire time series to simulate the relationship between atmospheric variables and streamflow in all months. The model performance (not shown) was poor in terms of the NSE values and was, therefore, rejected for further use in developing climate impact projections. We then applied the method suggested by Sachindra et al. (2013) in which 12 statistical models are developed using the time series corresponding to each month separately, that is, the use of a 1-month moving window. This method improved the simulated monthly streamflow substantially in the calibration period, but the model performance is reduced dramatically in the validation period. Because our ultimate goal is to project streamflow under a changing climate, this method is most likely not robust enough for non-stationary conditions. As a compromise solution, we retained the concept of 12 statistical models, but tested the use of 3-month and 5-month moving windows to select a broader seasonal sample of data. This solution improves the transferability of the statistical models and provides good model performance in both the calibration and validation periods. Hence, we would also recommend testing the use of different moving windows in similar downscaling studies.
Regarding pre-processing methods, interpolation of variables between grid points appears to be advantageous because it reduces the number of probable predictors. However, we should again highlight that most of the catchments considered in this study are much smaller than the grid cells of the reanalysis data. For larger catchments, different results may be obtained if the gridded climate data is interpolated to the hydrological stations (such as Joshi et al. (2013)) or to the centroids of the catchments (such as Huang et al. (2013)) because the distance between the outlet and centroid of the catchment is long. Thus, use of original gridded data may be preferred for large catchments in other regions.
PCA is a common method for reducing the dimension of input data and has been used in several downscaling studies. Our results showed the best model performance using the combination of PCA and RVM and the poorest performance using PCA and linear models. Hence, it F I G U R E 1 0 Histogram of the 13 potential variables plus their lagged information, which were selected as the real predictors by the combination of interpolation, correlation analysis and multi-linear regression model for the 42 Norwegian catchments would be good to test the combination of PCA with different statistical models. However, an advantage of non-PC models is that the predictors used retain their physical meaning and their roles can therefore be interpreted in the final results. This is often not true for sets of variables after they are reduced into their corresponding PCs. In addition, our results confirm the conclusion obtained by previous studies (Cannon and Whitfield, 2002;Tisseuil et al., 2010;Joshi et al., 2013;Sachindra et al., 2013), that the nonlinear models generally give better simulation results than do linear models.
The downscaling models have a poorer performance for rainfall-dominated catchments along the coast of western, mid-, and northern Norway, but give good simulation results for most snow-dominated catchments in inland areas. This result can be somewhat surprising, as it is often assumed that it is more difficult to apply direct downscaling to snow-dominant catchments, due to the weaker correspondence between the dominant atmospheric conditions in a given season and streamflow. It is also in contrast with early work on direct downscaling of runoff in Norway using precipitation and temperature as individual univariate predictors (Engen-Skaugen et al., 2007).
In the present study, however, we have included the use of lagged information and a range of possible atmospheric variables, and this has led to a superior performance in the snow-dominated catchments, particularly relative to catchments in which rainfall is entirely dominant in streamflow generation. The weaker performance of the models for the rainfall-driven catchments may be partly due to the poor representation of coarse gridded data for the transitional areas between the ocean and the land surface. However, it also reflects the higher variability in the runoff time series in rainfall-dominated catchments; and, indeed, it is in these catchments that calibration of hydrological models using observed data is most difficult (e.g., Lawrence et al. (2009)). In addition, it may well be that atmospheric variables with a higher spatial resolution than the NCEP/NCAR data used here are necessary for modelling monthly streamflow in the smaller, rainfall-dominated catchments along the western coast of Norway. To test the effect of resolution, we applied the ERA-Interim data at the resolution of 0.75 for the same catchments using the best combination of methods obtained in this study. The median NSE of the simulated monthly discharge for the 42 gauges is 0.83 and 0.61 for the calibration period 1981-2005 and validation period 2006-2015. These results are comparable with the ones using NCEP/NCAR data, and thus the best methods and results found in this study should be robust. Finally, it may be that the inclusion of precipitation as a predictor variable would improve the model performance in these catchments for the higher resolution data.

| CONCLUSION
This paper presents a methodological study to systematically analyse various combinations of predictors and methods for downscaling of monthly streamflow from atmospheric variables. We tested different predictors (five alternative groups of potential variables, three possible choices of moving windows, two calibration periods of different lengths), and methods (eight combinations of pre-processing methods and statistical models) for developing monthly streamflow for 42 Norwegian catchments. Out of the tested data, we found that group 3, consisting of 13 independent variables plus their lagged information, represents the most suitable set of potential variables for simulating monthly streamflow, especially for snow-dominated catchments. The use of a 3-month moving window to derive the statistical models for the centred month gave good simulation results and improved the stability of model performance in different periods. The validation results based on a 31-year calibration period were comparable with results using a 41-year calibration period, indicating that a 31-year time series was sufficient for training the statistical models for the range of catchments considered here. Out of the tested methods, we found that the best combinations of preprocessing and statistical models were (a) PCA combined with RVM using gridded data, (b) PCA and RVM using interpolated data, and (c) the Pearson correlation analysis and RVM using interpolated data. The temperature and specific humidity variables were the most frequently selected variables for simulating streamflow generation in Norway for linear models.
Based on the best combinations of predictors and methods, the simulated monthly streamflow for the 42 catchments generally show a good agreement with the observations, with a median NSE over 0.82 and PBIAS of 0 in the calibration period and a median NSE over 0.60 and PBIAS less than −6 in the validation period. The downscaling models had a poorer performance for some rainfall-dominated catchments along the western coast of western, mid-and northern Norway coastline, but gave good simulation results for most snow-dominated catchments in the inland areas. Thus, we conclude that the direct statistical downscaling approach is capable of reproducing the recent observed changes in monthly streamflow in most of the Norwegian catchments we have considered.
Based on the conclusion, we should be able to apply the direct downscaling models for the same gauged catchments using input data derived from climate models, rather than reanalysis data. Similarly to other climate impact studies using hydrological models, we will firstly train models for each catchment using ERA-Interim reanalysis data that are upscaled (i.e., aggregated) to match the GCM resolution. Secondly, we will run the calibrated downscaling models using GCM projections that have been bias corrected relative to the upscaled ERA-Interim reanalysis data using an anomaly approach for multi variables (e.g., wind, temperature, humidity and pressure). An example of this bias correction method can be found in Pontoppidan et al. (2018). Thirdly, changes in monthly runoff can be estimated by comparing the simulation results in the scenario and reference periods. The ultimate goal of our work is to compare the projected changes using the direct downscaling methods with those simulated using hydrological models and the downscaled bias-corrected climate data to assess the similarities and differences in the outcomes from these two distinct approaches.  A P P E N D I X A (Continued)