Missing data imputation of high‐resolution temporal climate time series data

Analysis of high‐resolution data offers greater opportunity to understand the nature of data variability, behaviours, trends and to detect small changes. Climate studies often require complete time series data which, in the presence of missing data, means imputation must be undertaken. Research on the imputation of high‐resolution temporal climate time series data is still at an early phase. In this study, multiple approaches to the imputation of missing values were evaluated, including a structural time series model with Kalman smoothing, an autoregressive integrated moving average (ARIMA) model with Kalman smoothing and multiple linear regression. The methods were applied to complete subsets of data from 12 month time series of hourly temperature, humidity and wind speed data from four locations along the coast of Western Australia. Assuming that observations were missing at random, artificial gaps of missing observations were studied using a five‐fold cross‐validation methodology with the proportion of missing data set to 10%. The techniques were compared using the pooled mean absolute error, root mean square error and symmetric mean absolute percentage error. The multiple linear regression model was generally the best model based on the pooled performance indicators, followed by the ARIMA with Kalman smoothing. However, the low error values obtained from each of the approaches suggested that the models competed closely and imputed highly plausible values. To some extent, the performance of the models varied among locations. It can be concluded that the modelling approaches studied have demonstrated suitability in imputing missing data in hourly temperature, humidity and wind speed data and are therefore recommended for application in other fields where high‐resolution data with missing values are common.


| INTRODUCTION
Climatic conditions such as precipitation, temperature, humidity, wind speed, wind gust and sea level pressure have been used over time in many meteorological, energy application, agricultural, ecological and hydrological studies (Firat et al., 2012;Xu et al., 2013;Lara-Estrada et al., 2018). Weather stations across the world continue to record and monitor various climatic parameters for climate classification, planning, modelling and management purposes (Firat et al., 2012;Lara-Estrada et al., 2018). In recent years, the threats of global warming and climate change (World Bank, 2012) have sparked a resurgent interest in the analysis and inference of climatic variables and related subjects in the natural, social and political sciences. Climate data are typically processed and analysed at low-resolution levels such as daily, weekly, monthly and yearly resolution (Firat et al., 2012, Kanda et al., 2018. In contrast, processing and analysing data at a high-resolution scale such as h minutes (h ≤ 60) results in the availability of an appreciable number of points even when the overall time period under investigation is short; e.g. a (leap) year-long hourly time series consists of 8,784 data points. Moreover, the analysis of high-resolution data offers greater ability to understand the nature of data variability, behaviours, trends and to detect small changes (Pincetl et al., 2015). In many instances, high-resolution climatic data are incomplete. Yozgatligil et al. (2013) attributed the gaps to faulty measuring instruments, which are caused by recording errors or malfunctioning equipment, or instances of meteorological extremities and remoteness, routine maintenance and sensor calibration (Coble et al., 2012). In effect, missing observations in climate data often occur consecutively for long periods of time (Simolo et al., 2010).
Climate studies require complete time series data which, in the presence of missing data, means that imputation must be undertaken. A literature search identified that several imputation methods have been applied to relatively low-resolution climate data, typically of daily, monthly and yearly resolutions. The unconditional mean imputation is very simple in application; however, results suggest that it is not robust and often results in an underestimation of the standard errors (Yozgatligil et al., 2013). The expectation maximization (EM) algorithm developed by Dempster et al. (1977) has been applied to extrapolate missing data in an analysis of monthly temperature time series, e.g. by Firat et al. (2012). Simolo et al. (2010) proposed the completion method with the ability to preserve the probability density function in imputing missing values in a daily precipitation dataset. Xu et al. (2013) applied the point estimation model of Biased Sentinel Hospitals-based Area Disease Estimation (P-BSHADE) to interpolate missing data in an annual temperature dataset. In a comparative study on the estimation of missing values in daily temperature and precipitation data, Kanda et al. (2018) reported that multiple regression using the least absolute deviation performed best based on four performance indicators, namely mean absolute error (MAE), root mean squared error (RMSE), coefficient of efficiency and skill score. Other methods such as the regularized EM algorithm (Schneider 2001), the Fourier fit, the artificial neural network (McCandless et al. 2011;Kashani and Dinpasho, 2012), the multilayer perceptron neural network, the EM-Markov chain Monte Carlo (Yozgatligil et al., 2013) and the Bayesian network (Lara-Estrada et al., 2018) have been applied in the imputation of missing observations in daily and monthly precipitation, temperature and humidity data.
The analysis of low-resolution climate data is commonly based on aggregation from higher-resolution datasets. It is important to note that, for incomplete datasets, the estimation of fundamental statistics such as the mean and covariance is challenging, mostly biased and can be misleading (Schneider, 2001). Therefore, in a field where missing observations are common, there should be clarity about how low-resolution data are derived and how missing values are handled. The effective handling of missing observations in finer-resolution data would guarantee a precise estimation of parameters to guide decision making processes. To the best of our knowledge, in the literature on the imputation of missing data in meteorological settings only low-resolution climate datasets were considered. Although there is some literature on imputation for high-resolution data in other areas, such as electricity consumption (Hyndman and Fan, 2015), audio signal processes (Smaragdis et al., 2009) and electrochemical ozone measurements (Pang et al., 2017), the underlying mechanisms of their data generating processes are different from those of climate datasets. The gap that needs to be addressed includes investigation and imputation of missing data in high-resolution climate datasets. This study evaluated the performance of univariate time series models by state-space methods and multiple linear regression models driven by other climatic variables to impute missing values in a 12 month time series of hourly measurements from four locations in Western Australia (WA). The climatic variables considered were temperature, humidity and wind speed.
In dealing with the imputation of missing data, the fundamental principles are to understand and use the nature of the data including the cause for the missing data occurrences. Climatic data are generally characterized by properties such as autocorrelation between time lags, seasonality, periodic trends, cycles and the homogeneity effect over geographical areas. These characteristics are profound in high-resolution data and could be useful information for developing imputation modelling schemes to "fill in" the periods of missingness (Moritz et al., 2015). According to Rubin (1976), missingness in data may be completely at random (MCAR, data are missing independently of both observed and unobserved data), random (MAR, given the observed data, data are missing independently of unobserved data) or non-random (MNAR, missing observations are related to values of unobserved data). There are also cases where the missing mechanism is censored (De Jong et al., 2016). Most imputation methods assume MCAR and MAR because their missing data mechanisms are said to be ignorable, implying that the imputation modelling does not require special models for the cause of missingness and the likely values for the gaps (Rubin, 1976;Moritz et al., 2015). However, a clearer understanding of the cause of missingness helps to infer the distribution of the data gaps in the dataset and would enable the design of reasonable simulators on test data for validation purposes (Moritz et al., 2015).
The development of MCAR and MAR imputation schemes relies on the nature of the relationship between variables. For instance, MAR driven models involving highly correlated variables typically yield highly plausible imputation values (van Buuren and Groothuis-Oudshoorn, 2011). In the case of univariate time series, MCAR and MAR imputations are similar and draw inferences on the behaviour of the variable over time (Moritz et al., 2015). In the modelling phase, there is a need to address the high levels of stochasticity in most climate time series. This can be achieved by applying models whose parameters evolve over time (Harvey, 1989). The classical autoregressive integrated moving average (ARIMA) and seasonal ARIMA models have been applied to the imputation of missing data in univariate time series (Yodah et al., 2013). However, structural time series models, also known as unobserved component models, have evolutionary properties to study dynamic phenomena including imputations (Moritz et al., 2015). Both additive and multiplicative time series can be modelled structurally. For instance, multiplicative models can be additively decomposed via the logarithm of model components and each component can be fitted with an independent autoregressive moving average or ARIMA process (Pollock, 2008). To capture the complexity of the behaviour of climate time series over time adequately and to fine-tune the parameter estimation process, smoothing functions with the ability to extract the signals in the time series can be jointly modelled with the statespace representation of time series models for prediction and imputation purposes (Moritz and Bartz-Beielstein, 2017). The common smoothing functions in time series applications are the Kalman filter and the locally weighted scatterplot smoother (lowess) curve (Bianchi et al., 1999;Moritz et al., 2015, Moritz andBartz-Beielstein, 2017). The Kalman filter is often preferred because it uses maximum likelihood to estimate the smoothing parameters and tends to give optimal results since the mean square errors of one-step-ahead predictions are minimized and also for its ability to provide sequentially updated estimates (Bianchi et al., 1999).
Alternatively, regression-based approaches such as the ordinary, lagged, Bayesian multiple linear regression and dynamic regression models are well-known modelling techniques for prediction and in cases of highly correlated variables produce better imputation estimates (Petris, 2010;van Buuren and Groothuis-Oudshoorn, 2011). Climate variables are generally correlated, e.g. temperature and humidity, and, in instances where weather stations collect multiple climatic conditions, imputations for missing observations in one missing variable can be inferred from the other observed variables.

| METHODS
This section presents the study area, data description and simplified versions of the modelling techniques evaluated in the imputation. Advanced time series models with state-space representation amenable to Kalman filter and smoothing algorithms (Pollock, 2008) are discussed. More detailed information about the time series modelling techniques presented below can be found in Harvey (1989), Harvey and Peters (1990), Pollock (2008) and Jalles (2009).

| Study area and data description
WA has a coastline stretching 12,889 km and 10 different climatic zones according to the Köppen-Geiger climate classification (see Figure 1). The Australian Government Bureau of Meteorology collects weather observation data from stations in 14 districts across WA. In this study, climate data from four locations ( Figure 1) were analysed between March 1, 2011 and February 29, 2012. These coastal weather stations were chosen because remote camera monitoring also occurs at these sites as part of ongoing research on recreational fishing (Steffe et al., 2017) and each site is located in one of the four marine bioregions off the WA coastline: Esperance in the South Coast region, Perth in the West Coast region, Exmouth in the Gascoyne Coast region and Broome in the North Coast (Ryan et al., 2015).
Precipitation, temperature, humidity, wind speed and direction and sea level pressure were provided by the Australian Bureau of Meteorology at an hourly resolution. The overall time period covered was March 1, 2011 to February 29, 2012. The imputation focused on missing observations in the temperature, humidity and wind speed time series data while the other variables were used as predictors in the modelling phase. The climatic variables analysed in this study will be used as a principal source of inference in building models to impute missing data from the remote cameras that record recreational boating activity related variables.
The study methodology adopted for the purposes of this study was five-fold cross-validation. In this scheme, missing data were imputed for five different folds of missing patterns and the resulting imputations were compared to the true values. For each of the four locations the time interval of greatest duration without missing data in the original dataset was identified. These complete sub-samples of hourly temperature, humidity and wind speed data formed the basis of our analysis. The assumption that data were missing at random (MAR) was imposed and artificial gaps of missing observations were created in the complete sub-samples to mimic the general nature of missing patterns in the records provided by the Australian Bureau of Meteorology. The percentage of missing data was set to 10%, as in the datasets provided for the four locations none recorded more than 10% missing observations. Table 1 presents the dates and data points for the four sub-samples and the number of missing values (NA) created.
The distribution of the lengths of gaps of missing observations in the five missing patterns is displayed in Figure 2.

| Autoregressive integrated moving averages (ARIMAs)
ARIMA models are perhaps the most used time series technique in an application. The models are theoretically and statistically sound, flexible and make few assumptions (Ho and Xie, 1998). For a univariate time series y t , an ARIMA(p, d, q) is defined as: where a t is a white noise process, p and q denote the order of the autoregressive and moving average processes, B is the backshift operator and d is the order of differencing. The vector θ = (θ 1 , …, θ p ) denotes the vector of p parameters for the autoregressive process and ϕ = (ϕ 1 , …, ϕ q ) is the vector of q parameters for the moving average process. These explain the nature of the autocorrelation between lagged observations. A non-stationary time series, for instance a time series with trends, seasonality or both, can be modelled using ARIMA, with an appropriate order of differencing. ARIMA models, however, are based on autocorrelation instead of the structural view of level, trend and seasonality.

| Structural time series models
The advantage of using a structural time series model is the ability to relax the formulation to adapt to changes in series levels over time. The evolutionary nature of the components of climatic time series implies a deviation from the assumption of a fixed pattern and behaviour over time. All the components such as cycle, trend, seasonality and irregular may have stochastic features and complicate the modelling effort. The assumption that these components evolve randomly over time is the basis for structural time series modelling (Jalles, 2009). Here Y t is defined as: F I G U R E 2 The distribution of the lengths of gaps measured in hours in the five folds in the four locations where μ t , ∁ t , S t and ε t are the global trend, cycle, seasonal and irregular components of the time series. The trend and cycle components can be combined into a single level component, and upon applying logarithms the additive decomposition becomes: where y t = ln Y t , γ t = ln μ t ∁ t ð Þ, τ t = ln S t and ξ t = ln ε t . Each component can be considered and formulated as a stochastic process with random disturbances (Pollock, 2008;Jalles, 2009). Each additive component was z-transformed and modelled by an independent ARIMA process, where ς γ (z), ς τ (z) and ξ(z) are the z-transforms of independent white noise processes: The z-transformation ensured that the varying signals associated with the components were effectively dealt with and allowed the design of filters for parameter estimation. The unit-root factor (1 − z) p for the trend component is a factor of the autoregressive polynomial θ γ (z) whereas the autoregressive polynomial θ τ (z) contains the factor (1 + z + Á Á Á + z s − 1 ), where s represents the number of periods in a seasonal cycle, which was 24, representing daily seasonality. The sum of ARIMA processes yields an ARIMA process (Pollock, 2008) and was represented in the state-space form.
2.4 | ARIMA(p, d, q) in state-space forms, Kalman filter and smoothing State-space forms offer great flexibility in extracting time series data features. They are generally used for prediction, smoothing and likelihood evaluation purposes (De Jong and Penzer, 2000). They also offer convenient frameworks for incorporating smoothing functions in a wide range of time series models to improve prediction generally. They are specified by two sets of equations, namely the observation and state equations represented in Equations (9) and (10) respectively: where, for a state vector of length k, H t is a vector of length k, ε tÑ ID 0, σ 2 ε t , Z t is a k × k matrix, R t is a k × 1 matrix and ω t~N ID(0, W). Considering Equation (1), let m = max (p + d, q + 1). Then: Also, for j < m and ϕ 0 = − 1, let: Then, the state vector This form of state-space representation of an ARIMA (p, d, q) model has been found to have both computational and conceptual advantages (Harvey, 1989;Hamilton, 1994). The formulation guarantees that the ARIMA models are responsive to the application of the Kalman filter and smoothing for the estimation of the model parameters and the extraction of unobserved components.
The estimation and updating of the model parameters constitute the Kalman filter and these filters were obtained by considering a Gaussian process with the initial state Y t~N (η t , V t ), with mean and variance at time t + 1: The mean and variance of the joint distribution of Y T t + 1 , y T t + 1 À Á T are given respectively as: The mean and variance of the conditional distribution of Y t + 1 j y t + 1 are obtained as: The prediction and the updating of estimates by the Kalman filter can therefore be summarized as follows: e t + 1 = y t + 1 − H t + 1 η t + 1jt prediction error ð Þ The Kalman filter operates recursively where current best estimates are updated when new observations are available. These models were implemented using the "imputeTS" package, version 2.7 (Moritz and Bartz-Beielstein, 2017), making use of the options "StructTS" and "auto.arima" of the function "na.kalman" in the R software version 3.5.1 (R Core Team, 2013).

| Multiple regression modelling
The multiple regression models in this study were formulated with both continuous and circular predictors.
Precipitation, temperature, humidity, wind gust and sea level pressure were treated as continuous variables. Following SenGupta and Ugwuowo (2006), wind direction as a circular variable was trigonometrically transformed by: cos π direction 180 , sin π direction 180 The resultant variables representing wind speed together with the continuous variables were used in the multiple regression modelling. The estimate of a missing observation at time t was imputed using the model: where X represents the vector of p predictors with parameters β.
F I G U R E 4 The mean absolute error, root mean square error and symmetric mean absolute error values for the performance evaluation of imputing the five missing folds of temperature, humidity and wind speed from the different estimation techniques for the four study locations To overcome the possible violation of the model assumption of non-autocorrelated errors, heteroscedasticity and autocorrelation robust standard errors were used (Newey and West, 1987). For imputation purposes, independent variables that are highly correlated with a dependent variable with missing observations can be modelled to obtain highly plausible imputations (van Buuren and Groothuis-Oudshoorn, 2011). Figure 3 depicts the distributional characteristics and the nature and strength of the relationship between the study variables. The distributional characteristics, the nature and strength of the relationship differed with respect to location. The strength of the relationships could inform how important a predictor was in the regression imputation models. For instance, from Figure 3, except for Broome, high correlation values were observed between temperature and humidity and each variable would serve as a good predictor for imputing the other.
The nature of the relationship between variables presented in Figure 3 especially between the resultant variables (sine and cosine) of transformation does not look linear. However, the linear regression modelling technique according to Berman (1998) has a non-parametric interpretation as the average linear trend across all pairs of observations and the relationships between variables do not have to look linear and could be applied to any monotonic trend. The multiple linear regression models were formulated with the response and predictor variables depicted in Table 2.

| MODEL PERFORMANCE EVALUATION
Using the out-of-sample splits in the five-fold cross-validation, the estimation accuracy of the models in the F I G U R E 4 (Continued) imputations was assessed by means of three model performance metrics: the MAE, RMSE and the symmetric mean absolute percentage error (SMAPE). Equations (27)-(29) respectively represent the metrics: where y = P T t = 1 y t =T,ŷ t and y t are the imputed and actual values for time t and T is the length of the missing data in the time series. The MAE and RMSE are useful performance indicators (McCandless et al., 2011;Kanda et al., 2018;Lara-Estrada et al., 2018). Both the MAE and RMSE range from 0 to +∞, and lower values indicate high levels of agreement between observed and estimated values. The MAE and RMSE values have the same units as the variables measured. SMAPE is a variation of mean absolute percentage error (MAPE) proposed by Makridakis (1993) where, instead of dividing the error by the observed value, the observed average is used instead. This ensures that symmetry is achieved and thus is applicable to variables with meaningful zero values, for instance a 0 C temperature value. Smaller percentage values indicate high levels of agreement between observed and imputed values.

| RESULTS AND DISCUSSION
The estimation accuracy of the methods was assessed based on the performance indicators described in the preceding section. Generally, the performance indicators agreed in the selection of the best imputation method in the five-fold gaps of missing observations (see Figure 4). There were instances, however, where the measures ranked the methods differently, notably the RMSE compared to the other indicators. For example, in assessing the imputations of wind speed in Perth, the methods were ranked differently for fold 2, while both the MAE and SMAPE ranked multiple linear regression as the best and RMSE ranked the structural model with Kalman smoothing as the best. Similarly, for humidity in Broome, RMSE ranked multiple linear regression as the best while the MAE and SMAPE both ranked the ARIMA with Kalman smoothing as the best method in folds 2, 3 and 5 (see Figure 4). In cases where the indicators followed divergent paths in ranking the models, the MAE was used in assessing the models (Legate and McCabe, 1999;Kanda et al., 2018). The best methods were determined based on the pooled averaged values of the performance indicators over the five folds. Table 3 presents the pooled performance indicators' average values for the five-fold missing observations and the rankings of the methods in imputing missing values for temperature, humidity and wind speed at the four locations. Multiple linear regression was consistently ranked as the best among the methods especially for the more southerly locations Perth and Esperance. The performance of multiple linear regression at these locations could be attributed to the relatively strong relationships between the studied variables compared to the strength Note: The most appropriate models as ranked by the performance indicators for the study variables with respect to location are in bold and the underlined text were ranked second.
of the relationships in the northern parts of WA (see Figure 3). However, in Broome (in the north of WA) multiple linear regression was ranked as the worst model for all the variables except for humidity, based on the MAE. From Figure 3, the Pearson correlation coefficients between the variables suggest a very weak relationship between the variables in Broome and possibly contributed to the poor performance of the multiple regression model (Petris, 2010;van Buuren and Groothuis-Oudshoorn, 2011). Broome has a tropical climate defined by dry and wet seasons, which is a complete departure from the typical four seasons, namely summer, winter, autumn and spring, in the other locations. For instance, the Pearson correlation between temperature and humidity was r = − 0.111, significantly lower than the strength of correlation in the other locations (see Figure 3). The locations appeared to influence the choice of the preferred model. Kashani and Dinpasho (2012) concluded that climate data in different land topologies were significantly different. The geomorphological and vegetation features are significantly different between these locations (found in different climate zones) and will interact differently in the mechanisms leading to the formations of these climatic variables (Kanda et al., 2018). The nature of the relationships between the study variables in Figure 3 also suggests that the data structures in the locations are different. The choice of model was guided by the data characteristics which were apparently influenced by the locations. In the models' choice, the ARIMA with Kalman smoothing performed best in Broome which is characterized by hot semi-arid desert conditions (Figure 1). Similarly, multiple linear regression performed best in Perth in the hot summer Mediterranean zone for all the variables studied. The density plots (see Figures 5-8) of temperature and wind speed suggest that the series were to some extent similar for the locations in the southern part of WA (Esperance and Perth) and similarly for those in the north (Exmouth and Broome).
F I G U R E 5 Density plots comparing the distribution of the observed and the five missing folds of imputed values for temperature, humidity and wind speed in Perth From Table 3, the average MAE associated with the imputed temperature missing values by the methods was 0.25 C with a magnitude of error less than 0.30 C for all the four locations. The difference in the MAE values between the top competing models was 0.01, 0.04,0.03 and 0.02 C for Esperance, Perth, Exmouth and Broome, respectively. For humidity, the average MAE was 1.320% with a magnitude of error of less than 3.400% for all locations. The differences in the MAE values between the top two models were 0.180%, 0.340%, 0.002% and 0.090% for Esperance, Perth, Exmouth and Broome, respectively. In the case of wind speed, the average MAE value association with the imputations was 0.560 kmÁh −1 with a magnitude of error less than 0.900 kmÁh −1 for all locations. For the two best performing models, the differences between the MAE values were 0.006, 0.060, 0.007 and 0.008 kmÁh −1 for Esperance, Perth, Exmouth and Broome respectively. The relatively low magnitude of error values for the variables suggested that the methods were closely competing and imputed highly plausible values.
To check the validity and plausibility of the imputations performed by the methods further, the density plots of the imputations of the five-fold gaps of missing observations were overlaid on that of the observed. Figures 5-8 depict the overlaid density plots for the temperature, humidity and wind speed with respect to location. The preservation of the distribution of the variable with missing values to enable the derivation of reliable sample estimators is a key objective of data imputation (Brick and Kalton, 1996). The density plots showed a close resemblance between the distribution of the observed and the imputed values for the study variables in the four locations. Figures 5-8 suggest that the imputed values were highly plausible to allow any analyses to be conducted as if the data were complete. The resemblance of the imputed values in the five-fold gaps of missing observations to the observed data was also assessed using the correlation coefficient (R) (see Table 4). The coefficients of determination scores observed by the methods were 0.946 ≤ R ≤ 0.988, F I G U R E 6 Density plots comparing the distribution of the observed and the five missing folds of imputed values for temperature, humidity and wind speed in Broome 0.747 ≤ R ≤ 0.989 and 0.957 ≤ R ≤ 0.971 for temperature, humidity and wind speed respectively for the locations. Except, for humidity in Esperance where the ARIMA with Kalman smoothing seemed more biased, recording low scores of 0.747 ≤ R ≤ 0.872, the models generally recorded high R values suggesting that the imputed values in the five-fold gaps of missing observations by the models were very similar to the observed values.
The multiple linear regression model was the best model in most cases, followed by the ARIMA with Kalman smoothing. Multiple linear regression models of highly correlated response and predictor variables have been found to impute highly plausible values (van Buuren and Groothuis-Oudshoorn, 2011;Kashani and Dinpasho, 2012;Kanda et al., 2018). The comparatively poor performance of the structural time series models to the ARIMA models could be due to the fact that the underlying structures of the series studied were simple and had relatively short cycles. The lengths of the series studied were relatively short and did not exhibit any complexity in structure. For instance, the series only cut across two and three seasons and in effect features such as trend, seasonality and cyclic behaviours were obscured. This reduced the level of complexity of the time series and to some extent undermined the potential of the structural time series models. The structural time series models are known to be efficient in modelling series with complex structure, where prior analysis of the structure underlying the generating system is carried out before the model fit (Jalles, 2009). The structural time series model explicitly modelled the trend, seasonal, error terms and other relevant components. There were some cases especially with wind speed in Exmouth where the structural model outperformed the ARIMA model. Observing the distributional characteristics of the variables in Figures 5-8 suggests that the series for wind speed was more erratic compared to temperature and humidity. The structural models, although ranked as the worst performing models in most cases, still performed F I G U R E 7 Density plots comparing the distribution of the observed and the five missing folds of imputed values for temperature, humidity and wind speed in Exmouth satisfactorily in modelling and imputing the missing observations in relatively short climate time series datasets.

| Limitations and opportunities of the modelling techniques
Missing data are common in instrumentation measurements and missingness can be persistent (Simolo et al., 2010). The choice of imputation technique, all things being equal, should allow for an easy and efficient modelling technique. In this study, the multiple linear regression model was the least complex model and in most cases performed best. However, in the absence of predictors or if missing data are persistent in predictors, the method cannot be used. Data scarcity is common especially in developing countries and access to datasets for important predictor variables may not be feasible. Also, the modelling assumptions need to be considered, and if they are not satisfied the use of multiple linear regression modelling may not be applicable. The use of heteroscedasticity and autocorrelation may have worked well because of the lengths of the series studied; for relatively long series it could easily result in model misspecification leading to less accurate predictions. Alternatively, the univariate time series models by the state-space method, despite the level of complexity, are time dependent and are independent of other variables. The techniques competed closely with multiple linear regression even for very short cycle time series datasets. The plausibility of the imputed values obtained from these techniques makes them ideal for any situation. The advances in statistical software and packages would enable easy implementation of these techniques.

| CONCLUSION
The methods studied have demonstrated their suitability in imputing missing data in high-resolution temperature, humidity and wind speed data. However, the study only used sub-samples of relatively short time series and this could have contributed to the general performance of the univariate time series modelling approaches. It is recommended that longer climate time series datasets with varying patterns of complexity are studied to assess the T A B L E 4 The correlation coefficient between the imputed values and the observed values with respect to the estimation techniques for the five missing folds techniques further under several varying scenarios of missing data. The assumption that data were missing at random may not be applicable to all causes of missing data, for instance missingness caused by routine maintenance or calibration. To ensure that the missing at random assumption holds, it is recommended that such activities are carried out in a randomized fashion.

ACKNOWLEDGEMENT
This study was funded and supported by the Government of Western Australia Department of Primary Industries and Regional Development (DPIRD). The authors are grateful to the Australian Bureau of Meteorology for providing the data for the study. The authors wish to thank Dr Ainslie Denham, Dr Timothy Green and Mr Mark Pagano for their input during the internal review process by DPIRD. Thanks are also due to the two anonymous reviewers for their constructive criticisms and the useful suggestions made.