An evaluation of surface meteorology and fluxes over the Iceland and Greenland Seas in ERA5 reanalysis: The impact of sea ice distribution

The Iceland and Greenland Seas are a crucial region for the climate system, being the headwaters of the lower limb of the Atlantic Meridional Overturning Circulation. Investigating the atmosphere–ocean–ice processes in this region often necessitates the use of meteorological reanalyses—a representation of the atmospheric state based on the assimilation of observations into a numerical weather prediction system. Knowing the quality of reanalysis products is vital for their proper use. Here we evaluate the surface‐layer meteorology and surface turbulent fluxes in winter and spring for the latest reanalysis from the European Centre for Medium‐Range Weather Forecasts, i.e., ERA5. In situ observations from a meteorological buoy, a research vessel, and a research aircraft during the Iceland–Greenland Seas Project provide unparalleled coverage of this climatically important region. The observations are independent of ERA5. They allow a comprehensive evaluation of the surface meteorology and fluxes of these subpolar seas and, for the first time, a specific focus on the marginal ice zone. Over the ice‐free ocean, ERA5 generally compares well to the observations of surface‐layer meteorology and turbulent fluxes. However, over the marginal ice zone, the correspondence is noticeably less accurate: for example, the root‐mean‐square errors are significantly higher for surface temperature, wind speed, and surface sensible heat flux. The primary reason for the difference in reanalysis quality is an overly smooth sea‐ice distribution in the surface boundary conditions used in ERA5. Particularly over the marginal ice zone, unrepresented variability and uncertainties in how to parameterize surface exchange compromise the quality of the reanalyses. A parallel evaluation of higher‐resolution forecast fields from the Met Office's Unified Model corroborates these findings.


INTRODUCTION
The subpolar seas of the North Atlantic are critically important for the global climate system as they are the source of the dense waters of the Atlantic Meridional Overturning Circulation (AMOC). Investigating coupled atmosphere-ocean processes, in particular surface turbulent heat and momentum fluxes, are key steps to improving our understanding of the role that the North Atlantic subpolar seas play within the AMOC (e.g., Buckley and Marshall, 2016). The dominant contribution to the AMOC is from east of Greenland (Pickart and Spall, 2007), as is the largest variability in volume transport (Lozier et al., 2019), pointing to the Norwegian, Barents, Greenland, Iceland, and Irminger Seas as key locations for the formation of dense water masses. Ocean circulation paradigms have shifted over the years: from when it was thought that the Iceland and Greenland Seas were the primary source of dense water via open ocean convection (e.g., Swift and Aagaard, 1981), to a view that consistent ocean cooling and densification around the rim current of the Nordic Seas was dominant (e.g., Mauritzen, 1996), to a shift back to the importance of the Iceland and Greenland Seas due to the discovery of the North Icelandic Jet (Jónsson and Valdimarsson, 2004;Våge et al., 2011;Semper et al., 2019) and of areas of dense water in the northwest Iceland and western Greenland Seas (Våge et al., 2018). Exactly where, when, and how the water mass transformations take place, and how the dense water feeds the AMOC, are active areas of research. These were key questions posed at the inception of the Iceland-Greenland Seas Project (IGP): a coordinated atmosphere-ocean project encompassing a rare wintertime field campaign to observe, analyze, and model the coupled climate system in this region (see Renfrew et al., 2019a for an overview). Here, we make use of several atmospheric datasets gathered during the IGP field campaign that together provide unparalleled coverage of the region to evaluate a state-of-the-art meteorological reanalysis product. We focus on the surface-layer meteorology and surface fluxes, the salient fields for atmosphere-ocean-sea ice coupling.
Meteorological reanalyses are generated from the assimilation of observations into a consistent version of a numerical weather prediction (NWP) forecast system by optimally blending short-range forecasts with millions of observations through data assimilation. As the quality of NWP systems has increased tremendously over recent decades (Bauer et al., 2015), so too has the quality of meteorological reanalysis (Hersbach et al., 2020). They are an excellent tool for the analysis of the climate system (e.g., Papritz and Spengler, 2017), especially in regions with a paucity of in situ observations such as the Iceland and Greenland Seas (e.g., Jung et al., 2016). However, it is vital to have knowledge of the quality of reanalysis products before they are used for particular applications. This is particularly important for the polar and subpolar regions, where NWP systems have numerous well-known weaknesses, for example, in the representation of stable boundary layers, mixed-phased clouds, sea-ice characteristics, and surface exchange over heterogeneous surfaces or in the use of observations (Bourassa et al., 2013;Vihma et al., 2014;Jung et al., 2016;Lawrence et al., 2019). All of these processes will impact the quality of surface-layer meteorological variables and surface fluxes, raising questions as to how accurate these fields will be in reanalyses, analyses, and forecasts. Here we address this through an evaluation of ERA5, the latest global reanalysis product produced by the European Centre for Medium-range Weather Forecasts (ECMWF), against independent observations from the IGP. Our focus is on ERA5 (Hersbach et al., 2020) as this is a relatively new product that has been produced to replace and improve upon the popular ERA-Interim reanalyses (ERA-I; Dee et al., 2011), using enhanced observations and a recent improved version of the ECMWF Integrated Forecasting System.
A number of evaluations of meteorological reanalyses have been carried out recently for the whole Arctic (Lindsay et al., 2014;Bromwich et al., 2016), for the Arctic Ocean (Lüpkes et al., 2010;Jakobson et al., 2012;Wesslén et al., 2013;Wang et al., 2019), and for the subpolar seas Harden et al., 2011;Moore et al., 2016). All of the above evaluations used ERA-I output (or operational output from the same model cycle in Renfrew et al., 2009), while several also evaluated other products such as the regional Arctic System Reanalyses (ASR; Bromwich et al., 2016;Moore et al., 2015;Wesslén et al., 2013) or other global reanalyses (e.g., Jakobson et al., 2012;Lindsay et al., 2014;Jones et al., 2016;Graham et al., 2019aGraham et al., , 2019b. A number of errors in surface-layer meteorology have been revealed in these studies; that is, all reanalyses tend to have wind speeds that are biased low over land stations , especially for moderate-to-strong winds in regions of steep or complex orography (Moore et al., 2015;Jones et al., 2016;Nygård et al., 2016), although higher resolution partly ameliorates this problem DuVivier and Cassano, 2013;Moore et al., 2016). ERA-I usually performs comparatively well against other global reanalyses (e.g., Jakobson et al., 2012;Lindsay et al., 2014;Jones et al., 2016). The limited evaluations of ERA5 so far indicate that it also performs well against independent radiosonde observations and for radiative fluxes in spring and summer over the Arctic Ocean (Graham et al., 2019a(Graham et al., , 2019b, and outperforms ERA-I for global oceanic wind fields when compared with scatterometer observations (Belmonte Rivas and Stoffelen, 2019).
Focusing on evaluations for the subpolar seas, ERA-I generally does well at representing surface-layer temperatures, winds, humidity, and turbulent fluxes, although with more scatter in relative humidity and turbulent fluxes , and with similar findings for the equivalent operational ECMWF analyses evaluated in Renfrew et al. (2009). For example, comparing against 2 years of meteorological buoy observations in the central Iceland Sea, the ERA-I biases (root-mean-square errors, RMSE) in air temperature, relative humidity, wind speed, and sensible heat flux were 0.43 (0.82) K, −5.5% (8.4%), 0.12 (1.6) m/s, and −8.3 (15.8) W/m 2 , respectively . Moore et al. (2008) report comparable discrepancies in air temperature and wind speed against 5 months of buoy observations from the Irminger Sea for the North American Regional Reanalyses (NARR). Dukhovskoy et al. (2017) find similar differences in wind speed from the ASR, the Climate Forecast System Reanalysis (CFSR), and satellite-derived products when compared against the same buoy observations. However, if QuikScat scatterometer winds are taken as truth, they find the biases (RMSE) in the ASR and CFSR winds to be <0.5 (1) m/s in the subpolar North Atlantic and Nordic Seas. Closer to the steep orography of coastal Greenland, there are challenges in representing orographic flows and 10-m wind speed biases (RMSE) increase to approximately −2 (3-4) m/s Harden et al., 2011;Moore et al., 2015;.
Reviewing previous evaluations of reanalyses, it is clear that there are some gaps in knowledge. Over the subpolar seas, there have been no specific evaluations of reanalysis products over sea ice or the marginal ice zone (MIZ), the zone of more variable sea-ice conditions where waves F I G U R E 1 Map of the Iceland and Greenland Seas with sea-ice fraction averaged over the field campaign period. Overlaid are the positions of the low-level components of the research flights (colored by flight number), the track of the research vessel (thin black line), and the position of the meteorological buoy (star). Some key locations are noted and swell impact the sea ice. Renfrew et al. (2009) show a handful of aircraft observations over the MIZ that illustrate substantial differences in surface temperature, air temperature, and wind speed between the various models and the observations, but there are too few data points for a quantitative analysis. All of the Arctic Ocean evaluations currently available are for near-100% ice concentrations, meaning that the quality of reanalyses over any Arctic MIZ is currently unknown.
Our observations come from three separate platforms-a meteorological buoy, a research vessel, and a research aircraft-all used during the IGP to make observations of the atmospheric surface layer (Renfrew et al., 2019a). Our meteorological buoy was in the NW Iceland Sea (see Figure 1) for 78 days in an open ocean location, closer to the sea ice than the central Iceland Sea buoy of Harden et al. (2015). Our research vessel, the NRV Alliance, traversed the Iceland and Greenland Seas for 43 days, penetrating the MIZ on several occasions. Meanwhile, our research aircraft primarily targeted the NW Iceland Sea and the MIZ in particular, with observations Note: To obtain wind speeds from the research vessel and aircraft, data on the location and platform motion had to be combined with the measurements from the named instruments (e.g., Renfrew et al., 2008;Duscha et al., 2020). For brevity, accuracy estimates are only given for the derived wind speed. The uncertainty in aircraft winds given here is higher than in previous studies such as those of Fiedler et al. (2010) and Weiss et al. (2011) due to the post-flight calibration that was required. Note that an uncertainty in dewpoint temperature of ±0.5 K is equivalent to an uncertainty of ±0.08 g/kg in specific humidity and ±3% in relative humidity (RH) at the air temperatures observed.
from nine flights included here. Combining data from these three platforms allows us to make a comprehensive evaluation of ERA5 over the winter to spring period, and for the first time we are able to contrast ice-free ocean and MIZ conditions. In Section 2 we describe the observations, model output, and methods employed. Section 3 provides an evaluation of ERA5 for the ice-free ocean and for the MIZ, revealing contrasts in accuracy. In Section 4 we explore why this is the case, aided by an evaluation of higher-resolution limited-area analyses and forecasts from the Met Office Unified Model. Section 5 provides conclusions and recommendations.

Observations from a meteorological buoy
A Seawatch Wavescan meteorological buoy was deployed on 17 February 2018 in the NW Iceland Sea at 70 o 38.38 N, 15 o 24.58 W. It worked well for 78 days before breaking loose from its anchor. Hourly observations of air temperature, relative humidity (RH), air pressure, solar radiation, wind speed, and wind direction were made at a height of ∼3 m (see Table 1 for instrumentation details and estimates of accuracy for all platforms). In addition, observations of sea surface temperature (SST), ocean currents, and wave height, period, and direction were recorded. All variables were quality controlled with outliers, and nonphysical measurements removed. Quality control procedures revealed the air pressure to be erroneous for about half of the deployment, so the mean-sea-level pressure from ERA5 is used when needed to derive other variables. Unfortunately, the SST was not measured reliably, so instead we use the shallowest (8 m) ocean temperature from an adjacent ocean mooring (see Renfrew et al., 2019a). At this time of the year, and in this location, the ocean surface layer is generally well mixed, so this substitution is reasonable when comparing mean values (e.g., Våge et al., 2018). However, at 8 m depth, the variability in temperature will be reduced compared with that at the surface. A comparison with observations at the meteorological station on Jan Mayen revealed that the air temperature was erroneous during a short cold period in April (likely due to icing). Note that the buoy observations were not made available to meteorological forecast centers and so are independent of ERA5.

Observations from the research vessel
A time series of surface-layer meteorological variables was generated from the 43-day cruise of the NATO research vessel Alliance in February-March 2018 in the Iceland and southern Greenland Seas (Figure 1; see Renfrew et al., 2019a). Temperature, pressure, and RH were taken from the WeatherPak shipboard systems mounted at ∼15 m above sea level on the bow mast (see Table 1). Unfortunately, these systems had some technical problems, so a careful quality control procedure was implemented, with timing, linear regression, and bias checks against independent measurements from the boat deck, which were then used to fill in several small gaps in the WeatherPak time series. Due to instrument problems with the WeatherPak anemometers, and to avoid periods of sheltering by the ship's superstructure, here we take wind speed and wind direction from the lowest bin (40 m) of a Doppler wind lidar (a Leosphere WindCube v2 8.66) located on the boat deck. A novel correction algorithm for translational motions of the ship, as well as established corrections for the pitch, roll, and yaw of the Alliance based on intertial motion unit measurements was implemented (Duscha et al., 2020). SST was measured by a bow temperature sensor with checks against 2-m measurements from an underway conductivity-temperature-depth (CTD) system. Underway salinity measurements were also used to confirm when a few SST measurements were erroneous, and interpolated CTD data were used to cover a few short episodes of missing SST data. Periods of time in port were removed from the time series. All variables were quality controlled, with outliers and nonphysical measurements removed. Here we use 10-min averaged data. Note that data from radiosondes released from the Alliance were sent to forecast centers in real time and so were available for assimilation into operational systems and reanalyses. However, the ship-based measurements used here are independent of ERA5.

Observations from the research aircraft
Surface-layer observations are also available from our coordinated aircraft campaign in February and March 2018. We used the British Antarctic Survey's instrumented DH6 Twin Otter aircraft for 14 science missions (Flights 292-306), several in the vicinity of the Alliance, and more than half flying over the MIZ. A summary of the IGP meteorological field campaign is given in Renfrew et al. (2019a). A number of minor technical issues arose during the quality control of the aircraft data: the radar altimeter was not functioning on the first three flights and so was substituted by a calibrated GPS altitude; icing on the turbulence probe prevented calculation of 50-Hz winds on flights 292 and 297, so substitute horizontal winds were calculated using pitot tube and inertial navigation unit measurements; the 1-Hz temperature data were not available on flight 297, and high-frequency humidity data were missing for part of flight 294 due to a mission scientist blunder. The airborne surface temperature is based on a downward-pointing infrared thermometer which needs to be calibrated. Here we follow Cook and Renfrew (2015) and apply a constant offset for each flight determined by a comparison against ERA5 SSTs over open water. We also checked that the corrected surface temperatures were consistent for co-located data points and physically realistic with respect to the sea-ice cover. It is worth noting that the Heimann infrared thermometer is only accurate to within ±1 K (cf. Table 1). Minor flight-dependent timing adjustments were made to the 50-Hz thermistor, humidity sensor, and GPS altimeter data on all the flights to account for their positions on the aircraft and the instrument response times, based on lagged correlations with vertical velocity observations. In addition, there was initially a problem with partitioning the horizontal wind into components. A careful analysis of adjacent flight legs with reciprocal headings allowed us to apply a small correction to the true air speed and heading (∼1 • ) and thus derive accurate wind components for all flights. The aircraft-based observations are independent of ERA5.
Here we use observations from the nine successful marine flights. These include over 400 min of flying in the atmospheric surface layer (over 230 min over the MIZ), typically at 20-50 m above sea level. We have divided our surface-layer legs into "runs" of 150 s (approximately 9 km), and we calculate mean and turbulent quantities for each run. A run length of 150 s was chosen following sensitivity testing; it is a reasonable compromise between capturing the vast majority of the fluxes and accommodating the heterogeneous surface conditions (see Grunwald et al., 1996;Elvidge et al., 2016). The mean variables used here are air temperature, RH, specific humidity, wind speed, surface temperature, and ice fraction (derived from albedo and surface temperature; see Elvidge et al., 2016). The turbulent variables used are momentum flux, sensible heat flux, and latent heat flux, calculated using the eddy covariance method following Petersen and Renfrew (2009). A strict quality control procedure is applied with covariances, co-spectra, and ogives all checked. One concern with this technique is the relatively large sampling error when measuring turbulence for a relatively short time. This sampling error is typically around 30-40% of the magnitude of the flux (e.g., Drennan et al., 2007;Petersen and Renfrew, 2009;Weiss et al., 2011). To compensate for this, the data are usually averaged together to obtain robust results, for example, into wind speed bins. Here we directly compare covariances fluxes to model output. We use this approach because there is not currently a widely accepted bulk flux algorithm for estimating surface fluxes over the MIZ. However, this approach is unusual; more commonly, meteorological observations are used to derive bulk flux estimates from an offline algorithm, with these bulk fluxes then compared with model fluxes (e.g., Renfrew et al., 2002;). Our approach means that the sampling error needs to be taken into account. In other words, for a comparison to be valid, there needs to be sufficient data points for the statistical quantities to be robust; we believe this to be the case, with the possible exception of the aircraft-based fluxes over water.

ERA5 reanalysis
ERA5 is the fifth-generation ECMWF atmospheric reanalysis (Hersbach et al., 2020). ERA5 is produced using cycle 41r2 of the Integrated Forecast System (IFS) model, using a four-dimensional variational data assimilation scheme. The reanalysis benefits from a relatively high-resolution grid with 137 vertical levels and a horizontal grid spacing of 0.28125 • (∼31 km, or T L 639 triangular truncation).
The time frequency of atmospheric reanalyses parameters is 1 hr, and we use instantaneous meteorological variables and hourly mean surface fluxes. Besides a higher spatiotemporal resolution, ERA5 has a number of additional advantages over its predecessor, ERA-I (whose production stopped in 2019). The ERA5 data assimilation is enhanced by using not only satellite radiances, but also ozone, aircraft, and surface pressure data in the variational data assimilation scheme. ERA5 also assimilates a number of humidity-sensitive satellite channels using the all-sky approach instead of the clear-sky approach used in ERA-I, thus providing new information during cloudy and precipitating conditions. In addition, various reprocessed datasets and recent instruments that could not be ingested in ERA-I are included in ERA5. These improvements result, among other things, in more consistent sea-surface temperature and sea-ice cover compared with ERA-I. The evolution of SST and sea-ice cover in ERA5 is based on a number of products over different periods of time (Hersbach et al., 2020 (Donlon et al., 2012). OSTIA provides daily updated SST and sea-ice fields, primarily sourced from satellite observations, with a horizontal resolution of 1/20 • (∼6 km). OSTIA is also used in the ECMWF's operational forecasting system. The ERA5 SST does not vary during the day, although there is not an observable diurnal signal in SST in this region anyway.

Met Office analyses
The Met Office Unified Model (MetUM) is a state-of-the-art, nonhydrostatic atmospheric model used Note: The variables are: temperature at 2 m, T 2m ( • C); sea surface temperature, SST ( • C); specific humidity at 2 m, q 2m (g/kg); relative humidity at 2 m, RH 2m (%); wind speed at 10 m, U 10m (m/s); wind direction, WD (deg); surface momentum flux, (N/m 2 ); surface sensible heat flux, SHF (W/m 2 ); and surface latent heat flux, LHF (W/m 2 ). Note that the wind direction time series was filtered to remove data where the difference was greater than 270 • . The observed surface turbulent fluxes are calculated using the COARE3 algorithm. Recall that the observations of SST are from a depth of 8 m and so are shown in brackets. The nondimensional correlation coefficient and linear regression slope are shown in italics when statistically significant. The number of data points, N, plus the bias (model -observations) and RMSE are shown for all data. The bias is bold when statistically significant at the 95% level using a one-sided T-test.
for operational weather forecasting and as a component in climate models. Here we analyze limited-area simulations made using version 10.6 of the MetUM and a standard parameterization configuration generally following that used operationally in the limited-area km-scale RAL1-M configuration (Bush et al., 2020). This configuration has proven to be reasonably accurate at simulating cases of cold-air outbreaks and polar lows in this area (e.g., Sergeev et al., 2017;Renfrew et al., 2019b). It employs daily updated sea-ice and sea-surface temperature fields from OSTIA (as used in ERA5). Here, the model domain covers an area of approximately 1,000 × 1,500 km across the Iceland and Greenland Seas (see figure 13 in Renfrew et al., 2019a). The setup has a horizontal grid spacing of 0.02 • (∼2.2 km) and 70 vertical levels, the lowest of which is at a height of 2.5 m over the ocean. The limited-area model is forced at its lateral boundaries by a global MetUM simulation which employs a horizontal grid spacing of ∼10 km (N1280) with 70 vertical levels and also generally follows operational settings. We use instantaneous hourly model output from the simulations initialized at 0000 UTC that day.

Comparison methodology
We used the COARE3 bulk flux algorithm (cf. Fairall et al., 2003) to adjust the meteorological observations from the buoy and research vessel to standard levels (e.g., 2-m temperature, 10-m wind) and to estimate surface turbulent fluxes. We matched the model output to the observations as follows: For the buoy and research vessel observations, we use linear spatial interpolation and match hourly observations and model output. For the aircraft observations, we use linear interpolation to the height of the observations, and for ERA5 the nearest neighbour was used spatially and nearest hour in time, while for the MetUM a linear interpolation was used in both space and time.
The meteorological buoy was located in the ice-free ocean, whereas both the research vessel and aircraft crossed from the open ocean into the MIZ on numerous occasions (Figure 1). For the following comparison, we divided both of these time series into subsets for "over water" and "over the MIZ." For the research vessel, a time series of satellite-derived ice fraction is derived from the OSTIA grid point nearest to the position of the Alliance every hour. The Alliance is designated as over the MIZ when the ice fraction >0. This is a pragmatic approach given that in situ observations of ice fraction are not available. For the aircraft, ice fraction is estimated using an albedo derived from shortwave radiation observations (after Elvidge et al., 2016). As above, we designate data as "over the MIZ" when the ice fraction >0. Note that using an alternative, temperature-based ice fraction produces very similar results.

AN EVALUATION OF ERA5 FOR THE ICELAND AND GREENLAND SEAS REGION
Surface-layer meteorology and surface turbulent fluxes generally compare well to observations from the meteorological buoy. Figure 2 shows a very good correspondence over time for 2-m air temperature (T 2m ) and 10-m wind speed (U 10m ). All the major variability is captured, and the timing of the changes is generally in very good agreement.
There are a few periods of larger difference, for example, when the maxima and minima are not captured. The correspondence illustrated here is generally representative of the other variables. Figure 3 shows scatter plots for the buoy observations versus ERA5 output, and Table 2 gives selected statistics, including the correlation coefficient and slope of a linear regression fit, the bias (model -observations), and the RMSE. The correspondence in T 2m , RH 2m , specific humidity (q 2m ), and the surface heat fluxes is very good, with low biases and relatively low RMSE (e.g., smaller than the standard deviation of the observations). There is a dry bias of −4.7% in RH 2m (or − 0.14 g/kg in q 2m ). The SST comparison has a small bias (0.07 K); however, recall that the observations are from a depth of 8 m, which likely inhibits the observed variability compared with the  Note: Variables and statistics are the same as Table 2. Note that the ship-based SST observations over the MIZ will not be representative of a grid-box value and so are bracketed. The number of data points, N, the bias (model -observations), and the RMSE are shown separately for observations over water (i.e., ice-free ocean) and over the MIZ. The bias is bold when statistically significant at the 95% level using a one-sided T-test.
data into quartiles by observed significant wave height; the U 10m biases (regression slopes) for each quartile are then 0.14 m/s (0.8), 0.50 m/s (0.91), 0.53 m/s (1.09), and 1.52 m/s (1.13). There is a clear worsening in correspondence with significant wave height, suggesting that the wind and momentum flux biases may be entirely due to buoy sheltering. There is also a bias of −15 • in wind direction; i.e., ERA5 has winds coming from a more easterly direction. Overall, our buoy comparison is qualitatively and quantitatively similar to a comparison of ERA-I with a buoy in the central Iceland Sea by Harden et al. (2015), and to a comparison of NARR, ASR, and CFSR output with buoy observations in the Irminger Sea and the central Iceland Sea Dukhovskoy et al., 2017). Observations from the Alliance as it traversed the Iceland and Greenland Seas are shown in a time series in Figure 4 and illustrate it penetrating the MIZ on eight occasions (see also Figure 1). The proximity of the Alliance to sea ice results in greater variability in T 2m and SST than at the buoy. ERA5 generally captures the timing of this variability well, although it does fail to capture some of the cold extremes and appears poorer for SST at times, especially close to the MIZ. The correspondence in U 10m is generally very good.
A quantitative evaluation of ERA5 over water and over the MIZ is presented in Figure 5 and Table 3 for the ship-based observations and in Figure 6 and Table 4 for the aircraft-based observations. The tables give selected statistics for each time series as well as the bias and RMSE separately for over water and over the MIZ. The scatter plots are shaded to represent open water (blue) or the MIZ (white) for the ship, and the ice fraction (blue to white) for the aircraft. Generally, the correspondences-as measured by the correlation coefficient and linear regression slope-are good and similar for the ship-and aircraft-based comparisons. The correspondences for RH and the turbulent fluxes are noticeably worse for the aircraft comparison, partly due to the sampling issues discussed in Section 2.4 and the small size of the data subset. We now discuss the comparisons for over water and for over the MIZ in turn.
Over water, the ERA5 biases against ship-based observations are generally small and the RMSE are small compared with the standard deviation of the observations (see Table 3). In comparison with the buoy results, the correlation, slope, and RMSE are similar for T 2m , RH 2m , q 2m , the sensible heat flux (SHF), and the latent heat flux (LHF). The bias is higher for T 2m (0.49 K compared with 0.05 K) and for the SHF, while there is considerable scatter in the SST comparison, all likely due to the proximity to sea ice. In contrast to the buoy comparison, for U 10m the linear regression slope is low due to high wind speeds being underpredicted (Figure 5c), and this contributes to a bias of −0.20 m/s over water. Similar to the buoy results, there is an easterly bias of −7 • in wind direction and high accuracy in the surface flux estimates (the RMSE is less than half the observed standard deviation). In the aircraft comparison, there are 69 data points over water (only 44 for the turbulent fluxes). For the meteorological variables, the accuracy is generally similar to that found for the buoy  Note: The variables are: temperature, T (K), surface temperature, T sfc (K), specific humidity, q (g/kg), relative humidity, RH (%), wind speed, U (m/s), momentum flux, (N/m 2 ), sensible heat flux, SHF (W/m 2 ), latent heat flux, LHF (W/m 2 ), and ice fraction. All variables are at flight level, except for T sfc and the ice fraction. Flight-level ERA5 LHF are not available. The observed surface turbulent fluxes are calculated using the eddy covariance method; there is higher uncertainty in the comparison of these over water due to there being relatively few data points (hence the bias and RMSE are bracketed). The mean, standard deviation, and non-dimensional correlation coefficient and linear regression slope (in italics) are shown for all of the observations. The number of data points, N, the bias (model -observations), and the RMSE are shown separately for observations over water (i.e., ice-free ocean) and over the MIZ. Points are defined as over the MIZ when the observed ice fraction >0. The bias is bold when statistically significant at the 95% level using a one-sided T-test. The T sfc bias over water is not shown because ERA5 data are used for calibrating the airborne observations. Li et al., 2013). Overall, the correspondence over water is very good, largely consistent between the buoy, ship, and aircraft comparisons, and similar to previous evaluations of ERA-I for the subpolar seas (e.g., Harden et al., 2015). Over the MIZ, there are typically 84 data points in the ship-based comparison and 88 in the aircraft comparison. Figures 5 and 6 illustrate that ERA5 is less accurate over the MIZ than over water. For example, there is a clear increase in scatter with increasing ice fraction (paler dots) in Figure 6. Examining the statistics (Tables 3 and 4), the RMSE are greater over the MIZ than over water for all meteorological variables (except for RH/q in the aircraft comparison) and for all the surface fluxes (except for momentum in the aircraft comparison).
1 For some variables, the RMSE over the MIZ are particularly large, for example, 2.94 K for T sfc and 2.42 m/s for U from the aircraft. Note that the RMSE for SST are similar over the MIZ and over water in the ship comparison. This difference reflects that the Alliance was on the fringes of the 1 The two exceptions, for RH/q and the momentum flux, are both due to the aircraft RMSE over water being surprisingly large (when compared with the ship or buoy comparisons), primarily due to the large variances for these variables over water and the relatively small size of the data subset.
MIZ and actively avoiding sea ice, whereas the aircraft went much deeper into the MIZ. In general, the accuracies between the aircraft-and ship-based comparisons over the MIZ are consistent, but there are quantitative differences due to the aircraft observations being from flight level (20-70 m) or derived differently. Turning to the biases, these are larger over the MIZ for all of the ship-based comparisons, except T 2m and wind direction; but this finding is not consistent with the aircraft-based comparison. In short, combining the comparisons from the three observing platforms demonstrates that ERA5 is significantly less accurate over the MIZ than over water for both the surface-layer meteorology and surface turbulent fluxes. This is clearly demonstrated by contrasting the RMSE over water/over the MIZ: • For air temperature, surface temperature, and wind speed: 0.78/1.00 K, 0.47/2.94 K, 1.77/2.42 m/s from the aircraft comparison (Table 4) • For momentum flux, sensible heat flux, and latent heat flux: 0.077/0.098 N/m 2 , 20.9/35.2 W/m 2 , and 14.2/20.1 W/m 2 from the ship comparison (Table 3) In the next section we examine the causes of this lower accuracy over the MIZ.

INVESTIGATING THE REDUCED ACCURACY OF ERA5 OVER THE MARGINAL ICE ZONE
There are a number of possible reasons why the surface-layer meteorology and the surface fluxes from ERA5 are less accurate over the MIZ. There is an increase in the heterogeneity of many surface properties over the MIZ compared with over the ice-free ocean, for example, in surface temperature, surface roughness, and albedo, as is evident from our aircraft-based observations (e.g., Figure 6f). Perhaps ERA5 cannot represent this heterogeneity due to limitations in the data assimilated, or perhaps there are deficiencies in model parameterizations (e.g., in surface exchange), which may be more acute during meteorological conditions that are more prevalent over the MIZ. Here we investigate these possibilities primarily by focusing on some of the aircraft observations.
It is instructive to consider a case study. Here we compare the observations and ERA5 output off SE Greenland on 8 March 2018, when the aircraft spent a considerable amount of time over the pack ice and twice crossed the ice edge. Figure 7a shows the aircraft-observed ice fraction plotted over a Sentinel synthetic aperture radar (SAR) backscatter image, while Figure 7b shows the same data overlaid on ice fraction derived from the Advanced Microwave Scanning Radiometer 2 (AMSR2) (Spreen et al. 2008). The ice fraction is plotted using the same color bar for the aircraft and satellite-derived observations. Most of the pack ice is highly concentrated, with some leads and polynyas, as well as some narrow filaments of sea ice at an otherwise very narrow ice edge zone. The AMSR2 data correspond well to the SAR image, capturing the shape of the well-defined ice edge and the coherent patches of lower ice fraction, and also match the aircraft ice fraction observations reasonably well. Note that the seemingly different observations from the easternmost SW to NE leg are only just below an ice fraction of 0.8. In contrast to this, Figures 7d,f,h shows sea ice from the satellite-derived OSTIA analysis that is assimilated into ERA5. This has a much smoother sea-ice distribution. The gradient in ice fraction across the OSTIA MIZ is spread out over 50-80 km and does not match the abrupt ice edge seen in the aircraft observations, the SAR image, or the AMSR2 data. The OSTIA product has a grid size of 1/20 • (∼6 km) and has recently undergone an upgrade in its data assimilation algorithm to capture fine-scale fronts in SST (Fiedler et al., 2019), so it should be able to resolve the observed MIZ gradient. The smoothness of the sea-ice field is due to the relatively coarse resolution of the input data, that is, the OSI-SAF 401 data (Tonboe et al., 2017), which is based on SSMI observations from the 19-and 37-GHz channels, which have along-track resolutions of 69 and 37 km, respectively.
The aircraft observations illustrate a clear division between conditions over the sea ice and over water. There is a sharp increase in T, U, and SHF progressing across the ice edge into open water, with the SHF rising from 0 to ∼100 W/m 2 over 30 km, for example. There are also sharp increases in T sfc , RH, q, and LHF (not shown). In contrast, the gradients from ERA5 are much weaker and smoother; for example, the SHF rises from 0 to ∼100 W/m 2 over ∼80 km. It is evident that the overly smooth sea-ice field in ERA5 leads to overly smooth surface-layer meteorology and flux fields. Figure 8 illustrates another case study from 16 March 2018. As before, the AMSR2 sea-ice distribution matches the SAR image and aircraft observations well, whereas the OSTIA sea-ice distribution is too smooth, with an ice edge that is smeared out over 60-100 km instead of 10-20 km. Again, there is an increase in observed U, T, and SHF across the MIZ, with a sharp increase at the ice edge in the southernmost leg. The pattern is broadly captured in ERA5, but with weaker gradients and an overly smooth distribution. These cases illustrate that ERA5 does not represent the sea-ice distribution across the MIZ very well and that this directly impacts the simulated surface-layer meteorology and fluxes. Looking across all the aircraft data over the MIZ, the linear regression slope for ice fraction is only 0.64, confirming the smearing out of ice fraction seen in Figures 7 and 8, and there are also low regression slopes for T, T sfc , U, momentum flux, SHF, and LHF (not shown). Using all of the IGP aircraft observed ice fraction data, it is clear that the AMSR2 ice fraction is more accurate than the OSTIA ice fraction; for example, the RMSE and linear regression slopes are 0.17/0.19 and 1.00/0.75, respectively.
ERA5 has a grid size resolution of about 30 km and is thus limited in its representation of spatial gradients. To examine whether this was the decisive factor, we carried out a parallel evaluation of output from a set of MetUM forecasts that have a grid size of 2.2 km (see Section 2.5 for model details). Figure 9 shows MetUM output for the 8 and 16 March 2018 case studies. Note that the OSTIA surface boundary conditions used in these forecasts are from 2 days earlier than those used in ERA5 (although this makes little qualitative difference). In both cases studies, the MetUM suffers from similar problems to ERA5: the spatial gradients are smeared out into an overly smooth distribution, and the abrupt increases in T, U, and SHF at the ice edge are not captured (compare Figure 9 with Figures 7 and 8). Note that, in the 8 March case, the MetUM is uniformly about 1 K too cold and the winds are too strong over the ice (which is also the case for ERA5), although the MetUM does capture the high winds over water in the easternmost leg (unlike ERA5). In the 16 March case, the MetUM is uniformly about 2 K too warm (as is also the case for ERA5). Table 5 provides an evaluation of the MetUM for all the marine flights. The mean, standard deviation, correlation coefficient, and linear regression slope are generally very similar to those of the ERA5 comparison for the meteorological variables (cf. Table 4). The mean fluxes are higher, giving a better match for the momentum flux, but a worse match for the SHF. The bias and RMSE are shown separately for over water and over the MIZ and generally follow the same qualitative pattern as those of ERA5; For F I G U R E 9 Spatial maps of sea-ice distribution from 8 March (left) and 16 March 2018 (right) with observations or MetUM output overlaid. Panels shows satellite-derived sea-ice fraction contours: (a) from AMSR2, with ice fraction observations overlaid; and (b-d) from OSTIA, with MetUM output overlaid for flight-level air temperature, T (K), wind speed (m/s), and sensible heat flux (W/m 2 ) as indicated. Recall that Figures 7 and 8 show aircraft observations of the same quantities example, there is a negative bias in T over water and a positive bias in T over the MIZ. As for ERA5, the RMSE are greater over the MIZ than over water for all the meteorological variables (except q and wind direction). For the turbulent heat fluxes, the RMSE over the MIZ are large, but as discussed earlier, over water the large variance and relatively small dataset make this comparison unreliable. Note that an evaluation of the MetUM forecasts against   Table 4. In brief, the mean, standard deviation, and non-dimensional correlation coefficient and linear regression slope (in italics) are shown for all of the observations. The number of data points, N, the bias (model -observations), and the RMSE are shown separately for observations over water and over the MIZ. The bias is bold when statistically significant at the 95% level using a one-sided T-test.
the buoy observations gives RMSE over water of 21 and 18 W/m 2 for the SHF and LHF, respectively, compared with 48 and 31 W/m 2 over the MIZ (Table 5). This implies that the MetUM heat fluxes are less accurate over the MIZ than over water, in keeping with our findings for ERA5. In short, the ERA5 and MetUM comparisons over the MIZ are remarkably similar and have the same major deficiencies. This suggests a common cause: the overly smooth sea-ice distribution in the surface boundary conditions. The evidence points to this being the primary reason for less accurate simulations over the MIZ. However, there are other factors to consider: • The biases in the SHF and LHF over the MIZ are relatively large in magnitude for both ERA5 (Table 3) and the MetUM (Table 5). This raises questions about the surface exchange parameterization over the MIZ, which are being pursued in a separate study. Recent work has demonstrated that an improved surface exchange scheme for momentum can significantly improve forecasts for surface-layer meteorology and fluxes over the MIZ, regionally and globally (Renfrew et al., 2019b).
• The atmospheric conditions (e.g., static stability) may be different over the MIZ and over water. Even if this is the situation, it seems unlikely to be the dominant factor, especially as the aircraft-based comparison uses data from legs that often cover both the MIZ and open water.
• The models may not properly resolve the heterogeneity and sharp contrasts of the MIZ. The ERA5 grid size makes it impossible to fully represent the detailed sea-ice distribution seen in the AMSR2 product; however, it should be able to represent more detail than it currently does. The limiting factor appears to be the very smooth OSTIA sea-ice distribution, which is based on the OSI-SAF 401 product.

CONCLUSIONS
A comprehensive evaluation of surface-layer meteorology and surface turbulent fluxes in ERA5 for winter conditions over the Iceland and Greenland Seas has been presented. Observations from three platforms (a meteorological buoy, a research vessel, and a research aircraft) provide unparalleled coverage of both the ice-free ocean and the marginal ice zone (MIZ) that is independent from the reanalyses and forecasts. These observations allow the first evaluation of meteorological reanalyses that focuses on the MIZ. In general, ERA5 performs well: it captures the temporal variability very well and the spatial variability qualitatively well. The biases are significantly less than the observed standard deviations for all variables. Over water, ERA5 performs very well and broadly in line with previous evaluations of ERA-I for the subpolar seas (e.g., Harden et al., 2015). Over the MIZ, ERA5 is less accurate for almost all variables. This is clearly demonstrated by contrasting the RMSE over water versus over the MIZ: • For air temperature, surface temperature, and wind speed: 0.78/1.00 K, 0.47/2.94 K, 1.77/2.42 m/s from the aircraft comparison (Table 4) • For momentum flux, sensible heat flux, and latent heat flux: 0.077/0.098 N/m 2 , 20.9/35.2 W/m 2 , and 14.2/20.1 W/m 2 from the ship comparison (Table 3).
There is also a bias in surface temperature over the MIZ of about 1 K in the aircraft comparison. A parallel evaluation of a set of forecasts from a 2-km configuration of the MetUM yields similar findings.
The primary cause of the lower accuracy over the MIZ is an overly smooth sea-ice distribution in the prescribed surface boundary conditions. These use the OSTIA SST and sea-ice analysis, which takes sea-ice concentration from the OSI-SAF 401 product. The OSTIA sea-ice concentration gradient is too weak compared with aircraft observations, SAR imagery, or satellite observations from AMSR2. This has an impact on the surface-layer meteorology and fluxes, which also have gradients that are too weak across the MIZ. It is likely that the surface exchange parameterization over the MIZ also has some limitations, but these appear secondary.
Our findings suggest the hypothesis that a more accurate and precise sea-ice concentration would yield a better performance from meteorological reanalyses or forecasts for surface-layer meteorology and fluxes in the marginal ice zone. There is evidence from idealized modeling studies that the atmospheric surface layer is strongly impacted by the sea-ice distribution, both locally and for hundreds of kilometers downstream (e.g., Liu et al., 2006;Gryschka et al., 2008;Chechin et al., 2013;Müller et al., 2017;Batrak and Müller, 2018), and case studies have shown that an improved sea-ice distribution can improve the surface-layer meteorology (e.g., Outten et al., 2009;Smith et al., 2013). Verifying this hypothesis for the IGP data or, more generally, for the subpolar North Atlantic region should be a next step and would provide further motivation for improving the sea-ice data used as initial conditions in meteorological reanalyses and forecasts.

ACKNOWLEDGMENTS
This study was part of the Iceland Greenland Seas Project. Funding was from the NERC AFIS grant (NE/N009754/1), the ALERTNESS (Advanced models and weather prediction in the Arctic: enhanced capacity from observations and polar process representations) project (Research Council of Norway project number 280573), the Trond Mohn Foundation (BFS2016REK01), and the National Science Foundation grant OCE-1558742. The Leosphere WindCube v2 and the Wavescan buoy are part of the OBLO (Offshore Boundary Layer Observatory) infrastructure funded by the Research Council of Norway (project number 227777). We would like to thank all those involved in the fieldwork associated with the IGP, including L. Papritz, E. Kolstad, and M. Hallerstig. We would also like to thank the reviewers for their comments. Fully quality-controlled meteorological datasets from the buoy, research vessel, and aircraft are available at CEDA (www.ceda.ac.uk), and for the buoy data at thredds.met.no. This paper contains modified Copernicus Climate Change Service Information (2020). The MetUM simulations were carried out on MONSooN, a collaborative facility supplied under the Joint Weather and Climate Research Programme, a strategic partnership between the Met Office and the NERC. Visualisation software used: MATLAB, Matplotlib: A 2D Graphics Environment (J.D. Hunter, Computing in Science & Engineering, vol. 9, no. 3, pp. 90-95, 2007)