Adapting the SAL method to evaluate reflectivity forecasts of summer precipitation in the central United States

The Structure Amplitude Location (SAL) method was originally developed to evaluate forecast accumulated‐precipitation fields through identification and comparison of objects in both the forecast and the observed fields. This study describes a small modification for use with instantaneous composite‐reflectivity forecasts, where objects' minimum size and reflectivity thresholds are prescribed. Both the original and modified SAL methods are used to evaluate daily 0000 UTC 12‐km North American Model (NAM) forecasts, against NCEP/EMC 4‐km Stage IV accumulated‐precipitation estimates, during the summer of 2015 for a central US domain. Results show substantial sensitivity to the reflectivity threshold. This is likely related to sampling more signal from convective cell cores, and progressively ignoring stratiform rain areas, as threshold increases. Setting the threshold too high (40 dBZ) yields only 7% of time periods on which error scores can be computed, as opposed to 94% using a low threshold (5 dBZ). The primary difference between the two methods is a larger structural error in SAL using reflectivity, likely related to the unresolved convective peaks in the 12‐km NAM forecasts; this error is smoothed out when accumulated precipitation is evaluated. SAL using reflectivity also reveals a diurnal cycle of skill, with minimum skill occurring around 1800–2200 UTC (early to late afternoon local time, before average convective activity reaches its maximum) and maximum skill occurring around 1000 UTC (just before sunrise). We conclude that both methods yield useful results, but results presented herein may not be generalisable to other verification domains or SAL formulations.


Introduction
The Structure Amplitude Location (SAL) method (Wernli et al., 2008; hereby W08) evaluates accumulated-precipitation fields by identifying objects in both a forecast and an observed field at a given time, and decomposing differences (i.e. error) into three components. This procedure avoids a double penalisation for timing and locational errors inherent in verification methods such as root mean square error. The errors are normalised by the size of the domain and domain-wide accumulation such that many cases using the same grid can be compared.
The power of SAL also lies in its ability to evaluate the type of error. The structural component S (between −2 and 2) considers the gradient around the object and its size. For instance, a negative S component may indicate, e.g. too high a radial gradient of the forecasted objects, such as a forecast of convective cells when a stratiform area is observed. The amplitude component A (between −2 and 2) considers domain-wide accumulation. Finally, the location component L (between 0 and 2) consists of two parts. One part (L 1 ) measures location differences in centres of mass for the domain-wide observed and forecast fields; the other part (L 2 ) accounts for location differences of all objects weighted by their integrated precipitation. However, as with all so-called objective schemes, there is a subjective element. SAL scores may on occasion be highly sensitive to the choice of minimum threshold, which occurs when a bimodal distribution of precipitation may or may not be split into two objects rather than one (termed the 'camel effect' in W08).
SAL has shown its flexibility in previous studies, such as where a potential vorticity anomaly component replaced S (Madonna et al., 2015), or where SAL was used to evaluate interpolated tracer-plume data sets (Dacre, 2011). Given the connection between radar reflectivity and precipitation accumulation, the authors have refactored SAL for use with composite reflectivity for evaluating forecasts of moist convection in the Great Plains of the United States. This study addresses the differences between the original SAL formulation (hereby SAL acpc ) and its slight modification for evaluating instantaneous composite reflectivity (hereby SAL cref ) in Section 2. To compare the two SAL methods over an extended period, North American Model (NAM) forecasts of composite reflectivity and 24-h accumulated precipitation were verified over the central United States with radar observations and multisensor Adapting SAL to evaluate reflectivity forecasts in the United States 525 estimates of precipitation, respectively. The method and data sources are detailed in Section 3, and results are presented and analysed in Section 4. We discuss interpretation of SAL cref , along with concluding statements, in Section 5.

SAL modification
The SAL cref method has three modified aspects: (1) instantaneous composite reflectivity is used instead of accumulated precipitation; (2) the minimum area of the object (herein termed its footprint) is specified; and (3) the minimum object threshold is explicitly set in dBZ. The latter two changes, while small, were motivated by challenges presented by the noisy nature of the instantaneous reflectivity field, and the nonlinear relationship between precipitation and reflectivity that precludes simple extrapolation of SAL acpc to a reflectivity field.
SAL acpc deals with a smoothed field (for instance, precipitation accumulated where frontal systems have traversed), whereas SAL cref is an instantaneous snapshot of the reflectivity field. The noisier nature of a reflectivity field hence necessitates a minimum footprint. Smaller footprint and threshold parameters yield more, smaller objects in SAL cref than in SAL acpc . Note the increased likelihood of multiple objects in at least one data set means the L component is more likely to be larger (due to a non-zero L 2 component), and increases the potential frequency of the camel effect. As S is computed with the average of all objects' scaled volume, a large structural error will occur if, e.g. observations have a quasi-linear convective system represented by many strong convective cores joined by weak stratiform rain (e.g. in the trailing stratiform mode) appearing as one object, but if the simulated field has less stratiform precipitation and appears as numerous cell objects.
In addition, it is desirable that the footprint be large enough to ignore radar clutter and clear-air backscatter, but not large enough that growing convective cells suddenly appear or disappear as objects between forecast hours and cause a large step increase or decrease in SAL error. Preliminary testing with object identification on a 4-km grid found a footprint of ∼200 gridpoints (3200 km 2 ) subjectively related best to the field, with little variation when the footprint was varied between 100 and 500 gridpoints (1600 and 8000 km 2 ). When a footprint value is not applied, the resultant large number of objects in the composite-reflectivity field degrades the signal-to-noise ratio (i.e. the camel effect occurs more often). Hence, all SAL cref computations in this study use a footprint of 200 gridpoints (3200 km 2 ).
A constant reflectivity threshold is set to (1) focus further on moist convection as the threshold increases and (2) ignore clear-air backscatter and/or radar clutter. An increase in threshold should not degrade the quality of the S-and A-component scores if the threshold allows a large sample size of cases: as S is computed using a weighted mean, the splitting of a larger object into its convective cores would not detract from the component's rationale. Note the A component is not sensitive to the threshold. However, as mentioned before, the camel effect is compounded as the number of objects increases, and L is more likely to be non-zero and/or large. If the threshold becomes too large, SAL cref ignores an increasingly large number of cases in which the minimum threshold is not reached in either the model or observations, as noted in W08, yielding results biased towards well-forecast cases.

Data and methods
We evaluated the 12-km NAM model using 0000 UTC initialisations between 1 April 2015 and 31 August 2015 inclusive, verifying 24-h precipitation accumulation between 12 and 36 forecast hours, and composite reflectivity forecasts valid hourly between 12 and 36 forecast hours exclusive (The 36th hour was not included to ensure the first and last hours were not accounted for twice in statistics).
A 4-km grid was arbitrarily defined inside the continental United States ( Figure 1) as a common grid to which observations and forecast data were interpolated linearly. (Stage IV and NAM data were interpolated using the NumPy method scipy.interpolate.griddata, because of the non-rectangular grid. Composite reflectivity observations, due to the large data set, were interpolated using the more efficient NumPy method scipy.interpolate.RectBivariateSpline, with no smoothing (i.e. interpolation only) allowed by the regular latitude-longitude grid of the data set). This is a similar method to W08. As in Barrett et al. (2015), our preliminary testing found negligible sensitivity of the reflectivity and precipitation fields to our re-projection and interpolation methods. We may expect the NAM forecast model -in which convection is parameterised, and afterwards is interpolated to a finer grid -to develop objects too flat (S > 0). This may be perceived as a limitation of our methodology, but served as a check for our tests.
Verification of forecasted reflectivity was performed with composite NEXRAD Level III radar reflectivity from archives at the Iowa State University (https:// mesonet.agron.iastate.edu/docs/nexrad_composites/, accessed 1 February 2016). Base reflectivity product data are composited through the GEMPAK program nex2img, after which suspected false echoes are removed through comparison with the Net Echo Top product. Gridded accumulated-precipitation data sets (NCEP/EMC 4-km Stage IV), created using rain gauge and radar observations, were obtained from the Earth Observing Laboratory (http://data.eol.ucar.edu/).
Six of the 153 days in our period had missing archived NAM forecast data, and one other day had missing accumulated-precipitation data. These days were removed, leaving 146 days for SAL acpc and 147 days for SAL cref . We ran SAL cref for four thresholds: 5, 15, 30, and 40 dBZ. As the SAL methodology requires identification of at least one object in both the observed and the forecast fields, times or periods that resulted in spurious SAL-component scores (i.e. exactly 0 or ±2) were removed. This did not affect the number of SAL acpc days, but reduced the 3672 instances of composite reflectivity to 3458, 3408, 2560, and 254 times, for 5-, 15-, 30-, and 40-dBZ thresholds, respectively. The number of ignored instances of composite reflectivity increases with threshold because higher thresholds eliminate more areas of reflectivity, and is more likely to remove an object in that field. This is particularly drastic for the 40-dBZ threshold, where only 7% of times contained objects in both forecast and observational data sets.

Accumulated precipitation (SAL acpc )
We found that NAM forecasts only weakly overestimate accumulated precipitation on the majority of days in the data set ( Figure 2). Objects were too flat, which is likely related to the 12-km horizontal resolution of the NAM forecasts. The positive correlation between S and A components is unsurprising, as discussed in W08, due to the physical relationship between larger objects and larger domain-wide accumulation. There is no obvious relationship between these two components and L-component error.

Composite reflectivity (SAL cref )
Due to the high volume of data, we focus on the 30-h NAM reflectivity forecast (i.e. 0600 UTC on day 2) as an example here. This is around the time of maximum thunder occurrence for summer months in the central United States (Easterling and Robinson, 1985). At lower thresholds (5 and 15 dBZ; Figure 3(a) and (b)), the positive correlation between S and A components is similar to that for SAL acpc (Figure 2). However, at 30 dBZ (Figure 3(c)), the line of best fit (not shown) is more parallel with the x-axis: the A-component error remains positive regardless of S error for almost all points. This suggests the stronger signal ratio from convective cells over stratiform precipitation results in positive A error not simply because of the size of objects, but from radial reflectivity gradients in the objects. As discussed in W08, positive S and negative A (bottom right quadrant) can occur when a forecast misses an observed convective cell. We note the S-A relationships shown here, valid at forecast-hour 30 for all four thresholds, are consistent throughout the whole 24-h period. We also find the variation in all three components, represented by the 'spread' of points and their colours in Figure 3, is smallest around 24 forecast hours (0000 UTC) and largest between 30 and 36 forecast hours (0600 and 1200 UTC; not shown). In other words, systematic errors in simulated composite reflectivity dominate SAL statistics in the early night period, while random errors dominate towards sunrise.
Median S-component error, denoted by the black vertical broken line in Figures 2 and 3, is similar (0.6-0.7) in both SAL acpc (Figure 2) and SAL cref (Figure 3), but only when the threshold of the latter is set at 15 dBZ or higher. However, despite the larger S component when using 5-dBZ threshold (Figure 3(a)), the A component is around the same as at other thresholds (∼0.5; Figure 3(b)-(d)). This suggests that weak stratiform (<15 dBZ) precipitation areas are forecasted too large in area coverage and too weak in magnitude.
When median S and A components are plotted for each hour over the whole data set, we find a diurnal loop as shown in Figure 4. Note 40 dBZ is not discussed here The earlier peak in A may represent forecasted cell initiation that grows too quickly, while the later peak in S may be related to upscale growth (forecast reflectivity objects are too stratiform). The progression of the trajectories towards the origin suggests an increase in forecast skill towards and after sunset, as diurnal convection decays and nocturnal systems develop. This signal that nocturnal systems are better forecast may be due to the propensity of mesoscale convective systems to occur at night (Markowski and Richardson, 2010 and references therein), whose length scales are larger than (typically daytime) single-cell convection, and whose predictability is therefore theoretically larger (Lorenz, 1969;Clark et al., 2007). Yan and Gallus (submitted to Monthly Weather Review) found NAM forecasts of precipitation to be more skillful between midnight and early morning, and least skillful near noon. While this corroborates results presented here, we note that our location error (L) is an order of magnitude smaller than structural (S) and amplitudinal (A) error, whereas displacement error was the main source of low forecast skill in Yan and Gallus. However, as SAL components are normalised, we expect L to be small because of our large domain size.
As each SAL component is normalised (i.e. by how wrong a forecast could possibly be in each component), the authors propose that the total of absolute SAL component values (taSAL) at each time or time period allows an estimate of forecast skill (taSAL = |S| + |A| + |L|). Ultimately subjective in nature because of its formulation, a skill score may be considered an objective method to reflect how the human eye subjectively perceives numerous error components: quantifying how the total error limits the utility of a forecast. For example, the Model Evaluation Tool (MET; http://www.dtcenter.org/met/users/) software package Method for Object-Based Diagnostic Evaluation (MODE) weights its error components subjectively and empirically, each potentially with different values.
Herein, we propose a simple combination of equal weighting to all three components as a starting point. A different weighting combination should be tested in subsequent studies. The median taSAL values at each forecast time are shown in Figure 5(a) for 5-, 15-, and 30-dBZ reflectivity thresholds. The decrease of error with increased dBZ threshold may be related to a smaller area of variation in scaled volume to occur within each object, and objects restricted to convective cores, both of which lower the 'area of freedom' for potential dBZ values. Additionally, as noted in Section 2 and W08, the smaller error could simply be due to the increasing neglect of missed events and false alarms as the threshold is raised. Note, given that stratiform precipitation is increasingly ignored with larger thresholds associated with lower error, more power in the SAL signal is given to less predictable convective precipitation (Zhang et al., 2003). Hence, one may have instead expected reduced skill of convective-object forecasts. We also present in Figure 5(b) the average percentage of pixels in both the observed (darker shades) and forecast (paler shades) fields that are active, i.e. above the thresholds denoted by the same colour scheme, Active pixels are defined as those belonging to an object at the given threshold and footprint at a given time. The vertical black line in panel (a) denotes 0000 UTC on day 2 (forecast hour 24), for reference. and belonging to an object larger than the prescribed footprint. The average is computed for each hour over the whole summer, and because of the propensity of thunderstorms within our chosen domain, serves as a proxy for convective activity. Figure 5(b) shows that peak activity occurs at 2300 UTC (6 pm local time) in simulated data, and at 0400 UTC (11 pm local time) in the observed data, with both displaying a similar sinusoidal pattern as in taSAL (Figure 5(a)). The 5-h difference in convective maxima between observed and forecast fields suggests that forecast error could well contain a large timing component. Furthermore, the peak taSAL error occurs earlier -while observed and forecast convective activity is growing -suggesting that forecast error may develop within growing late afternoon convection.

Conclusions
This study presented modifications to the original SAL methodology (SAL acpc ) to verify composite reflectivity fields (SAL cref ), instead of accumulated precipitation. We evaluated NAM forecasts for a summer (April-August inclusive) season in the central United States with both SAL methods to gauge the modifications' impact. The two methods draw different conclusions from their respective fields. SAL cref demonstrated a larger positive S-component error, likely related to the inability of the convection-parameterising forecast model to resolve peak maxima associated with convective cells, and the smoother nature of the accumulated-precipitation field in SAL acpc . The positive correlation between S and A components is expected because of the physical relationship between object size and domain-wide composite reflectivity. However, this correlation is not apparent when the minimum threshold of SAL cref is raised to 30 dBZ. SAL cref reveals a diurnal cycle of skill, with forecasts best in the early morning and worst around noon, with similar patterns in all three SAL components. The lag in convective activity after the largest total SAL error suggests that timing and simulated character of developing convection may be responsible for a substantial proportion of forecast error. These results show the need to set a threshold and footprint small enough to give a sufficient sample size, but large enough to capture the signal of interest -be it convective or stratiform in nature. Further work may investigate whether SAL scores of hourly maximum composite reflectivity forecasts show less sensitivity to threshold, because of the smoother nature of that field. Our results also reiterate that interpretation of SAL must be isolated to the threshold and field chosen. For instance, use of a large domain reduces the impact of the L component as object displacements are normalised by the diagonal length. Further investigation is needed into the optimal weighting of SAL components to create a skill score. A remaining limitation of the SAL method relating to moist convection is its inability to measure error in orientation of objects. The authors are aware of the MET package MODE, another object-based evaluation system that considers orientation, but which lacks some simplicity and portability of SAL. A fourth component that considers the mode and orientation of convection may improve SAL's utility for reflectivity fields.