Using machine learning to predict fire-ignition occurrences from lightning forecasts

Lightning-caused wildfires are a significant contributor to burned areas, with lightning ignitions remaining one of the most unpredictable aspects of the fire environment. There is a clear connection between fuel moisture and the probability of ignition; however, the mechanisms are poorly understood and predictive methods are underdeveloped. Establishing a lightning – ignition relationship would be useful in developing a model that would complement early warning systems designed for fire control and prevention. A machine learning (ML) approach was used to define a predictive model for wildfire ignition based on lightning forecasts and environmental conditions. Three different binary classifiers were adopted: a decision tree, an AdaBoost and a Random Forest, showing promising results, with both ensemble methods (Random Forest and AdaBoost) exhibiting an out-of-sample accuracy of 78%. Data provided by a Western Australia wildfire database allowed a comprehensive verification on over 145 lightning-ignited wildfires in regions of Australia during 2016. This highlighted that in a minimum of 71% of the cases the ML models correctly predicted the occurrence of an ignition when a fire was actually initiated. The super-learner developed is planned to be used in an operational context to the enhance information connected to fire management.


| INTRODUCTION
Although human activity is widely recognized as the primary cause of wildfires, in most ecosystems (Amatulli et al., 2007;Archibald et al., 2009;Ganteaume et al., 2013), lightning-caused ignitions are not uncommon and the majority of burned areas in remote regions of the Boreal and Arctic biomes have a lightning origin (Krider et al., 1980;Granström, 1993). Lightning is a major cause of wildland fires in Canada (Hanes et al., 2019), and in Alberta, for example, it is responsible for igniting 45% of reported wildfires and 71% of area burned (Blouin et al., 2016). There is a clear connection between fuel dryness, its availability and the probability of a lightning ignition (Flannigan and Haar, 1986), even if complex nonenvironmental variables can play a role (Anderson, 2002). Simplified conceptual models to associate a danger of ignition from lightning have been proposed to derive generic danger levels. These can be used as one of various components to assess fire danger in rating systems (e.g. the lightning activity level in the National Fire Danger Rating System) (Fuquay, 1979]). Still, the mechanisms associated with ignitions are poorly understood and predictive methods to identify a cause-effect relationship have been limited to specific regions (Flannigan and Wotton, 1991) and applied as interpretative tools for the analysis of occurred events (Wierzchowski et al., 2002;Wotton and Martell, 2005).
In recent years, efforts have been devoted to the forecast of lightning activity using both physical models of various complexity (Price and Rind, 1992;Grewe et al., 2001;MacGorman et al., 2001;Allen and Pickering, 2002;McCaul Jr et al., 2009;Lopez, 2016;Finney et al., 2018) and statistical methods/machine learning (ML) approaches (Bates et al., 2018;Mostajabi et al., 2019). To a certain degree, lightning activity can be predicted with a reasonable level of confidence because it is linked to particular weather conditions. A key ingredient is intense convection: the rising of plumes of moist air in response to instability in the atmosphere, usually associated with the development of cumulonimbus clouds. Such atmospheric conditions often involve collisions between hydro-meteors such as graupel, hail, ice particles and liquid water droplets. The collisions may cause electric charges in those particles to separate, which leads to the creation of oppositely charged layers inside the clouds. The resulting electric fields can lead to the powerful discharges observed as lightning (Takahashi, 1978). One of the most significant challenges in assessing lightning's influence on ignition comes from the capability actually to predict whether a thunderstorm is going to be wet or dry, that is, accompanied by precipitation. Weather conditions and remotely sensed soil moisture data, as provided by microwave sensors, are ideally placed to discriminate between these two situations. Other environmental variables could also be either confounding or triggering factors. Ultimately, as the physical mechanisms to predict lightning ignitions are still largely unknown, the present paper explores the potential of ML techniques to provide a data-driven solution that could be applied globally. The aim is to derive a product able to link lightning predictions with the observed occurrence of fires and to provide a quantitative measure to identify those lightning flashes that could be conducive of fires.
As the availability of fire databases, including ignition causes, is limited, the study will not try to predict lightning-ignition events, but rather will focus on a methodology to identify the environmental factors occurring at the time when a fire was observed following lightning flashes predicted within the previous 24 hr. The result is to identify a forecasting product that could add valuable information to existing raw lightning forecasts in aid of early warning systems designed for fire control and prevention.
2 | DATA 2.1 | Lightning prediction Using a parameterization developed by Lopez (2016), the European Centre for Medium-range Weather Forecasts (ECMWF) has been providing a lightning density forecast to its users since 2018. This is defined as the average number of flashes in a given time interval. Experiments show that the forecasts for lightning can have useful skill, at least, up to day 3. The skill depends on the averaging windows, so that when the integration is performed over the next 24 hr and for large regions, a very good agreement with observations can be achieved (Lopez, 2016). Lightning flashes can be categorized into either cloud-to-ground (CG) or intraand intercloud (IC) discharges. Lopez's parameterization, in its operational implementation, predicts the total lightning flash (IC + CG) density (F T ) in (flashesÁkm -2 Áday -1 ) as: where α is a tuneable co-efficient (currently set to 37.5 to match the climatological annual global mean flash rate from the Tropical Rainfall Measuring Mission (TRMM) satellite Lightning Imaging Sensor (LIS)/Optical Transient Detector (OTD) climatology, as described by Cecil et al., 2014); Z base is the convective cloud base height (km); Q r is a proxy for the charging rate resulting from the collisions between graupel particles and other types of hydrometeors; and CAPE is the convective available potential energy (JÁkg −1 ). An estimation of the cloud-toground density (F CG ) is calculated from the total flash density by a simple conversion: where H c is the thickness of the convective cloud above the freezing level; and F(H c ) is derived using the partition function proposed by Price and Rind (1993) applied to where 5.5 < H c < 14 km. A database for CG lightning flash density (averaged over a 6 hr time window) was then computed using the TCo399 configuration of the ECMWF's Integrated Forecasting System (IFS) (roughly corresponding to a grid size of 28 km) for a full year in 2016.

| Lightning-ignition database
While there are local databases of wildfire ignitions categorized to have been initiated by lightning, they do not provide the global coverage needed for the present study. A training data set was achieved by identifying wildfire ignition points (IPs) and combining these data with the ECMWF's CG lightning density product to produce a data set of fire ignition points that could be potentially attributed to lightning (IPLs) (Figure 1).
The technique used to construct the IPL data set relied on the use of fire radiative power (FRP) (WÁm −2 ) observations from the operational Global Fire Assimilation System (GFAS), which has been available on a daily basis since 2003 (Kaiser et al., 2012;Di Giuseppe et al., 2016a, 2016b. The procedure uses a series of decision nodes to identify new ignitions. We first exclude points that are burning as the result of previous fires (1) or are spreading from neighbouring areas (2). The procedure then removes fires that can be attributed to agriculture practices either by being shorter than a day or by being of low intensity (FRP < 0.5 WÁm −2 ) or if placed in an agricultural biome (3). The agricultural biomes are defined as the combination of agricultural with organic soils (AGOS) and agricultural (AG) in the GFAS database (Van der Werf et al., 2017).
Any newly active fires with no fire spread and not being agriculture related are deemed IPs. Depending on the CG lightning density, points identified as IPs (six) are then split into IPLs and non-IPLs. As non-IPLs observations are several orders of magnitude larger than IPLs, only a randomly selected sample (of the same sample size of IPLs) is used to train the binary classifier described below in Section 3.

| Environmental predictors
A comprehensive set of environmental variables is candidates that are possible predictors of lightning ignition. The selection is based on capturing not only the start of a fire event but also its sustainability. The probability of fuel ignition by lightning has been shown to be relatively independent of fuel moisture (Latham and Williams, 2001), while its survival is highly dependent on it (Wotton and Martell, 2005), thus the reason why the concept of dry lightning is of importance. To capture these conditions, a crucial aspect is physical consistency among the environmental factors and lightning prediction. Also given the global nature of the present study, and the focus on developing a predictive model that could be used in real time, the majority of the meteorological fields are taken from the IFS operational analysis as either raw outputs or derived parameters (Table 1).
F I G U R E 1 Flow chart representing the methodology employed in the construction of the data set that identifies fire ignition points possibly caused by lightning Note: FRP, fire radiative power One exception is provided by the soil moisture field. Even if great progress has been made in soil moisture initialization in weather prediction (Albergel et al., 2012), this variable is still often used as a tuning parameter to adjust the prediction of screen-level diagnostics (Hess, 2001;Di Giuseppe et al., 2011). Thus, despite recent efforts to include remote observations in soil moisture initialization in order to characterize better the water balance of the surface (De Rosnay et al., 2013), the simplicity of the land surface schemes used in weather forecast models makes them inappropriate to represent fully hydrological processes across the globe (Zsoter et al., 2019). For this reason remote observations of soil moisture are also tested as possible predictors. The soil moisture near real-time neural network product available from the Soil Moisture and Ocean Salinity (SMOS) satellite is available in near real time and used in the surface assimilation cycle of the ECMWF (Sabater et al., 2017). Data are available in swath segments on the ISEA 4H9 grid, with grid data provided as discrete global grid (DGG) points, only available along the satellite track with a nominal resolution of 15 km. Soil moisture retrievals undergo a quality control and filtering identical to what was applied before ingestion in the ECMWF's IFS data assimilation. Data are discarded if any of these four conditions are met: • DGG ID number < 0.
Daily swath files are then regridded onto the 0.1 grid resolution using the inverse distance to a power interpolation algorithm (GDAL/OGR contributors, 2020). Three days of daily swath files were missing between March 14 and 16, 2016; missing data were imputed using previous days observations. The availability time for these data varies because it depends on the availability of the satellite observations and the survival rate through the reprocessing and quality-assurance stages. However, usually data are available every 3-5 days.
As low fuel moisture, especially in the duff layer, is known to be an important agent for lightning ignition (Wotton and Martell, 2005), an additional parameter is considered which combines the drought code (DC) and the duff moisture code (DMC) of the Fire Weather Index model (Van Wagner, 1974, 1987. The DC averages moisture content of deep, compact organic layers of depth 10-20 cm with a long-term response (about 50 days) to weather variations. The DMC averages moisture content of loosely compacted organic layers of moderate depth 5-10 cm and is characterized by a medium-term response (about 10-12 days) to weather variations. A vegetation drought sensitivity index (VDI) was devised by Météo-France as a combination of DC and DMC and used in its operational practice (personal communication) to identify areas where fires can develop due to dry/hot and windy conditions (for the interpretation, see Table 2). The algorithm to calculate the VDI is publicly available (Vitolo et al., 2017(Vitolo et al., , 2018, and the ECMWF produces daily VDI layers as part of its experimental products. Inevitably, the definition of the environmental parameters has a margin of discretion. Figure 2 highlights how lower level atmospheric parameters such as 2 m temperature, 2 m relative humidity (RH) and precipitation have a good discriminating power with the probability density function (PDF) for the IPLs shifted towards dryer and warmer conditions, as would be expected when fires ignite. Soil moisture also shows as a strong candidate as a predictor if using the IFS prediction or SMOS observations. The use of an averaging window does not seem to add additional information. Among the categorical parameters, the VDI suggests the non-IPL situation falling mostly into category 1 (no fire vulnerability), while the majority of the ignitions fall into categories 2-5, corresponding to higher levels of vulnerability.

| Validation data set
A large data set of lightning-ignition fires on lands throughout Western Australia managed by the Department of Biodiversity, Conservation and Attractions Western Australia was made available to the ECMWF through a scientific collaboration for the purposes of the study (for the geographical extension of the verification database, see Figure 3). The data collected use lookout towers and spotter aircraft with a manual cause-classification for fire events. The database reports several different attributes, including fire cause and fire discovery time. Even though the detection network is used several times each day and the likelihood of a prolonged delay is small, there still could be a possible delay compared with the actual ignition time. In 2016, a total of 147 events attributed to lightning were observed.

| BINARY CLASSIFIERS FOR LIGHTNING-IGNITION PREDICTION
The study used supervised learning to find a mapping function between weather and environmental variables and a specific target. The target was a binary label identifying whether a fire event occurred and if it were caused by lightning. The ground truth used to train the model was, in turn, the result of the other model used to construct the data set of fire IPs possibly caused by lightning. We used models falling under the so-called binary classification algorithms allowing only two distinct class decisions (i.e. two possible outcomes): IPLs (corresponding to 1) or non-IPLs (corresponding to 0). To build a robust prediction system, three models of increasing complexity were tested: a single decision tree (DT); a Random Forest (RF) and an adaptive (Ada-) Boost (AD) applied to the DT. The reason for employing multiple models of increasing complexity is to explore the trade-off between the complexity of the ML approach and the accuracy of prediction. Kotsiantis (2011) notes that DT techniques have been widely used to build classification models; such models closely resemble human reasoning and are easy to understand. The goal of a DT is to create a set of rules (i.e. a tree) that minimize a loss function that measures the discrepancy between the predicted and the true values (Venkatasubramaniam et al., 2017). The algorithm works by analysing a data set and constructing a set of rules, which subsequently divides the data set into smaller and smaller subgroups until it can make a prediction that falls into one of the expected classes (Song and Lu, 2015). In general, decisions trees are very sensitive to the data on which they are trained. Thus, small changes to the training set can result in significantly different tree structures and, in turn, the predictions can be quite different.
To overcome this limitation, an RF uses ensemble methods that combine several DT classifiers to produce better predictive performance than a single DT classifier. An added benefit is that several trees in an RF can be run in parallel, thus allowing a very large number of input variables to be handled and minimizing overfitting (Biau, 2012). Another strategy to improve a simple DT is provided by applying to it an adaptive boosting (AdaBoost). The AdaBoost algorithm trains sequentially a set of single-variable DTs (Ying et al., 2012). It uses base learners of individual decision stumps (depth = 1), or small DTs, and trains them for several iterations, each trying to correct its predecessor. By doing so it "boosts" its predictive capability. This method, which can be applied to any classification algorithm, builds on top of a classifier rather than being a classifier by itself. Unlike other models, variable selection is pre-applied to the base learner rather than the ensemble learning algorithm.
The applied supervised learning uses data covering the whole of 2016, where all environmental parameters are available and CG lightning density is > 0. This environmental conditions data set is then sieved into non-IPL class (labelled 0) and IPL class (labelled 1). Data are then divided into 85% training data and 15% testing data. The same data sets are used with all the models in order to assess and compare performance accurately. There is a large disparity in sizes between the two classes with a significantly larger number of samples in the non-IPL class. This is known to lead to lower sensitivity when detecting the less populated class (Cheng et al., 2015. In order to rebalance the training data set, fixed under-sampling (Lemaître et al., 2017) was performed to generate a data set containing equally sized classes.
Models were built in two steps. First, a feature selection was performed combined with a 10-fold crossvalidation (CV). An optimization procedure was then applied, referred to as hyperparameter tuning (or simply tuning), which allows the final predictive skill of the models to be improved. For details of these procedure and the parameter setting, see Appendix A1 in the supplemental data online.

| Feature selection and model assessment
A very relevant aspect of the definition of the prediction model is the analysis of the selected predictors as a result of the applications of the recursive feature elimination (RFE) with CV. This step provides information on which physical processes have the largest importance when detecting lightning ignition-prone conditions (for the methodologies, see Appendix A1 in the supplemental data online). The best variable combination for the deterministic DT is achieved with 10 predictors, while all 15 predictors are deemed important by the RF (Table 3). The AdaBoost, a boosting method applied to the DT, relies on the variables selected by the base estimator. The prediction score for the DT tends to decrease with the inclusion of variables such as soil type, low-level vegetation cover, and three day, five day or 15 day average soil moistures. Looking at the PDF of soil type in Figure 2, for instance, it is evident that this variable does not discriminate well between IPLs and non-IPLs. Soil moisture, on the other hand, is shown to have a very good predictive skill. This is consistent across several averaging periods; therefore, the information becomes redundant when it is carried by several variables. For the ensemble RF, the use of all variables leads to the highest prediction score.
Each estimator is trained on the initial set of features and the importance of each feature is obtained through a "feature importance" attribute (Pedregosa et al., 2011). This attribute provides a score that indicates how valuable each feature was during the construction of the model. The features listed in Table 3 are ordered based on the feature importance attribute; Figure 4 provides a quantitative comparison of the importance of each feature across the three models. For instance, for the AdaBoost model, the top three variables were volumetric soil water layer 1 (swvl1), 20 day average soil moisture (SM r20) and high-level vegetation cover (cvh). There is a high level of consistency between the DT and RF, with the two most important predictors being 2 m RH and 2 m temperature (t2m). The variables excluded from the DT are soil type, low-level vegetation cover and other soil moisture variables, which are generally ranked low in the RF.
Model accuracy can be improved by performing hyper-parameter optimization by looking at relevant elements of the model design (for details of the optimization performed, see Appendix A1 in the supplemental data online). A key aspect of optimization in the DT algorithms is pruning, which attempts to build a simpler model to help generalization and avoid overfitting to the training data. AdaBoost allows the further optimization of the DT identifying the optimum depth of the base trees. This is done empirically by assessing the performances of various depths (≤ 10) and comparing the CV mean test-fold score; the highest score is used to choose the optimum depth for the base estimator. The RF instead can be optimized by changing the number of the trees in the forest and the subsampling (bootstrapping) of the training data set for each tree in the forest. After tuning, predictive model performance can be assessed by using the independent test data set and the following metrics (Pedregosa et al., 2011 DT, AB and RF, respectively. The test data set uses unbalanced data for which Table 4 reports the results from a randomly sampled month (i.e. April) and shows the four elements inside the contingency table (namely TP, TN, FP and FN), as well as: specificity, which is the TN rate; precision, which is the ability of the model not to label as positive a negative outcome; sensitivity, which is the ability of the model to detect all the positive outcomes and balanced accuracy score (a combination of sensitivity and specificity); and F1 score (a combination of sensitivity and precision). The use of balanced accuracy and F1 score avoids inflated performance estimates on imbalanced data sets (He and Ma, 2013).
As expected, the out-of-sample accuracy (calculated on the independent test data) is smaller than the insample accuracy (train data). The difference between inand out-of-sample accuracies is relatively small, which suggests none of the models is overfitting. While the scores in Table 4 provide a useful indication of the expected accuracy of the model, the practical significance of the elements of the confusion matrix requires some additional comments. It must be stressed that the training data set collects all points where a fire event occurred, and the aim of the model is to attribute that ignition to a lightning strike, or not. This means that the model is trained to detect conditions that are prone to fire and also to attribute correctly the origin of the fire to lightning. Thus, the high number of FP (and consequently low F1 score) should be read having this perspective in mind. The model can predict correctly that a fire occurred but incorrectly tag it as generated by a lightning event. This might be due to meteorological conditions that do not allow one to discern between lightning and other causes of fires, as well as to a possibly incorrect lightning forecast. Thus, while from a ML perspective an FP is an erroneous result, in the present application this is expected due to the uncertainties associated with how the IPL data set was built, that is, by associating a cause-effect relationship with lightning and fires when occurring at the same time.
In summary, while we report all the scores for completeness, we stress that the FP result has not the same relevance as a FN when a lightning ignition occurred but was not picked up by the model. The FNs could lead to serious consequences in the use of this model in an operational context. A more detailed analysis of the implications these verification scores have in a real contest is performed in the next session where observed fires are analysed.

| Lightning-ignition prediction accuracy in observed fire events
Several factors are important to consider in the verification of the ML algorithms in order to forecast a lightning-ignition event. Possibly the most important is if the forecast of the CG lightning is accurate enough actually to trigger the predictive model in the first instance. The second aspect is the instant at which to calculate predictions from the models, because there can be a delay between the time of a lightning ignition and the date of discovery (Dowdy and Mills, 2012). To address these two aspects, we adopted strategies to understand the relative impacts of both conditions. Following the schematic depicted in Figure 5, the verification addresses three perspectives: • If the models are correctly triggered by the CG lightning density forecast, looking at the discovery day and a time window of five days before this date, what is the success of both the CG lightning density product combined with each individual model? • Only correctly considering the predicted events by the CG lightning forecast, what is the individual success of the predictions of each model? • If the CG lightning forecast had been perfect and triggered all the wildfire events on the discovery date (DD), what would the predictive success be for each individual model?
In order to assess the success of predictions, we used the sensitivity score, as defined in Section 4.1. If we consider verification perspective 1, that is, the overall capability of the system to predict a lightning ignition, we find that 104 of the 147 lightning events coincided with F I G U R E 5 Methodology for the verification of lightning-caused wildfire events highlighting the three perspectives of verification implemented. Perspective 1 examines the overall capability of the system to identify lightning ignition as the combined success of the lightning forecast and the machine learning (ML) models. Perspective 2 assesses the capability of every ML model to predict lightning ignition when the lightning forecast is correct. Perspective 3 overrides the lightning forecast and assesses the capability of the three ML models to predict lightning ignition in all cases lightning forecast on the same day when the fire was discovered. This increases to 136 if we also consider lightning forecasts occurring in any of the previous five days from the DD, with the largest contribution (18 events) given by the day before the discovery date (DD -1). For the combined success of the CG lightning density product with the three models' predictions, see Table 5. The results shows that, by only considering the DD, 43 events are not correctly predicted by the lightning forecast. However, by extending the search to the days before the DD, we reduce this to only 11 undetected events. Looking at the sensitivity of the models in Table 5, there is also an increase in the score as the number of detected events increases. While the increased number of lightning "triggerings" is an obvious consequence widening the temporal search for a lightning event, the sensitivity increase also arises from the capability of the models to use environmental conditions prone to fire. If we only consider events that were correctly predicted by lightning and the model performance of the classifiers (Table 5), the sensitivity increases initially, with a decrease in the score as the number of detected events increases. This decrease is due to ambiguity between the day of discovery and the "trigger" day environmental conditions. Lastly, assuming a perfect lightning forecast able always to predict a strike when it happened, what would be the predictive capability of the models? In this third perspective the sensitivities of all three models are between 0.71 and 0.76 when using the discovery day as the ignition day. Degrading performance is recorded if a temporal delay is used (data not shown). Overall, the three models have an accuracy in the range of 71-76%. This is a good performance rate due to its position between a random unbiased binary classifier and the perfect binary classifier, which have by definition accuracies of 50% and 100%, respectively. Interestingly, the best predictions in this case are provided by the RF model (Table 6). Performance differences between models are limited, with variances occurring due to the contrasting approach between the various models, the different variables used and their importance, and how each model appoints a prediction result.

| CONCLUSIONS
The aim of the study was to develop a lightning-ignition model. The analysis performed established a definable lightning-ignition relationship by only looking at three of the more popular machine learning (ML) methods. We acknowledge that further improvements could be made by using other ensemble methods, for example, a gradient boosting classifier, which is similar to AdaBoost, but identifies the shortcomings of the prediction not by weights but by a loss function. Although an empirical approach was adopted to identify predictors that were assumed to be most relevant in a lightning-ignition relationship, further variables (such as wind speed and solar radiation) could be added to explore whether supplementary variables add extra benefit. Further, soil moisture variables used for the study were based on the top layer (0-7 cm). Laboratory studies have shown that some soils report burn depths much greater than this range (Benscoter et al., 2011), which could suggest that the inclusion of an extra feature such as the depth profile of the soil moisture might be worthwhile. Lastly, the T A B L E 5 Number of lightning trigger misses (false negatives, FN) considering up to five days wide temporal window before the discovery date (DD) and sensitivity for the decision tree (DT) model, Random Forest (RF) model and AdaBoost (AD) model for verification perspective 1 (Ver 1) and verification perspective 2 (Ver 2) inclusion of a predictor related to fuel availability beyond the climatological vegetation cover adopted here could be beneficial. Still, the results are encouraging as for most of the real cases all models can correctly point out those environmental conditions favourable to an ignition. This shows how empirical models can have great potential to overcome some of the limitations of process-based models. It also suggests that the future will highly benefit from the adoption of a pragmatic approach that combines traditional model formulation with artificial intelligence methods for model parameter estimation and process optimization (Reichstein et al., 2019). For the purpose of developing an operational product from the analysis, the most interesting information comes from the analysis of the missed events because it has been shown that the skill in lightning prediction plays a significant role. If lightning is not correctly predicted, the ML models will fall into the non-IPL class (where IPL is fire ignition points that could be potentially attributed to lightning) regardless of whether or not the environmental conditions were prone to a lightning ignition. Moreover, the definition of the data set of the IPLs used to train the binary classifiers assumes that the cloud-to-ground (CG) lightning density product is accurate in space and time, along with the removal of ignition points (IPs) of agricultural nature and short duration, which leaves the potential to miss smaller lightning-ignited fires that are suppressed by rain or other activities. Thus, the quality of the lightning forecast affects not only the triggering of the model but also the definition of the classification. However, there are clear advantages in using a forecast model: (1) the forecast has global coverage while observations are spatially sparse; (2) there is a dynamic consistency and spatial coherence with the other environmental variables; and (3) there is an opportunity to extend the prediction into the future. However, it is clear from the analysis performed that a system that accounts for the inherent stochasticity of the prediction has to be put in place to deliver a product that could be useful in complementing early warning systems designed for fire control and prevention. Weather forecast errors are characterized by nonlinear growth, meaning that a small inaccuracy can lead to very different physical states (Lorenz, 2000). Accounting for these errors is traditionally done through the use of ensemble prediction systems where several simulations are performed starting from slightly different initial conditions and model configurations (Molteni et al., 1996;Buizza et al., 1999). An ensemble system means that the forecast can be interpreted as probabilistic rather than deterministic, and uncertainties can then be assigned to the final prediction, and this can boost confidence in the decision process during emergency situations as a cost-loss analysis can also be associated with the different scenarios (Cloke et al., 2017).

Day of discovery FN
The use of an ensemble system that is very relevant for this application also implies a substantial increase in the computational time and could jeopardize the feasibility of delivery products in real time. From a preliminary test performed at the European Centre for Medium-range Weather Forecasts (ECMWF) supercomputer facility, we estimated that with the model run time to < 10 min for an ensemble member and the possibility of running parallel tasks, even a probabilistic product could be delivered with a delay of around 1 hr from when the lightning forecast becomes available.
Although the model was designed to recognize the IPs due to lightning from other causes of ignitions, further work will test the ability of the model also to classify hnon IPs as if they were non-IPLs. This goes in the direction of making the model work in an operational setting and benefit decision-making for fire management and prevention. In this context, land use/cover and the temporal evolution of fire radiative power at a location and its neighbouring area would become active features of the model. Also, a probabilistic outcome, in terms of the probability of ignition due to lightning, would suit tools for decision-making better than the current deterministic outcome.

| Steps to build a binary classifier model
The construction of the binary classifiers in the present study has four steps. The first consists of splitting the initial data set into training (85%) and testing (15%) sets. In this stage, fixed under-sampling (Lemaître et al., 2017) was performed on the training portion to generate a data set containing equally sized classes. The training set was used to train the model through cross-validation (CV), while the testing set was used to assess the out-of-sample accuracy.
The second step is feature (variable) selection, that is, which combinations of the 15 environmental variables are more able to predict the conditions present in the training data set, which are identified using a combination of recursive feature elimination (RFE) and CV. The RFE is built by recursively considering smaller and smaller sets of variables. First, the estimator is trained on the initial set of variables, and the importance of each variable is obtained through a "feature importance" attribute (Pedregosa et al., 2011) that attributes importance related to how a variable was used during the construction of the model. Essentially, the more a variable is used to make key decisions, the higher its relative importance. The least important variables are removed from the current set of variables and the procedure is recursively repeated on the reduced set of features until the desired number of features to select is reached. The CV is combined with the RFE to identify the best set of variables. On each recursion, the CV is performed and the recursion with the highest CV score is deemed the optimum set of variables for a model.
The third step is the definition of a CV methodology. Normally when developing an ML model, data are split into training and test sets, with the training set used to learn/train the model and the test set to assess the model. The test set is used to appraise the model's performance by using evaluation metrics, such as accuracy. This method, however, can be unreliable as the accuracy obtained for one test set can differ to the accuracy obtained for another. The CV provides a solution to this: it evaluates and compares a model by splitting data into segments-one is used to train a model and the other is used to validate (the interactions of the various partitions replace the need for a separate "validation" set). A K-fold CV is the basic form of the CV where train data are divided into k equally (or nearly equally) sized segments or folds. Subsequently, k iterations of training and validation are performed, and within each iteration a different fold is held out for validation while the remaining k -1-fold are used for learning (Refaeilzadeh et al., 2009). For the present study, the number of folds chosen was 10, with the overall mean CV test fold score taken as the assessment metric for models.
The final step is hyper-parameter tuning. A hyperparameter is a parameter whose value is used to control the learning process (Feurer and Hutter, 2019). Thus, hyper-parameter tuning is the problem of choosing a set of optimal hyper-parameters for a learning algorithm (Ghawi and Pfeffer, 2019). To obtain good accuracy in a model, the easiest way is to test and compare different parameter combinations (DeCastro-García et al., 2019), and this is how hyper-parameter tuning is approached. Hyper-parameter tuning for the study was undertaken by creating different unique models with different sets of hyperparameters using tuning tools from the Python scikit-learn package. Within this package, we used the GridSearchCV function for both decision tree (DT) and AdaBoost models and the RandomizedSearchCV and GridsearchCV functions for Random Forest (RF). Model assessment was undertaken by using the highest mean CV test fold score, with the hyper-parameters of this model considered the optimum for the learning process. (Pedregosa et al., 2011).
T A B L E 7 Parameters used to create the parameter grid for the GridSearchCV for the decision tree with optimal parameters for a learning algorithm highlighted in bold In the spirit of reproducibility, Tables 9-12 provide the main parameter settings for all three classifiers used in hyper-parameter optimization. In particular, Tables 7 and 10 define the parameter grid for the GridSearchCV of the DT and AdaBoost, respectively. Similarly, Table 8 defines the parameter grid for the RF RandomizedSearchCV. Finally, Table 9 shows the results of the optimum depth assessments for the base estimator for AdaBoost.