Evaluating performance of four species distribution models using Blue-tailed Green Darner Anax guttatus ( Insecta : Odonata ) as model organism from the Gangetic riparian zone

In this paper we evaluated the performance of four species distribution models: generalized linear (GLM), maximum entropy (MAXENT), random forest (RF) and support vector machines (SVM) model, using the distribution of the dragonfly Blue-tailed Green Darner Anax guttatus in the Gangetic riparian zone between Bijnor and Kanpur barrage, Uttar Pradesh, India. We used forest cover type, land use, land cover and five bioclimatic variable layers: annual mean temperature, isothermality, temperature seasonality, mean temperature of driest quarter, and precipitation seasonality to build the models. We found that the GLM generated the highest values for AUC, Kappa statistic, TSS, specificity and sensitivity, and the lowest values for omission error and commission error, while the MAXENT model generated the lowest variance in variable importance. We suggest that researchers should not rely on any single algorithm, instead, they should test performance of all available models for their species and area of interest, and choose the best one to build a species distribution model.


INTRODUCTION
Species distribution models (SDMs) are tools that integrate information about species occurrence or abundance with environmental estimates of a landscape, used to predict distribution of a species across landscapes . When applied in a geographic information system (GIS), SDMs can produce spatial predictions of occurrence likelihood at locations where information on species distribution was previously unavailable (Václavík & Meentemeyer 2009). Though various types of algorithms are used to build different SDMs (Elith et al. 2006), they share common and general approaches (Hirzel et al. 2002) such as: (i) at a specified resolution, the study area is divided into grid cells; (ii) species presence localities (and sometimes absence localities) data are used as the dependent variable; (iii) several environmental variables (e.g., temperature, precipitation, soil type, aspect, land cover type) are collected for each grid cell as predictor variables; and (iv) the suitability of each cell for the species distributions defined as a function of the environmental variables (Stanton et al. 2012). The species distribution prediction is central to applications in ecology, evolution and conservation science (Elith et al. 2006) across terrestrial, freshwater, and marine realms ). But it remains a question for researchers which model should be selected for particular organisms and habitats of interest, particularly when few samples are present for large under-sampled areas (Mi et al. 2017).
Riparian zones are broadly defined as terrestrial landscapes with characteristic vegetation associated with temporary or permanent aquatic ecosystems (Meragiaw et al. 2018). These areas are highly complex biophysical systems, and their ecological functions are maintained by strong spatio-temporal connectivity with adjacent riverine and upland systems (Décamps et al. 2009). It has been observed that species distribution models are used more often for terrestrial environments than for aquatic or riparian ecosystems. Globally, odonates are used as model organisms to study climate change, data simulation, environmental assessment and management, effects of urbanization, landscape planning, habitat monitoring and evaluation, and conservation of rare species (Bried & Samways 2015). To date, no work has been done on the comparative use of species distribution models in India using insects as model organisms in riparian or freshwater ecosystems. With this background, in the present work we evaluated the effectiveness of four species distribution models using odonates from the Gangetic riparian zone as model organisms.

Study area and field data collection
For the study, we selected Anax guttatus (Burmeister, 1839) commonly called Blue-tailed Green Darner (Image 1) as the model insect species. It is a dragonfly (suborder Anisoptera Selys, 1854) under the family Aeshnidae Leach, 1815 and superfamily Aeshnoidea Leach, 1815 (Dijkstra et al. 2013). The species can be identified in the field due to its large size, highly active behaviour, green colour of the thorax & first, second, & third abdominal segments, and presence of turquoise blue colour on the dorsal part of the second abdominal segment (Subramanian 2005).
We conducted the study during May 2019 from Bijnor, Uttar Pradesh to Kanpur, Uttar Pradesh ( Fig.  1). The river flows through alluvial plain and covers a length of about 450km in this stretch. For the study we selected four sites, and the distance between each two successive sites was about 150km. In each site we chose a 10km river stretch and observed the presence of Bluetailed Green Darner. We collected a total of 10 sighting locations.

Data processing and analysis
We derived the thematic layer of LULC (N.R.S.C. 2016) from multi-temporal advanced wide field sensor (AWiFS) images with 56m spatial resolution using digital and rule-based image classification methods, and forest J TT cover type (F.S.I. 2009) from IRS P6 (Linear Imaging Self Scanning Sensor) LISS III with 23.5m spatial resolution using a combined method of digital and on-screen visual image classification and bioclimatic layers from worldclim gridded climatic data (Fick & Hijmans 2017) with 1km spatial resolution. For analysis, we took 2km buffer zones from the river bank and resampled all the layers to 1km spatial resolution.
We used 'stack' function of package 'raster' (Hijmans 2019) to stack all the 19 available bioclimatic variable, forest cover and land use land cover (LULC) layers. After that we used 'pairs' function of the package 'raster' (Hijmans 2019) to find the correlation coefficient between stacked layers. Then we selected the variables which had a correlation coefficient less than 0.60 (Pozzobom et al. 2020), and again stacked the selected layers with 'stack' function of package 'raster' (Hijmans 2019). These selected layers were LULC, forest cover and five bioclimatic layers: annual mean temperature (Bio 1), isothermality (Bio 3), temperature seasonality (Bio 4), mean temperature of driest quarter (Bio 9), and precipitation seasonality (Bio 15).
We built four species distribution models: generalized linear model (GLM), maximum entropy (MAXENT) model, random forest (RF) model, and support vector machines (SVM).
GLM is an extension of classic linear regression modeling, where the iterative weighted linear regression technique is used to estimate maximum-likelihood of the parameters, with observations distributed in terms of an exponential family and systematic effects made linear by the suitable transformation that allow for analysis of non-linear effects among variables and non-normal distributions of the independent variables (McCullagh & Nelder 1989;Chefaoui & Lobo 2008;Shabani et al. 2016).
RF modeling is a machine learning technique which is a bootstrap-based classification and regression trees method (Cutler et al. 2007). It is used to model species distributions from both the abundance and the presence-absence data (Howard et al. 2014). It is insensitive to data distribution (Hill et al. 2017) and also takes a large number of potentially collinear variables; it is robust to over-fitting which makes it very useful for prediction (Prasad et al. 2006;Segal 2004).
MAXENT modeling is a general-purpose machine learning method to estimate a target probability distribution by finding the probability distribution of J TT maximum entropy and it has several aspects that make it well-suited for species distribution modelling . It is relatively less sensitive to the spatial errors associated with location data and needs few locations to build useful models (Baldwin 2009) and it is one of the most accurate and trusted modelling methods for presence-only distribution data (Huerta & Peterson 2008;Srinivasulu & Srinivasulu 2016).
SVM modeling is developed from the theory of statistical learning, in which the error involved with sample size is minimized and the upper limit of the error involved in model generalization is narrowed, which solve the problems of nonlinearity, over-learning and the curse of dimensionality during modelling (Fielding & Bell 1997;Howley & Madden 2005;Huang & Wang 2006). It can be used on small data sets as it is independent of any distributional assumptions or asymptotic arguments (Wilson 2008).
We used 'load_var' function to normalize and load environmental variables, then used 'load_occ' function to load species occurrence data and then used 'modelling' function to build the models with 100 iterations by the package 'SSDM' (Schmitt et al. 2017) to plot the models.
We evaluated and compared four models by comparing values of area under the receiver operating characteristic curve (AUC), Kohen's Kappa, true skill statistic (TSS), model sensitivity, model specificity, and omission error.
The area under the receiver operating characteristic curve or AUC measures the ability of a model to discriminate between the sites where a species is present and the sites where a species is absent (Fielding & Bell 1997;Elith et al. 2006) and it provides a single measure of overall accuracy that is independent of a particular threshold (Fielding & Bell 1997). The evaluation criteria for the AUC statistic are as follows: excellent (0.90-1.00), very good (0.8-0.9), good (0.7-0.8), fair (0.6-0.7), and poor (0.5-0.6) (Swets 1988;Duan et al. 2014).
The Kappa statistic is based on the optimal threshold, measure the performance of the model by using the best of the information in the mixed matrix (Duan et al. 2014) ranges from −1 to +1, where +1 indicates perfect agreement and values of zero or less than zero indicate a performance no better than random (Allouche et al. 2006;Cohen 1960) and the evaluation criteria for the Kappa statistic are as follows: excellent (0.85-1.0), very good (0.7-0.85), good (0.55-0.7), fair (0.4-0.55), and fail (<0.4) (Duan et al. 2014;Monserud & Leemans 1992).
The true skill statistic (TSS) is expressed as Sensitivity + Specificity -1 (Allouche et al. 2006) and ranges from −1 to +1, where +1 indicates a perfectly performing model with no error, 0 indicates the model with totally random error and -1 indicates the model with total error (Marcot 2012;Ruete & Leynaud 2015).
The model sensitivity denotes the proportion of correctly predicted presences, thus quantifying omission errors (Ward 2007;Shabani et al. 2016) and model specificity denotes the proportion of correctly predicted presences, thus quantifying commission errors (Shabani et al. 2016).
Omission error (1-sensitivity) is the under-prediction or false-negative result in areas being classified as unsuitable when they are not and commission error (1specificity) is the over-prediction or false-positive result in areas being classified as suitable when they are not (Ward 2007) and for a good SDM, both of the omission error and commission error should be low. For evaluation of model performance and variable importance we used 'knitr::kable(Modelname@ evaluation)' function and 'knitr::kable(Modelname@ variable.importance)' function of the package 'SSDM' (Schmitt et al. 2017), respectively.
We chose five probability classes (0 to <0.20, 0.20 to <0.40, 0.40 to <0.60, 0.60 to <0.80 and 0.80 to 1.00) to know what percentage of the area is being declared the best and worst by each of the models by 'ratify' function of package 'raster' (Hijmans 2019) We performed all the analysis in the ArcMap 10.3.1, QGIS 2.14.7 and in R language and environment for statistical computing (R Core Team 2019).

RESULT
The plot for each of the four models is given in Fig.  2. We found that the AUC value was highest for GLM (0.983), followed by RF (0.833), MAXENT (0.829) and SVM (0.667); the value of Kappa statistic was highest for RF (0.667), followed by GLM (0.356), SVM (0.333) and MAXENT (0.049); the value of TSS was highest for GLM (0.965), followed by RF (0.666), MAXENT (0.658) and SVM (0.334); the value of model sensitivity was 1 for GLM, 0.833 for both MAXENT and RF and 0.667 for SVM; the value of model specificity was maximum for GLM (0.965), followed by RF (0.833), MAXENT (0.825) and SVM) (0.667); the omission error was lowest for GLM (0.00), for both MAXENT and RF models it was 0.167 and for SVM it was 0.333; the commission error was lowest for GLM (0.035), followed by RF model (0.167), MAXENT (0.175) and SVM (0.333) (Table 1, Fig. 3) For GLM, RF, and SVM models the forest had  Fig. 4). Overall, the variation in the variable importance was lowest in MAXENT model (SD = 3.367), followed by GLM (SD = 24.344), RF (SD = 30.868) and SVM (SD = 37.071) (Fig. 5). By comparative analysis, we found that GLM showed 1.62% of total area as the best (occurrence probability, 0.80 to 1) and 65.50% of total area as the worst (occurrence probability, 0 to 0.20) for suitable habitat. MAXENT model showed 10.08% of total area as the best and 77.70% of total area as the worst for suitable habitat. RF model showed 5.39% of total area as the best and 23.79% of total area as the worst for suitable habitat. SVM model showed 4.53% of total area as the best and 27.68% of total area as the worst for suitable habitat (Table 3, Fig. 6).

DISCUSSION
Freshwater ecosystems, which include rivers, lakes, peat lands, swamps, fens, and springs, are highly dynamic and host a great diversity of life forms, particularly freshwater endemic species (He et al. 2019;Tickner et al. 2020). They are among the most threatened ecosystems (He et al. 2019), as globally wetlands are vanishing more rapidly than forests and freshwater species are declining faster than terrestrial or marine populations (Tickner et al. 2020). Therefore, for proper conservation management, we should understand the distribution of plants and animals inhabiting aquatic ecosystems. Species distribution models can play an important role on such efforts, because they can produce credible, defensible and repeatable information and provide tools for mapping habitats to inform decisions (Sofaer et al. 2019). Species distribution models can forecast the potential impacts of future environmental changes (Howard et al. 2014) and predict how species will respond (Buckley et al. 2010). Yet debate remains over the most robust species distribution modelling approaches for making projections (Howard et al. 2014), because these models have sensitivity to data inputs and methodological choices. This makes it important to assess the reliability and utility of the model predictions (Sofaer et al. 2019).
In the present study we compared the GLM, MAXENT, RF, and SVM approaches. We found that GLM generated the highest values for AUC, TSS, specificity and sensitivity, and the lowest values for omission error and commission error. The value of Kappa statistic was highest for RF modelling. The MAXENT model used roughly all variables equally, which is not true of the other models which put more emphasis on forest cover.
The success of a model depends on many factors, such as sample size, spatial extent of the study area, and number of ecological and statistical significant variables which affect the distribution of species of interest. We acknowledge that there were some limitations to the current work, such as that our sample size was small (only 10 presence locations), we used only seven variables, we tested only four species distribution models, and we selected a species whose distribution depends on other factors, such as the physiochemical parameters of water and availability of resources. We did not include such   variables as this study was preliminary. Collins & McIntyre (2015) reviewed 30 studies on species distribution modelling of odonates across the world, and found that 43% used GLM, 33% MAXENT and 20% RF models. Other models used were BIOMOD, general additive model (GAM), generalized boosted model (GBM), artificial neural networks (ANN), multivariate adaptive regression splines (MARS), classified tree analysis (CTA), flexible discriminant analysis (FDA), boosted regression trees (BRT), surface range envelopes (SRE), and mixture discriminant analysis (MDA). Different species distribution models produce different results (Shabani et al. 2016), and the same model can give different results for different species and areas. We urge researchers not to rely on just one model, rather they should compare different available species distribution models and select the best one. Our study was in India where an insect was used for comparative evaluation of species distribution models in a riverine riparian zone. We recommend that further J TT  www.threatenedtaxa.org The Journal of Threatened Taxa (JoTT) is dedicated to building evidence for conservation globally by publishing peer-reviewed articles online every month at a reasonably rapid rate at www.threatenedtaxa.org. All articles published in JoTT are registered under Creative Commons Attribution 4.0 International License unless otherwise mentioned. JoTT allows allows unrestricted use, reproduction, and distribution of articles in any medium by providing adequate credit to the author(s) and the source of publication.

Threatened Taxa
Publisher & Host PLATINUM OPEN ACCESS