Estimating the completeness of orchid checklists and atlases: a case study from southern Italy

: Checklists and atlases are important tools for knowledge of the biodiversity of a geographic unit. Nevertheless, they often suffer from bias due to preferential sampling. It is important to assess the level of completeness of the data collected during such research to allow comparison of the biodiversity of different areas, or to use them for macroecology, biogeography or conservation purposes. This assessment is not trivial, especially when information from heterogeneous sources is used (e.g., herbaria specimens, field observations, literature data). The author suggests some simple methods to assess the completeness of floristic database and to represent the distribution of the completeness at a scale level appropriate to the size of the studied area or, on another hand, to the precision level of the available data. Such information is useful to direct the surveys identifying less explored areas or habitats and thereby correcting the sampling biases. Adding information about sampling effort or completeness could be very useful to make floristic research more objective.


INTRODUCTION
Floristic inventories or check lists and atlases are important tools for assessing biodiversity and addressing its conservation (Vallet et al. 2012). They are often the result of careful and time-consuming researches conducted in specific geographic units, focused on vascular plants or on smaller taxonomic group such as Orchidaceae, one of the largest and most widespread family of flowering plants (Dressler 1981;WCSPF 2019). The presence and distribution of species of this family have been assessed at different scales as most of them are rare, threatened or endangered (Cribb et al. 2003). A checklist is a "card collection" aiming at listing all the taxa belonging to the studied taxonomic group and reporting whether they are observed, collected or reported in literature for a given area (e.g., Mathew & George 2015;Aung et al. 2020;Popovich et al. 2020). The taxa are typically identified at species or subspecies level, some sites of growth are reported together with other information on the habitats, variety, rarity, ecology, chorology, systematic or taxonomic issues. Atlases are more focused on the geographic distribution of the taxa, instead. To be accomplished they require a field work aiming not only at listing all the different taxonomic entities, but also at detecting as more sites of growth as possible for each taxon. The result of such work is a checklist with cartographic references or distribution maps and, sometimes, their elaborations (e.g., Crain & Fernández 2020;Efimov 2020). Due to the long time needed for exhaustive surveys, at a local scale this kind of research is increasingly carried out by non academics, the so called 'citizen scientists'. This is particularly true for the inventories and atlases of the European terrestrial orchids, often published in specialized journals (e.g., Galesi & Lorenz 2010;Frangini et al. 2019;Katopodi & Tsiftsis 2019;Marrero et al. 2019).
The huge amount of work, even when results in detailed distribution maps, almost never follows an experimental design, and currently data are affected by bias caused by a preferential sampling approach, e.g., data collector tends to sample protected areas or to collect more data along the roads (Croce & Nazzaro 2017). Furthermore, none of the above mentioned floristic studies is usually provided with a clear reference to the sampling effort or to the level of completeness of the surveys. The absence of a repeatable background and of a standardized approach is not a trivial issue, as such collections of data are of great value for macroecology, ecology, biogeography or conservation research (Soberón et al. 2000(Soberón et al. , 2007Rocchini et al. 2011;Weigelt et al. 2020).
In order to make inventories and atlases useful tools for biogeographical or ecological research it is thus necessary to take into account these issues and support floristic works with appropriate measures of the degree of uncertainty (Rocchini et al. 2011). In the same context, maps of floristic richness should be accompanied by maps of knowledge, "maps of ignorance" or maps of completeness. These can be realized considering that the number of species (namely the species richness) recorded in a given period and in a given area is partial and lower than the real number of species present (Gotelli & Colwell 2011). The more the sample effort increases the more the number of observed species approaches the theoretical, real number of species. On the contrary, a sampling activity carried over a too long time could detect the species turnover (e.g., for habitat change due to socio-economic or ecological reasons or for climate changing) resulting in an overestimation of the number of species than the existing habitats could theoretically host in a given time. The real floristic richness and its distribution in an area can be estimated with different methods (Gotelli & Colwell 2001;Vallet et al 2012). The most suitable for the kind of data recorded in the field by orchidologists is the use of 'sample based species rarefaction-curves' (Gotelli & Colwell 2001). Given that the sampling order in an area is not important, data are resampled and curves are built. While the shape of accumulation curves depends upon the order in which the samples are considered, the rarefaction curves show smoother lines facilitating the comparison among entire datasets or subsets. A species rarefaction curve is plotted starting from the mean number of species of the smallest sample size. Then the mean number of species is calculated for all combinations of the next sample size (i.e., the mean number of species of two random samples, then three random samples, etc.). This paper analyzes some typical aspects of local scale inventories and atlases hitherto neglected. Here, we propose simple approaches, accessible even for the non-academic, citizen scientists to answer the following specific questions: i. How can the richness of a floristic database be assessed and how can different database be compared?
ii. Which richness estimator is more suitable for terrestrial European orchids, given its intrinsic difficulties of observation in field?
iii. When is the sampling of an area sufficiently complete?
iv. How can completeness maps be realised and how they can be useful to identify where to address further explorations?

Source of the data and study areas
We used three datasets reporting the presence of orchids in three areas in southern Italy, in northern Campania region, (Figure 1; Table 1)

Data collection
Only the observations geolocated with a precision level lower than 100 m (punctual data according to Croce & Nazzaro 2017) were included in the analysis. Nomenclature was revised and, when needed, standardised and hybrids were excluded from the analysis. To avoid the oversampling bias (i.e., a single population of plants sampled in different sampling units) the records have been clumped to represent the presence of the taxa in 100 x 100 m squares, connected to the geographic grid of the used coordinates system (WGS 84 / UTM zone 33N, EPSG 32633). Each sampling unit (plot) is univocally identified, therefore, by the geographic position of the square and by the sampling date so that two sampling activities that took place in two different date but inside the same square have been considered as two different plots. In this way, I take into account the sampling effort in terms of time, very important for species requiring observations at different times to be correctly observed and identified. In the end I get, for each dataset, a matrix taxon × plot that I used for the elaborations and further analysis.

Data analysis
To compare the three datasets in terms of sampling effort and observed specific richness (S obs ), I have mapped the specific richness for each area using a grid with 1 km 2 resolution (i.e., 1 x 1 km UTM cells) intersecting the study areas (i.e., the three geographic units as defined above) and calculating both the number of plots and the number of observed species in each cell. A regression analysis between the number of plots and the number of species per each cell has been performed to correlate the sampling effort to the observed species richness and therefore to validate the density of plots as an indicator of the sampling effort. Then for each area I built a sample-based rarefaction curve using the plots as samples. The curves have been limited to the lower number of plots in the three datasets for a better comparison of the observed species richness and its pattern among the three studied areas. Being drawn with resampling statistical methods, the curves allow the calculation of the 95% confidence limits or the standard deviations.
Among the methods used to estimate the species richness of an area starting from presence-absence data, the most appropriate for floristic inventories and atlases is the relation between number of species and sampling effort (Vallet et al. 2012). This relation is investigated mainly using non parametric estimators, less sensitive to the sampling effort (Palmer 1990;Brose et al. 2003). Such indexes give an estimate of the species richness for a given geographic unit, based upon the considered sample and, therefore, upon its species assemblage. Once an estimate value is obtained, the completeness for each of the three datasets can be calculated by means of the completeness index proposed by Soberón et al. (2000). Such index (C) is expressed as a percentage value of the ratio between the number of observed species (S obs ) and the number of estimated species (S est ): C = S obs /S est The most used non parametric estimators for presence/absence data or incidence data are Jackknife, Chao, Bootstrap, and ICE (Gotelli & Colwell 2011;Vallet et al. 2012). While the first of these indexes could represent a good compromise (Brose et al. 2003), several other authors prefer to compare more than one index (Martinez-Sanz et al. 2010;Bruno et al. 2012;Garcia-Marquez et al. 2012;Vallet et al. 2012;Archer 2019). It is therefore noted that the Jackknife estimator gives higher values of estimated richness and, accordingly, lower completeness values than the Bootstrap estimator (Garcia-Marquez et al. 2012). Nevertheless, it is particularly effective in estimating the richness of small sample size (Hortal et al. 2006). Another very used estimator is Chao2 (Ugland et al. 2003;Chao & Chiu 2016;Idohou et al. 2015;Asase & Peterson 2016) that gives more emphasis to the presence of singletons species (i.e., present in only one plot of the set or subset) or doubletons (i.e., present in only two plots). Considered that many orchid species are locally rare and the number of rare species increases with decreasing the size of the sampled area, I calculated the completeness index (C) choosing as value of estimated richness (S est ) the maximum value between Chao2 (S Chao2 ) and Jackknife1 (S jack1 ) estimates. For each of the three study areas I calculated the total value of completeness (C) and the completeness of each cell of the 1 km 2 UTM grid, using the plots as sampling units. Only for Roccamonfina area the completeness has been calculated also for each cell of a 4 km 2 , 9 km 2 , 16 km 2 , 25 km 2 , and 36 km 2 UTM grid intersecting the study area. Then I aggregated the data into 1 x 1 km cells and the obtained taxon × cells matrix has been used to recalculate the estimated species richness and the completeness of each study area. This was intended to test the reliability of such atlases built mapping the presence of the species in grids with cells of 1 km 2 or more, to estimate the species richness of the study areas. In order to test the estimators robustness when even larger sample units are used, the above mentioned aggregation method has been repeated using grids of 4 km 2 , 9 km 2 , 16 km 2 , 25 km 2 , and 36 km 2 cells, only for the larger area of Roccamonfina volcano. In other terms, I used increasing size cells as sampling units. Such cells size can be useful to analyse atlases produced with bibliographic data whose precise geolocation is not possible. The completeness of each cell, for all the grids of different cells size, has been classified into four levels: 0-25 %, 25-50 %, 50-75 %, and 75-100 %. The cell with less than six plots have not been analysed and have been classified as "not evaluable" (n.e.). These limits have been set considering for all the datasets used an average number of five plot sampled in a day. According to the method used in Bruno et al. (2012), the cells with completeness >65% have been considered sufficiently studied squares (SSS).
Once I knew the less explored cells, to which priority in the future research should be given, I could assess the level of completeness of our datasets among different habitats. So, I assigned a kind of vegetation to each plot on the basis of the collected field information and therefore I estimated the completeness of each vegetation type for each study area as explained above.
The cartographic elaborations have been performed by the software Qgis3 (QGIS Development Team 2019),

J TT
the rarefaction curves and the calculation of the richness estimators have been produced by means of the software Estimates 8.20 (Colwell 2013) performing 1000 permutations. Statistical analyses have been performed using the software PAST (Hammer et al. 2001). All the used software is open source or free.

RESULTS
In Table 1 the data about the three study areas are reported, including the list of the taxa considered. The Roccamonfina area has the highest species richness, average number of records/plot and plot/km 2 . Vairanese and W-Matese show comparable values of the number of records/plot (higher values for W-Matese) and number of plots/km 2 (higher values for Vairanese). Nevertheless, the distribution of the number of plots ( Figure 2a) and observed species richness (Figure 2b) in the 1km 2 cells is extremely heterogeneous with a very high standard deviation of the plots/cells ratio (6.9 for Roccamonfina, 6.5 for Vairanese and 5.9 for Matese areas). Such values underline a sampling effort not uniformly distributed in the studied areas. The regression analysis (Figure 3) shows, for all the three areas, a statistically significant (p <0.001) positive correlation between the number of plots and the number of species inside the 1 km 2 cells. The two variables are statistically correlated according to the Kendall's tau test.
The rarefaction curves ( Figure 4) indicate a similar pattern for all the three areas: limited to 121 plots, they show slight differences with a higher species richness for the W-Matese area (32.84 average observed species) followed by the Vairanese area (32 average observed species) and the Roccamonfina volcano (31.32 average observed species).
The total estimated floristic richness, computed using the plots as sampling units ( Table 2) for each of the three areas, gives completeness values between 78.2% (Vairanese) and 88.5% (Roccamonfina). Using the 1 km 2 cells as sampling units (Table 3), we get identical values for Roccamonfina area, a slightly higher value for Vairanese area and slightly lower for W-Matese area.
The completeness of the 1km 2 cells in the three areas (Figure 2c) is distributed in a similar way in the Roccamonfina and Vairanese areas (Table 4): the 35.6% and 33.3% of the 1 km 2 cells, respectively, have a completeness higher than 65% and therefore are considered as Sufficiently Studied Squares (SSS). For the W-Matese area only the 25% of the 1 km 2 cells are SSS. It is relevant, for each area, the great number of cells with data not allowing further elaborations ('n.e.' cells).
The estimated richness for the Roccamonfina area, calculated using sampling units of increasing size ( Figure  5) shows a general stability of the two estimators chosen, always with higher values for Jackknife1 estimator (51.82-53.47) compared to Chao2 estimator (48.11-49.34). Both the estimators feature variations included within 1.65 unity, a value lower than the standard deviations calculated by the software. The completeness of the cells of increasing size, calculated for Roccamonfina area (Table 5) using the plots as sampling units, gives a gradual increase of the number of SSS, up to over 50% of the 9 km 2 cells and 80% of the 36 km 2 cells.
In Table 6 the observed and estimated species richness and the completeness of the different habitats using the plots as sampling units are reported. For the Roccamonfina area the completeness of the habitats is high except for agricultural environments. The chestnut orchards host the higher species richness (38 species, 82% of the whole area), followed by the open habitats such as meadows and shrublands (33 species). In the other study areas the completeness is relatively low for the broadleaved woodlands of Vairanese and open habitats of the W-Matese, indicating a still not adequate sampling for such habitats. For a better comparison of the species richness among the different habitats, considering that more than 70% of the plots are located inside chestnut orchards, the rarefaction curves were plotted for Roccamonfina habitats (Figure 6), limited to 100 plots. The richness curve rises in a steeper way in the chestnut orchards but it is overtaken by artificial habitats around 30 plots and by open habitats around 50 plots. The richness of broadleaved woodlands and chestnut coppices is always lower, as expected.

DISCUSSION
The higher species richness is correlated to the sampling effort, expressed as number of plots, as well as the ecological features of the areas and their extension. This parameter is known, in ecology as the species/area relationship (SAR -Preston 1962) and it could be used to compare and estimate species richness of floristic atlases only under certain conditions that, if disregarded impede its extrapolation (Vallet et al. 2012). The correlation analysis here performed confirms that the higher is the number of sampling units (plot) in an area, the higher will be the observed species richness. Comparing the richness of the three studied areas plotted by rarefaction curves, highlights that with the same sampling effort (i.e., the same number of plots), the richest area can host a relatively lower number of species than the less rich area. Nevertheless, such kind of analysis requires the same exhaustivity of the studies for each area. The overall completeness of the study areas gives values close to 90% and consistently above 70%. Also, very interesting is the data emerging from the estimates of the richness and the completeness calculated using the 1 km 2 cells of the UTM grid as sampling units. Such size could be very useful to study larger areas or to include lower precision data in the analysis and the completeness values did not differ significantly from the resulting estimates obtained using 100 x 100 m sampling units (plots). For the Roccamonfina area, in addition, even using increasing size cells as sampling units, the estimates do not vary significantly. This result can be taken into account whenever we have to choose the better grid resolution to draw atlases from non punctual data (e.g., literature data or observations with low location accuracy). The elaborations should follow, in this case, a reverse path: starting from a large sampling unit (e.g., a 10 x 10 km cells UTM grid), decreasing the size of the sampling units and calculating the completeness for the study area. Since small size cells will have more probability to hold 'singletons' (unique presence data) for a bigger number of species, the used estimators will

J TT
give higher estimates of richness and, therefore, lower values of completeness. For the same reason linked to the presence of singletons, in our study the number of sufficient studied squares (SSS) increases as their size become bigger. In the case of Roccamonfina area, using a grid of 9 km 2 cells, a half of them are classified as SSS. The distribution of the completeness for a grid of 1 km 2 cells (Table 4), on the other hand, is comparable for Roccamonfina and Vairanese, with more than 33% of the squares classified as SSS while for W-Matese area this value reaches only 25%. To assess whether these rates represent a good result (i.e., the area is exhaustively well studied), we can refer to the choice of the limit of 65% to consider a cell as sufficiently studied. In Bruno et al. (2012) this completeness limit has been chosen to select a useful number of squares to perform further analysis. These authors, for all the four considered taxonomic groups, get lower portion of squares SSS compared to the portion we get for our studied areas. Nevertheless, the absolute number of SSS for both Vairanese and W-Matese areas (respectively six and five squares) is too low and recall the need to continue the study in these two areas. The stratified analysis by habitat types underlines firstly what habitats need more studies or are less suitable for orchids. For example, agricultural habitats for Roccamonfina would need further sampling since their completeness is only 55% (Table 6). It could be expected that, adding further sampling, the completeness would increase even without an increasing of the species   richness. These habitats are in fact less suitable to host orchids as they are affected by frequent and strong ecological changes (e.g., soil tillage, switching to other crops, supply of nutrients). Such considerations could be made for the broadleaved woodlands of the Vairanese area, mostly represented by Holm oaks woodlands with very low light in the understory since orchids abundance is highly correlated to light regime (Djordjević & Tsiftsis 2020;Hrivnák et al. 2020). On the contrary we expect that the low completeness value for the open habitats of the W-Matese area is due to a high theoretical richness of such habitats, not fully detected by the sampling activity. In other words, the sampling effort for the open habitats of the W-Matese area is still insufficient.

J TT
Also, the rarefaction curves allow ecological considerations ( Figure 6). The chestnut orchards represent an ecosystem made of a mosaic between woodlands and meadows, so they are a suitable habitat for the most heliophilous species as well as for the nemoral ones. This explains why their average species richness increases steeply even with a few plots (it is possible to observe more than 20 species in one plot). Nevertheless, on a larger scale, the richness of chestnut orchards is higher than the richness in open habitats only because of the higher area occupied by the former. When the curves are limited to 50 plots, surprisingly the richest habitats are the artificial areas. This result can be explained with the apophyte behavior of many orchids species (Adamowski 2006) and with the fact that we considered the roadsides as artificial habitats. Such environments can host many species characteristics of open habitats such as meadows and grasslands, and constitute important refuge areas for native species (Auestad et al. 2011).
Overall, the analysis of the three datasets allowed the sampling effort to be evaluated and gave useful indications to where and how to conduct the future researches. Moreover, some suggestions on the use of statistical tools to compare different study areas were given. For two areas (Roccamonfina and Vairanese), there is a sufficient level of knowledge of how the orchids richness is distributed, if we assume that a low completeness value in two squares out of three could be due to the lack of suitable habitats (i.e., urban areas or intensive agriculture areas) and to the difficult to locate a sufficient number of sampling units or plots. The squares with no data or with a lower completeness should be regarded as the highest priority areas for the future floristic research. Sampling these areas could increase the level of knowledge (i.e., the completeness value) and could lead to detect new species for the squares or for the studied area. The analysis of the floristic richness and the completeness of every habitat in a less known area would be very useful to prioritize, in each cell of a chosen grid, where to focus the research.

CONCLUSIONS
In conclusion, this study highlights that the quality of a floristic research can benefit from the evaluation of the completeness. Its calculation allows the creation of knowledge/ignorance maps for orchids at different scale using grids at different resolutions (e.g., from cells of 1 km 2 for small islands and reserves to cells of 100 km 2 for regions). A randomized and stratified sampling design would reduce the sampling bias, enable the use of abundance indices rather than presence/absence data and allow the investigation on the relation between species richness and environmental variables. It is often necessary, however, to take into account a large amount of data lacking accuracy or uniformity as is the case of data from literature or collected by different and sometimes occasional contributors (e.g., in citizen science projects).
In any case it is desirable in each modern floristic study and particularly orchids distribution study, a quantitative analysis of the work expressing the results not only as the total number of species observed and their distribution but focusing more on the sampling methods and on the distribution of the knowledge. Even if a sampling design avoiding preferential sampling would be desirable but not always possible (e.g., when using data from online platforms or literature), the proposed methods would help the authors to evaluate the sampling effort, identify the less studied areas or postpone the publication of their checklists and atlases until an acceptable level of exhaustivity, or completeness, would be reached.