Skip to main content
European Commission logo print header

Optimising big data from citizen science projects for biodiversity research

Periodic Reporting for period 1 - OptimCS (Optimising big data from citizen science projects for biodiversity research)

Période du rapport: 2020-11-01 au 2022-10-31

Citizen science – research conducted in whole or in part by people for whom science is not their profession – is increasingly valuable for society, ecology, and conservation. Natural resource and landscape management based on the best available science is increasingly relying, at least in part, on citizen science data to make informed and adaptive decisions supporting biodiversity conservation. The data collection power of citizen science is enormous, but as citizen science at this scale is a new development in ecology and conservation, there is a great deal of inefficiency in this process. The largest inefficiency is that, to this point, the most ‘successful’ citizen science projects generally have a haphazard sampling regime replete with redundancies and gaps in the associated citizen science data. Can we direct this enormous amount of effort more efficiently? What steps can be taken at the upstream portion of citizen science projects to maximise efficiency of analyses with downstream datasets? This project will build a workflow which allows us to maximise the information content that citizen scientists contribute to our collective knowledge of biodiversity by developing algorithms that predict the highest ‘valued’ sites in time and space for biodiversity sampling by citizen scientists which leads to more efficiently directing effort in space and time. This project had two objectives.

Objective 1: Develop algorithms to optimise sampling by citizen scientists in time and space.

Objective 2: Experimentally determine the willingness of citizen scientists to sample more strategically.
Objective 1: Develop algorithms to optimise sampling by citizen scientists in time and space.

I developed a framework aimed at sampling biodiversity using eBird citizen science, demonstrated in the state of Florida. This framework uses landcover variables, including tree cover, water cover, and habitat heterogeneity, along with observed species richness and the number of checklists. It then puts these into a structural equation model. Our SEM showed a strong influence of land-cover in predicting species richness, with urban cover the strongest supported, followed by habitat heterogeneity, tree cover, and water cover. The number of checklists at a site was also predicted by the percentage of urban cover and habitat heterogeneity, suggesting that these two land-cover attributes influence where people submit eBird checklists. The number of checklists was also higher in grids with higher species richness. I found that a relatively small number of samples are needed to meet 95% sampling completeness when diversity estimation is focused on dominant species: 43, 64, 96, 123, 172, and 176 for 5 × 5, 10 × 10, 15 × 15, 20 × 20, 25 × 25, and 30 × 30-km2 grain sizes, respectively.

Additionally, I quantified the differences in bias between unstructured and structured citizen science data. I found strong evidence that large-bodied birds were over-represented in the unstructured citizen science dataset; moderate evidence that common species were over-represented in the unstructured dataset; strong evidence that species in large groups were over-represented; and no evidence that colorful species were over-represented in unstructured citizen science data. These results suggest that biases exist in unstructured citizen science data when compared with semi-structured data, likely as a result of the detectability of a species and the inherent recording process. Importantly, in programs like iNaturalist the detectability process is two-fold—first, an individual organism needs to be detected, and second, it needs to be photographed, which is likely easier for many large-bodied species. Results indicate that caution is warranted when using unstructured citizen science data in ecological modelling, and highlight body size as a fundamental trait that can be used as a covariate for modelling opportunistic species occurrence records, representing the detectability or identifiability in unstructured citizen science datasets.

Objective 2: Experimentally determine the willingness of citizen scientists to sample more strategically.

Using a novel experimental design I found that indeed individuals were willing to sample more strategically and that biodiversity sampling can be improved through the use of behavioral nudges. I developed a framework to estimate ‘priority’ of a given citizen science sample based on species richness and then used a controlled study where participants were asked to contribute according to where the ‘highest priority’ grid cells were. Relative to the available area in a study region, the low priority cells in the control study regions accounted for 36% (Lake Macquarie) and 61% (Wingecarribee) of sampling, whereas for our study regions presented with a dynamic map low priority cells accounted for only 12% (Wollongong) and 23% (Central Coast) of sampling, and for study regions presented with a dynamic map and a leaderboard, low priority cells accounted for only 15% (Blue Mountains) and 22% (Hornsby) of sampling. Further, the high priority cells in the control study regions accounted for 32% (Lake Macquarie) and 15% (Wingecarribee) of sampling, whereas for our study regions presented with a dynamic map, high priority cells accounted for 58% (Wollongong) and 8% (Central Coast) of sampling, and for the study regions presented with a dynamic map and a leaderboard, high priority cells accounted for 73% (Hornsby) and 67% (Blue Mountains) of sampling.
The results presented here represent the state of the art in biodiversity citizen science sampling. It is expected that these results will contribute to a new way of thinking about how we sample biodiversity with citizen science participants. Citizen science data will continue to play an important role in biodiversity monitoring in future. Despite their promise, there remains reluctance to use these data, in large part stemming from gaps and redundancies. To increase utility of these data, a key goal is to understand how biodiversity sampling should be conducted in space and time. I find relatively few samples are necessary to meet 95% completeness, thus allowing for relatively robust comparisons of species diversity across the landscape. Hence, the results highlight the potential of citizen science data to make informed comparisons of species diversity in space and/or time. However, the sampling effort inherently depends on the monitoring goal, for example, whether all species or only more common species are targeted. The generalizable workflow presented here allows for the quantification of sampling effort needed to estimate species diversity with citizen science data and shows how citizen science sampling effort might be targeted toward better estimates of biodiversity. It is expected these results will contribute to the wider society through collection of better biodiversity data to improve our understanding of biodiversity change into the future.

Documents connexes