Citizen science provides large amounts of information about everything, ranging from galaxies to animal species, thus presenting a boon for scientists.
However, inconsistencies have been observed in crowd-sourced data. Additional reports emerged from densely populated regions and fewer from hard-to-access spots, posing challenges for scientists requiring uniformly distributed data.
There is a huge bias in the data set because the data is collected by volunteers.
Di Chen, Doctoral Student, Department of Computer Science, Cornell University
Chen is also the first author of “Bias Reduction via End to End Shift Learning: Application to Citizen Science,” which will be presented at the AAAI Conference on Artificial Intelligence on January 27th, 2109 in Honolulu.
“Since this is highly motivated by their personal interest, the distribution of this kind of data is not what scientists want,” stated Chen. “All the data is actually distributed along main roads and in urban areas because most people don’t want to drive 200 miles to help us explore birds in a desert.”
In order to compensate, Chen created a deep learning model along with Carla Gomes, professor of computer science and director of the Institute for Computational Sustainability. This model compares the population densities of numerous locations and effectively rectifies location biases in citizen science. Chen and Gomes tested their model on the data obtained from the Cornell Lab of Ornithology’s eBird, which gathers over 100 million bird sightings submitted yearly by birdwatchers across the globe.
“When I communicate with conservation biologists and ecologists, a big part of communicating about these estimates is convincing them that we are aware of these biases and, to the degree possible, controlling for them,” stated Daniel Fink, a senior research associate at the Lab of Ornithology who is working with Chen and Gomes on this research. “This gives [biologists and ecologists] a better reason to trust these results and actually use them, and base decisions on them.”
Historically, scientists have known about the problems associated with citizen science data and have attempted a wide range of techniques to deal with them, including other kinds of statistical models. While projects providing incentives to tempt volunteers to travel to distant locations or look for less-popular species have shown great potential, these can be costly and can be difficult to carry out on a massive scale.
A large data set like eBird has been shown to be handy in machine learning, where huge amounts of data are used for training computers to not only make predictions but to also solve issues. However, a model developed with the eBird data would make wrong predictions owing to the location biases. In addition, the various characteristics of the data further complicate any bias adjustment in the eBird data. Furthermore, 16 distinct pieces of information are included in each bird sighting in the system, rendering it computationally challenging.
The problem was solved by Gomes and Chen with the help of a deep learning model that adjusts for population variations in different regions by comparing their density ratios. The deep learning model is a sort of artificial intelligence that is good at classifying.
Right now the data we get is essentially biased because the birds don’t just stay around cities, so we need to factor that in and correct that. We need to make sure the training data is going to match what you would have in the real world.
Carla Gomes, Professor, Department of Computer Science; Director, Institute for Computational Sustainability, Cornell University
After testing a number of models, Chen and Gomes discovered that their deep learning algorithm is more effective when compared to other machine learning or statistical models at predicting where bird species can possibly be found.
Although they both worked with eBird, their discoveries could be applied to in any kind of citizen science project, stated Gomes.
“There are many, many applications that rely on citizen science, and this problem is prevalent, so you really need to correct for it, whether people are classifying birds, galaxies or other situations where data biases can skew the learned model,” she stated.
The National Science Foundation supported the study.