Of late, big data from numerous social media applications have changed the web into a user-generated warehouse of information in constantly increasing number of areas. Due to the fairly easy access to tweets and their metadata, Twitter has become a prevalent source of data for studies of several phenomena. These include, for example, political and social disturbances, Twitter as a tool for crisis communication, different political campaigns, and using social media data to forecast stock market prices.
However, a study using data from social media data is frequently lopsided due to the presence of bots. Bots can be defined as non-personal and automated accounts that post information to online social networks. Twitter’s popularity as a tool in public debate has led to a situation wherein it has become the perfect target of automated scripts and spammers. It has been projected that about 5–10% of all users are bots, and that these accounts produce about 20–25% of all tweets posted.
Digital humanities scientists at the University of Eastern Finland and Linnaeus University in Sweden have created a new application that depends on machine learning to spot Twitter bots. The application can detect auto-generated tweets regardless of the language used. The team captured for investigation a total of 15,000 tweets in Swedish, Finnish, and English. Finnish and Swedish were mostly used for training, while tweets in English were used to assess the language independence of the application. The application is light, making it feasible to categorize massive amounts of data rapidly and comparatively efficiently.
“This enhances the quality of data – and paints a more accurate picture of the reality,” Professor of English Mikko Laitinen from the University of Eastern Finland notes.
According to Professor Laitinen, bots are rather harmless, while trolls do harm as they spread fake news and develop concocted stories. This is why there is a demand for progressively sophisticated tools for social media monitoring.
This is a complex issue and requires interdisciplinary approaches. For instance, we linguists are working together with machine learning specialists. This type of work also calls for determination and investments in research infrastructures that serve as a platform for researchers from different fields to collaborate on.
Mikko Laitinen, Professor of English, University of Eastern Finland
According to Professor Laitinen, it is vital for scientists to have access to data from social media.
“Currently, data are the property of American technology conglomerates, and a source of their income. In order for researchers to gain access to this data, cooperation at the national and international levels, and especially the involvement of the EU are needed.”