ChatGPT, the large language model (LLM) from OpenAI, recently brought artificial intelligence (AI) firmly within the public debate. Like other LLMs, ChatGPT owes its wealth of knowledge to the different texts it trained on, including open access academic articles. In fact, the histories and fates of open access science are closely linked, but their relationship is far from simple.
Image Credit: Viktoria Kurpas/Shutterstock.com
The rise of open access science can be traced back to the dawn of the Internet and World Wide Web, which enabled researchers worldwide to access and provide open access to scientific knowledge. This was further propelled by the growing movement for academic journal publishing reform and the need to find new funding models for academic publishing.
Meanwhile, the value of AI is heavily dependent on access to large datasets. However, using such data raises a host of ethical and legal issues, such as plagiarism, copyright infringement, data purity, and the reliability of sources. There is much to consider in this complex and ever-evolving field.
What is Open Access Science?
Open Access (OA) refers to the principles and practices that allow research outputs to be distributed online without access barriers or fees. In the traditional publishing model, publishers acquire the copyright of articles from authors in exchange for publishing and distributing the articles worldwide through subscriptions to libraries. However, many electronic journals have adopted the OA concept, making articles freely accessible and reusable on the internet.
The OA movement has progressed in two directions: gold OA, which involves publishing articles as OA from the time of initial publication with the author's consent, and green OA, which consists of sharing articles through self-archiving or institutional repositories before or after publication.
As of March 2021, over 100 research funders and 800 universities have registered open-access mandates, listed in the Registry of Open Access Repository Mandates and Policies. In 2022, President Biden’s administration issued a mandate requiring US federal agencies to make all results (papers, documents, and data) from US government-funded research publicly available immediately upon publication by the end of 2025.
Despite controversies surrounding the peer review system and the prevalence of predatory journals, high-quality open access journals continue to emerge. The main advantage of open access journals is the free access to scientific papers, regardless of affiliation with a subscribing library, and improved access for the general public, especially in developing countries.
The Budapest Open Access Initiative claims that OA results in lower costs for research in academia and industry. However, some argue that OA may increase the overall cost of publication and lead to further exploitation in academic publishing.
The debate over OA is complex and controversial. Commercial publishers and nonprofit scientific societies have long argued in favor of maintaining a 1-year embargo on OA, stating that it is crucial for protecting subscription revenues that support editing and production costs and funding for society activities. However, opponents of paywalls argue that they obstruct the free flow of information, lead to price gouging by some publishers, and force US taxpayers to pay twice – once to fund the research and again to access the results. Since the late 1990s, these critics have been advocating for Congress and the White House to require free and immediate open access to government-funded research.
OA and AI
This historical moment will be remembered as a testament to the potential of AI technologies like machine learning (ML) and natural language processing (NLP) in science.
The recent rise of the pretraining-fine tuning modeling paradigm has led to the creation of large domain-adapted language models such as BioBERT and SciBERT, which serve as crucial resources for advancing the state-of-the-art in various scientific NLP tasks such as information extraction, information retrieval, knowledge base population, question answering, and summarization.
Open-access data repositories provide the high-quality data that is essential for AI applications and are crucial for the rapid progress of the field.
Semantic Scholar and CORD-19
Semantic Scholar, an AI-powered research tool, is making scientific breakthroughs easier by helping scholars find and understand critical research findings.
It created the COVID-19 Open Research Dataset (CORD-19) in collaboration with NIH, Microsoft, and research groups. CORD-19 aims to serve as a blueprint for addressing global challenges and highlights the potential of AI and NLP in advancing scientific research.
With its nearly two million views, the resource has become the basis of the most popular Kaggle competition and demonstrates the importance of scientific collaboration and open access to scientific data in accelerating discovery.
The success of CORD-19 suggests that if scientific literature were widely available for automated analysis, it could potentially speed up advancements in all areas of research.
OA AI Development
AI and open access science have a close relationship. AI often interacts with open access scientific data, and in many cases, AI is being developed with open access principles in mind.
Kaggle, now part of Google, provides a platform for data scientists and machine learning experts to access and share datasets, build and test models, and collaborate on challenges.
The BigScience Workshop brought over 1000 researchers together to train the BLOOM Language Model in particle physics, genetics, and astronomy using specialized hardware.
The Turing Way is a community promoting responsible and collaborative practices in data science. The group has co-written a handbook with tools and best practices available in multiple languages.
The Mozilla Festival's Building Trustworthy AI Working Group collaborates on AI projects with over 400 members from various countries.
Open access science makes scientific information, including research articles and data, freely available to the public. The relationship between open access science and AI is that AI can utilize open access scientific information as a resource to advance scientific discovery and innovation.
By having access to large amounts of scientific data and research, AI algorithms can analyze and extract insights, potentially leading to breakthroughs and advancements.
References and Further Reading
Avraamidou, L. (2023). ChatGPT is amazing and everything that’s wrong with the world. [Online] UKRANT. Available at: https://ukrant.nl/chatgpt-is-amazing-and-everything-thats-wrong-with-the-world.
Brainard, J., and J. Kaiser (2022). White House requires immediate public access to all U.S.-funded research papers by 2025. Science. doi.org/10.1126/science.ade6076.
Ding, J. et al (2023). Towards Openness Beyond Open Access: User Journeys through 3 Open AI Collaboratives. Computers and Society. doi.org/10.48550/arXiv.2301.08488.
Etzioni, O. (2020). Fighting COVID-19 with Open Access and AI. [Online] Towards Data Science. Available at: https://towardsdatascience.com/fighting-covid-19-with-open-access-and-ai-9a4df3cbe8c0.
Kim, S-J., and K.S. Park (2021). Influence of open access journals on the research community in Journal Citation Reports. Science Editing. doi.org/10.6087/kcse.227.
Prior, F., and W. Bennett (2023). Open Access Data to Enable AI Applications in Radiation Therapy. Artificial Intelligence in Radiation Oncology. doi.org/10.1142/9789811263545_0004.