Combatting the spread of scientific falsehoods with NLP

Thursday, September 05, 2024

It’s widely recognized that misinformation poses serious risks. Yet despite efforts by social media companies and major international groups — including the World Health Organization and the United Nations — misleading information remains a significant problem on the web.

Combating lies online is difficult and time-consuming. There is simply too much misinformation for human fact-checkers to debunk and automated fact-checking technologies still need to be significantly improved. Moreover, it’s been known for years that lies spread faster than accurate information on social media networks. Yet despite these challenges, natural language processing (NLP) technologies offer promise for limiting the impact of misinformation.

There are many kinds of misinformation found online, but one common technique is to manipulate findings contained in scientific studies. This method is particularly effective because it relies on specialized and authoritative research to support a false claim while containing “a kernel of truth.”

Indeed, the misrepresentation of scientific studies was a common tactic used during the COVID-19 pandemic to spread lies about the origin of the virus, alternative treatments and the efficacy of vaccines. But it can be used in many other contexts. “It’s an especially dangerous type of misinformation because it’s hard for non-experts to debunk,” said Iryna Gurevych, adjunct professor of natural language processing at the Mohamed bin Zayed University of Artificial Intelligence. “Recognizing fallacies that are based on misrepresented science is hard for people.”

Gurevych and coauthors from MBZUAI and other institutions have authored a study that is a step towards combatting misinformation which misinterprets evidence from scientific research. Their findings were recently presented at the 62nd Annual Meeting of the Association for Computational Linguistics (ACL), one of the largest annual gatherings of natural language processing researchers.

First-of-its-kind dataset

Gurevych and her coauthors’ study is based on a dataset they compiled called MISSCI, which is comprised of real-world examples of misinformation gathered from a fact-checking website.

To create the MISSCI dataset, the researchers hired annotators to collect fact-checking articles that were written by people for a website called HealthFeedback, which collaborates with scientists to review health and medical claims. In total, the researchers provided the annotators with 208 links that pointed to scientific publications which were misrepresented in the fact-checking articles.

The annotators manually reconstructed the misleading arguments from the fact-checking articles according to a specific method determined by Gurevych and her team. The annotators then classified the different types of errors in reasoning — known as fallacies — into nine different classes, such as “fallacy of exclusion” and “false equivalence.”

This dataset compiled by the annotators provided the basis for evaluating the LLMs and it is the first of its kind to use real-world examples of misrepresented scientific studies.

Evaluating performance

The researchers were then interested in evaluating the reasoning abilities of two popular LLMs: GPT-4, developed by OpenAI, and LLaMA 2, developed by Meta. Their goal was to use LLMs to automatically generate fallacious premises that resulted in inaccurate claims and classify these premises into categories.

The LLMs were given a claim, a fallacious premise and the context of the scientific publication and were prompted to classify fallacious premises that were used to arrive at the claim into one of nine categories. The LLMs also were asked to generate the fallacious premises on their own.

The researchers tested the two models in several different configurations. For example, providing the models with the definition of the fallacy, the logic of the fallacy and the logic of the fallacy in addition to an example of the fallacy.

In predicting the fallacy class, GPT-4 performed better than LLaMA 2 overall and did so across nearly every fallacy class. When given the logic of the fallacy and an example, GPT-4 was able to predict the correct fallacy class based on correct fallacious premises 77% of the time, while LLaMA 2 did so at a rate of 66%.

Interestingly, when human evaluators examined the outputs generated by the models, they found that even if a model was able to generate a plausible fallacy, it might misclassify the fallacy.

Other fact-checking approaches

A current approach to fact-checking with NLP is to use what are known as knowledge bases, which are repositories of factual information. According to this method, a system compares a claim to information in the knowledge base and makes a prediction about the veracity of the claim. Techniques like this that focus on attacking a claim with counter-evidence fail, however, when counter-evidence isn’t available, which is true of most real-world cases of false claims, Gurevych explained.

In this study, instead of countering a claim, Gurevych and her coauthors’ method attacks the reasoning between the scientific source and the claim and uses the LLMs to identify the fallacies that are required to develop a false claim based on the source.

A significant challenge

Fact-checking is difficult for both humans and machines. “Discovering fallacies is something that people struggle with, and it makes experimentation difficult,” Gurevych said. “We see that LLMs are not at the stage yet where they can accurately address these complex reasoning questions.”

In the future, Gurevych and colleagues intend to analyze other models according to the same framework and develop methods to account for scientific evidence contained in more than one document.

That said, the very nature of the task — to determine if a claim is true based on evidence — is often contested and is a fundamental process in the generation of knowledge. “The question of whether evidence supports a claim is at the heart of any empirical science and one of the reasons scientists have developed methods like peer review,” Gurevych said. “It’s an extremely difficult question to answer.”