Fighting fake news with Uni of Penn's Cognitive Computation Group
20.09.2020 | 6 min read
Earlier this year, 10Clouds worked with the University of Pennsylvania’s Cognitive Computation Group on its COVID-19 information verification platform. The product was built to challenge misinformation. During the COVID-19 pandemic, it’s very easy to become a victim of unconfirmed or specifically-designed fake news or simply misinformed sources. This creates chaos, uneducated opinions and conflicts among people.
The problem of course reaches far beyond COVID-19, and the platform created for University of Pennsylvania is a solution that can be later enhanced to be more universal and cover a vast array of different topics. We spoke to Sihao Chen to find out more about the platform and its goals.
We were very excited to work with you on the COVID-19 project. But we understand that your ultimate aim is to fight disinformation in a range of different areas of life. Could you tell us a bit more about your mission?
Indeed, the problem existed well before COVID-19.
We are living in an era where generating and publishing content is remarkably easy. Such privileges also come with the growing concerns of misinformation, disinformation and bias.
Some of them, such as health and political misinformation, have already had direct and detrimental effects on our society. The “infodemic” from the COVID-19 pandemic is but another unfortunate example.
We are not the only ones trying to address the problem. In fact, experts from both the industry and the research world have been dedicating years of efforts towards automated solutions for “fact checking”, in hopes of delivering “truth” to the end users sitting in front of their TVs, computers or phones, etc. However, defining “truth” is not always easy. Take "Should we wear face masks" in the context of COVID-19 as an example. As the implications of wearing face masks vary depending on the aspects one is looking at, the media attitude and even the official guidance have shifted dramatically over the past few months. “Fact checking” alone would not solve the entire problem here, as there exist many different, yet valid perspectives to the question. In such cases, we realize that the current tools or information channels are often limited in terms of the diversity of perspectives they provide, even if they are considered “trustworthy”. If one is interested in becoming more informed about a topic, listening to different voices would be essential to start thinking critically on the matter.
This became our motivation for the project.
As researchers in Natural Language Understanding, we seek to develop systems to understand the “perspectives” behind textual information. Such systems can be used to differentiate between redundant perspectives, and keep a reasonably-sized set of relevant, trustworthy, and timely information that can be delivered to the users.
This is in no way an easy task. We view the COVID-19 infodemic as a case study to define and address the key research challenges raised by the need to navigate our way through the information-polluted space. For this reason, we collaborated with 10clouds on the COVID-19 information platform, powered by our prototype machine learning models for “perspective understanding”. We are currently using the site to evaluate our models’ performance from the end users’ perspective and identify potential areas of improvement. This project now has seen its initial success, and has ignited our future research on this exciting research direction.
It would be great to hear about how you're using computational methods to address the issue of our contaminated information supply. Could you give us an overview of how your platform works?
The core idea behind our current platform is the “perspective understanding” of web content. We build a series of machine learning models for (1) identifying or abstracting the key argument from a document/webpage, (2) deciding whether the key argument is indeed a relevant perspective to the user query (3) judging whether two perspectives are semantically different, and thus having different implications. The machine learning models are mostly trained with debate forum data, covering different discussion-worthy topics.
Apart from the “perspective understanding” module, we also have a few different models for retrieving COVID-19 relevant web content (from a list of trustworthy sources) and categorizing web pages based on the aspects of pandemic they discuss.
It’s worth noting that none of the models are tailored to COVID-19, since we hope to apply the same methodology to any subject. And as we identify more challenges in trying to combat information pollution, more modules will be added to the platform and evaluated for their effectiveness.
How will a user be able to understand the source of a given piece of information about COVID-19 and be able to make an informed assessment of its reliability?
This is also a question that is getting more attention lately.
How do we, as users who are not media professionals, spot information pollution and learn to judge the trustworthiness of what we see? This is especially tricky, considering that even the most professional media with a transparent fact-checking process could have political tendency.
A good starting point would be reading and hearing more voices on the subject. However this is not as easy as it sounds. Information is not always organized as formal debates, where people of different backgrounds and interests will offer their concise opinions, corroborated by supporting facts and evidence.
As part of our mission, we try to deliver information in a similar format to “debate”, where the users are able to view different perspectives from various sources in one convenient place. However, this alone does not guarantee that every piece of information we show is “reliable”. Our short-term goal is to reduce the redundancy of information presented, so that a critical reader would take less time before forming an educated opinion on the subject, and so have the ability to assess the reliability of information on a comparative basis. Our solution now is to include only sources that are known to have a professional and transparent fact-checking process. As part of our future research, we aim to automate the comparative analysis approach, and build systems for inferring the reliability of information based on similar perspectives or evidence that are known to be well-corroborated or trustworthy (e.g. peer reviewed).
Which new sectors are you looking to target and what are your plans for the next year?
The fight against information pollution has been an on-going part of my Ph.D. study, and I plan to continue this effort. Beyond what we have discussed here, there are many other aspects of information pollution that needs automated support. The reliability of sources would be an example. For the rest of my Ph.D. study (and maybe my research career), apart from working towards the solutions, I would also like to bring more attention to the problems I’m trying to address. Hopefully what I have shared so far has convinced you that this is an important problem.