Pandemic Misinformation Spreads Fast. Can Machine Learning Help?

• Bookmarks: 105


As information on the COVID-19 pandemic spreads at a rate rivaling that of the virus itself, separating trustworthy information from sensational and false news is paramount. While the pandemic illuminates the dangers of misleading health stories, researchers have been investigating methods to detect misinformation long before the outbreak. They hope these algorithmic tools will help social media users find the best information and avoid articles that are intentionally misleading.

In a recent study published in IEEE Access, Yue Liu et al. proposed a new machine learning algorithm that distinguishes reliable from unreliable health-related information. Liu et al.’s study focused on China because of its aging population and the prevalent use of e-health to acquire medical knowledge. They reviewed 4,381 health-related articles and classified them as either “reliable” or “unreliable.” Articles deemed reliable were written by doctors, medical researchers, or published by government-sponsored organizations. Conversely, articles with health-related rumors propagated through Chinese social media websites like WeChat or Weibo were classified as unreliable. Through this categorization method, the researchers obtained 2,296 pieces of reliable health information and 2,085 pieces of unreliable information.

The researchers began building their algorithm by generating summary statistics of keywords and writing styles of each type of story. Prior analysis suggested unreliable articles generally are designed to go viral. They tend to use hot-button issues like cancer, weight loss, and vaccines. Conversely, reliable articles covered a wider range of topics, aiming to highlight disease facts, present medical findings, or dispel rumors. The authors also differentiated reliable articles from unreliable articles though title length. Unreliable articles tended to use exaggerated language and click-bait terms, resulting in longer titles. In contrast, reliable articles used more concise titles to state a fact or raise a question. The average title length of a reliable health-related article was 18.4 words compared to 26.1 words for an unreliable story. However, the overall length of a reliable article averaged 1,980 words, while unreliable articles were 1,473 words on average.

The researchers used manually sorted “reliable” and “unreliable” articles to train and validate a machine learning model that can automatically sort through health information. The research team trained the model to flag articles that contained specific words or phrases common in the initial sorting. For example, researchers found abundant usage of words like “cancer,” “detoxification,” and “natural” along with imperative phrases like “repost [it to your family]” or “[click to] buy” in unreliable sources. The machine learning model also learned to associate words like “research,” “doctor,” and “relieve” with reliable articles. It also learned to associate square brackets with reliable sources, as brackets were an indicator of proper citation.

After training the model with 80% of the collected articles, the researchers tested its effectiveness with the remaining 20%, allowing it to identify which articles were reliable and which were not with a quantitative scoring system. While the model sometimes provided false negatives, it had a precision score of 83.74%, meaning it worked well for detecting unreliable articles, but incorrectly classified some reliable articles as unreliable. Liu et al. then compared their model with a simpler pre-built machine learning model called FastText, which takes word sequences as an input and automatically predicts reliability. While less labor intensive than the researcher’s custom model, the FastText AI received a lower precision score of 70.98%.

These findings suggest that machine learning can be a tool to detect fictitious health information. The authors’ approach does have limitations: it relies on the assumptions made by the researcher on what is and isn’t reliable, and it was conducted on a limited field of health-related stories written for a Chinese audience. Nevertheless, the accuracy ratings this study found illustrates unique signatures for both reliable and unreliable health stories. For the first time, the possibilities machine learning offers policymakers and consumers to better prevent and contain an epidemic feel within our grasp.


Liu, Yue, Ke Yu, Xiaofei Wu, Linbo Qing, and Yonghong Peng. 2019. “Analysis and Detection of Health-Related Misinformation on Chinese Social Media”. Institute of Electrical and Electronics Engineers Access 7 (October): https://doi.org/10.1109/ACCESS.2019.2946624

364 views
bookmark icon