2 Approaches to using Data Analytics techniques to analyze and predict news authenticity

Angel Aponte
Feb 19, 2023
7 min read

Updated: Apr 8, 2023

FAKE news... A hot topic that has been around for many years. It recently has exploded and got hotter amid the 2020 fraud elections and the COVID-19 pandemic.

What exactly is Fake news? Fake news is false or misleading information presented as news. It often aims to damage the reputation of a person (see image below) or an entity or make money through advertising revenue.

Particularly for business, why is it relevant to analyze, characterize, and PREDICT fake news? At the same time that Fake news is considered a serious threat to real journalism, they are also connected to stock market fluctuations and massive trades. For example, a few years ago, Fake news claiming that Barack Obama was injured in an explosion wiped out 130 billion dollars in stock value.

As I mentioned in my first post, in this and future posts, I'll dive deeper into relevant examples of Data Analytics applications to real-life situations. In this post, I'm going to describe how it is possible to apply Advanced Visualization techniques to visually explore and characterize Fake and True (real) news articles; and quickly extract insight, that can reinforce PREDICTIONS carried out leveraging, for example, Neural Natural Language Modeling for binary text classification, using the RUIMTEHOL R package.

To facilitate the exposition, I've divided it into three sections:

Data preparation in TRIFACTA
QUANTEDA visualization tools: Descriptive Analytics
Neural Natural Language modeling — RUIMTEHOL R package: Predictive Analytics

Now, it's time to unlock another real-world use case. I hope it'll be informative and insightful for you. Please, don't forget to leave your comments below and share.

Data preparation in TRIFACTA

The dataset for this example is from a Kaggle competition; it comprises two files in CSV format containing news from November 2016 to December 2017 (Fake news articles file 61.32 MB, 23,500 Fake news articles; True news articles file 52.33 MB, 21,417 True news articles). Both files have the same features/columns: news title/headline, the text of the news, news subject, and dates.

It is worth noticing that there exists the misconception that Analytics techniques are only suited for a high volume of data or "big data." This example will show (as well as the use case to be discussed in a future post) that most of the Analytics methods and techniques are flexible enough to adapt smoothly to small data volumes and "big data." This flexibility is also true regarding modern data preparation workflows and tools like TRIFACTA Data Wrangling Tools and Software (available on-premise and as Dataprep by Trifacta in the Google Cloud Platform). However, in this particular example, to prepare the data, why not use an Excel spreadsheet or a programmed script instead of TRIFACTA?

As a future post's dataset will show, a small file doesn't mean a simple one. In collaborative tools like TRIFACTA, users of any level of expertise (only domain knowledge and general math/statistics are required) can visually assess the data and intuitively interact with it along the preparation process, knowing exactly what's going on with each data point at any time.

Data preparation in TRIFACTA, including blending disparate and complex files, is now an experience easily auditable, explainable, flexible, and reproducible. It doesn't matter the file(s) size(s). And the process can be fully automated. Yes, the once time-consuming and tedious data preparation job now can be done better, easier, and faster. Achieving the same goals with obsolete, cryptic programmed scripts or Excel spreadsheets is impractical and lengthy.

The figure above depicts True news text data loaded in TRIFACTA; to the right is the recipe with the logic implemented to cleanse the data (to remove punctuation and unnecessary symbols, etc.) and perform all the necessary transformations (transform to lowercase, for example), including the addition of an extra column with the tag or label (class) TRUE.

The figure above illustrates the same for Fake news data; the tag or label (class) FAKE column has also been included. The figure below; shows the result of blending in a single dataset both files after being refined. Only the necessary features have been kept: a document/news ID, news title, news text, and the label/class, namely, TRUE or FAKE. This tag/class will be used ahead in the workflow, together with the news text, to train a Machine Learning model corresponding to a binary classification task. Now, It's also possible to connect directly with the refined, blended data or download it in the appropriate format. The data is ready to perform visual exploration/Descriptive Analytics, Predictive Analytics, and more.

It's important to point out that if any issue is detected downstream or additional features are required to be included in the analysis, it is pretty easy to quickly go back and edit the recipe(s), fix the problem, and move on again in the pipeline.

With the data ready, it's time for visual exploration to gain some insight; it's time for Descriptive Analytics. Here is where QUANTEDA's powerful visualization functions and tools come in handy. Let's build some informative and insightful visualizations!

QUANTEDA visualization tools: Descriptive Analytics

The QUANTEDA package is a fast, flexible, comprehensive framework for graphics and quantitative text analysis. The package is designed for R users needing to apply Natural Language Processing (NLP) to texts, from documents to final analysis. Its capabilities match or exceed those provided in many end-user software applications, many of which are expensive and not open source. Therefore, the package greatly benefits researchers, students, and other analysts with fewer financial resources. I was particularly interested in some of the package's rich graphic toolbox functions to visually explore the Fake/True news dataset and quickly surface relevant insight.

Now some examples of QUANTEDA graphic tools to visually assess Fake/True news text. The figure below compares Fake news and True news Word Clouds. It can be seen that Fake news' word distribution is sharper than True news': Fake news tends to use fewer words and repeats some of these words more frequently ("trump" for example) than True news. The "transition" zone (in light red and green) between low frequent words (in red) and more frequent words (in blue) is notably broader in True news. See the Word Cloud to the right.

The figure below depicts the comparison of Fake news and True news Sentiment Word Clouds; sentiments considered are fear, anticipation, trust, surprise, and sadness. More subtle differences between Fake and True news can now be observed; for example, in the sadness sector, Fake news stress words like black, while True news tax; in the anticipation zone, Fake news shows up with the word boiler, True news with white house, and so on. Therefore, it could be said that relying on specific sentiments, Fake news tries to create or exploit a particular idea or opinion, sometimes making the reader's experience quite uneasy; on the other hand, the aim of True news is more informative and based on facts.

To further corroborate some of the above observations, I built the Wordfish plots shown in the figure below. In a Wordfish plot, it is possible to track the most often associated words with a type of content or specific subject — the higher the words are in the plot, the more likely they are related to it.

It's clear from the plots and following the previous observations that True news is word-riched, resulting in a broader word distribution quite notable in the Wordfish graph to the right (see figure below).

For better comparison, the two figures below show the Fake news and True news Wordfish plots again. Consider, for example, the words "black" and "police"; as can be seen, this combination of words appears very close together in the Fake news plot and quite near the top; on the other hand, in the True news Wordfish plot, these words are notably separated; and the word "black" show up down in the plot.

In summary, it could be interpreted from the visual exploration so far — using some of the powerful QUANTEDA's visualization tools— that Fake news focuses mainly on particular sensitive topics and specific words (sometimes making the reader experience pretty uneasy), and, at most, in opinions partially true, and real news is more informative, word-riched, and based, very often, on verifiable facts.

After the visual exploration and gaining some insight, the next step is to build a content-based news classifier, training a Machine Learning binary classification algorithm to predict if a new news article is Fake or True. This task can be accomplished by performing Neural Natural Language Modeling with the RUIMTEHOL R package.

Neural Natural Language modeling — RUIMTEHOL R package: Predictive Analytics

RUIMTEHOL is a comprehensive R package that wraps the StarSpace C++ library. It's a Neural Natural Language Modeling toolkit which allows you to carry out:

Text classification
Finding sentence or document similarity
Content-based recommendations (e.g. recommend text/music based on the content)
Collaborative filtering-based recommendations (e.g. recommend text/music based on interest)
And much more

As mentioned at the end of the last section, a binary text classification task will be carried out. As described before, each news article has been labeled with the tag/class FAKE or TRUE. Next, a binary classification model is constructed, which can be used to tag fresh news articles (articles not considered in the modeling). For a detailed discussion and explanation of the modeling process using RUIMTEHOL functions, please click this LINK.

The figure below illustrates an example of the prediction of Fake news. The model correctly classifies a new Fake news article (news text included in the figure as a reference) as FAKE. Tabulated similarity values could be interpreted as a prediction probability: closer to 1 (0.9999447) higher the likelihood this sample of news is FAKE; a negative value of almost -1 (-0.999511) indicates a very low probability this news article is TRUE.

The figure below illustrates an example of the prediction of TRUE news. Again, the model correctly classifies the new True news article (news text included in the figure) as TRUE. Tabulated similarity values could be interpreted, as before, to represent a prediction probability, so a value closer to 1 (0.9999447) indicates a high likelihood the sample of news is TRUE; a negative value around -1 (-0.999511) indicates a very low probability the news is FAKE.

Reading the text samples depicted in the latest two figures, it is possible to identify several features similar to those surfaced in the visual exploration step carried out in the previous section, characteristics that typify Fake and True news, respectively, and as would be expected the intuition and the observations are pretty consistent with and reinforce the predictions achieved through the Neural Natural Language model just implemented.

Summary

Wrapping up, a dataset from a Kaggle competition, comprising samples of Fake and True news articles (23,500 and 21,417 articles, respectively), was cleaned, structured, reshaped, and blended in TRIFACTA software. The resulting refined data was the input to some of the powerful graphic tools available in the QUANTEDA R package; a handful of compelling and insightful visualizations were delivered. Also, the refined data was used to train a Neural Natural Language algorithm available in the RUIMTEHOL R package; a binary classification model was implemented to predict if fresh news articles (not considered in the modeling process), were True or Fake. Reader intuition and results of both Visual Analytics and Predictive Analytics are pretty consistent and reinforce each other.

In future posts, I'll present and discuss more real-life interesting, and relevant use cases. Please, stay tuned and don't miss them out. And kindly leave your comments below and share. Thank you!