Aaditya

| Technical Review: ABCOM Team | Copy Editor: Anushka Devasthale | Last Updated: Aug 25, 2020 | Level: Intermediate | Banner Image Source : Internet |

Splah

Image source Internet

Are you tired of watching all the fake news spread on social media for the sake of publicity? How to know if a news is fake or not fake just by looking at its headline? Learn to perform text classification using the NLTK library to answer such questions.

Everyone would agree with me that the year 2020 has been the year with a lot of unprecedented events. The corona outburst at the very beginning of this year forced the majority of us to stay at our homes missing on our daily activities. As stated in a famous saying goes: An empty mind is a devil's workshop, all our empty-minds are constrained due to the void of work shifted towards social media to amuse ourselves.

It won’t be an exaggeration if I say that Social Media is the easy channel for spreading fake news. On social media platforms, to get popularity and increasing followers, many people resort to sensational and ostentatious posts, including fake ones. There is also a large group that indulges in forwarding whatever is received without verification. It is of utmost essential to check whatever we forward. To help achieve this tedious task, we present you with an application of text classification built on the NLTK library to classify news as “real” or “fake” by examining its headline.

Creating a Project

Create a new Google Colab project and rename it to Fake News Classifier. If you are new to Colab, then check out this short video tutorial on Google Colab.

Import the required libraries:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import sklearn
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

Loading Dataset

The dataset of this project is taken from the Kaggle site. This dataset is divided into two files. One contains fake news (fake.csv), and the other contains actual news (true.csv).

Use the wget command to download the two files from my GitHub.

!wget https://github.com/abcom-mltutorials/FakeNews/blob/master/572515_1037534_compressed_Fake.csv.zip?raw=true
!wget https://github.com/abcom-mltutorials/FakeNews/blob/master/572515_1037534_compressed_True.csv.zip?raw=true

Unzip the two downloaded files:

!unzip "/content/572515_1037534_compressed_Fake.csv.zip?raw=true"
!unzip "/content/572515_1037534_compressed_True.csv.zip?raw=true"

Load the file data into the pandas dataframes:

fake = pd.read_csv("Fake.csv")
true = pd.read_csv("True.csv")

Examine the data by printing a few records.

print(fake.head())
print(true.head())

The above code gives the following output:

image01

Both data frames contain four columns. You will use only the title column for the model development. The title column contains news headlines.

The following code prints a single headline from both the dataframes:

print("Fake news headline: "+fake.iloc[0,0])
print("True news headline: "+true.iloc[0,0])

The above code gives the following output:

Fake news headline:  Donald Trump Sends Out Embarrassing New Year’s Eve Message; This is Disturbing
True news headline: As U.S. budget fight looms, Republicans flip their fiscal script

You can check the number of records in each data frame by calling its shape method:

print(fake.shape)
print(true.shape)

You will see that the fake news dataset contains 23481 records, and the true dataset contains 21417 records. We will combine these two datasets while building our model. A few of the classifiers we are going to use cannot handle such massive datasets. So, I am going to reduce the dataset to just 50% of its original size. Use the following commands to keep the first half of the records in both the dataframes:

fake = fake[:11740]
true = true[:10708]

Preprocessing Data

We will carry out several preprocessing steps to make the data ready for model training.

Adding is_fake Column

We have two datasets fake and true, before combining them for model training, we need to add a is_fake column to both the datasets.

Use the following code to add an is_fake column and its appropriate values for both the dataframes.

fake_news = []
for row in range(len(fake)):
   fake_news.append("fake")
fake["is_fake"] = fake_news
 
fake_news = []
for row in range(len(true)):
   fake_news.append("true")
true["is_fake"] = fake_news

Now we are ready to join both the data frames. Use the merge command to join the dataframes.

news = pd.merge(fake,true, how = "outer")

The outer keyword is used for including all the columns during the merge operation. It’s like a union or outer join operation in set theory.

You can plot the class distribution in the combined dataset using the following code:

classes = news["is_fake"]
print(classes.value_counts())
classes.hist()
plt.xlabel("Classes in is_fake")
plt.ylabel("Number of records")
plt.show()

The output is as follows:

image02

Encoding Columns

We will now use the LabelEncoder from sklearn to encode our is_fake field to 0’s and 1’s.

# convert class labels to binary values,
# 0 = fake and 1 = true
encoder = LabelEncoder()
is_fake = encoder.fit_transform(classes) 

Preparing Headlines

We extract the headlines into an array from the title column:

headlines = news["title"]
print(headlines[:10])

Use the regular expression to remove the punctuation from each headline:

headlines = headlines.str.replace(r'[^\w\d\s]', ' ')

Convert the headlines to lowercase.

headlines = headlines.str.lower()

Next, you remove the Stopwords. The words such as "the," "is," "at," "which," are the stopwords, and removing those does not affect the meaning of headlines. To remove the stopwords, first, download the list of stopwords from nltk. Then create a list of English stopwords:

# Removing stopwords from news headlines
nltk.download("stopwords")
stop_words = set(stopwords.words("english"))

Use the apply method to run the lambda function on every headline from headlines keeping only those words that are not in stop_words and joining the words remaining in that headline.

headlines = headlines.apply(lambda x : " ".
                           join(word for word in x.split()
                           if word not in stop_words))

Removing Stopwords can potentially help improve the performance as there are fewer and only meaningful tokens left. Thus, it could increase classification accuracy. Even search engines like Google remove stopwords for fast and relevant retrieval of data from the database.

In English sentences, many times, you find words such as “sends” and “send” or “embarrassing” and “embarrass.” In both cases, the base words “send” and “embarrass” are required for machine learning. The process of removing these suffixes and extracting the base words is called stemming. We use nltk’s PorterStemmer class to extract stems in every headline:

# Remove affixes to give stems using a Porter stemmer
ps = nltk.PorterStemmer()
headlines = headlines.apply(lambda x: ' '.join(ps.stem(word)
                           for word in x.split()))

We instantiate the PorterStemmer class and use it in the lambda function to extract the stem from each word giving a list of stems for a sentence. Run this for every headline in headlines using the apply method. By using PorterStemmer, we have successfully removed all the affixes, and only root words remain. For instance, "sends" becomes "send" and "embarrassing" becomes “embarrass.”

Generating Features

The features for our model training would be all words in our headlines dataset. To extract the words, you use a sentence tokenizer from nltk library. Download the tokenizer using the following statement.

nltk.download("punkt")

You will use the word_tokenize method from this library to extract words. Use the following code fragment to do so:

# creating a collection of all the words
all_words = []
 
for line in headlines:
   words = word_tokenize(line)
   for word in words:
       all_words.append(word)
 
print("Number of words: ", len(all_words))

You will notice that the number of words in our vocabulary is 206497. Indeed, the vocabulary will contain many repetitions. We extract the unique words by using nltk’s FreqDist method:

all_words = nltk.FreqDist(all_words)

You may examine the total word count and print the top fifteen common words and their frequency by using the following code:

print('Number of words: {}'.format(len(all_words)))
print('Most common words: {}'.format(all_words.most_common(15)))

The output is:

Number of words: 10174
Most common words: [('trump', 11274), ('video', 4022), ('u', 3198), ('say', 1980), ('republican', 1602), ('hous', 1584), ('senat', 1416), ('obama', 1401), ('watch', 1279), ('white', 1130), ('presid', 981), ('clinton', 978), ('tweet', 970), ('bill', 919), ('democrat', 886)]

With this, we have narrowed down our features from a whopping 206497 to just 10714. You observe that the word “trump” occurs mostly with a frequency of 11274, followed by the word “video” with a frequency of 4022, and so on.

Even the features size of 10174 would be large for training some of the ML algorithms. Thus, we narrow down the features list by extracting the top 2300 common words.

word_features = list(all_words.keys())[:2300]

We have now extracted the most common words (features). The headline will be classified as fake or true, depending on how many of these features are present in a given headline.

We need to create a training dataset for training our classifiers. I am going to use multiple classifiers for training to select the one with the best accuracy. To create the training dataset, we first write a function to find the features in a given headline.

The function defined below determines whether a word from word_features is contained in a given headline.

def find_features(headline):
   words = word_tokenize(headline)
   features = {}
   for word in word_features:
       features[word] = (word in words)
 
   return features

We will test this function on one of the headlines:

features = find_features(headlines[0])
for key, value in features.items():
   if value == True:
       print(key)

This gives the following output:

donald
trump
send
embarrass
new
year
eve
messag
disturb

Thus, the first headline in our dataset contains the above most commonly used words (features).

Now, perform the above process for all the headlines. First, make an iterator of the tuple containing headlines and the is_fake list using the zip command. Then initialize a seed variable with value one and use it as a seed to shuffle the above list of tuples. Finally, call find_features function on each headline in the above list of tuples.

# Do it for all the headlines
headlines = list(zip(headlines, is_fake))
# define a seed for reproducibility
seed = 1
np.random.seed = seed
np.random.shuffle(headlines)
 
# call find_features function for each headline
featuresets = [(find_features(headline), category)
               for (headline, category) in headlines]

Note: If you rerun this code cell, it may cause runtime errors due to shuffling. The features array would be different in each run, causing errors in the generation of featuresets.

Finally, split the featuresets into training and testing datasets using train_test_split:

training, testing = train_test_split(featuresets,
                                    test_size = 0.25,
                                    random_state=seed)

Now, we are ready to train the model on different classifiers.

Scikit - Learn Classifiers

The Scikit-Learn library provides the implementation of several ML algorithms. We will try a few of them on our above-generated dataset. Import these algorithms into your project using the following statements:

from nltk.classify.scikitlearn import SklearnClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import SGDClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import SVC
from sklearn.metrics import classification_report, accuracy_score, confusion_matrix, max_error

Note: I have also imported a few error metrics.

To apply these classifiers to your training dataset, you use the SklearnClassifier method. The following statement shows you how to wrap LogisticRegression in SklearnClassifier.

news_model = SklearnClassifier(LogisticRegression())

You train the model on the training dataset by calling its train method and passing the dataset as a parameter to it.

news_model.train(training)

After the training is over, you can check the model’s accuracy on the test dataset using the following statement:

   accuracy = nltk.classify.accuracy(news_model, testing)*100

Having understood how to wrap a classifier in SklearnClassifier, we will try several algorithms on our dataset. We will first create two arrays containing the classifiers that we want to test along with their names. Explaining the functioning of each type of classifier is beyond the scope of this tutorial.

names = ["Logistics Regression",
        "K Nearest Neighbors",
        "Decision Tree",
        "Random Forest", 
        "SGD Classifier",
        "Naive Bayes",
        "SVM Linear"]
 
classifiers = [
   LogisticRegression(),
   KNeighborsClassifier(),
   DecisionTreeClassifier(),
   RandomForestClassifier(),
   SGDClassifier(max_iter = 100),
   MultinomialNB(),
   SVC(kernel = 'linear')
]

We now use SklearnClassifier to wrap each model, train it on the training dataset, and print the model’s accuracy using the following code fragment:

models = zip(names, classifiers)
 
for name, model in models:
   news_model = SklearnClassifier(model)
   news_model.train(training)
   accuracy = nltk.classify.accuracy(news_model, testing)*100
   print("{} Accuracy: {}".format(name, accuracy))

This is the output:

Logistics Regression Accuracy: 92.96151104775481
K Nearest Neighbors Accuracy: 63.934426229508205
Decision Tree Accuracy: 87.08125445473985
Random Forest Accuracy: 90.68068424803991
SGD Classifier Accuracy: 93.01496792587312
Naive Bayes Accuracy: 91.21525302922309
SVM Linear Accuracy: 92.8367783321454

Note that some of the classifiers take a long time to get trained even on a GPU. So be patient, while running the above code.

Based on the produced accuracy metrics, you may select an appropriate one for your application. The accuracy measure is not always an indication of the best performing algorithm. So, I am going to give you another technique of developing a better performing model. This is called a VotingClassifier.

Using VotingClassifier

As the name suggests, the VotingClassifier takes the votes of each model while inferring the data. Consider a case where the following five classifiers have given the predictions as follows:

LogisticRegression 0
KNeighborsClassifier 1
DecisionTreeClassifier 0
RandomForestClassifier 1
SGDClassifier 0

As the three out of five classifiers have predicted the given news as fake, our model's final output would be that it is fake news (majority voting). In the case of a tie, the model will pick up the top label in the ascending order of the labels, where the labels are alphanumeric. You use the VotingClassifier as follows:

from sklearn.ensemble import VotingClassifier
ensemble = SklearnClassifier(VotingClassifier(estimators = models, voting = 'hard', n_jobs = -1))
ensemble.train(training)
accuracy = nltk.classify.accuracy(news_model, testing)*100
print("Voting Classifier: Accuracy: {}".format(accuracy))

This gives the following output:
Voting Classifier: Accuracy: 92.8367783321454

Thus, the VotingClassifier conceptually combines different classifiers and uses a majority vote (hard voting), or the average predicted probabilities (soft voting) to predict the class. This kind of voting balances the weaknesses of the individual classifiers.

Model Evaluation

To further evaluate the model’s performance, we will generate a classification report and a confusion matrix on the entire test dataset. First, apply an iterator on the testing set to get headline features and labels. Use the classify_many method from nltk’s ensemble on the headline_features only without specifying categories and save the results in the prediction variable:

headline_features, labels = zip(*testing)
prediction = ensemble.classify_many(headline_features)

Now, use the prediction and labels from above to make the classification_report:

print(classification_report(labels, prediction))

The output is:

image03

To print the confusion matrix, use the following statement:

pd.DataFrame(
   confusion_matrix(labels, prediction),
   index = [['actual', 'actual'], ['fake', 'true']],
   columns = [['predicted', 'predicted'], ['fake', 'true']])

The output is:

image04

It can be seen in the confusion matrix that 2647 news out of 2941 is correctly classified as fake news, and 2566 out of 2671 are correctly classified as true news. The total data for testing was 5612.

Now that we know that our model is reasonably good at classifying fake news, let's use it to sort some real-time news.

Inference on Unseen Data

Run the news that is to be classified and get a binary result where 1 signifies the news is true, and 0 represents fake news.

To classify any news, use the classify method on features extracted using our previously defined find_features method. Remember that the find_features method accepts a string argument - our news headline is a string. The following statement shows you how to use the classify method on our trained ensemble model.

print(ensemble.classify(find_features("Alia Bhatt’s Sadak 2 the most disliked trailer on YouTube amid nepotism debate, fans demand justice for Sushant Singh Rajput")))

The above news item is a recent one, picked up from the Internet. Our model classifies this as true news. The Sadak 2 trailer was indeed most disliked, and also the fans demand justice for Rajput is equally true.

Consider the classification of two more recent news on the latest Air India plane crash in Kerala, India.

print(ensemble.classify(find_features("India Air crash survivor recounts final minutes in plane")))

In my run, the model classified this as fake, which is acceptable to us as making claims on what happened within a few minutes of a plane crash is always questionable.

Now try one more headline on the same incident. The news of a plane crash in Kerala last week became trending, and here is another headline on that incident.

print(ensemble.classify(find_features("Kerala plane crash: 92 injured passengers discharged from hospitals after 'obtaining complete fitness'")))

The model classifies this as true news. Note that our model is probably trained on news in a specific domain. Try feeding news from some other field to see how the model performs on those.

Classifying News Feed

The real practical use of this model is to classify the streaming news, whereby we will extract the real news for our further use. I will not go into the details of how to capture news from a live stream. You will need to study the API for the desired live stream.

Store all the captured news in a list like the one shown below:

# Make a list of news
newslist = []
newslist.append("NASA tweets beautiful pictures of Mars. They may make you gasp in wonder")
newslist.append("How visually impaired woman beat the odds to crack UPSC exam. She’s inspiring many including Mohammad Kaif")
newslist.append("Russia registers the world's first Covid-19 vaccine, Putin says his daughter was given a shot")
newslist.append("2020 is the year to stay alive, and don't think of profit or loss")
newslist.append("Democratic presidential candidate Joe Biden and running mate Kamala Harris have attacked 'whining' President Donald Trump as an incompetent leader who has left the US 'in tatters'.")

Then make a dataframe news_classification and insert this list as a column in it:

news_classification = pd.DataFrame(newslist, columns=["News"])

Now make the predictions by using the process that was used for single news in a for loop created over the list of news. Store the result in a different list and name it is_true.

is_true = []
for i in newslist:
 is_true.append(ensemble.classify(find_features(i)))

Insert this list as a second column of the existing news_classification dataframe. Print the data frame to see the result. Remember, 0 signifies fake news, and 1 represents true news.

news_classification["is_true"] = is_true
news_classification

The output is as follows:

image05

If you read all the above news, you would probably agree that out of these five news, four are true, and one is fake. The model has given us one wrong result. The second news about the success of a visually impaired girl in the UPSC exam was incorrectly classified as fake. All other news is correctly classified, and thus we may trust the model in classifying the real-time news as well. This model can be used for any text classification problem. Feel free to try it out by yourself on the dataset of your interest.

Summary

In this tutorial, you learned how to develop a text classification model using the NLTK library to classify news as fake or not fake. You used the dataset provided by Kaggle for model training. You learned how to pre-process this dataset. You used several functions provided in the nltk library to remove punctuation, stopwords, and affixes from the text. You learned how to extract the unique words (features) from this text. Then you defined a function to check for those features in every headline in the shuffled dataset to create a dictionary of Booleans, which you named as “featuresets.” You created the training and testing datasets on this feature set. The next step was to import various classifiers from the Scikit-Learn library to train and test them on our dataset for comparison of classification accuracy of various models. Finally, you implemented the Voting Classifier technique to balance out the weaknesses of each classifier. The classification report and confusion matrix helped us in performance judging. In the end, you tested the model with real-time news to check the model's usability in a real-life scenario. The techniques you have learned in this tutorial are applied in many NLP projects. Use them in your next exciting project on NLP.

Source: Download the project source from our Repository.

image