Sentiment Analysis on Covid-19 Tweets
Sentiment analysis (or opinion mining) is a natural language processing technique used to determine whether data is positive, negative or neutral. Sentiment analysis is often performed on textual data to help businesses monitor brand and product sentiment in customer feedback, and understand customer needs.
Sentiment analysis models focus on polarity (positive, negative, neutral) but also on feelings and emotions (angry, happy, sad, etc), urgency (urgent, not urgent) and even intentions (interested v. not interested).
Sentiment analysis is extremely important because it helps businesses quickly understand the overall opinions of their customers. By automatically sorting the sentiment behind reviews, social media conversations, and more, you can make faster and more accurate decisions. ~ definitions are adapted from https://monkeylearn.com/sentiment-analysis/
With the current covid-19 pandemic, we see a lot of stuides and research going on in the world today around covid-19. In this article, using data sets of tweets about covid-19 obtained from Kaggle, A text classifier was built to classify sentiments from the tweets. The kaggle data set from curated from tweeter by scraping the tweeter API for all mentions of covid-19.
To build a text classifier, we need to vectorize the text that will be used for training. There are different ways text can be vectorized. Some of the commom techniques used are bag of words, Term Frequency-Inverse Document Frequency(TF-IDF), and Word Embeddings- Word2Vec and GLOVE.
This article uses Bag of Words and TF-IDF. Let us take a moment to explain these two terms:
Bag of Words: In Bag of Words,a text is represented as the bag of its words. We disregard word order and grammar but keep multiplicity.
Given the following texts: Barrack Obama was the 44th president of the United States.
A JSON representation is Bag of Words will be {Barrack :1, Obama :1, was :1, the :2, 44th:1, president:1, United :1, States:1}. The frequency of occurrence for each word in the text/corpus is shown.
Term-Frequency Inverse Domain Frequency(TF-IDF) : TF-IDF is a statistical measure that evaluates how relevant a word is to a document in a collection of documents. This is done by multiplying two metrics: how many times a word appears in a document, and the inverse document frequency of the word across a set of documents.
How is TF-IDF calculated?
TF-IDF for a word in a document is calculated by multiplying two different metrics:
- The term frequency of a word in a document. There are several ways of calculating this frequency, with the simplest being a raw count of instances a word appears in a document. Then, there are ways to adjust the frequency, by length of a document, or by the raw frequency of the most frequent word in a document.
- The inverse document frequency of the word across a set of documents. This means, how common or rare a word is in the entire document set. The closer it is to 0, the more common a word is. This metric can be calculated by taking the total number of documents, dividing it by the number of documents that contain a word, and calculating the logarithm.
- So, if the word is very common and appears in many documents, this number will approach 0. Otherwise, it will approach 1. ~ Adapted from https://monkeylearn.com/sentiment-analysis/
After vectorizing the text into features, the labels(Sentiments) were also converted into numbers using label encoding. The Word vectorization and label encoding are pre-processing steps required on the text data before Model training/Evaluation.
The Jupyter notebook showing the python codes and results is posted below:
3 different Models were built. According to George E.P. Box, “All models are wrong but some are useful”. It is imperative to build multiple models in any ML projects. Pick the best based on your defined criteria. In this case, we are picking the best based on accuracy on test data.
The 3 different models built are Logistic Regression, Naive Bayes and a basic neural network model using keras with tensorflow back end. The keras model was built using 5 iterations(Model Performance got worse after more iterations). The Neural Network used has one input layer with 10 neurons. Rectified-Linear Unit(ReLu) was used for activation on this layer. Since we have more than 2 output lables, softmwax activation was used on the output layer. Due to the sparse nature of the data, A space categorical cross entropy was used as the loss function, and Adam Optimizer was used. Details can be seen in the code.
Results and Interpretation
- About 44,955 tweets were used for this Model
- About 60,000 were classified as Extremely Negative
- A little less than 80,000 were Extremely Positive
- A little less than 120,000 were Negative
- A little above 80,000 were Neutral
- About 120,000 were Positive
- The Most Common tweet locations in the data sets were the United States and England.
- Accuracy using Logistic Regression/Bag of Words: 62%
- Accuracy using Logistic regression/TF-IDF:58%
- Accuracy using Neural Nets: 67.5%
Neural Nets gave the best accuracy on the test data.
The goal of this article was to talk on Sentiment Analysis and also build a Text Classifier.
Hope you enjoy reading the Article!
ABOUT ME
I am a Data Science/MLOps practitioner. I am passionate about teaching Data Science and MLOps to aspiring individuals. Currently run two active AI/ML Meetups with a very good friend who himself is a Data /MLOps Platform Engineer. We also run Data Science/MLOps trainings.I also have two advanced degrees-Electrical Engineering and Data Analytics.