The NLP Journey: Sentiment Analysis with Logistic Regression.

HIMANSHU
4 min readFeb 4, 2021

Objective of the article: Create a NLP Model to determine the sentiment of a tweet(positive or negative) using Logistic Regression.

Package: NLTK( Open Source Python Library for NLP)

Training Dataset: twitter_samples(5000 Positive tweets, 5000 Negative tweets)

Brief Introduction: Sentiment Analysis is a Natural language Processing Technique which is used to detect positive or negative sentiment in a text. Sentiment Analysis can be carried out using Logistic Regression, Naïve Bayes, Support Vector Machines, Deep Learning, etc. It uses a Supervised Machine Learning Method to train a classifier which in return predicts the sentiment of the text. A Basic layout for the Supervised Machine Learning Method is as below:

https://www.coursera.org/learn/classification-vector-spaces-in-nlp

A basic layout for sentiment analysis using LR is as below:

‘X’ represents the features. The features need to be extracted from the text to train the classifier.

This article considers tweets as text and Logistic regression as a classifier.

Steps for Implementation:

  1. Data Preparation.
  2. Pre-processing of Tweets.
  3. Frequency distribution vector.
  4. Logistic Regression.
  5. Feature Extraction.
  6. Training the Model.
  7. Prediction and Testing.

Step 0: Import required Packages

Step 1: Data Preparation

Step 2: Preprocessing

Preprocessing of data plays an important role in every machine learning project. The data needs to be cleaned, formatted before training. For NLP, the preprocessing involves following steps:

A. Removing hyperlinks, twitter marks, styles: Twitter consists many substrings like hashtag, retweet marks, hyperlinks, which do not participate in sentiment analysis. Removing these substrings is carried through re(regular expression) library in Python.

B. Tokenizing: Tokenize means to split strings into individual words without spaces and tabs. Also, each string is converted to lowercase. For sentiment analysis, Sectumsempra = sectumsempra = SECTUMSEMPRA. It is done using NLTK module — tokenize

C. Removing Stop words and punctuation: Stop words are words that don’t add significant meaning to the texts. NLTK has a collection of stop words and punctuation which can be used to remove the same. However, there may be few words in the list which might affect as per the model. Hence, a custom function can be built for the same.

D. Stemming: It is a process of reducing a word to a most general form-stem. It is done to reduce the vocabulary size. There are modules present in NLTK for the same. PorterStemmer is one such module. Example of Stemming-

A helper function for Pre-processing tweet.

Let’s test a processed tweet

Step 3: Frequency distribution vector.

Each tweet should be encoded as a vector. It is done by creating a dictionary which maps the word and the class to its number of times the word appeared in particular class(positive or negative). So, basically this vector of frequency distribution will consist of — positive frequency i.e number of times a particular word has appeared in a positive tweet and a negative frequency i.e number of times that particular word has appeared in a negative tweet. Also, it consists a bias term. So, the vector is of dimension 3: [bias,positive freq, negative freq]. Let’s create a helper function for the same:

Step 4: Logistic Regression

A quick overview of Logistic Regression:

As Logistic Regression in itself is huge topic and is beyond the scope of the article, following steps can be noted for a nutshell view.

A. Logistic Regression takes a linear regression and applies a sigmoid function to it.

B. The cost function for logistic regression is the average of log across all training examples:

C. The weight vector, theta is updated using gradient descent.

Let’s create a function for the same:

Step 5: Feature Extraction

Let’s create a function for feature extraction. The function will consider a single tweet. It will perform preprocessing to return a set of words. Then, it will use the frequency distribution function to return the frequency distribution vector. Finally it will return a Feature Vector of dimension (1, 3) which will be used for training the model.

Step 6: Training the Model

Step 7: Prediction and Testing

The model’s accuracy is quite perfect. Now, let me check the sentiment for a tweet of mine.

So, as Thomas Shelby replies, “Sad.”

--

--