Machine Learning with PythonChapter 132

13.2 Project 2: Sentiment Analysis with Naive Bayes

Section 2 of 3-~ 12 min read-Synced from Cuantum content

In this project, we will develop a sentiment analysis model using the Naive Bayes algorithm. Sentiment analysis is a common application of Natural Language Processing (NLP) and Machine Learning, and it involves determining the sentiment expressed in a piece of text, such as a review or tweet.

13.2.1 Problem Statement

The goal of this project is to build a model that can accurately classify text as positive or negative based on the sentiment expressed in it. This can be useful in a variety of contexts, such as understanding customer feedback or analyzing social media posts.

13.2.2 Dataset

We will use the IMDB movie reviews dataset for this project. This dataset consists of 50,000 movie reviews from the Internet Movie Database (IMDB), each labeled as either positive (1) or negative (0). The dataset is divided evenly with 25,000 reviews intended for training and 25,000 for testing.

13.2.3 Implementation

Let's start by loading the dataset and examining its structure.

from sklearn.datasets import load_filesimport numpy as np # Make sure the path points to the correct location where your training data is stored# If the data is in the same directory as the script, you can use "aclImdb/train/"reviews_train = load_files("aclImdb/train/") # Extract text data and labels from the loaded datasettext_train, y_train = reviews_train.data, reviews_train.target # Print the number of documents in the training dataprint("Number of documents in training data: {}".format(len(text_train))) # Print the distribution of samples per classprint("Samples per class (training): {}".format(np.bincount(y_train))) 

Code breakdown:

The code first imports the loadfiles function from the sklearn.datasets library and the numpy library. Next, the code uses the loadfiles() function to load the training data from the aclImdb/train/ directory. The code then splits the data into two arrays, texttrain and ytrain, where texttrain contains the text of the reviews and ytrain contains the sentiment of the reviews (positive or negative). Finally, the code prints the number of documents in the training data and the number of samples per class.

Next, we will preprocess the data by removing HTML tags and converting all text to lowercase.

import re def preprocess_text(text):    # Remove HTML tags    text = re.sub('<[^>]*>', '', text)        # Find emoticons and remove non-word characters    emoticons = re.findall('(?::|;|=)(?:-)?(?:\)|\(|D|P)', text)    text = re.sub('[\W]+', ' ', text.lower()) + ' '.join(emoticons).replace('-', '')        return text # Preprocess each text in text_traintext_train = [preprocess_text(text) for text in text_train] 

Code breakdown:

The code first imports the re library, which provides regular expression operations. Next, the code defines a function called preprocesstext(), which takes a string as input and returns a processed string. The function first removes HTML tags from the input string using the re.sub() function. The function then finds all emoticons in the input string using the re.findall() function. The function then converts all non-word characters to spaces in the input string using the re.sub() function. The function then converts the input string to lowercase. The function then joins the emoticons with spaces. The function then replaces all hyphens with empty strings. The function then returns the processed string. Finally, the code uses a list comprehension to apply the preprocesstext() function to all strings in the text_train array.

We will then split the data into training and testing sets.

from sklearn.model_selection import train_test_split # Split the preprocessed text data and corresponding labels into training and testing setsX_train, X_test, y_train, y_test = train_test_split(text_train, y_train, test_size=0.2, random_state=42) 

Code breakdown:

Next, we will convert the text data into numerical feature vectors using the Bag of Words technique.

from sklearn.feature_extraction.text import CountVectorizer # Initialize CountVectorizer with stop_words='english' to remove common English wordsvectorizer = CountVectorizer(stop_words='english') # Fit and transform the training dataX_train = vectorizer.fit_transform(X_train) # Transform the testing dataX_test = vectorizer.transform(X_test) 

Code breakdown:

The code first imports the traintestsplit function from the sklearn.modelselection library. Next, the code uses the traintestsplit() function to split the data into training and testing subsets. The testsize parameter specifies that 20% of the data should be used for testing, and the randomstate parameter specifies that the data should be shuffled randomly. Finally, the code assigns the training and testing subsets to the Xtrain, Xtest, ytrain, and y_test variables.

Finally, we will train a Naive Bayes classifier on the training data and evaluate its performance on the testing data.

from sklearn.naive_bayes import MultinomialNBfrom sklearn.metrics import accuracy_score # Initialize Multinomial Naive Bayes classifierclf = MultinomialNB() # Train the classifier on the training dataclf.fit(X_train, y_train) # Predict labels for the testing datay_pred = clf.predict(X_test) # Compute accuracy scoreaccuracy = accuracy_score(y_test, y_pred)print("Accuracy: {:.2f}".format(accuracy)) 

Code breakdown:

The code first imports the MultinomialNB and accuracyscore functions from the sklearn.naivebayes and sklearn.metrics libraries, respectively. Next, the code creates a MultinomialNB classifier called clf. The code then fits the classifier to the training data using the clf.fit() function. The code then predicts the sentiment of the testing data using the clf.predict() function. The code then calculates the accuracy of the classifier using the accuracy_score() function. Finally, the code prints the accuracy of the classifier.

This project provides a practical application of machine learning in the field of NLP. It demonstrates how to use the Naive Bayes algorithm to perform sentiment analysis on movie reviews. The code provided can be used as a starting point for further exploration and experimentation.