Natural Language Processing with Python Updated EditionChapter 163

13.3 Building and Training Sentiment Analysis Models

Section 3 of 6-~ 12 min read-Synced from Cuantum content

Building and training sentiment analysis models is a crucial step in developing a sentiment analysis dashboard. These models analyze the sentiment of text data and classify it as positive, negative, or neutral. In this section, we will discuss how to build and train sentiment analysis models using machine learning and deep learning techniques. We will also provide example codes to guide you through the process.

13.3.1 Choosing the Right Model

Choosing the right model for sentiment analysis depends on several factors, including the size of the dataset, the complexity of the text data, and the desired accuracy. We will explore two approaches: traditional machine learning models and deep learning models.

1. Traditional Machine Learning Models

Traditional machine learning models, such as Logistic Regression, Support Vector Machines (SVM), and Naive Bayes, are effective for text classification tasks and are relatively easy to implement and interpret.

2. Deep Learning Models

Deep learning models, such as Recurrent Neural Networks (RNNs), Long Short-Term Memory Networks (LSTMs), and Bidirectional Encoder Representations from Transformers (BERT), can capture complex patterns in text data and often achieve higher accuracy. However, they require more computational resources and training time.

13.3.2 Implementing Machine Learning Models

Let's start by implementing a machine learning model for sentiment analysis. We will use Logistic Regression as an example.

logistic_regression.py:

import pandas as pdimport picklefrom sklearn.linear_model import LogisticRegressionfrom sklearn.metrics import accuracy_score, classification_reportfrom sklearn.model_selection import train_test_split # Load balanced training datawith open('data/processed_data/X_train_balanced.pickle', 'rb') as file:    X_train = pickle.load(file)with open('data/processed_data/y_train_balanced.pickle', 'rb') as file:    y_train = pickle.load(file) # Load test datawith open('data/processed_data/X_test.pickle', 'rb') as file:    X_test = pickle.load(file)test_data = pd.read_csv('data/processed_data/test_data_preprocessed.csv')y_test = test_data['sentiment'] # Train a Logistic Regression modelmodel = LogisticRegression(max_iter=1000)model.fit(X_train, y_train) # Save the trained modelwith open('models/logistic_regression_model.pickle', 'wb') as file:    pickle.dump(model, file) # Evaluate the model on the test sety_pred = model.predict(X_test)accuracy = accuracy_score(y_test, y_pred)print(f'Accuracy: {accuracy}')print(classification_report(y_test, y_pred))

In this script, we train a Logistic Regression model on the balanced training data and evaluate its performance on the test set. The trained model is saved for future use.

13.3.3 Implementing Deep Learning Models

Next, let's implement a deep learning model for sentiment analysis. We will use an LSTM model as an example.

lstm_model.py:

import numpy as npimport pandas as pdimport picklefrom tensorflow.keras.preprocessing.text import Tokenizerfrom tensorflow.keras.preprocessing.sequence import pad_sequencesfrom tensorflow.keras.models import Sequentialfrom tensorflow.keras.layers import Embedding, LSTM, Dense, Dropoutfrom tensorflow.keras.optimizers import Adamfrom sklearn.metrics import accuracy_score, classification_report # Load preprocessed datatrain_data = pd.read_csv('data/processed_data/train_data_preprocessed.csv')test_data = pd.read_csv('data/processed_data/test_data_preprocessed.csv') # Extract features and labelsX_train = train_data['review']y_train = train_data['sentiment'].apply(lambda x: 1 if x == 'positive' else 0)X_test = test_data['review']y_test = test_data['sentiment'].apply(lambda x: 1 if x == 'positive' else 0) # Tokenize and pad sequencestokenizer = Tokenizer(num_words=5000, oov_token='<OOV>')tokenizer.fit_on_texts(X_train)word_index = tokenizer.word_index X_train_sequences = tokenizer.texts_to_sequences(X_train)X_test_sequences = tokenizer.texts_to_sequences(X_test) max_length = 200X_train_padded = pad_sequences(X_train_sequences, maxlen=max_length, padding='post', truncating='post')X_test_padded = pad_sequences(X_test_sequences, maxlen=max_length, padding='post', truncating='post') # Build the LSTM modelembedding_dim = 100model = Sequential([    Embedding(input_dim=5000, output_dim=embedding_dim, input_length=max_length),    LSTM(128, return_sequences=True),    Dropout(0.2),    LSTM(64),    Dropout(0.2),    Dense(1, activation='sigmoid')]) # Compile the modelmodel.compile(optimizer=Adam(learning_rate=0.001), loss='binary_crossentropy', metrics=['accuracy']) # Train the modelhistory = model.fit(X_train_padded, y_train, epochs=10, batch_size=32, validation_split=0.2) # Save the trained model and tokenizermodel.save('models/lstm_model.h5')with open('models/tokenizer.pickle', 'wb') as file:    pickle.dump(tokenizer, file) # Evaluate the model on the test sety_pred_prob = model.predict(X_test_padded)y_pred = (y_pred_prob > 0.5).astype(int)accuracy = accuracy_score(y_test, y_pred)print(f'Accuracy: {accuracy}')print(classification_report(y_test, y_pred))

In this script, we build and train an LSTM model for sentiment analysis. The text data is tokenized and padded to a fixed length, and the model is trained on the padded sequences. The trained model and tokenizer are saved for future use. The model is then evaluated on the test set to measure its performance.

13.3.4 Hyperparameter Tuning

Hyperparameter tuning is essential for optimizing the performance of machine learning and deep learning models. It involves selecting the best combination of hyperparameters for the model.

Example: Hyperparameter Tuning using GridSearchCV

We can use GridSearchCV from the sklearn library to perform hyperparameter tuning for the Logistic Regression model.

hyperparameter_tuning.py:

from sklearn.model_selection import GridSearchCV # Define hyperparameters to tuneparam_grid = {    'C': [0.01, 0.1, 1, 10, 100],    'solver': ['lbfgs', 'liblinear']} # Initialize GridSearchCVgrid_search = GridSearchCV(LogisticRegression(max_iter=1000), param_grid, cv=5, scoring='accuracy', verbose=1) # Perform hyperparameter tuninggrid_search.fit(X_train, y_train) # Print the best parameters and the corresponding scoreprint(f'Best Parameters: {grid_search.best_params_}')print(f'Best Score: {grid_search.best_score_}') # Save the best modelbest_model = grid_search.best_estimator_with open('models/best_logistic_regression_model.pickle', 'wb') as file:    pickle.dump(best_model, file)

In this script, we define a grid of hyperparameters for the Logistic Regression model and use GridSearchCV to find the best combination. The best model is saved for future use.

13.3.5 Evaluating Model Performance

Evaluating the performance of sentiment analysis models is essential to understand their strengths and weaknesses. We will use various metrics, including accuracy, precision, recall, F1-score, and confusion matrix.

Example: Model Evaluation

evaluate_model.py:

import matplotlib.pyplot as pltfrom sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay # Load the best Logistic Regression modelwith open('models/best_logistic_regression_model.pickle', 'rb') as file:    best_model = pickle.load(file) # Predict on the test sety_pred = best_model.predict(X_test) # Calculate metricsaccuracy = accuracy_score(y_test, y_pred)print(f'Accuracy: {accuracy}')print(classification_report(y_test, y_pred)) # Plot confusion matrixcm = confusion_matrix(y_test, y_pred, labels=[0, 1])disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=['Negative', 'Positive'])disp.plot(cmap=plt.cm.Blues)plt.show()

In this script, we evaluate the best Logistic Regression model on the test set and print various metrics. We also plot the confusion matrix to visualize the performance.

In this section, we covered the essential steps of building and training sentiment analysis models. We discussed how to choose the right model, implemented both traditional machine learning models (Logistic Regression) and deep learning models (LSTM), and performed hyperparameter tuning using GridSearchCV.

Additionally, we evaluated the model performance using various metrics and visualizations. By following these steps, we have developed robust sentiment analysis models that can classify the sentiment of text data.