Skip to main content

Naive Bayes Practical Example with PyTorch

In this example, we will implement a Naive Bayes classifier using PyTorch for a spam detection task on the SMS Spam Collection dataset. PyTorch does not have a built-in Naive Bayes implementation, so we will manually construct the classifier by calculating the necessary probabilities.


1. Install Dependencies

First, ensure that PyTorch and pandas are installed. You can install them via pip:

pip install torch pandas

2. Load and Preprocess the Data

We will use the SMS Spam Collection dataset again. Our goal is to preprocess the data into a format that can be used with PyTorch.

Step 1: Import Libraries

import pandas as pd
import torch
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report

Step 2: Load the Dataset

# Load the dataset
df = pd.read_csv('sms_spam_collection.csv', delimiter='\t', header=None)
df.columns = ['label', 'message']

# Display the first few rows
print(df.head())

3. Preprocess the Text Data

We need to convert the text messages into a bag-of-words representation using CountVectorizer and convert them into PyTorch tensors.

Step 1: Convert Text to Features

# Initialize the CountVectorizer to transform text into a bag-of-words model
vectorizer = CountVectorizer(stop_words='english')

# Convert the messages into numeric form
X = vectorizer.fit_transform(df['message']).toarray()

# Labels (spam/ham)
y = df['label'].map({'ham': 0, 'spam': 1}).values # Map ham to 0 and spam to 1

Step 2: Split the Data

We will split the dataset into training and test sets and convert them to PyTorch tensors.

# Split the dataset into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Convert to PyTorch tensors
X_train_tensor = torch.tensor(X_train, dtype=torch.float32)
X_test_tensor = torch.tensor(X_test, dtype=torch.float32)
y_train_tensor = torch.tensor(y_train, dtype=torch.float32)
y_test_tensor = torch.tensor(y_test, dtype=torch.float32)

4. Define the Naive Bayes Model in PyTorch

In PyTorch, we will manually implement the Naive Bayes classifier by calculating the prior probabilities and likelihoods for each feature given the class.

Step 1: Build the Naive Bayes Classifier

class NaiveBayesClassifier:
def __init__(self):
self.class_probs = None
self.feature_probs_given_class = None

def fit(self, X, y):
# Calculate the prior probability P(C) for each class
class_counts = torch.bincount(y.long())
self.class_probs = class_counts / torch.sum(class_counts)

# Calculate the likelihood P(X|C) for each feature given the class
feature_counts_given_class = torch.zeros((len(class_counts), X.shape[1]))

for c in range(len(class_counts)):
feature_counts_given_class[c] = torch.sum(X[y == c], axis=0)

# Add Laplace smoothing
self.feature_probs_given_class = (feature_counts_given_class + 1) / (
torch.sum(feature_counts_given_class, axis=1, keepdim=True) + 2
)

def predict(self, X):
# Calculate log probabilities to avoid numerical underflow
log_class_probs = torch.log(self.class_probs)
log_feature_probs_given_class = torch.log(self.feature_probs_given_class)
log_probs = torch.matmul(X, log_feature_probs_given_class.T) + log_class_probs
return torch.argmax(log_probs, axis=1)

Explanation:

  • P(C): The prior probability of each class (spam/ham) is computed from the training data.
  • P(X|C): The likelihood of each feature (word presence/absence) given the class is computed and Laplace smoothing is applied to handle zero probabilities.

5. Train the Model

Now that we have defined the model, we will train it on the training data.

# Initialize the model
nb_classifier = NaiveBayesClassifier()

# Train the model
nb_classifier.fit(X_train_tensor, y_train_tensor)

6. Make Predictions and Evaluate the Model

Step 1: Make Predictions

After training the model, we can make predictions on the test data.

# Make predictions on the test data
y_pred_tensor = nb_classifier.predict(X_test_tensor)

Step 2: Convert Predictions and Evaluate

Convert the predictions back to NumPy arrays for evaluation.

# Convert predictions to NumPy arrays
y_pred = y_pred_tensor.numpy()

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy * 100:.2f}%")

# Detailed classification report
print(classification_report(y_test, y_pred, target_names=['ham', 'spam']))

7. Interpret the Results

Accuracy

The accuracy shows how many predictions were correct out of the total predictions. A high accuracy indicates that the model is correctly classifying messages as spam or ham.

Classification Report

The classification report provides detailed insights into the model's performance:

  • Precision: The ratio of true positive predictions (correctly predicted spam) to all predicted positives.
  • Recall: The ratio of true positives to all actual positives (how well the model identifies spam).
  • F1-Score: The harmonic mean of precision and recall, offering a balanced measure of the model's performance.

8. Summary

In this article, we implemented a Naive Bayes classifier using PyTorch for text classification. We covered:

  1. Preprocessing the SMS Spam Collection dataset.
  2. Building a Naive Bayes classifier from scratch in PyTorch.
  3. Training the model on the preprocessed data.
  4. Evaluating the model's performance using metrics like accuracy and F1-score.

Although PyTorch does not have built-in support for Naive Bayes, we constructed the classifier manually by computing the necessary probabilities. This example illustrates the flexibility and power of PyTorch for custom machine learning models.

In the next section, we will explore more advanced algorithms or comparisons with other classifiers.