Skip to main content

Naive Bayes Practical Example with TensorFlow

In this example, we will use TensorFlow Probability to implement a Naive Bayes classifier for text classification. Since TensorFlow doesn't have built-in support for Naive Bayes, we will leverage the TensorFlow Probability library, which provides probabilistic models, including Naive Bayes.

We will use the SMS Spam Collection dataset for a spam detection task.


1. Install Dependencies

First, ensure that TensorFlow, TensorFlow Probability, and pandas are installed. You can install them via pip:

pip install tensorflow tensorflow-probability pandas

2. Load and Preprocess the Data

We will use the same SMS Spam Collection dataset from the previous example. The dataset contains SMS messages labeled as spam or ham (not spam).

Step 1: Import Libraries

import pandas as pd
import tensorflow as tf
import tensorflow_probability as tfp
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report

Step 2: Load the Dataset

# Load the dataset
df = pd.read_csv('sms_spam_collection.csv', delimiter='\t', header=None)
df.columns = ['label', 'message']

# Display the first few rows
print(df.head())

3. Preprocess the Text Data

Before feeding the data into our Naive Bayes classifier, we need to preprocess it into numeric features using CountVectorizer.

Step 1: Convert Text to Features

We will use CountVectorizer to create a bag-of-words representation of the SMS messages.

# Initialize CountVectorizer to transform text into a bag-of-words model
vectorizer = CountVectorizer(stop_words='english')

# Convert the messages into numeric form
X = vectorizer.fit_transform(df['message']).toarray()

# Labels (spam/ham)
y = df['label'].map({'ham': 0, 'spam': 1}).values # Map ham to 0 and spam to 1

Step 2: Split the Data

We will split the dataset into training and test sets.

# Split the dataset into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

4. Define the Naive Bayes Model with TensorFlow Probability

Step 1: Create a TensorFlow Probability Naive Bayes Classifier

We will use TensorFlow Probability to build a Naive Bayes classifier. The approach here models the feature likelihoods for each class as a Bernoulli distribution (since the data from CountVectorizer is binary — indicating word presence or absence).

tfd = tfp.distributions

class NaiveBayesClassifier:
def __init__(self):
self.class_probs = None
self.feature_probs_given_class = None

def fit(self, X, y):
# Calculate the prior probability of each class (P(C))
class_counts = tf.math.bincount(y)
self.class_probs = class_counts / tf.reduce_sum(class_counts)

# Calculate the likelihood P(X|C) for each feature given the class
feature_counts_given_class = tf.math.unsorted_segment_sum(
data=tf.cast(X, tf.float32),
segment_ids=tf.convert_to_tensor(y, dtype=tf.int32),
num_segments=len(class_counts)
)
self.feature_probs_given_class = (feature_counts_given_class + 1) / (
tf.reduce_sum(feature_counts_given_class, axis=1, keepdims=True) + 2
)

def predict(self, X):
# Calculate log probabilities to avoid numerical underflow
log_class_probs = tf.math.log(self.class_probs)
log_feature_probs_given_class = tf.math.log(self.feature_probs_given_class)
log_probs = tf.einsum('ij,kj->ik', X, log_feature_probs_given_class) + log_class_probs
return tf.argmax(log_probs, axis=1)

# Initialize the model
nb_classifier = NaiveBayesClassifier()

# Train the model
nb_classifier.fit(X_train, y_train)

Explanation:

  • P(C): The prior probability of each class is computed based on the relative frequencies of spam and ham in the dataset.
  • P(X|C): The likelihood of each feature (word presence/absence) given the class is calculated, smoothed using Laplace smoothing.

5. Make Predictions and Evaluate the Model

Once the model is trained, we can make predictions on the test data and evaluate its performance.

Step 1: Make Predictions

# Predict on the test data
y_pred = nb_classifier.predict(X_test)

Step 2: Evaluate Performance

We will evaluate the model using accuracy, precision, recall, and F1-score.

# Print accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy * 100:.2f}%")

# Detailed classification report
print(classification_report(y_test, y_pred, target_names=['ham', 'spam']))

6. Interpret the Results

Accuracy

The accuracy metric shows how many predictions were correct out of the total predictions. A high accuracy indicates that the model is correctly distinguishing between spam and ham messages.

Classification Report

The classification report provides more detailed insights:

  • Precision: The ratio of true positive predictions (correctly predicted spam) to all predicted positives.
  • Recall: The ratio of true positives to all actual positives (how well the model identifies spam).
  • F1-Score: The harmonic mean of precision and recall, providing a balanced evaluation of the model’s performance.

7. Summary

In this article, we implemented a Naive Bayes classifier using TensorFlow Probability. We covered:

  1. Loading and preprocessing the SMS Spam Collection dataset.
  2. Training a Naive Bayes classifier using TensorFlow Probability to predict whether a message is spam or ham.
  3. Evaluating the model's performance using accuracy and classification metrics.

While TensorFlow does not have a built-in Naive Bayes implementation, TensorFlow Probability provides a flexible framework for building probabilistic models like Naive Bayes. In the next section, we will explore the implementation of Naive Bayes using PyTorch.