Naive Bayes Practical Example with scikit-learn
In this practical example, we will use Naive Bayes to perform spam detection on the SMS Spam Collection dataset. We will use Multinomial Naive Bayes, which is well-suited for text classification tasks. This example demonstrates how to:
- Load and preprocess the data.
- Train a Naive Bayes classifier using scikit-learn.
- Evaluate the model’s performance using accuracy and other metrics.
1. Install Dependencies
To get started, ensure that scikit-learn, pandas, and nltk are installed in your environment. You can install them via pip:
pip install scikit-learn pandas nltk
2. Load and Preprocess the Data
We will use the SMS Spam Collection dataset, which contains SMS messages labeled as spam or ham (not spam). This dataset is available in many online repositories, but for simplicity, you can download it from Kaggle or UCI Machine Learning Repository.
Step 1: Import Libraries
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, classification_report
Step 2: Load the Dataset
Assuming the dataset is a CSV file:
# Load the dataset
df = pd.read_csv('sms_spam_collection.csv', delimiter='\t', header=None)
df.columns = ['label', 'message']
# Display first few rows
print(df.head())
The dataset contains two columns: label
(spam or ham) and message
(the SMS text message).
3. Preprocess the Text Data
Since Naive Bayes requires numeric input, we need to convert the SMS messages into a bag-of-words representation using CountVectorizer
.
Step 1: Convert Text to Features
# Initialize the CountVectorizer to transform text into a bag-of-words model
vectorizer = CountVectorizer(stop_words='english')
# Convert the messages into numeric form
X = vectorizer.fit_transform(df['message'])
# Labels (spam/ham)
y = df['label'].map({'ham': 0, 'spam': 1}) # Map ham to 0 and spam to 1
Step 2: Split the Data
We will split the dataset into training and test sets to evaluate the model’s performance.
# Split the dataset into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
4. Train the Naive Bayes Model
Now that the data is preprocessed, we will train the Multinomial Naive Bayes classifier.
# Initialize the Naive Bayes classifier
nb_classifier = MultinomialNB()
# Train the classifier on the training data
nb_classifier.fit(X_train, y_train)
5. Make Predictions and Evaluate the Model
After training, we can make predictions on the test data and evaluate the model's performance.
Step 1: Make Predictions
# Make predictions on the test data
y_pred = nb_classifier.predict(X_test)
Step 2: Evaluate Performance
We will evaluate the model using accuracy, precision, recall, and F1-score.
# Print accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy * 100:.2f}%")
# Detailed classification report
print(classification_report(y_test, y_pred, target_names=['ham', 'spam']))
6. Interpret the Results
Accuracy
The accuracy metric shows how many predictions were correct out of the total predictions. A high accuracy indicates that the model is correctly distinguishing between spam and ham messages.
Classification Report
The classification report provides more detailed information:
- Precision: The ratio of true positive predictions (correctly predicted spam) to all predicted positives.
- Recall: The ratio of true positives to all actual positives (how well the model identifies spam).
- F1-Score: The harmonic mean of precision and recall, providing a balanced evaluation of the model’s performance.
7. Summary
In this example, we used Multinomial Naive Bayes to build a simple spam detection model with the SMS Spam Collection dataset. We covered the following steps:
- Loading and preprocessing the data using
CountVectorizer
. - Training a Naive Bayes classifier with scikit-learn.
- Evaluating the model’s performance using accuracy and classification metrics.
Naive Bayes is an excellent algorithm for text classification tasks like spam detection because of its simplicity and efficiency. In the next section, we will explore how to implement Naive Bayes using other popular machine learning libraries like TensorFlow and PyTorch.