Intro to Machine Learning with Spammy Emails, Python and, SciKit Learn

Introduction

Machine Learning is capturing significant attention among technologists and innovators due to a desire to shift from descriptive analytics focused on understanding what happened in the past towards predicting what is likely to occur in the future as well as prescribe actions to take in response to that prediction. In this article I focus on the use case of classifying email messages as either spam or ham with supervised machine learning using Python and SciKit Learn.

Key Terms for Machine Learning and Text Analytics

Spam: Unwanted email (7)

Ham: Non-spam, legitimate email (7)

Document: An arbitrarily defined collection of words representing a complete record under investigation of a text analytics experiment such as a sentence, paragraph, chapter, blog post, news article, tweet, ect .... it is up to the practitioner to decide what constitutes a document.

doc0 = "The author's name is Adam McQuistan."
doc1 = "The author's profession is software engineering."
doc2 = "The author's interests outside of software are golf, football, and fishing"

Text Corpus: A collection of documents used in text analytics project (5)

corpora = [ doc0, doc1, doc2 ]

Tokenization: text analytics preprocessing technique where a document is broken down into component words or even word roots (3,4)

single_words = doc1.split()
print(single_words)
['The', "author's", 'profession', 'is', 'software', 'engineering.']
import string

def make_normalized_token(word):
    translator = str.maketrans('', '', string.punctuation)
    return word.translate(translator).lower()

normalized_tokens = [make_normalized_token(w) for w in single_words]
print(normalized_tokens)
['the', 'authors', 'profession', 'is', 'software', 'engineering']

Vocabulary Building: An extension of tokenization whereby the collective words in a document or corpus are assembled into a unique collection (5)

def make_vocabulary(doc):
    doc_vocab = set()
    for word in doc.split():
        doc_vocab.add(make_normalized_token(word))
    return doc_vocab

corpora_vocab = set()
for doc in corpora:
    doc_vocab = make_vocabulary(doc)
    corpora_vocab = corpora_vocab.union(doc_vocab)

print(corpora_vocab)
{'adam',
 'and',
 'are',
 'authors',
 'engineering',
 'fishing',
 'football',
 'golf',
 'interests',
 'is',
 'mcquistan',
 'name',
 'of',
 'outside',
 'profession',
 'software',
 'the'}

Stop Words: words that are so ubiquitous in language that they are considered uninformative or lacking in predictive strength for numerical analysis in the field of NLP (1).

In my example vocabulary I am going to say that the words (is, of, and, are, the) are low value stop words and remove them from the vocabulary.

stop_words = {'is', 'of', 'and', 'are', 'the'}
corpora_vocab = corpora_vocab - stop_words
print(corpora_vocab)
{'adam',
 'authors',
 'engineering',
 'fishing',
 'football',
 'golf',
 'interests',
 'mcquistan',
 'name',
 'outside',
 'profession',
 'software'}

Bag Of Words: is a feature extraction technique for text analytics where the frequency of words in a document, or collection of documents (corpus), are represented in a data structure (usually a sparse matrix) then used to used to train a algorithm or model for predicting characteristics of a text document such as an email, customer review, tweet, ect ... (2)

bag_of_words = {}
for i, doc in enumerate(corpora):
    doc_counts = {}
    doc_words = [make_normalized_token(w) for w in doc.split()]
    for token in corpora_vocab:
        doc_counts[token] = doc_words.count(token)
    doc_name = 'doc{}'.format(i)
    bag_of_words[doc_name] = doc_counts

print(bag_of_words)
{'doc0': {'engineering': 0,
  'outside': 0,
  'golf': 0,
  'interests': 0,
  'authors': 1,
  'profession': 0,
  'adam': 1,
  'mcquistan': 1,
  'software': 0,
  'football': 0,
  'fishing': 0,
  'name': 1},
 'doc1': {'engineering': 1,
  'outside': 0,
  'golf': 0,
  'interests': 0,
  'authors': 1,
  'profession': 1,
  'adam': 0,
  'mcquistan': 0,
  'software': 1,
  'football': 0,
  'fishing': 0,
  'name': 0},
 'doc2': {'engineering': 0,
  'outside': 1,
  'golf': 1,
  'interests': 1,
  'authors': 1,
  'profession': 0,
  'adam': 0,
  'mcquistan': 0,
  'software': 1,
  'football': 1,
  'fishing': 1,
  'name': 0}}

Sparse Matrix: An array or matrix data structure that is largely composed of elements containing no value or zero and often represented in more compact, memory efficient, formats employed in text analytics because of the large number of words that may be present in a corpora but consisting mostly of zero counts for any given individual document (8)

As an example, if I convert my bag_of_words dictionary into a matrix where the columns represent the words in the vocabulary and each row represents a document then the count of each word from the vocab would look something like this.

import pandas as pd

sparsematrix_df = pd.DataFrame(bag_of_words).T
print(sparsematrix_df)
  engineering outside golf interests authors profession adam mcquistan software football fishing name
doc0 0 0 0 0 1 0 1 1 0 0 0 1
doc1 1 0 0 0 1 1 0 0 1 0 0 0
doc2 0 1 1 1 1 0 0 0 1 1 1 0

Note that the majority of data points, defined as the intersection of rows and columns, are zeros here. For a small dataset like this representing the data this way isn't a big deal but, for much larger datasets typical among text analytics projects most elements are zeros which makes are a computationally inefficient data structure.

Another way to represent this sparse matrix could be done by only representing the non-zero values in something like a dictionary that has keys mapping rows (documents in this case) to columns (words of the vocabulary) mapped to the non-zero count of that word for that document.

sparsematrix_dct = {}
for docname, doc in sparsematrix_df.iterrows():
    for word in sparsematrix_df.columns:
        if doc[word] != 0:
            sparsematrix_dct[(docname, word)] = doc[word]

print(sparsematrix_dct)
{('doc0', 'authors'): 1,
 ('doc0', 'adam'): 1,
 ('doc0', 'mcquistan'): 1,
 ('doc0', 'name'): 1,
 ('doc1', 'engineering'): 1,
 ('doc1', 'authors'): 1,
 ('doc1', 'profession'): 1,
 ('doc1', 'software'): 1,
 ('doc2', 'outside'): 1,
 ('doc2', 'golf'): 1,
 ('doc2', 'interests'): 1,
 ('doc2', 'authors'): 1,
 ('doc2', 'software'): 1,
 ('doc2', 'football'): 1,
 ('doc2', 'fishing'): 1}

Term Frequency - Inverse Document Fequency (tf-idf): A rescaling technique for representing token (aka word) frequency with the aim of measuring the extent to which features (words, tokens) are informative. Specifically there is higher weight, or emphasis, given to terms that occur significantly more in any one document but, very few times across the remaining collection of documents which leads to the terms being highly descriptive features of the document(s) that they are prominantly appearing in. This is a purely unsupervised technique so, the degree to which a term is determined to be signficant to the document's content doesn't necessarily mean it is linked to a given topic (11).

Smaller values represent terms that are of low specificity to a given document and/or are commonly used across many of the documents

Larger values represent terms that are more highly specific to the document relative to the reset of the corpus.

import numpy as np

def calc_tfidf(sparse_df):
    tfidf_df = pd.DataFrame(data=np.full(sparsematrix_df.shape, np.nan),
                            index=sparsematrix_df.index,
                            columns=sparsematrix_df.columns)
    words = sparse_df.columns
    doc_names = sparse_df.index
    N = sparse_df.shape[0]
    
    for docname, doc in sparse_df.iterrows():
        other_docs = sparse_df[sparse_df.index != docname]
        for word in words:
            tf = doc[word]
            Nw = (other_docs[word] > 0).sum() + bool(tf)
            tfidf = tf * (np.log10( (N + 1)/(Nw + 1) ) + 1) 
            tfidf_df.loc[tfidf_df.index == docname, word] = tfidf
    
    return tfidf_df

tfidf_df = calc_tfidf(sparsematrix_df)
print(tfidf_df)
  engineering outside golf interests authors profession adam mcquistan software football fishing name
doc0 0.00000 0.00000 0.00000 0.00000 1.0 0.00000 1.30103 1.30103 0.000000 0.00000 0.00000 1.30103
doc1 1.30103 0.00000 0.00000 0.00000 1.0 1.30103 0.00000 0.00000 1.124939 0.00000 0.00000 0.00000
doc2 0.00000 1.30103 1.30103 1.30103 1.0 0.00000 0.00000 0.00000 1.124939 1.30103 1.30103 0.00000

Now if I want to find the most informative words that characterize the content of doc0 I could transpose the tfidf_df dataframe and sort in descending values to see the mose informative words at the top.

print("doc0: {}\n".format(doc0))
print(tfidf_df.T.sort_values('doc0', ascending=False))
  doc0 doc1 doc2
adam 1.30103 0.000000 0.000000
mcquistan 1.30103 0.000000 0.000000
name 1.30103 0.000000 0.000000
authors 1.00000 1.000000 1.000000
engineering 0.00000 1.301030 0.000000
outside 0.00000 0.000000 1.301030
golf 0.00000 0.000000 1.301030
interests 0.00000 0.000000 1.301030
profession 0.00000 1.301030 0.000000
software 0.00000 1.124939 1.124939
football 0.00000 0.000000 1.301030
fishing 0.00000 0.000000 1.301030

n-Grams: a feature extraction technique for text analytics where a contiguous sequence of n words are identified and represent a single feature (12)

  • unigram: regular bag of words approach where only single terms are assessed as features
  • bigrams: grouping of continuous terms two at a time are assessed as features
  • trigrams: grouping of continous terms as a sequence of tripples are assessed as features
  • n-grams: grouping of n continuous terms are assessed as features

Lemmatization: a form of normalization of a word so that a human curated dictionary is used to translate a conflated word down to it's root form by removing prefixes or inflectional endings while still producing a linguistically proper word (15)

Stemming: a form of normalizing a word down to it's root form based off of crude rules or heuristics whereby common prefixes and suffixes are chopped off and may result in an incomplete final word (15)

Supervised Learning: Approach to predictive modeling (aka machine learning) whereby a model is built from a prelabeled dataset consisting of one or more input predictor variables (features) are paired with known outcomes (9)

Unsupervised Learning: Approach to predictive modeling (aka machine learning) whereby a model is built to learn patterns based off unclassified or naive datasets (10)

Logistic Regression: A prediction model that is a classifier of two possible outcomes based off the linear combination of input predictor variables (features) (13)

Cross-Validation: A method of assessing the robustness and generalizability of a model through specificly splitting and training multiple models of varying combinations (aka folds) of the training dataset having been split into test and training (14)

Example of Building and Assessing Spam / Ham Prediction Models

At this point I've covered enough theory to lay a foundation to be able to speak relatively freely in the common terms and concepts to be expected in a quality disucssion of machine learning, particularly in the space of text analytics and spam / ham classification. This means I can shift focus to a more practical demonstration of using machine learning to solve a real world problem such as the spam / ham classification use case being presented here.

As mentioned previously I am utilizing the wildly popular Python programming language and equally venerable Scikit Learn machine learning libary built upon Python to classify a couple of different spam - ham labeled publically available email datasets.  One dataset comes from the Apache SpamAssassin project and another from Athens University which features a collection different sourced email datasets but, primarily focusing on the confiscated Enron emails.

I will start with the SpamAssassin dataset which I download in compressed form using the HTTPie Python based HTTP Client library.

http --download https://spamassassin.apache.org/old/publiccorpus/20021010_easy_ham.tar.bz2
http --download https://spamassassin.apache.org/old/publiccorpus/20021010_hard_ham.tar.bz2
http --download https://spamassassin.apache.org/old/publiccorpus/20021010_spam.tar.bz2
http --download https://spamassassin.apache.org/old/publiccorpus/20030228_easy_ham_2.tar.bz2
http --download https://spamassassin.apache.org/old/publiccorpus/20030228_hard_ham.tar.bz2
http --download https://spamassassin.apache.org/old/publiccorpus/20030228_spam.tar.bz2
http --download https://spamassassin.apache.org/old/publiccorpus/20030228_spam_2.tar.bz2
http --download https://spamassassin.apache.org/old/publiccorpus/20050311_spam_2.tar.bz2
ls -l
-rw-r--r--  1 adammcquistan  staff  1677144 Mar 18 20:13 20021010_easy_ham.tar.bz2
-rw-r--r--  1 adammcquistan  staff  1021126 Mar 18 20:20 20021010_hard_ham.tar.bz2
-rw-r--r--  1 adammcquistan  staff  1192582 Mar 18 20:21 20021010_spam.tar.bz2
-rw-r--r--  1 adammcquistan  staff  1077892 Mar 18 20:21 20030228_easy_ham_2.tar.bz2
-rw-r--r--  1 adammcquistan  staff  1029898 Mar 18 20:21 20030228_hard_ham.tar.bz2
-rw-r--r--  1 adammcquistan  staff  1183768 Mar 18 20:21 20030228_spam.tar.bz2

Next I extract each to their own directories.

mkdir 20021010_easy_ham 
mkdir 20021010_hard_ham 
mkdir 20021010_spam 
mkdir 20030228_easy_ham_2 
mkdir 20030228_hard_ham 
mkdir 20030228_spam
mkdir 20030228_spam_2
mkdir 20050311_spam_2
tar xf 20021010_easy_ham.tar.bz2 --directory 20021010_easy_ham --strip-components 1
tar xf 20021010_hard_ham.tar.bz2 --directory 20021010_hard_ham --strip-components 1
tar xf 20021010_spam.tar.bz2 --directory 20021010_spam --strip-components 1
tar xf 20030228_easy_ham_2.tar.bz2 --directory 20030228_easy_ham_2 --strip-components 1
tar xf 20030228_hard_ham.tar.bz2 --directory 20030228_hard_ham --strip-components 1
tar xf 20030228_spam.tar.bz2 --directory 20030228_spam --strip-components 1
tar xf 20030228_spam_2.tar.bz2 --directory 20030228_spam_2 --strip-components 1
tar xf 20050311_spam_2.tar.bz2 --directory 20050311_spam_2 --strip-components 1

Then I have a peek at an email message data to gain an understanding of the raw data.

cat 20021010_easy_ham/2551.3b1f94418de5bd544c977b44bcc7e740
From rssfeeds@jmason.org  Thu Oct 10 12:32:34 2002
Return-Path: <rssfeeds@example.com>
Delivered-To: yyyy@localhost.example.com
Received: from localhost (jalapeno [127.0.0.1])
	by jmason.org (Postfix) with ESMTP id 89EE616F03
	for <jm@localhost>; Thu, 10 Oct 2002 12:32:33 +0100 (IST)
Received: from jalapeno [127.0.0.1]
	by localhost with IMAP (fetchmail-5.9.0)
	for jm@localhost (single-drop); Thu, 10 Oct 2002 12:32:33 +0100 (IST)
Received: from dogma.slashnull.org (localhost [127.0.0.1]) by
    dogma.slashnull.org (8.11.6/8.11.6) with ESMTP id g9A84QK14194 for
    <jm@jmason.org>; Thu, 10 Oct 2002 09:04:26 +0100
Message-Id: <200210100804.g9A84QK14194@dogma.slashnull.org>
To: yyyy@example.com
From: newscientist <rssfeeds@example.com>
Subject: Critical US satellites could be hacked
Date: Thu, 10 Oct 2002 08:04:26 -0000
Content-Type: text/plain; encoding=utf-8
X-Spam-Status: No, hits=-959.4 required=5.0
	tests=AWL,DATE_IN_PAST_03_06,T_NONSENSE_FROM_00_10,
	      T_NONSENSE_FROM_10_20,T_NONSENSE_FROM_20_30,
	      T_NONSENSE_FROM_30_40,T_NONSENSE_FROM_40_50,
	      T_NONSENSE_FROM_50_60,T_NONSENSE_FROM_60_70,
	      T_NONSENSE_FROM_70_80,T_NONSENSE_FROM_80_90,
	      T_NONSENSE_FROM_90_91,T_NONSENSE_FROM_91_92,
	      T_NONSENSE_FROM_92_93,T_NONSENSE_FROM_93_94,
	      T_NONSENSE_FROM_94_95,T_NONSENSE_FROM_95_96,
	      T_NONSENSE_FROM_96_97,T_NONSENSE_FROM_97_98,
	      T_NONSENSE_FROM_98_99,T_NONSENSE_FROM_99_100
	version=2.50-cvs
X-Spam-Level: 

URL: http://www.newsisfree.com/click/-3,8708820,1440/
Date: Not supplied

Military communications could be jammed or intercepted and satellites thrown 
off course or destroyed, a new US study warns

Given these are email messages I will use the email module from the Python standard library to parse the email messages and extract the body of the message for analysis.

import email
import os
import re

p = os.path.join('20021010_easy_ham', '2551.3b1f94418de5bd544c977b44bcc7e740')
with open(p) as fp:
    mail = email.message_from_file(fp)
print(mail.get_payload())
'URL: http://www.newsisfree.com/click/-3,8708820,1440/\nDate: Not supplied\n\nMilitary communications could be jammed or intercepted 
and satellites thrown \noff course or destroyed, a new US study warns\n\n\n'

In addition to parsing the messages with the email module I also clean the payload body a bit in a preprocessing step by removing the any inproperly parsed email tags like sent, from, to, url containing lines as well as excess empty new lines in the body. I wrap this parsing and cleaning functionality into a reusable function that returns lists of two element tuples representing the cleaned message body and the label type of the message (ie, spam or ham).

def parse_messages(base_dir, msg_type, verbose=False):
    labeled_messages = []
    r_url = re.compile(r'url\s*:', flags=re.IGNORECASE)
    r_to = re.compile(r'to\s*:', flags=re.IGNORECASE)
    r_from = re.compile(r'from\s*:', flags=re.IGNORECASE)
    r_date = re.compile(r'date\s*:', flags=re.IGNORECASE)
    r_sent = re.compile(r'sent\s*:', flags=re.IGNORECASE)
    r_cc = re.compile(r'cc\s*:', flags=re.IGNORECASE)
    
    
    for f in os.listdir(base_dir):
        filepath = os.path.join(base_dir, f)
        with open(filepath) as fp:
            try:
                mail = email.message_from_file(fp)

                semi_cleaned = []
                message_payload = mail.get_payload()
                if isinstance(message_payload, list):
                    # some appear to be duplicated so just take the first
                    message_payload = message_payload[0].get_payload()

                for line in message_payload.split('\n'):
                    line = line.strip()
                    exclude = not line or \
                                r_url.match(line) or \
                                r_to.match(line) or \
                                r_from.match(line) or \
                                r_date.match(line) or \
                                r_sent.match(line) or \
                                r_cc.match(line)
                    if not exclude:
                        semi_cleaned.append(line)
                labeled_messages.append((msg_type, '\n'.join(semi_cleaned)))
            except Exception as e:
                if verbose:
                    print('Skipping ' + filepath)
                    print(str(e))
    return labeled_messages


HAM_TYPE = 0
SPAM_TYPE = 1

ham_messages1 = parse_messages('20021010_easy_ham', HAM_TYPE)
ham_messages2 = parse_messages('20021010_hard_ham', HAM_TYPE)
ham_messages3 = parse_messages('20030228_easy_ham_2', HAM_TYPE)
ham_messages4 = parse_messages('20030228_hard_ham', HAM_TYPE)

spam_messages1 = parse_messages('20021010_spam', SPAM_TYPE)
spam_messages2 = parse_messages('20030228_spam', SPAM_TYPE)
spam_messages3 = parse_messages('20030228_spam_2', SPAM_TYPE)
spam_messages4 = parse_messages('20050311_spam_2', SPAM_TYPE)

Next I combine the individual spam and ham messages to see how many of each there are.

ham_messages = ham_messages1 + ham_messages2 + ham_messages3 + ham_messages4
spam_messages = spam_messages1 + spam_messages2 + spam_messages3 + spam_messages4

print("Number of ham message {}".format(len(ham_messages)))
print("Number of spam message {}".format(len(spam_messages)))
Number of ham message 4124
Number of spam message 3322

Lastly, I combine the spam and ham labeled datasets into one collection then randomize their order so they are evenly distributed which increased their efficacy when used to build an unbiased machine learning model.

combined_messages = spam_messages + ham_messages

np.random.seed(seed=23)
rng = np.random.default_rng()

rng.shuffle(combined_messages)

n_test = 1000
n_train = len(combined_messages) - n_test
train = combined_messages[:n_train]
test = combined_messages[n_train:]

print("Training sample size: {}".format(len(train)))
print("Testing sample size: {}".format(len(test)))
Training sample size: 6446
Testing sample size: 1000

I then re-split both the training and testing datasets into independent and dependent variables so that the predictors (aka indpendent variables) can be isolated and used to generate the linear regression coefficients via Ordinary Least Squares to minize the cummulative error between the estimated email class and the true ones specified in the labeled y_train and y_test arrays.

y_train, train_txt = zip(*train)
y_test, test_txt = zip(*test)

To be sure that the training and testing dataset have a relatively equal distribution of spam and ham labeled emails I'll employ the help a barchart graph from the matplotlib libary. 

import matplotlib.pyplot as plt
import pandas as pd

train_cnts = pd.Series(data=y_train).value_counts()
test_cnts = pd.Series(data=y_test).value_counts()

x = ['ham', 'spam']
colors = ['blue', 'green']

fig, (ax1, ax2) = plt.subplots(ncols=2, figsize=(12, 6), sharey=True)
ax1.bar(x, train_cnts.values, color=colors)
ax1.set_title('Training')

ax2.bar(x, test_cnts.values, color=colors)
ax2.set_title('Testing')

plt.show()

Next I generate a bag of words representation of the training text dataset using the CountVectorizer feature extraction class from the Scikit Learn library. Recall from the earlier discussion of the Bag of Word text feature extraction technique that a vocabulary must be generated from the result of tokenizing a body of text such as a document or a corpus.

from sklearn.feature_extraction.text import CountVectorizer

v = CountVectorizer()

Next I fit the CountVectorizer which tokenizes the input text and generates the vocabular.

v.fit(train_txt)

The fitted instance of CountVectorizer will have a computed field named vocabulary_ which is a dictionary with each tokenized term represented as a key and corresponding value representing it's index in sorted order. To better visualize this I generate a sorted list of term, index items from the vocabulary_ dictionary and display below.

vocab = sorted([(t, i) for t, i in v.vocabulary_.items()], key=lambda term: term[1])
print("Terms in vocabulary: {}".format(len(vocab)))
print("Every 5,000th term, index element of the vocabulary:\n{}".format(vocab[::5000]))
Terms in vocabulary: 69215
Every 5,000th term, index element of the vocabulary:
[('00', 0), ('256f4b655a55d56db6f498eb9ce6f50c', 5000), ('65gs', 10000), ('allfreecars', 15000), ('buzzmeter', 20000), ('d9fdce0f5763bbb219ee56cced16f298', 25000), 
('endpoints', 30000), ('grasshopper', 35000), ('ite', 40000), ('mhyxqvy2a7hthl11koqfb2gvagyr', 45000), ('pc9ppjwvyj4siekgdw5kzxjzdgfuzcb0agf0ihroaxmgbwf0zxjpywwgaxm8yni', 50000),
('rename', 55000), ('steroid', 60000), ('valenti', 65000)]

There is also a handy get_feature_names() method of the CountVectorizer class which you can use to get the sorted list of terms (aka features) comprising the vocabulary from the dataset used to fit the CountVectorizer class

features = v.get_feature_names()
print("Number of features: {}".format(len(features)))
print("Every 5,000th feature:\n{}".format(features[::5000]))
Number of features: 69215
Every 5,000th feature:
['00', '256f4b655a55d56db6f498eb9ce6f50c', '65gs', 'allfreecars', 'buzzmeter', 'd9fdce0f5763bbb219ee56cced16f298', 'endpoints', 'grasshopper', 'ite', 
'mhyxqvy2a7hthl11koqfb2gvagyr', 'pc9ppjwvyj4siekgdw5kzxjzdgfuzcb0agf0ihroaxmgbwf0zxjpywwgaxm8yni', 'rename', 'steroid', 'valenti']

You use the transform method of the fitted CountVectorizer class to generate the actual counts for each term from the vocabulary in each document of a given dataset effectively building an efficient sparse matrix representation of the bag of words. In this case the sparse matrix is built using the SciPy Sparse Matrix using row / column indexing similar to the toy example I introduced sparse matrices with earlier except the row is the document index and the column is the index of the word in the vocabulary.

X_train = v.transform(train_txt)
print("X_train is of type {}".format(type(X_train)))
print("X_train contents:\n{}".format(X_train))
X_train is of type <class 'scipy.sparse.csr.csr_matrix'>
X_train contents:
  (0, 14345)	1
  (0, 15269)	2
  (0, 15334)	1
  (0, 16507)	1
  (0, 18542)	1
  (0, 19173)	1
  (0, 20120)	1
  (0, 20730)	1
  (0, 20845)	1
  (0, 22402)	1
  (0, 26718)	1
  (0, 27257)	1
  (0, 29297)	1
  (0, 29654)	1
  (0, 29725)	1
  (0, 29727)	2
  (0, 31692)	1
  (0, 31964)	1
  (0, 32912)	1
  (0, 34439)	1
  (0, 34589)	1
  (0, 35553)	3
  (0, 35797)	2
  (0, 36279)	1
  (0, 37592)	1
  :	:
  (6445, 42921)	1
  (6445, 45168)	1
  (6445, 46230)	1
  (6445, 47151)	1
  (6445, 48177)	2
  (6445, 48412)	1
  (6445, 48702)	1
  (6445, 48938)	1
  (6445, 49366)	1
  (6445, 50345)	1
  (6445, 51632)	1
  (6445, 53588)	1
  (6445, 54230)	1
  (6445, 58358)	1
  (6445, 62088)	5
  (6445, 62213)	1
  (6445, 62242)	1
  (6445, 62666)	2
  (6445, 63615)	1
  (6445, 64737)	1
  (6445, 66042)	1
  (6445, 66385)	1
  (6445, 66450)	1
  (6445, 66573)	2
  (6445, 67110)	1

At this point I am ready to make my first predictive model to classify an email body as either a legitimate message (ham) or unwanted spam. Two class datasets like this are prefect for a Logistic Regression and it just so happens that the Scikit Learn library offers up a robust classifier named LogisticRegression.

from sklearn.linear_model import LogisticRegression

logreg_clf = LogisticRegression()
logreg_clf.fit(X_train, y_train)

Fitting the Logistic Regression model generates the coefficients such that the errors between the true dependent variables and the predicted outcomes are minimized. Once the model is trained then the next thing to do is evalaute the models accurracy on the training set to see how capable the model is for learning the dataset and thus explain the variation of the outcomes as well as the testing dataset to assess it's robustness and ability to generalize to new messages it hasn't seen before.

X_test = v.transform(test_txt)

print("Training score: {}".format(logreg_clf.score(X_train, y_train)))
print("Testing score: {}".format(logreg_clf.score(X_test, y_test)))
Training score: 0.9992243251628917
Testing score: 0.992

Now lets pretend I didn't actually just run the score(...) method on the test dataset and instead I'll show how to use the model to predict the presence of spam or ham using the X_test data.

pred = logreg_clf.predict(X_test)
print(pred[:5])
accuracy = sum(y_test == pred) * 100.0 / len(y_test)
print("Prediction accuracy: {}%".format(accuracy))
array([0, 0, 0, 0, 1])
Prediction accuracy: 99.2%

It turns out that this model is pretty darn good for that particular SpamAssassin dataset and I am doubtful I will be able to improve it much from here. Instead, I will now move on to another dataset featuring the Enron messages that will hopefully prove more challenging and thus provide for a more thorough investigation into other important machine learning concepts (16).

Again, I download the data files in compressed form over the internet using the HTTPie client.

http --download http://www.aueb.gr/users/ion/data/enron-spam/preprocessed/enron1.tar.gz
http --download http://www.aueb.gr/users/ion/data/enron-spam/preprocessed/enron2.tar.gz
http --download http://www.aueb.gr/users/ion/data/enron-spam/preprocessed/enron3.tar.gz
http --download http://www.aueb.gr/users/ion/data/enron-spam/preprocessed/enron4.tar.gz
http --download http://www.aueb.gr/users/ion/data/enron-spam/preprocessed/enron5.tar.gz
http --download http://www.aueb.gr/users/ion/data/enron-spam/preprocessed/enron6.tar.gz
tar xf enron1.tar.gz
tar xf enron2.tar.gz
tar xf enron3.tar.gz
tar xf enron4.tar.gz
tar xf enron5.tar.gz
tar xf enron6.tar.gz
ls -l
drwx------     5 adammcquistan  staff      160 May 15  2006 enron1
-rw-r--r--     1 adammcquistan  staff  1802573 Mar 18 21:30 enron1.tar.gz
drwx------     5 adammcquistan  staff      160 May 15  2006 enron2
-rw-r--r--     1 adammcquistan  staff  2905627 Mar 18 21:30 enron2.tar.gz
drwx------     5 adammcquistan  staff      160 May 15  2006 enron3
-rw-r--r--     1 adammcquistan  staff  4569634 Mar 18 21:30 enron3.tar.gz
drwx------     5 adammcquistan  staff      160 May 15  2006 enron4
-rw-r--r--     1 adammcquistan  staff  2533019 Mar 18 21:31 enron4.tar.gz
drwx------     5 adammcquistan  staff      160 May 15  2006 enron5
-rw-r--r--     1 adammcquistan  staff  2396886 Mar 18 21:31 enron5.tar.gz
drwx------     5 adammcquistan  staff      160 May 15  2006 enron6
-rw-r--r--     1 adammcquistan  staff  3137204 Mar 18 21:31 enron6.tar.gz

It is always a good practice to get a peek at the raw data especially when it is in a format that is human readable as is the case here. Sometimes you will need specialized utilities if the raw data is binary but I don't have that problem today.

p = os.path.join('enron1', 'ham', '5172.2002-01-11.farmer.ham.txt')
with open(p) as fp:
    mail = email.message_from_file(fp)
print(mail.get_payload())
i tried calling you this am but your phone rolled to someone else ' s voicemail . can you call me when you get a chance ?
- - - - - original message - - - - -
from : farmer , daren j .
sent : thursday , january 10 , 2002 2 : 06 pm
to : hill , garrick
subject : re : tenaska iv
rick ,
i ' ve had a couple of meetings today . i ' m sorry i ' m just getting back to you . i tried to call but the voice mail said that you were 
unavailable . so , give me a call when you get a chance . d - - - - - original message - - - - - from : hill , garrick sent : wednesday , january 09 , 2002 6 : 11 pm to : farmer , daren j . subject : re : tenaska iv i ' ll call you on thursday . . . what ' s a good time ? - - - - - original message - - - - - from : farmer , daren j . sent : wednesday , january 09 , 2002 3 : 03 pm to : hill , garrick cc : olsen , michael subject : tenaska iv rick , we need to talk about the ability of ena to continue its the current role as agent of tenaska iv . 1 ) since the end on november , ena has not been able to complete gas trading transactions . we cannot find any counterparties to trade physical gas in texas . this , of course , is due to the bankruptcy .
as a result , we are not able to sale tenaska ' s excess fuel . we did contact brazos to ask if they would buy a portion of the gas at a gas daily price , but they do not want it ( gas daily pricing has been
below the firm contract price for a while ) . in december , we had to cut 10 , 000 / day from the 7 th through the 27 th . for january , we haven ' t had to cut yet , but i am sure that the pipe will ask us
to do this in the near future . 2 ) for november activity ( which was settled in dec ) , ena owes tenaska iv for the excess supply that we sold . however , due to the bankruptcy , we could not make payments out . ena could not pay the
suppliers or the pipeline . james armstrong paid the counterparties directly . i think that he should continue to do this for dec and jan . we should not transfer any funds from tenaska iv to ena . i don ' t know how enron ' s ownership in the plant factors out in the bankruptcy preceding . but we need to determine how to go forward with the fuel management . please give me a call or e - mail me . we can get together sometime thurs or fri morning .

From inspecting the above email you can see that it represents a dialog back and forth between two individuals all represented in one message file. For the purpose of spam / ham classification I am really only interested in identifying if the first message sent is spam or ham so I will do additional preprocessing of these emails to only use the first email in the string of message by splitting on the text "- - - - - original message - - - - -" as shown below.

def filter_original(messages, split_on="- - - - - original message - - - - -"):
    original_messages = []
    for msg_type, msg in messages:
        msg_parts = msg.split(split_on)
        if len(msg_parts) > 1:
            msg = msg_parts[-1]
        original_messages.append((msg_type, msg))
    return original_messages

spam_messages = filter_original(parse_messages('enron1/spam', SPAM_TYPE))
ham_messages = filter_original(parse_messages('enron1/ham', HAM_TYPE))

print("Ham messages: {}".format(len(ham_messages)))
print("Spam messages: {}".format(len(spam_messages)))
Ham messages: 3672
Spam messages: 1485

Just like before with the SpamAssassin dataset I must combine the spam and ham messages into one data structure, randomize it then resplit it into testing and training subsets for model building and validation.

combined_messages = spam_messages + ham_messages
rng = np.random.default_rng()

rng.shuffle(combined_messages)

n_test = 750
n_train = len(combined_messages) - n_test
train = combined_messages[:n_train]
test = combined_messages[n_train:]

y_train, train_txt = zip(*train)
y_test, test_txt = zip(*test)

print("Training count: {}".format(len(train)))
print("Testing count: {}".format(len(test)))
Training count: 4407
Testing count: 750

As stated earlier it is generally a good idea to take some sort of measure to ensure that the distribution of classes within your dataset is in relatively equal proportions among the training and testing subsets. A quick plot will suffice for this assessment.

train_cnts = pd.Series(data=y_train).value_counts()
test_cnts = pd.Series(data=y_test).value_counts()

x = ['ham', 'spam']
colors = ['blue', 'green']

fig, (ax1, ax2) = plt.subplots(ncols=2, figsize=(12, 6), sharey=True)
ax1.bar(x, train_cnts.values, color=colors)
ax1.set_title('Training')

ax2.bar(x, test_cnts.values, color=colors)
ax2.set_title('Testing')

plt.show()

Starting simply is always the best approach so, I will use the CountVectorizer to perform bag of words feature extraction on this dataset then use it to train a new Logistic Regression model.

v = CountVectorizer()
v.fit(train_txt)
X_train = v.transform(train_txt)
X_test = v.transform(test_txt)

logreg = LogisticRegression()
logreg.fit(X_train, y_train)

bow_val = logreg.score(X_train, y_train)
bow_test = logreg.score(X_test, y_test)

print("Training score: {}".format(bow_val))
print("Testing score: {}".format(bow_test))
Training score: 0.9963694122986159
Testing score: 0.9826666666666667

Again, the model is really pretty accurate but, it is definitely exhibiting a bit more overfitting to the training dataset than may be desired. There are a few things that can be attempted to see if another model can be built that generalizes to the training set a little better.

The first thing I will try is simply using an additional parameter to the CountVectorizer class to specify that English based stop words should be filtered out from the vocabulary.

v_sw = CountVectorizer(stop_words='english')
v_sw.fit(train_txt)

n_basic = len(v.get_feature_names())
n_sw = len(v_sw.get_feature_names())

print("Basic Bag of Words Vocab Size: {}".format(n_basic))
print("Stop Words Filterd Bag of Words Vocab Size: {}".format(n_sw))
Basic Bag of Words Vocab Size: 45070
Stop Words Filterd Bag of Words Vocab Size: 44765

From this I can see that around 300 terms were removed from the stop word filtered feature set so, now I will see how that effects the fitting and evaluation of the Logistic Regression model.

X_train_sw = v_sw.transform(train_txt)
X_test_sw = v_sw.transform(test_txt)

logreg_sw = LogisticRegression()
logreg_sw.fit(X_train_sw, y_train)

sw_val = logreg_sw.score(X_train_sw, y_train)
sw_test = logreg_sw.score(X_test_sw, y_test)

print("Training score: {}".format(sw_val))
print("Testing score: {}".format(sw_test))
Training score: 0.9963694122986159
Testing score: 0.9813333333333333

It does not look like filtering out the predefined list of English stop words in the Scikit Learn improves the generalizability of the model. However, thus far I've been a little sloppy in how I am validating the training sets of the two models by picking an arbitrary proportion of training and testing data to work with. Not only am I potentially picking proportions of training and testing that may skew the accuracy assessment I am also running the risk of there being skewed distributions of the spam and ham classes among the training and testing samples being choosen, although this is pretty low risk as shown by relatively equal proportions among spam and ham in test bar graphs.

A more stringent technique which accounts for the aforementioned issues is cross-validation. The Scikit Learn library provides a helper function for performing cross-validation named cross_val_score(...) which takes a classifier model as the first parameter, the second and third are the independent and dependent variables of the training dataset plus there are other optional parameters with the most commonly used one being cv for number of crosses to perform.

The cross-validation procedure iterates on the given training data's independent and dependent variables by number of times specified via the cv parameter which is sometimes referred to as K. For each iteration from 1 to K the model is fit with K -1 proportions, usually referred to as folds, of the data then tested with the remaining proportion. This is repeated K times.

For example, say I have the below training dataset.

Independent Variable Dependent Variable
1 0
2 1
3 0
4 1
5 0
6 1
7 0
8 1
9 0

If I specify a cv value of 3 then the data will be chunked into 3 folds and for each iteration there will be 6 elements in the subset that is used to fit the model and 3 elements used to validate the model.

First Iteration:

Independent Variable Dependent Variable Train / Validate
1 0 Validate
2 1 Validate
3 0 Validate
4 1 Train
5 0 Train
6 1 Train
7 0 Train
8 1 Train
9 0 Train

Second Iteration:

Independent Variable Dependent Variable Train / Validate
1 0 Train
2 1 Train
3 0 Train
4 1 Validate
5 0 Validate
6 1 Validate
7 0 Train
8 1 Train
9 0 Train

Third Iteration:

Independent Variable Dependent Variable Train / Validate
1 0 Train
2 1 Train
3 0 Train
4 1 Train
5 0 Train
6 1 Train
7 0 Validate
8 1 Validate
9 0 Validate

With this new concept of cross-validation scoring of classifier models on different training sets I would now like to assess two feature extraction techniques using it with 5 crosses.

from sklearn.model_selection import cross_val_score

clf = LogisticRegression()

scores = cross_val_score(clf, X_train, y_train, cv=5)
scores_sw = cross_val_score(clf, X_train_sw, y_train, cv=5)

bw_val = np.mean(scores)
sw_val = np.mean(scores_sw)

print("Basic Bag Of Words: {}".format(bw_val))
print("Stop Word Filtered Bag of Words: {}".format(sw_val))
Basic Bag Of Words: 0.9705035249059897
Stop Word Filtered Bag of Words: 0.9754942461282659

This is actually a pretty interesting find. From the cross-validation I see that the stop word filtered feature set does produce a slightly better Logistic Regression model. It also shows that both feature sets run the risk of producing overfitted models since the scores for both executions of the cross-validations are a couple of percentage points lower than the earlier validation scores on just the training set alone.

Next I would like to see if using the TF-IDF feature extraction technique results in a more robust model. I can do this by swapping out the CountVectorizer for the TfidfVectorizer from the Scikit Learn library.

from sklearn.feature_extraction.text import TfidfVectorizer

v_tfidf = TfidfVectorizer()
v_tfidf.fit(train_txt)

X_train_tfidf = v_tfidf.transform(train_txt)
X_test_tfidf = v_tfidf.transform(test_txt)

clf_tfidf = LogisticRegression()
scores_tfidf = cross_val_score(clf_tfidf, X_train_tfidf, y_train, cv=5)

clf_tfidf.fit(X_train_tfidf, y_train)

tfidf_val = np.mean(scores_tfidf)
tfidf_test = clf_tfidf.score(X_test_tfidf, y_test)

print("TF-IDF Validation Score: {}".format(tfidf_val))
print("TF-IDF: {}".format(tfidf_test))
TF-IDF Validation Score: 0.9714105543844477
TF-IDF: 0.9853333333333333

In the case of the TF-IDF feature trained model I see a slight improvement in the scoring of the test dataset suggesting that it does a better job at generalizing to new, unknown, spam / ham email classification compared to the other feature extraction techniques. Below is a graph representing all the validation and scoring results of the three techniques for feature extraction and their effects on model building.

fig, ax = plt.subplots(figsize=(12, 8))

train = np.array([ bow_val, sw_val, tfidf_val ]) * 100
test = np.array([ bow_test, sw_test, tfidf_test ]) * 100

w = 0.4
half_w = 0.2
x_mp = np.arange(len(train))
x_train = x_mp - half_w
x_test = x_mp + half_w

train_bars = ax.bar(x_train, train, width=w, label='train')
test_bars = ax.bar(x_test, test, width=w, label='test')

ax.set_ylabel('Accuracy (%)')
ax.set_title('Feature Extraction Comparison')
ax.set_xticks(x_mp)
ax.set_xticklabels(['BOW', 'SW', 'TF-IDF'])
ax.legend()

def annotate_values(ax, bars, dx=0):
    for bar in bars:
        y = bar.get_height()
        x = bar.get_x() + dx
        ax.annotate("{:.3f}".format(y),
                    xy=(x, y),
                    xytext=(0, 3),
                    textcoords='offset points',
                    ha='center',
                    va='bottom')


annotate_values(ax, train_bars, dx=half_w)     
annotate_values(ax, test_bars, dx=half_w)

plt.ylim((20, 110))

fig.tight_layout()
plt.show()

 

References

1. Stop Words
2. Bag of Words
3. Lexical Analysis
4. Natural Language Processing
5. Text Corpus
6. Spam / Ham Dataset
7. Predictive Modeling
8. Sparse Matrix
9. Supervised Learning
10. Unsupervised Learning
11. tf-idf
12. n-gram
13. Logistic Regression
14. Cross-Validation
15. Stemming and Lemmatization
16. Another Spam Ham Dataset

Conclusion

In this article I have covered quite of bit of ground surrounding the topics of Text Analytics and using Machine Learning to make predictions about a text classification problem, in this case identifying spam or ham email messages (aka, spam filtering). Although there is a moderate amount of content here already I definitely only scratched the surface in the larger context and I invite the interested reader to dive deeper into the references section which were used to strengthen the theoretical section of this discussion.

As always, I thank you for reading and feel free to ask questions or critique in the comments section below.

Share with friends and colleagues

[[ likes ]] likes

Community favorites for Machine Learning

theCodingInterface