Bayesian Classification for Machine Learning for Computational Linguistics¶

Using token probabilities for classification¶

(C) 2017-2019 by Damir Cavar

Download: This and various other Jupyter notebooks are available from my GitHub repo.

Version: 1.2, September 2019¶

This is a tutorial related to the discussion of a Bayesian classifier in the textbook Machine Learning: The Art and Science of Algorithms that Make Sense of Data by Peter Flach.

This tutorial was developed as part of my course material for the course Machine Learning for Computational Linguistics in the Computational Linguistics Program of the Department of Linguistics at Indiana University.

Creating a Training Corpus¶

Assume that we have a set of e-mails that are annotated as spam or ham, as described in the textbook.

There are $4$ e-mails labeled ham and $1$ e-mail is labeled spam, that is we have a total of $5$ texts in our corpus.

If we would randomly pick an e-mail from the collection, the probability that we pick a spam e-mail would be $1 / 5$.

Spam emails might differ from ham e-mails just in some words. Here is a sample email constructed with typical keywords:

In [1]:
spam = [ """Our medicine cures baldness. No diagnostics needed.
We guarantee Fast Viagra delivery.
We can provide Human growth hormone. The cheapest Life
Insurance with us. You can Lose weight with this treatment.
Our Medicine now and No medical exams necessary.
Our Online pharmacy is the best.  This cream Removes
wrinkles and Reverses aging.
One treatment and you will Stop snoring.  We sell Valium
and Viagra.
Our Vicodin will help with Weight loss. Cheap Xanax.""" ]


The data structure above is a list of strings that contains only one string. The triple-double-quotes mark multi-line text. We can output the size of the variable spam this way:

In [2]:
print(len(spam))

1


We can create a list of ham mails in a similar way:

In [3]:
ham = [ """Hi Hans, hope to see you soon at our family party.
When will you arrive.
All the best to the family.
Sue""",
"""Dear Ata,
did you receive my last email related to the car insurance
offer? I would be happy to discuss the details with you.
Please give me a call, if you have any questions.
John Smith
Super Car Insurance""",
"""Hi everyone:
This is just a gentle reminder of today's first 2017 SLS
Colloquium, from 2.30 to 4.00 pm, in Ballantine 103.
Rodica Frimu will present a job talk entitled "What is
so tricky in subject-verb agreement?". The text of the
abstract is below.
If you would like to present something during the Spring,
The current online schedule with updated title
information and abstracts is available under:
http://www.iub.edu/~psyling/SLSColloquium/Spring2017.html
See you soon,
Peter""",
"""Dear Friends,
As our first event of 2017, the Polish Studies Center
presents an evening with artist and filmmaker Wojtek Sawa.
7:30 p.m. in the Global and International Studies
Building room 1100 for a presentation by Wojtek Sawa
on his interactive  installation art piece The Wall
Speaks–Voices of the Unheard. A reception will follow
the event where you will have a chance to meet the artist
and discuss his work.
Best,"""]


The ham-mail list contains $4$ e-mails:

In [4]:
print(len(ham))

4


We can access a particular e-mail via index from either spam or ham:

In [5]:
print(spam[0])

Our medicine cures baldness. No diagnostics needed.
We guarantee Fast Viagra delivery.
We can provide Human growth hormone. The cheapest Life
Insurance with us. You can Lose weight with this treatment.
Our Medicine now and No medical exams necessary.
Our Online pharmacy is the best.  This cream Removes
wrinkles and Reverses aging.
One treatment and you will Stop snoring.  We sell Valium
and Viagra.
Our Vicodin will help with Weight loss. Cheap Xanax.

In [6]:
print(ham[3])

Dear Friends,
As our first event of 2017, the Polish Studies Center
presents an evening with artist and filmmaker Wojtek Sawa.
7:30 p.m. in the Global and International Studies
Building room 1100 for a presentation by Wojtek Sawa
on his interactive  installation art piece The Wall
Speaks–Voices of the Unheard. A reception will follow
the event where you will have a chance to meet the artist
and discuss his work.
Best,


We can lower-case the email using the string lower function:

In [7]:
print(ham[3].lower())

dear friends,
as our first event of 2017, the polish studies center
presents an evening with artist and filmmaker wojtek sawa.
7:30 p.m. in the global and international studies
building room 1100 for a presentation by wojtek sawa
on his interactive  installation art piece the wall
speaks–voices of the unheard. a reception will follow
the event where you will have a chance to meet the artist
and discuss his work.
best,


We can loop over all e-mails in spam or ham and lower-case the content:

In [8]:
for text in ham:
print(text.lower())

hi hans, hope to see you soon at our family party.
when will you arrive.
all the best to the family.
sue
dear ata,
did you receive my last email related to the car insurance
offer? i would be happy to discuss the details with you.
please give me a call, if you have any questions.
john smith
super car insurance
hi everyone:
this is just a gentle reminder of today's first 2017 sls
colloquium, from 2.30 to 4.00 pm, in ballantine 103.
rodica frimu will present a job talk entitled "what is
so tricky in subject-verb agreement?". the text of the
abstract is below.
if you would like to present something during the spring,
the current online schedule with updated title
information and abstracts is available under:
http://www.iub.edu/~psyling/slscolloquium/spring2017.html
see you soon,
peter
dear friends,
as our first event of 2017, the polish studies center
presents an evening with artist and filmmaker wojtek sawa.
7:30 p.m. in the global and international studies
building room 1100 for a presentation by wojtek sawa
on his interactive  installation art piece the wall
speaks–voices of the unheard. a reception will follow
the event where you will have a chance to meet the artist
and discuss his work.
best,


We can use the tokenizer from NLTK to tokenize the lower-cased text into single tokens (words and punctuation marks):

In [9]:
from nltk import word_tokenize

print(word_tokenize(ham[0].lower()))

['hi', 'hans', ',', 'hope', 'to', 'see', 'you', 'soon', 'at', 'our', 'family', 'party', '.', 'when', 'will', 'you', 'arrive', '.', 'all', 'the', 'best', 'to', 'the', 'family', '.', 'sue']


We can count the numer of tokens and types in lower-cased text:

In [10]:
from collections import Counter

myCounts = Counter(word_tokenize("This is a test. Will this test teach us how to count tokens?".lower()))

print(myCounts)
print("number of  types:", len(myCounts))
print("number of tokens:", sum(myCounts.values()))

Counter({'this': 2, 'test': 2, 'is': 1, 'a': 1, '.': 1, 'will': 1, 'teach': 1, 'us': 1, 'how': 1, 'to': 1, 'count': 1, 'tokens': 1, '?': 1})
number of  types: 13
number of tokens: 15


Now we can create a frequency profile of ham and spam words given the two text collections:

In [11]:
hamFP = Counter()
spamFP = Counter()

for text in spam:
spamFP.update(word_tokenize(text.lower()))

for text in ham:
hamFP.update(word_tokenize(text.lower()))

print("Ham:\n",  hamFP)
print("-" * 30)
print("Spam:\n", spamFP)

Ham:
Counter({'the': 14, ',': 11, '.': 11, 'to': 8, 'you': 8, 'a': 6, 'will': 4, 'is': 4, 'of': 4, 'and': 4, 'with': 3, 'please': 3, ':': 3, '2017': 3, 'in': 3, 'hi': 2, 'see': 2, 'soon': 2, 'our': 2, 'family': 2, 'best': 2, 'dear': 2, 'car': 2, 'insurance': 2, '?': 2, 'would': 2, 'discuss': 2, 'me': 2, 'if': 2, 'have': 2, 'first': 2, 'from': 2, 'present': 2, 'event': 2, 'studies': 2, 'artist': 2, 'wojtek': 2, 'sawa': 2, 'on': 2, 'p.m.': 2, 'his': 2, 'hans': 1, 'hope': 1, 'at': 1, 'party': 1, 'when': 1, 'arrive': 1, 'all': 1, 'sue': 1, 'ata': 1, 'did': 1, 'receive': 1, 'my': 1, 'last': 1, 'email': 1, 'related': 1, 'offer': 1, 'i': 1, 'be': 1, 'happy': 1, 'details': 1, 'give': 1, 'call': 1, 'any': 1, 'questions': 1, 'john': 1, 'smith': 1, 'super': 1, 'everyone': 1, 'this': 1, 'just': 1, 'gentle': 1, 'reminder': 1, 'today': 1, "'s": 1, 'sls': 1, 'colloquium': 1, '2.30': 1, '4.00': 1, 'pm': 1, 'ballantine': 1, '103.': 1, 'rodica': 1, 'frimu': 1, 'job': 1, 'talk': 1, 'entitled': 1, '': 1, 'what': 1, 'so': 1, 'tricky': 1, 'subject-verb': 1, 'agreement': 1, "''": 1, 'text': 1, 'abstract': 1, 'below': 1, 'like': 1, 'something': 1, 'during': 1, 'spring': 1, 'let': 1, 'know': 1, 'current': 1, 'online': 1, 'schedule': 1, 'updated': 1, 'title': 1, 'information': 1, 'abstracts': 1, 'available': 1, 'under': 1, 'http': 1, '//www.iub.edu/~psyling/slscolloquium/spring2017.html': 1, 'peter': 1, 'friends': 1, 'as': 1, 'polish': 1, 'center': 1, 'presents': 1, 'an': 1, 'evening': 1, 'filmmaker': 1, 'join': 1, 'us': 1, 'january': 1, '26': 1, '5:30': 1, '7:30': 1, 'global': 1, 'international': 1, 'building': 1, 'room': 1, '1100': 1, 'for': 1, 'presentation': 1, 'by': 1, 'interactive': 1, 'installation': 1, 'art': 1, 'piece': 1, 'wall': 1, 'speaks–voices': 1, 'unheard': 1, 'reception': 1, 'follow': 1, 'where': 1, 'chance': 1, 'meet': 1, 'work': 1})
------------------------------
Spam:
Counter({'.': 13, 'our': 4, 'and': 4, 'we': 3, 'with': 3, 'medicine': 2, 'no': 2, 'viagra': 2, 'can': 2, 'the': 2, 'you': 2, 'weight': 2, 'this': 2, 'treatment': 2, 'will': 2, 'cures': 1, 'baldness': 1, 'diagnostics': 1, 'needed': 1, 'guarantee': 1, 'fast': 1, 'delivery': 1, 'provide': 1, 'human': 1, 'growth': 1, 'hormone': 1, 'cheapest': 1, 'life': 1, 'insurance': 1, 'us': 1, 'lose': 1, 'now': 1, 'medical': 1, 'exams': 1, 'necessary': 1, 'online': 1, 'pharmacy': 1, 'is': 1, 'best': 1, 'cream': 1, 'removes': 1, 'wrinkles': 1, 'reverses': 1, 'aging': 1, 'one': 1, 'stop': 1, 'snoring': 1, 'sell': 1, 'valium': 1, 'vicodin': 1, 'help': 1, 'loss': 1, 'cheap': 1, 'xanax': 1})


The probability that we pick randomly an e-mail that is spam or ham can be computed as the ratio of the counts divided by the number of e-mails:

In [12]:
total = len(spam) + len(ham)

spamP = len(spam) / total
hamP  = len(ham) / total

print("probability to pick spam:", spamP)
print("probability to pick  ham:", hamP)

probability to pick spam: 0.2
probability to pick  ham: 0.8


We will need the total token count to calculate the relative frequency of the tokens, that is to generate likelihood estimates. We could brute force add one to create space in the probability mass for unknown tokens.

In [13]:
totalSpam = sum(spamFP.values()) + 1
totalHam  = sum(hamFP.values()) + 1

print("total spam counts + 1:", totalSpam)
print("total  ham counts + 1:", totalHam)

total spam counts + 1: 87
total  ham counts + 1: 251


We can relativize the counts in the frequency profiles now:

In [14]:
hamFP  = Counter( dict([ (token, frequency/totalHam)  for token, frequency in hamFP.items() ]) )
spamFP = Counter( dict([ (token, frequency/totalSpam) for token, frequency in spamFP.items() ]) )

print(hamFP)
print("-" * 30)
print(spamFP)

Counter({'the': 0.055776892430278883, ',': 0.043824701195219126, '.': 0.043824701195219126, 'to': 0.03187250996015936, 'you': 0.03187250996015936, 'a': 0.02390438247011952, 'will': 0.01593625498007968, 'is': 0.01593625498007968, 'of': 0.01593625498007968, 'and': 0.01593625498007968, 'with': 0.01195219123505976, 'please': 0.01195219123505976, ':': 0.01195219123505976, '2017': 0.01195219123505976, 'in': 0.01195219123505976, 'hi': 0.00796812749003984, 'see': 0.00796812749003984, 'soon': 0.00796812749003984, 'our': 0.00796812749003984, 'family': 0.00796812749003984, 'best': 0.00796812749003984, 'dear': 0.00796812749003984, 'car': 0.00796812749003984, 'insurance': 0.00796812749003984, '?': 0.00796812749003984, 'would': 0.00796812749003984, 'discuss': 0.00796812749003984, 'me': 0.00796812749003984, 'if': 0.00796812749003984, 'have': 0.00796812749003984, 'first': 0.00796812749003984, 'from': 0.00796812749003984, 'present': 0.00796812749003984, 'event': 0.00796812749003984, 'studies': 0.00796812749003984, 'artist': 0.00796812749003984, 'wojtek': 0.00796812749003984, 'sawa': 0.00796812749003984, 'on': 0.00796812749003984, 'p.m.': 0.00796812749003984, 'his': 0.00796812749003984, 'hans': 0.00398406374501992, 'hope': 0.00398406374501992, 'at': 0.00398406374501992, 'party': 0.00398406374501992, 'when': 0.00398406374501992, 'arrive': 0.00398406374501992, 'all': 0.00398406374501992, 'sue': 0.00398406374501992, 'ata': 0.00398406374501992, 'did': 0.00398406374501992, 'receive': 0.00398406374501992, 'my': 0.00398406374501992, 'last': 0.00398406374501992, 'email': 0.00398406374501992, 'related': 0.00398406374501992, 'offer': 0.00398406374501992, 'i': 0.00398406374501992, 'be': 0.00398406374501992, 'happy': 0.00398406374501992, 'details': 0.00398406374501992, 'give': 0.00398406374501992, 'call': 0.00398406374501992, 'any': 0.00398406374501992, 'questions': 0.00398406374501992, 'john': 0.00398406374501992, 'smith': 0.00398406374501992, 'super': 0.00398406374501992, 'everyone': 0.00398406374501992, 'this': 0.00398406374501992, 'just': 0.00398406374501992, 'gentle': 0.00398406374501992, 'reminder': 0.00398406374501992, 'today': 0.00398406374501992, "'s": 0.00398406374501992, 'sls': 0.00398406374501992, 'colloquium': 0.00398406374501992, '2.30': 0.00398406374501992, '4.00': 0.00398406374501992, 'pm': 0.00398406374501992, 'ballantine': 0.00398406374501992, '103.': 0.00398406374501992, 'rodica': 0.00398406374501992, 'frimu': 0.00398406374501992, 'job': 0.00398406374501992, 'talk': 0.00398406374501992, 'entitled': 0.00398406374501992, '': 0.00398406374501992, 'what': 0.00398406374501992, 'so': 0.00398406374501992, 'tricky': 0.00398406374501992, 'subject-verb': 0.00398406374501992, 'agreement': 0.00398406374501992, "''": 0.00398406374501992, 'text': 0.00398406374501992, 'abstract': 0.00398406374501992, 'below': 0.00398406374501992, 'like': 0.00398406374501992, 'something': 0.00398406374501992, 'during': 0.00398406374501992, 'spring': 0.00398406374501992, 'let': 0.00398406374501992, 'know': 0.00398406374501992, 'current': 0.00398406374501992, 'online': 0.00398406374501992, 'schedule': 0.00398406374501992, 'updated': 0.00398406374501992, 'title': 0.00398406374501992, 'information': 0.00398406374501992, 'abstracts': 0.00398406374501992, 'available': 0.00398406374501992, 'under': 0.00398406374501992, 'http': 0.00398406374501992, '//www.iub.edu/~psyling/slscolloquium/spring2017.html': 0.00398406374501992, 'peter': 0.00398406374501992, 'friends': 0.00398406374501992, 'as': 0.00398406374501992, 'polish': 0.00398406374501992, 'center': 0.00398406374501992, 'presents': 0.00398406374501992, 'an': 0.00398406374501992, 'evening': 0.00398406374501992, 'filmmaker': 0.00398406374501992, 'join': 0.00398406374501992, 'us': 0.00398406374501992, 'january': 0.00398406374501992, '26': 0.00398406374501992, '5:30': 0.00398406374501992, '7:30': 0.00398406374501992, 'global': 0.00398406374501992, 'international': 0.00398406374501992, 'building': 0.00398406374501992, 'room': 0.00398406374501992, '1100': 0.00398406374501992, 'for': 0.00398406374501992, 'presentation': 0.00398406374501992, 'by': 0.00398406374501992, 'interactive': 0.00398406374501992, 'installation': 0.00398406374501992, 'art': 0.00398406374501992, 'piece': 0.00398406374501992, 'wall': 0.00398406374501992, 'speaks–voices': 0.00398406374501992, 'unheard': 0.00398406374501992, 'reception': 0.00398406374501992, 'follow': 0.00398406374501992, 'where': 0.00398406374501992, 'chance': 0.00398406374501992, 'meet': 0.00398406374501992, 'work': 0.00398406374501992})
------------------------------
Counter({'.': 0.14942528735632185, 'our': 0.04597701149425287, 'and': 0.04597701149425287, 'we': 0.034482758620689655, 'with': 0.034482758620689655, 'medicine': 0.022988505747126436, 'no': 0.022988505747126436, 'viagra': 0.022988505747126436, 'can': 0.022988505747126436, 'the': 0.022988505747126436, 'you': 0.022988505747126436, 'weight': 0.022988505747126436, 'this': 0.022988505747126436, 'treatment': 0.022988505747126436, 'will': 0.022988505747126436, 'cures': 0.011494252873563218, 'baldness': 0.011494252873563218, 'diagnostics': 0.011494252873563218, 'needed': 0.011494252873563218, 'guarantee': 0.011494252873563218, 'fast': 0.011494252873563218, 'delivery': 0.011494252873563218, 'provide': 0.011494252873563218, 'human': 0.011494252873563218, 'growth': 0.011494252873563218, 'hormone': 0.011494252873563218, 'cheapest': 0.011494252873563218, 'life': 0.011494252873563218, 'insurance': 0.011494252873563218, 'us': 0.011494252873563218, 'lose': 0.011494252873563218, 'now': 0.011494252873563218, 'medical': 0.011494252873563218, 'exams': 0.011494252873563218, 'necessary': 0.011494252873563218, 'online': 0.011494252873563218, 'pharmacy': 0.011494252873563218, 'is': 0.011494252873563218, 'best': 0.011494252873563218, 'cream': 0.011494252873563218, 'removes': 0.011494252873563218, 'wrinkles': 0.011494252873563218, 'reverses': 0.011494252873563218, 'aging': 0.011494252873563218, 'one': 0.011494252873563218, 'stop': 0.011494252873563218, 'snoring': 0.011494252873563218, 'sell': 0.011494252873563218, 'valium': 0.011494252873563218, 'vicodin': 0.011494252873563218, 'help': 0.011494252873563218, 'loss': 0.011494252873563218, 'cheap': 0.011494252873563218, 'xanax': 0.011494252873563218})


We can now compute the default probability that we want to assign to unknown words as $1 / totalSpam$ or $1 / totalHam$ respectively. Whenever we encounter an unknown token that is not in our frequency profile, we will assign the default probability to it.

In [15]:
defaultSpam = 1 / totalSpam
defaultHam  = 1 / totalHam

print("default spam probability:", defaultSpam)
print("default  ham probability:", defaultHam)

default spam probability: 0.011494252873563218
default  ham probability: 0.00398406374501992


We can test an unknown document by calculating how likely it was generated by the hamFP-distribution or the spamFP-distribution. We have to tokenize the lower-cased unknown document and compute the product of the likelihood of every single token in the text. We should scale this likelihood with the likelihood of randomly picking a ham or a spam e-mail. Let us calculate the likelihood that the random email is spam:

In [16]:
unknownEmail = """Dear ,
we sell the cheapest and best Viagra on the planet. Our delivery is guaranteed confident and cheap.
"""

tokens = word_tokenize(unknownEmail.lower())

result = 1.0
for token in tokens:
result *= spamFP.get(token, defaultSpam)

print(result * spamP)

9.669490943645368e-37


Since this number is very small, a better strategy might be to sum up the log-likelihoods:

In [17]:
from math import log

resultSpam = 0.0
for token in tokens:
resultSpam += log(spamFP.get(token, defaultSpam), 2)
resultSpam += log(spamP)

print(resultSpam)

-118.92540938825404

In [18]:
resultHam = 0.0
for token in tokens:
resultHam += log(hamFP.get(token, defaultHam), 2)
resultHam += log(hamP)

print(resultHam)

-139.6325534842533


The log-likelihood for spam is larger than for ham. Our simple classifier would have guessed that this e-mail is spam.

In [19]:
if max(resultHam, resultSpam) == resultHam:
print("e-mail is ham")
else:
print("e-mail is spam")

e-mail is spam


The are numerous ways to improve the algorithm and tutorial. Please send me your suggestions.