Python NLTK: Texts and Frequencies

(C) 2017-2019 by Damir Cavar <dcavar@iu.edu>

Version: 0.3, September 2019

Download: This and various other Jupyter notebooks are available from my GitHub repo.

This is a brief introduction to NLTK for simple frequency analysis of texts. I created this notebook for intro to corpus linguistics and natural language processing classes at Indiana University between 2017 and 2019.

For this to work, in the folder with the notebook we expect a subfolder data that contains a file HOPG.txt. This file contains the novel "A House of Pomegranates" by Oscar Wilde taken as raw text from Project Gutenberg.

Simple File Processing

Reading a text into memory in Python is faily simple. We open a file, read from it, and close the file again:

In [30]:
ifile = open("data/HOPG.txt", mode='r', encoding='utf-8')
text = ifile.read()
ifile.close()
print(text[:200], "...")
A HOUSE OF POMEGRANATES




Contents:

The Young King
The Birthday of the Infanta
The Fisherman and his Soul
The Star-child




THE YOUNG KING




[TO MARGARET LADY BROOKE--THE RANEE OF SARAWAK]


It  ...

The optional parameters in the open function above define the mode of operations on the file and the encoding of the content. For example, setting the mode to r declares that reading from the file is the only permitted operation that we will perform in the following code. Setting the encoding to utf-8 declares that all characters will be encoded using the Unicode encoding schema UTF-8 for the content of the file.

We can now import the NLTK module in Python to work with frequency profiles and n-grams using the tokens or words in the text.

In [31]:
import nltk

We can now lower the text, which means normalizing it to all characters lower case:

In [32]:
text = text.lower()

To generate a frequency profile from the text file, we can use the NLTK function FreqDist:

In [33]:
myFD = nltk.FreqDist(text)

We can remove certain characters from the distribution, or alternatively replace these characters in the text variable:

In [34]:
for x in ":,.-[];!'\"\t\n/ ?":
    del myFD[x]

We can print out the frequency profile by looping through the returned data structure:

In [36]:
for x in myFD:
    print(x, myFD[x])
a 11231
h 10802
o 9408
u 3269
s 8093
e 17372
f 3089
p 1884
m 3271
g 2666
r 7603
n 8843
t 12521
c 2693
y 2168
k 1026
i 8307
b 1812
d 7249
l 5270
w 3665
x 48
v 1122
q 81
j 103
z 61

To relativize the frequencies, we need to compute the total number of charachters. This is assuming that we removed all punctuation symbols.

In [37]:
total = float(sum(myFD.values()))
print(total)
133657.0

We can generate now a probability distribution over characters:

In [38]:
relfrq = [ x/total for x in myFD.values() ]
print(relfrq)
[0.08402852076584093, 0.0808188123330615, 0.07038913038598801, 0.024458127894536014, 0.06055051362816762, 0.1299744869329702, 0.02311139708357961, 0.014095782488010355, 0.024473091570213306, 0.019946579677832064, 0.056884413087230745, 0.06616189200715264, 0.09368009157769515, 0.020148589299475522, 0.016220624434186013, 0.007676365622451499, 0.06215162692563801, 0.013557090163627793, 0.05423584249234982, 0.03942928540966803, 0.0274209356786401, 0.0003591282162550409, 0.008394622054961581, 0.0006060288649303815, 0.0007706292973806085, 0.0004563921081574478]

We need the math log function:

In [39]:
from math import log

We can define the Entropy function according to the equation $I = - \sum P(x) log_2( P(x) )$ as:

In [40]:
def entropy(p):
    res = 0.0
    for x in p:
        res += x * log(x, 2)
    return -res

We can now compute the entropy of the character distribution:

In [41]:
print(entropy(relfrq))
4.124824125135032

We might be interested in the point-wise entropy of the characters in this distribution, thus needing the entropy of each single character. We can compute that in the following way:

In [42]:
entdist = [ -x * log(x, 2) for x in relfrq ]
print(entdist)
[0.3002319806578157, 0.29330480822209687, 0.2694850339615855, 0.13093762006835424, 0.24497024185626542, 0.38260584969385675, 0.12561626066788154, 0.08666922421352652, 0.13099613411450642, 0.11265259339509692, 0.2352638524022937, 0.2592127456210649, 0.32002184459468397, 0.11350057702872672, 0.09644826809662171, 0.053929238563860095, 0.24910770047189124, 0.08411915007238721, 0.2280405439219216, 0.18392139623527803, 0.1422756742696596, 0.004109580806277946, 0.05789199083219161, 0.006477433994507777, 0.007969598004767611, 0.005064783367910896]

We could now compute the variance over this point-wise entropy distribution...

From Characters to Words/Tokens

We see that the frequency profile is for the characters in the text, not the words or tokens. In order to generate a frequency profile over words/tokens in the text, we need to utilize a tokenizer. NLTK provides basic tokenization functions. We will use the word_tokenize function to generate a list of tokens:

In [47]:
tokens = nltk.word_tokenize(text)

We can now generate a frequency profile from the token list:

In [48]:
myTokenFD = nltk.FreqDist(tokens)

The frequency profile can be printed out in the same way as above, by looping over the tokens and their frequencies:

In [51]:
for token in list(myTokenFD.items())[:10]:
    print(token[0], token[1])
print("...")
a 673
house 26
of 1108
pomegranates 6
contents 1
: 33
the 2552
young 112
king 70
birthday 8
...

Counting N-grams

NLTK provides simple methods to generate n-gram models or frequency profiles over n-grams from any kind of list or sequence. We can for example generate a bi-gram model, that is an n-grams model for n = 2, from the text tokens:

In [62]:
myTokenBigrams = nltk.ngrams(tokens, 2)

The resulting bi-gram list can be printed out in a loop, as in the examples above, or we can print the entire data structure, that is the list generated from the Python generator object:

In [63]:
bigrams = list(myTokenBigrams)
print(len(bigrams))
print(bigrams[:30], "...")
38126
[('a', 'house'), ('house', 'of'), ('of', 'pomegranates'), ('pomegranates', 'contents'), ('contents', ':'), (':', 'the'), ('the', 'young'), ('young', 'king'), ('king', 'the'), ('the', 'birthday'), ('birthday', 'of'), ('of', 'the'), ('the', 'infanta'), ('infanta', 'the'), ('the', 'fisherman'), ('fisherman', 'and'), ('and', 'his'), ('his', 'soul'), ('soul', 'the'), ('the', 'star-child'), ('star-child', 'the'), ('the', 'young'), ('young', 'king'), ('king', '['), ('[', 'to'), ('to', 'margaret'), ('margaret', 'lady'), ('lady', 'brooke'), ('brooke', '--'), ('--', 'the')] ...

The frequency profile from these bigrams can be gnerated in the same way as from the token list above:

In [76]:
myBigramFD = nltk.FreqDist(bigrams)

If we would want to know some more general properties of the frequency distribution, we can print out information about it:

In [77]:
print(myBigramFD)
<FreqDist with 17766 samples and 38126 outcomes>

The bigrams and their corresponding frequencies can be printed using a loop:

In [78]:
for bigram in list(myBigramFD.items())[:20]:
    print(bigram[0], bigram[1])
print("...")
('a', 'house') 4
('house', 'of') 5
('of', 'pomegranates') 4
('pomegranates', 'contents') 1
('contents', ':') 1
(':', 'the') 2
('the', 'young') 107
('young', 'king') 24
('king', 'the') 1
('the', 'birthday') 3
('birthday', 'of') 3
('of', 'the') 354
('the', 'infanta') 33
('infanta', 'the') 1
('the', 'fisherman') 5
('fisherman', 'and') 3
('and', 'his') 53
('his', 'soul') 51
('soul', 'the') 1
('the', 'star-child') 41
...

Pretty printing the bigrams is possible as well:

In [79]:
for ngram in list(myBigramFD.items())[:20]:
    print(" ".join(ngram[0]), ngram[1])
print("...")
a house 4
house of 5
of pomegranates 4
pomegranates contents 1
contents : 1
: the 2
the young 107
young king 24
king the 1
the birthday 3
birthday of 3
of the 354
the infanta 33
infanta the 1
the fisherman 5
fisherman and 3
and his 53
his soul 51
soul the 1
the star-child 41
...

Instead of running the frequency profile through a loop we can also use a list comprehension construction in Python to generate a list of tuples with the n-gram and its frequency:

In [80]:
ngrams = [ (" ".join(ngram), myBigramFD[ngram]) for ngram in myBigramFD ]
print(ngrams[:100])
[('a house', 4), ('house of', 5), ('of pomegranates', 4), ('pomegranates contents', 1), ('contents :', 1), (': the', 2), ('the young', 107), ('young king', 24), ('king the', 1), ('the birthday', 3), ('birthday of', 3), ('of the', 354), ('the infanta', 33), ('infanta the', 1), ('the fisherman', 5), ('fisherman and', 3), ('and his', 53), ('his soul', 51), ('soul the', 1), ('the star-child', 41), ('star-child the', 1), ('king [', 1), ('[ to', 4), ('to margaret', 1), ('margaret lady', 1), ('lady brooke', 1), ('brooke --', 1), ('-- the', 4), ('the ranee', 1), ('ranee of', 1), ('of sarawak', 1), ('sarawak ]', 1), ('] it', 2), ('it was', 52), ('was the', 20), ('the night', 3), ('night before', 1), ('before the', 8), ('the day', 7), ('day fixed', 1), ('fixed for', 1), ('for his', 7), ('his coronation', 2), ('coronation ,', 3), (', and', 1271), ('and the', 287), ('king was', 1), ('was sitting', 1), ('sitting alone', 1), ('alone in', 1), ('in his', 31), ('his beautiful', 1), ('beautiful chamber', 1), ('chamber .', 1), ('. his', 8), ('his courtiers', 1), ('courtiers had', 1), ('had all', 2), ('all taken', 1), ('taken their', 1), ('their leave', 1), ('leave of', 1), ('of him', 8), ('him ,', 146), (', bowing', 1), ('bowing their', 1), ('their heads', 5), ('heads to', 1), ('to the', 149), ('the ground', 15), ('ground ,', 6), (', according', 1), ('according to', 1), ('the ceremonious', 1), ('ceremonious usage', 1), ('usage of', 1), ('day ,', 3), ('and had', 9), ('had retired', 2), ('retired to', 2), ('the great', 25), ('great hall', 2), ('hall of', 1), ('the palace', 18), ('palace ,', 5), (', to', 6), ('to receive', 1), ('receive a', 1), ('a few', 9), ('few last', 1), ('last lessons', 1), ('lessons from', 1), ('from the', 69), ('the professor', 1), ('professor of', 1), ('of etiquette', 1), ('etiquette ;', 1), ('; there', 2), ('there being', 1), ('being some', 1)]

We can generate an increasing frequency profile using the sort function on the second element of the tuple list, that is on the frequency:

In [82]:
sorted(ngrams, key=lambda x: x[1])
print(ngrams[:20])
print("...")
[('a house', 4), ('house of', 5), ('of pomegranates', 4), ('pomegranates contents', 1), ('contents :', 1), (': the', 2), ('the young', 107), ('young king', 24), ('king the', 1), ('the birthday', 3), ('birthday of', 3), ('of the', 354), ('the infanta', 33), ('infanta the', 1), ('the fisherman', 5), ('fisherman and', 3), ('and his', 53), ('his soul', 51), ('soul the', 1), ('the star-child', 41)]
...

We can increase the speed of this sorted call by using the itemgetter() function in the operator module:

In [83]:
from operator import itemgetter

We can now define the sort-key for sorted using the itemgetter function and selecting with 1 the second element in the tuple. Remember that the enumeration of elements in lists or tuples in Python starts at 0.

In [84]:
sorted(ngrams, key=itemgetter(1))
print(ngrams[:20])
print("...")
[('a house', 4), ('house of', 5), ('of pomegranates', 4), ('pomegranates contents', 1), ('contents :', 1), (': the', 2), ('the young', 107), ('young king', 24), ('king the', 1), ('the birthday', 3), ('birthday of', 3), ('of the', 354), ('the infanta', 33), ('infanta the', 1), ('the fisherman', 5), ('fisherman and', 3), ('and his', 53), ('his soul', 51), ('soul the', 1), ('the star-child', 41)]
...

A decreasing frequency profile can be generated using another parameter to sorted:

In [85]:
sorted(ngrams, key=itemgetter(1), reverse=True)
print(ngrams[0:20])
print("...")
[('a house', 4), ('house of', 5), ('of pomegranates', 4), ('pomegranates contents', 1), ('contents :', 1), (': the', 2), ('the young', 107), ('young king', 24), ('king the', 1), ('the birthday', 3), ('birthday of', 3), ('of the', 354), ('the infanta', 33), ('infanta the', 1), ('the fisherman', 5), ('fisherman and', 3), ('and his', 53), ('his soul', 51), ('soul the', 1), ('the star-child', 41)]
...

We can pretty-print the decreasing frequency profile:

In [87]:
sorted(ngrams, key=itemgetter(1), reverse=True)
for t in ngrams[:20]:
    print(t[0], t[1])
a house 4
house of 5
of pomegranates 4
pomegranates contents 1
contents : 1
: the 2
the young 107
young king 24
king the 1
the birthday 3
birthday of 3
of the 354
the infanta 33
infanta the 1
the fisherman 5
fisherman and 3
and his 53
his soul 51
soul the 1
the star-child 41

(C) 2017-2019 by Damir Cavar <dcavar@iu.edu>