(C) 2017-2019 by Damir Cavar <dcavar@iu.edu>
Version: 0.3, September 2019
Download: This and various other Jupyter notebooks are available from my GitHub repo.
This is a brief introduction to NLTK for simple frequency analysis of texts. I created this notebook for intro to corpus linguistics and natural language processing classes at Indiana University between 2017 and 2019.
For this to work, in the folder with the notebook we expect a subfolder data that contains a file HOPG.txt. This file contains the novel "A House of Pomegranates" by Oscar Wilde taken as raw text from Project Gutenberg.
Reading a text into memory in Python is faily simple. We open a file, read from it, and close the file again:
ifile = open("data/HOPG.txt", mode='r', encoding='utf-8')
text = ifile.read()
ifile.close()
print(text[:200], "...")
The optional parameters in the open function above define the mode of operations on the file and the encoding of the content. For example, setting the mode to r declares that reading from the file is the only permitted operation that we will perform in the following code. Setting the encoding to utf-8 declares that all characters will be encoded using the Unicode encoding schema UTF-8 for the content of the file.
import nltk
We can now lower the text, which means normalizing it to all characters lower case:
text = text.lower()
To generate a frequency profile from the text file, we can use the NLTK function FreqDist:
myFD = nltk.FreqDist(text)
We can remove certain characters from the distribution, or alternatively replace these characters in the text variable:
for x in ":,.-[];!'\"\t\n/ ?":
del myFD[x]
We can print out the frequency profile by looping through the returned data structure:
for x in myFD:
print(x, myFD[x])
To relativize the frequencies, we need to compute the total number of charachters. This is assuming that we removed all punctuation symbols.
total = float(sum(myFD.values()))
print(total)
We can generate now a probability distribution over characters:
relfrq = [ x/total for x in myFD.values() ]
print(relfrq)
We need the math log function:
from math import log
We can define the Entropy function according to the equation $I = - \sum P(x) log_2( P(x) )$ as:
def entropy(p):
res = 0.0
for x in p:
res += x * log(x, 2)
return -res
We can now compute the entropy of the character distribution:
print(entropy(relfrq))
We might be interested in the point-wise entropy of the characters in this distribution, thus needing the entropy of each single character. We can compute that in the following way:
entdist = [ -x * log(x, 2) for x in relfrq ]
print(entdist)
We could now compute the variance over this point-wise entropy distribution...
We see that the frequency profile is for the characters in the text, not the words or tokens. In order to generate a frequency profile over words/tokens in the text, we need to utilize a tokenizer. NLTK provides basic tokenization functions. We will use the word_tokenize function to generate a list of tokens:
tokens = nltk.word_tokenize(text)
We can now generate a frequency profile from the token list:
myTokenFD = nltk.FreqDist(tokens)
The frequency profile can be printed out in the same way as above, by looping over the tokens and their frequencies:
for token in list(myTokenFD.items())[:10]:
print(token[0], token[1])
print("...")
myTokenBigrams = nltk.ngrams(tokens, 2)
The resulting bi-gram list can be printed out in a loop, as in the examples above, or we can print the entire data structure, that is the list generated from the Python generator object:
bigrams = list(myTokenBigrams)
print(len(bigrams))
print(bigrams[:30], "...")
The frequency profile from these bigrams can be gnerated in the same way as from the token list above:
myBigramFD = nltk.FreqDist(bigrams)
If we would want to know some more general properties of the frequency distribution, we can print out information about it:
print(myBigramFD)
The bigrams and their corresponding frequencies can be printed using a loop:
for bigram in list(myBigramFD.items())[:20]:
print(bigram[0], bigram[1])
print("...")
Pretty printing the bigrams is possible as well:
for ngram in list(myBigramFD.items())[:20]:
print(" ".join(ngram[0]), ngram[1])
print("...")
Instead of running the frequency profile through a loop we can also use a list comprehension construction in Python to generate a list of tuples with the n-gram and its frequency:
ngrams = [ (" ".join(ngram), myBigramFD[ngram]) for ngram in myBigramFD ]
print(ngrams[:100])
We can generate an increasing frequency profile using the sort function on the second element of the tuple list, that is on the frequency:
sorted(ngrams, key=lambda x: x[1])
print(ngrams[:20])
print("...")
We can increase the speed of this sorted call by using the itemgetter() function in the operator module:
from operator import itemgetter
We can now define the sort-key for sorted using the itemgetter function and selecting with 1 the second element in the tuple. Remember that the enumeration of elements in lists or tuples in Python starts at 0.
sorted(ngrams, key=itemgetter(1))
print(ngrams[:20])
print("...")
A decreasing frequency profile can be generated using another parameter to sorted:
sorted(ngrams, key=itemgetter(1), reverse=True)
print(ngrams[0:20])
print("...")
We can pretty-print the decreasing frequency profile:
sorted(ngrams, key=itemgetter(1), reverse=True)
for t in ngrams[:20]:
print(t[0], t[1])
(C) 2017-2019 by Damir Cavar <dcavar@iu.edu>