Lexical Clustering

(C) 2016-2020 by Damir Cavar <dcavar@iu.edu>

Version: 1.2, September 2020

Download: This and various other Jupyter notebooks are available from my GitHub repo.

This notebook provides simple examples of vectorization of ditributional properties of lexical items using Python 3.x. The applied examples show how lexical properties can be derived using common clustering methods on the resulting distributional vector space. This material is used in my graduate classes on Natural Language Processing, Corpus and Computational Linguistics at Indiana University at Bloomington.

Vectorization of Distributional Properties

We will map out lexical distributional properties in the following. With lexical distributional properties we might refer to various kinds of positional or contextual features of words in text.

Loading a Text into Memory

We will use a collection of fairy tales "The House of Pomegranates" by Oscar Wilde. The following code will read the text into memory. We open a file, read from it, and close the file again:

In [8]:
ifile = open("data/HOPG.txt", mode='r', encoding='utf-8')
text = ifile.read()
ifile.close()

Using NLTK

We will use the NLTK module to generate frequency profiles and n-gram models using the tokens in the text.

In [11]:
import nltk

We will use the tokenization and lemmatization modules from NLTK. These are not the most accurate and best performing components. For more efficient lemmatizers consider using Python modules like spaCy.

Tokenization

We will need the tokens from the text, that is mainly all individual words and punctuation marks separated as individual elements in a token list:

In [12]:
tokens = nltk.word_tokenize(text)

Tokens will contain all tokens as they occur in text. This means that we will find in the token list a the, a The, maybe even a THE. To conflate all occurrences of these variants of "the" to one token representation the, we will use lemmatization in the next section.

Lemmatization for Dimensionality Reduction

NLTK provides a WordNet-based lemmatizer. In the follwoing we import the NLTK WordNetLemmatizer module:

In [13]:
from nltk.stem import WordNetLemmatizer

We instantiate a lemmatizer:

In [14]:
lemmatizer = WordNetLemmatizer()

The lemmatizer correctly converts the plural form dogs to the lemmatized form, as shown in the example below:

In [20]:
print(lemmatizer.lemmatize("dogs"))
dog

Unfortunately, the lemmatizer does not correct a capitalized the,

In [21]:
print(lemmatizer.lemmatize("The"))
The

Independent of this problem, we could use the lemmatizer for the basic tokens with some morphological structure and attachment in the following way:

In [22]:
lemmas = [ lemmatizer.lemmatize(token) for token in tokens ]

We can print out the first 100 lemmas:

In [23]:
print(lemmas[0:100])
['A', 'HOUSE', 'OF', 'POMEGRANATES', 'Contents', ':', 'The', 'Young', 'King', 'The', 'Birthday', 'of', 'the', 'Infanta', 'The', 'Fisherman', 'and', 'his', 'Soul', 'The', 'Star-child', 'THE', 'YOUNG', 'KING', '[', 'TO', 'MARGARET', 'LADY', 'BROOKE', '--', 'THE', 'RANEE', 'OF', 'SARAWAK', ']', 'It', 'wa', 'the', 'night', 'before', 'the', 'day', 'fixed', 'for', 'his', 'coronation', ',', 'and', 'the', 'young', 'King', 'wa', 'sitting', 'alone', 'in', 'his', 'beautiful', 'chamber', '.', 'His', 'courtier', 'had', 'all', 'taken', 'their', 'leave', 'of', 'him', ',', 'bowing', 'their', 'head', 'to', 'the', 'ground', ',', 'according', 'to', 'the', 'ceremonious', 'usage', 'of', 'the', 'day', ',', 'and', 'had', 'retired', 'to', 'the', 'Great', 'Hall', 'of', 'the', 'Palace', ',', 'to', 'receive', 'a', 'few']

Due to the weaknesses of the NLTK WordNet based lemmatizer for generic lemmatization, I provide here a token list of lemmatized tokens using some alternative lemmatizer.

Using Functional Items as Distributional Features

Distributional properties of lexical items can be associated with various contextual cues. In a Distributional Semantics approach the core hypothesis is that the meaning of a specific word is determined by the meaning of the words in its context. Imagine the two different uses of bats:

The bats were flying out of the cave.

The bats were made of solid wood.

For baseball bats it is more likely to be made of solid wood than to fly out of caves. On the other hand, the mammals of the order Chiroptera live in caves, and fly in and out of those.

The general idea in Distributional Semantics is that the meaning of bat can be determined by the words in the context. If a word would only have one specific meaning, its meaning could in principle be defined by the words frequently occuring in its context. We could think of it also in another way. The meaning of a word could be defined to be a probability function that predicts words in its context. This is a common interpretation in word-embedding approaches. This is obviously an oversimplification and conceptually wrong, but an approximation that appeared to be helpful in some NLP applications and models.

The core problem is of course that bat can refer to many things, at least two, and that the context can help us determine which meaning is most appropriate in a specific context.

Vectorization

In [28]:
bigrams = list( nltk.ngrams(lemmas, 2) )
In [29]:
bigrams[:10]
Out[29]:
[('A', 'HOUSE'),
 ('HOUSE', 'OF'),
 ('OF', 'POMEGRANATES'),
 ('POMEGRANATES', 'Contents'),
 ('Contents', ':'),
 (':', 'The'),
 ('The', 'Young'),
 ('Young', 'King'),
 ('King', 'The'),
 ('The', 'Birthday')]
In [150]:
bigramFD = nltk.FreqDist(bigrams)
In [151]:
functionwords = """
I me my myself we our ours ourselves you you're you've you'll you'd your yours yourself yourselves
he him his himself she she's her hers herself it it's its itself they them their theirs themselves
what which who whom this that that'll these those am is are was were be been being have has had
having do does did doing a an the and but if or because as until while of at by for with about
against between into through during before after above below to from up down in out on off over under
again further then once here there when where why how all any both each few more most other some such
no nor not only own same so than too very s t can will just don don't should should've now d ll m o re ve y
ain aren aren't couldn couldn't didn didn't doesn doesn't hadn hadn't hasn hasn't haven haven't isn
isn't ma mightn mightn't mustn mustn't needn needn't shan shan't shouldn shouldn't wasn wasn't weren
weren't won won't wouldn wouldn't
"""

To guarantee fast lookup, we can convert the function word list into a dictionary (or hash-map), storing the list position as the value. The list position will be the unique scalar index in the vector space.

In [152]:
functionWordsHash = { functionwords[i]:i for i in range(len(functionwords)) }

We will store word vectors in a dictionary, with the key being the word, the value being the context vector. The vector space consists of vectors for all non-feature tokens (or words).

We can write a function now that converts the distributional properties of words into a vector space. Assume that the function takes as parameters an ngram model and a dictionary of features with length $f$. Depending on the size of $n$ in the ngram model, the length of the vector representation for every word will be then be $(n - 1) \times f$ for left or right contexts, and $(n - 1) \times f \times 2$ for both contexts. The following function ignores $n$ completely. It takes the left peripheral and the right peripheral tokens and features for mapping out a vector space, for all $n > 1$.

In [168]:
def makeVectorSpace(tokenTuples, features, left=True, right=True):
    if tokenTuples:
        n = len(next(iter(tokenTuples)))
        if n < 2:
            return {}
    else:
        return {}
    if features:
        f = len(features)
        featuresHash = { features[x]:x for x in range(len(features)) }
    else:
        return {}

    if left & right:
        vectorLength = 2 * f
    elif left | right:
        vectorLength = f
    else:
        return {}

    vectorModel = {}
    for x in tokenTuples:
        if x[-1] in featuresHash and x[0] not in featuresHash:
            feat = x[-1]
            t = x[0]
            s = 1
        elif x[0] in featuresHash and x[-1] not in featuresHash:
            feat = x[0]
            t = x[-1]
            s = 0
        else:
            continue
        tokenVector = vectorModel.get(t, [0]*vectorLength)
        fPos = featuresHash.get(feat) + (f * s)
        tokenVector[fPos] = tokenVector[fPos] + tokenTuples.get(x, 0)
        vectorModel[t] = tokenVector
    return vectorModel

Let us create some sample text. In this case testText is a tokenized and orthographically normalized text.

In [166]:
testText = """John Smith met Susan Peters in Paris where she lived a happy life over the last ten years .
for many years she lived in Berlin .
the city was too big for her .
why she moved to Paris is unclear .
in Paris she has a small appartment . """

We can create an ngram model from the text, with $n=2$, and extract a vectorization as follows:

In [178]:
testNgrams = nltk.FreqDist(nltk.ngrams(nltk.word_tokenize(testText), 2))
testFeatures = ['the', 'a', 'in', 'on', 'where', 'over']
wordVectors = makeVectorSpace(testNgrams, testFeatures)
for k, v in wordVectors.items():
    print(k, v)
Paris [0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 1, 0]
Peters [0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0]
she [0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0]
lived [0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0]
happy [0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
life [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1]
last [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
Berlin [0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0]
. [0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0]
city [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
has [0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0]
small [0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

The vector space above contains absolute context frequencies. We can relativize the vector space for all vectors as follows:

In [179]:
relVecSpace = {}
for x, y in wordVectors.items():
    total = sum(y)
    relVecSpace[x] = [ i/total for i in y ]

Print the relativized vector space:

In [180]:
for k, v in relVecSpace.items():
    print(k, v)
Paris [0.0, 0.0, 0.6666666666666666, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.3333333333333333, 0.0]
Peters [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0]
she [0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]
lived [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.5, 0.5, 0.0, 0.0, 0.0]
happy [0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]
life [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0]
last [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]
Berlin [0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]
. [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.5, 0.0, 0.5, 0.0, 0.0, 0.0]
city [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]
has [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0]
small [0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]

Clustering Vector Spaces

Given the distributional vector space above, I will show in the following how to use different clustering algorithms to group the lexical vectors using geometrical similarity metrics.

For the clustering experiments we will use the scikit-learn module. If you did not install scikit-learn yet, follow the instructions on the module's website. To import the K-Means clustering algorithm, use the following import statement:

In [184]:
from sklearn.cluster import KMeans

More efficient and usefull representations and operations on vectors and vector space models can be achieved with the numpy module. We import it as np:

In [209]:
import numpy as np

The labels can be converted into a more memory efficient tuple. The vector space can be converted into numpy arrays using the array method and a list comprehension for the individual word vectors.

In [210]:
labels = tuple(wordVectors.keys())
vectors = np.array([np.array(x) for x in wordVectors.values()])

The vector labels are now stored in a tuple of strings:

In [211]:
print(labels)
('Paris', 'Peters', 'she', 'lived', 'happy', 'life', 'last', 'Berlin', '.', 'city', 'has', 'small')

Instead of providing KMeans with a numpy matrix, it is also possible to use a pandas data frame. To use this method, we import pandas as pd:

In [212]:
import pandas as pd

To create a new pandas data frame, we use the vectors from our vector space model and declare the labels to be our left and right function word dimensions.

In [213]:
df = pd.DataFrame(vectors, columns=testFeatures * 2, index=labels)
In [214]:
df
Out[214]:
the a in on where over the a in on where over
Paris 0 0 2 0 0 0 0 0 0 0 1 0
Peters 0 0 0 0 0 0 0 0 1 0 0 0
she 0 0 0 0 1 0 0 0 0 0 0 0
lived 0 0 0 0 0 0 0 1 1 0 0 0
happy 0 1 0 0 0 0 0 0 0 0 0 0
life 0 0 0 0 0 0 0 0 0 0 0 1
last 1 0 0 0 0 0 0 0 0 0 0 0
Berlin 0 0 1 0 0 0 0 0 0 0 0 0
. 0 0 0 0 0 0 1 0 1 0 0 0
city 1 0 0 0 0 0 0 0 0 0 0 0
has 0 0 0 0 0 0 0 1 0 0 0 0
small 0 1 0 0 0 0 0 0 0 0 0 0

For more transparency we can add -l and -r to indicate the function words as left or right context cues respectively:

In [222]:
df.columns = [ x+"-l" for x in testFeatures ] + [ x+"-r" for x in testFeatures ]
df
Out[222]:
the-l a-l in-l on-l where-l over-l the-r a-r in-r on-r where-r over-r
Paris 0 0 2 0 0 0 0 0 0 0 1 0
Peters 0 0 0 0 0 0 0 0 1 0 0 0
she 0 0 0 0 1 0 0 0 0 0 0 0
lived 0 0 0 0 0 0 0 1 1 0 0 0
happy 0 1 0 0 0 0 0 0 0 0 0 0
life 0 0 0 0 0 0 0 0 0 0 0 1
last 1 0 0 0 0 0 0 0 0 0 0 0
Berlin 0 0 1 0 0 0 0 0 0 0 0 0
. 0 0 0 0 0 0 1 0 1 0 0 0
city 1 0 0 0 0 0 0 0 0 0 0 0
has 0 0 0 0 0 0 0 1 0 0 0 0
small 0 1 0 0 0 0 0 0 0 0 0 0

We can use the values from the data frame as a matrix for KMeans:

In [201]:
m = df.values

We can apply now KMeans to the data matrix:

In [202]:
km = KMeans(n_clusters=4)
km.fit(m)
Out[202]:
KMeans(n_clusters=4)

The cluster labels can be used to generate a new data frame with the words and the associated cluster ID associated:

In [224]:
results = pd.DataFrame([df.index, km.labels_]).T
results.columns=("word", "cluster")
results
Out[224]:
word cluster
0 Paris 2
1 Peters 1
2 she 0
3 lived 1
4 happy 3
5 life 0
6 last 0
7 Berlin 0
8 . 1
9 city 0
10 has 0
11 small 3

On this small sample the clusters will not make a lot of sense. We need a larger text or corpus to extract more detailed distributional properties. In the following section I will introduce a larger text that is normalized or lemmatized, to run various lexical clustering experiments on it.

Clustering on Larger Texts