K-means Clustering Algorithm

 

K-means clustering is an algorithm to separate multidimensional data into subclasses.  The number of subclasses, must be specified to run the algorithm.  K-means is an iterative algorithm which groups together things that are closest together.  For example the following picture is of two dimensional data with 4 subclasses.   K-means will work on data which is any number of dimensions as long as all data points have the same number of dimensions. 

The K-means Algorithm:

  1. Given the number of clusters (k) choose k random start points from the data.
  2. Divide the data into k categories by deciding which of the k start points each other point is closest to.  This leaves you with k subclasses.
  3. Now for each subclass, calculate the center point of all the points in that subclass.
  4. Using these centers figure out which sub class each point is now in.
  5. Using the new subclasses repeat steps 3 and 4 until the data stabilizes into k categories.

 

A Sample Clustering Application:

This application allows the user to load a text file and creates a 5 dimensional vector for each word in the file.  Then, the user can cluster the words based on these 5 dimensional vectors.  The vector is as follows:

[frequency, length, startSentence, endSentence, middleSentence]

The parameters of the vector are defined as follows:

  1. frequency - the number of times the word occurs in the text.
  2. length - the length of the word.
  3. startSentence - the percentage of the time the word occurs as the first word in the sentence*.
  4. endSentence - the percentage of the time the word ends the sentence*.
  5. middleSentence - the percentage of time the word is neither the last word nor the first word of the sentence*.

 

*Note:  To determine if a word is at the beginning, end or middle of a sentence, sentences are found by looking for a period, question mark, or exclamation point ( . ! ?).  It is assumed that these mark the ends of sentences.  I do look for one exception.  If the words mr or mrs have periods following them, it assumed that these do not the mark ends of sentences.  Therefore other abbreviations may cause problems.  Cases where a question mark is not at the end of a sentence because it is in a quotation including the following:

"What did you get for Christmas?", John asked.

would also not be handled correctly.  It is assumed that these cases would not occur very often and should not change the results very much.

 

Running the Application:

This application needs the python interpreter to run.  First make sure that python is installed and that your path includes the python executable.   To run the application, first download the three files to a folder on your computer.  From inside this directory type the following:

        python ClusteringApp.py

Doing so will bring up a window similar to this:

 

From the menu select File -> Open and select a text file to open.  Then click the cluster button and finally the plot button.  Depending on the size of the file, the cluster function may take quite a bit of time.  The plot consists of only the first two dimensions of the data.  The y-axis is word length and the x-axis is word frequency.  Each number in the plot represents a word and if two words have the same number they are in the same subclass and they are also shown on the plot in the same color. Clicking on the numbers in the left hand select box will show the corresponding words in that category.  These words will appear in the select box in the upper right corner of the window.  Clicking on a word will show the corresponding vector at the bottom of the window.

 

Download:

kMeansClustering.zip