Python Clustering with Scikit-learn¶

Download: This and various other Jupyter notebooks are available from my GitHub repo.

License: Creative Commons Attribution-ShareAlike 4.0 International License (CA BY-SA 4.0)

This is a tutorial related to the discussion of clustering in the textbook Machine Learning: The Art and Science of Algorithms that Make Sense of Data by Peter Flach, and various other textbooks.

This tutorial was developed as part of my course material for the course Machine Learning for Natural Language Processing, Computational Linguistics at Indiana University at Bloomington.

K-means Clustering¶

We will use the array objects from the Python module numpy:

import numpy

X = numpy.array([[1, 2], [1, 4], [1, 0], [4, 2], [4, 4], [4, 0]])

To use the K-means clustering algorithm from Scikit-learn, we import it and specify the number of clusters (that is the k), and the random state to initialize the centroid centers of the clusters. We assume that the data can be grouped into two clusters:

from sklearn.cluster import KMeans

kmeans = KMeans(n_clusters=2, random_state=0)

We can now apply the clustering algorithm to the datapoints in $X$:

kmeans.fit(X)

print(kmeans.labels_)

[0 0 0 1 1 1]

The output above shows the assignment of datapoints to clusters.

We can use the model now to make predictions about other datapoints:

print(kmeans.predict([[0, 0], [4, 4]]))

[0 1]

We can also output the centroids of the two clusters:

print(kmeans.cluster_centers_)

[[ 1.  2.]
 [ 4.  2.]]

If we want to use K-Nearest Neighbor in Scikit Learn, we need to import the KNeighborsClassifier from the neighbors submodule:

from sklearn.neighbors import KNeighborsClassifier

We instantiate a KNN-classifier:

KNNClassifier = KNeighborsClassifier(n_neighbors = 3)

We use the following dataset X and class-vector y:

X = [[0, 1], [1, 1], [2, 4], [3, 4]]
y = [0, 0, 1, 1]

We train the classifier:

KNNClassifier.fit(X, y)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=3, p=2,
           weights='uniform')

We ask the classifier to suggest a class for an unseen vector:

print(KNNClassifier.predict([[1.1, 0.9]]))

[0]

It can also give us the likelihoods for the probability of a data-point being in any of the classes:

print(KNNClassifier.predict_proba([[2.9, 3.1]]))

[[ 0.33333333  0.66666667]]

We might not have a class assignment for a sample set. If we want to use a sample set to find the closes data-point, we could use the KNeighborsClassifier as well. Here is a sample.

samples = [[0., 0., 0.], [0., .5, 0.], [1., 1., .5]]

We can now train for K=1, the nearest neighbor model:

from sklearn.neighbors import NearestNeighbors

KNNClassifier = NearestNeighbors(n_neighbors=1)
KNNClassifier.fit(samples)

NearestNeighbors(algorithm='auto', leaf_size=30, metric='minkowski',
         metric_params=None, n_jobs=1, n_neighbors=1, p=2, radius=1.0)

We could ask for the nearest neighbor of a concrete data-point:

print(KNNClassifier.kneighbors([[1., 1., 1.]]))

(array([[ 0.5]]), array([[2]]))

The returned result [[0.5]] and [[2]] means that nearest neighbor is the third sample in samples and that the distance between the two is 0.5. One can also query for the distance of multiple data-points. In this case the output of the distance is supressed:

X = [[0., 1., 0.], [1., 0., 1.]]

KNNClassifier.kneighbors(X, return_distance=False)

array([[1],
       [2]])