Implementation of a K Means Clustering to classify documents by language

Recently i’ve been interested in machine learning, and made some sample implementations to understand better the subject. In particular recently i’ve implemented a simplified version of the K-Means Clustering classifier and then I decided to apply this algorithm to a more practical task.

I’ve implemented the K-Means Clustering to classify some text by the language it’s written in. In brief, you can provide some different text to the algorithm, decide in how many groups you want them to be classified and then run the algorithm. It will partition the text depending on the different relative frequency of the letters in it, trying to recognize some structure in the different languages. It’s not perfect, but works, and there’s a live version to try here.

The larger the text is, the more the algorithm will be accurate. In fact, with short text the result will be quite random.

Graphical representation of a sample K Means Clustering classifier

Moving to the second lesson of this tutorial, i’ve learnt about the K Means Clustering classifier. Basically, We’re giving the algorithm some points of the space and it will partition the elements in K different sets. The algorithm is really easy, I suggest you to read the tutorial for further information.

The only thing I want to explain here about this algorithm is that, given a certain dataset (in our case a set of points) you should already know in how many sets you should partition it. Otherwise, the algorithm will get to a solution which may be inaccurate. To better understand this, try to use my little implementation (the link is below) making 6 sets of close points, and try to run the algorithm with K different from 6. You’ll understand why it’s important to have an accurate guess of the K value.

I modified my recent implementation of the K Nearest Neighbour to use this algorithm.

Here’s the link.

Graphical representation of a sample K Nearest Neighbour classifier

Today i read the first lines of a promising tutorial about machine learning. It starts introducing the subject and showing the first javascript example of classifier. I’ve never read anything about that, so my knowledge is still very limited (yes, I stopped before the end of the first lesson because I wanted to implement this). For technical details, I send you to the tutorial.

Anyway, what I’ve understood so far is that an important part of machine learning consists in classifiers. Classifiers are algorithm meant for “recognizing” (classify) objects by some of their features. Actually you feed them with a known dataset (already classified) and hope they will be able to classify any new object you throw in them by just recognizing their features and comparing them with the known features in the dataset. I won’t go technical on this (I can’t yet), I suggest you to read the linked tutorial if you’re really interested.

There are plenty of classifiers, but one of the simplest is the K Nearest Neighbour Classifier.It works just by representing the n features (which must be numeric) on the axis of a n-dimensional space. When you give it an element to classify, it finds the K nearest known elements (of the given dataset) and finds which class occurs most of the times in this K elements. The element is then assumed to belong to this class, and thus is classified.

I wanted to try to implement my own version of this simple algorithm, so I wrote this little Javascript app which takes some input points (x and y are the numeric features), each of them with a color (the known class), and then generates new random elements (x and y) and classifies them with the K Nearest Neighbour (with K=5 for now, but i’m changing that often), coloring the points (putting them in a class). The result is, after some time, that the entire space is colored in a way dependent by the dataset (your original input). That’s not much useful, but it’s surely funny. At least it has been for me.

Notice that this implementation uses as dataset of the current step every element in the canvas, even those generated randomly and then classified in previous steps. This is of course not very clever for this kind of classifier, but it makes the result less predictable, and fits better my purposes (I have no purposes).

Here’s the link, and here’s a screenshot: