Hierarchical clustering of blog posts fetched through RSS Feed

Today I tried to implement a simple webapp which retrieves some RSS feeds from a given URL and then looks for the content and uses a hierarchial clustering classifier to cluster them in some kind of categories by content similarity.

Actually the implementation is really poor, and I’m not even sure it works. Anyway, it’s been long since I wanted try some kind of text classifier, and here we are. It extracts the text from the RSS feed, then indexes the words inside it and tries to use them as features for the algorithm. Short posts generally means bad results, especially without any kind of generalization (tokenization should be the word in this case) of the features. In fact, results are hardly understandable and I guess they’re random.

Anyway, here’s a list of what I (sort of) learned along the way:

  • what is hierarchical clustering (not how it works, though)
  • d3.js graph library basics (very basic basics)
  • how to use NetBeans to develop webapps

That’s not so bad for a spare afternoon&evening.

As I said, I didn’t implement the algorithm myself, but I used a library from github, clusterfck.
Oh and I also used the jFeed jQuery plugin for parsing the RSS, but I slightly modified it to fetch the content of the entries and not to crash trying to detect IE.

Here‘s the link.