5 interesting things (11/9/2015)

Density based clustering – the clearest and most practical guide I read about density based clustering.


Word segment – this python library which is train with over a trillion-word corpus aims to help segment text to words. E.g “thisisatest” to “this is a test”. I tried a random example -“helloworld’ and it didn’t split it at all. I tried other examples as well (“mynameis<x>”, “ilivein<y>”, etc) and it worked well. Beside the segmentation functionality it also offers unigrams and bigrams counting this can be usable for all kind of applications without the need to get the data, clean it and process it yourself. Numbers do not appear in the unigram count, I find it interesting for other needs rather than splitting.


Funny haha?! Predicting if a joke is funny or not based on the words it contains using Naive Bayes classifier from NLTK package. It is a good and funny beginners tutorial. NLTK contains additional classifiers beside Naive Bayes Classifier, e..g Decision Tree Classifier, it would also be interesting to see how they preform on this dataset.


Scaling decision forests lessons learned – some of those lessons are specific to decision forests, some are for scaling and some are general good practices. I wish there was a more found discussion about dealing with missing data \ features. Not specifically in this post but in general.


Redshift howto – very extensive guide to redshift, mostly admin and configuration related stuff. I miss another chapter regarding tools above redshift such as re:dash.



Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s