5 interesting things (04/06/2015)

Pitfalls when mining Wikipedia – this research was done on a bit old data (2013) but it show quite easy way to clean wikipedia dumps. Although they suggest a way to repeat the process they did, I suspect that the structure of some of the dumps was changed a bit.

An interesting question for me in this scope is the relation between content \ views of same entities cross different languages. This can be calculated by joining page count results with wikidata dumps once they are complete. Crunching wikidata dumps a bit, at the moment it does not feel mature enough and does not include the data about all the articles in wikipedia.


Weather prediction with Amazon Machine Learning – nice, simple way to start playing with Amazon machine learning –


How to evaluate machine learning models – a series of 5 posts by dato  –


Mean shift clustering – clustering approach with few advantages over common methods. E.g – don’t have to define number of clusters in advance  – but you have to tune the bandwidth. On the other hand it slower than other algorithms.


The future of spark – post following Strata + Hadoop conference in London.


One thought on “5 interesting things (04/06/2015)

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s