Pitfalls when mining Wikipedia – this research was done on a bit old data (2013) but it show quite easy way to clean wikipedia dumps. Although they suggest a way to repeat the process they did, I suspect that the structure of some of the dumps was changed a bit.
An interesting question for me in this scope is the relation between content \ views of same entities cross different languages. This can be calculated by joining page count results with wikidata dumps once they are complete. Crunching wikidata dumps a bit, at the moment it does not feel mature enough and does not include the data about all the articles in wikipedia.
Weather prediction with Amazon Machine Learning – nice, simple way to start playing with Amazon machine learning –
How to evaluate machine learning models – a series of 5 posts by dato –
Mean shift clustering – clustering approach with few advantages over common methods. E.g – don’t have to define number of clusters in advance – but you have to tune the bandwidth. On the other hand it slower than other algorithms.
The future of spark – post following Strata + Hadoop conference in London.