5 interesting things (26/06/2015)

Spark Summit 2015 Highlights in Tweets – Spark summit was held in San Francisco this week (15-17 June). Euro-Spark summit will be held on 27-29 October call for papers is open. This presentation is a brief of the highlights of different talks. I believe that (if not already) most of the talks and presentations will be available online soon.

How to implement neural network – somehow, neural networks are not fully supported by scikit-learn (probably will happen in the future) and there are several alternative such as PyBrain (not sure if it is maintained), PyLearn2, theano and so on.This tutorial is Neural Network DIY. It is both a technical-python guide and a step-by-step neural network reminder.http://peterroelants.github.io/posts/neural_network_implementation_part01/

Designing large scale python application – Talk by M.A Lumberg, Python core developer, in PyWaw summit. He also talked as a keynote about – “Past, present and future of Python”, I intend to hear that one when I have sometime.


The artpiece of command line – I’m a geek, I admit it and I like working from the command line (we love you Tom).  Both for small data science things (sum, average, uniq, etc.) and for checking up what is going on in my system (atop, ps, etc.). I fell I should keep this repository not more than a click a way.


Mattermost – mattermost is an open source alternative for Slack. In one of the organizations I worked introducing Slack changed the entire communication in the organization. People were sharing more of their ideas, thoughts, results and got involved into additional projects. Still some features made me feel inconvenient like preferring Slack over documentation, using external servers and hard to search \ backup. I therefore find it very exciting to have an open source alternative that will improve both products.



5 interesting things (19/06/2015)

Interviewing as an art – Finding the smart, talented people with the right vibe for your company is always hard. Being on the other side is also hard. Not only in the sense of being good enough but also “interviewing” the work place – is the job interesting \ challenging? are the colleagues nice, talented, will you like to work with them? etc. There are many posts and theories about what you should ask in order to find the right candidate and less posts and discussions about how to approach the candidate, make him or her comparable and being able to get the maximum out of it during a stressing interview. I find this post specially sensitive and inviting.


Airbnb airflow – a framework by AirBnb which help create, schedule and monitor data pipelines.

Seaborn cheat-sheet – the statistical data visualization package completing \ improving the PyData eco-system. A simple tutorial that can also be used as a cheat-sheet.

Data Science Ipython notebooks – a collection of links of Ipython notebooks which relates to data science. Include notebooks from Kaggle, AWS, PyCon, etc. Can also add a the link to some seaborn notebook..

Pitfalls of A\B testing – tl;dr – statistics is tricky and you should remember the underlining assumptions. Also a good chance to remind A/A testing as a sanity check.

5 interesting things (04/06/2015)

Pitfalls when mining Wikipedia – this research was done on a bit old data (2013) but it show quite easy way to clean wikipedia dumps. Although they suggest a way to repeat the process they did, I suspect that the structure of some of the dumps was changed a bit.

An interesting question for me in this scope is the relation between content \ views of same entities cross different languages. This can be calculated by joining page count results with wikidata dumps once they are complete. Crunching wikidata dumps a bit, at the moment it does not feel mature enough and does not include the data about all the articles in wikipedia.


Weather prediction with Amazon Machine Learning – nice, simple way to start playing with Amazon machine learning –


How to evaluate machine learning models – a series of 5 posts by dato  –


Mean shift clustering – clustering approach with few advantages over common methods. E.g – don’t have to define number of clusters in advance  – but you have to tune the bandwidth. On the other hand it slower than other algorithms.


The future of spark – post following Strata + Hadoop conference in London.