Hello Dask – Dask is a few month old addition to the PyData toolbox. It aims to enable light-weighted parallel computation. The idea behind it is that it produces a graph, which describes a data flow and is lazily evaluated. In this post Jake VanderPlas presents his usage of Dask on OpenStreetMap data and refer to Rob Story talk from PyData conference which presents the several tools in the PyData eco-system and big data tools (I usually don’t have patience to see technical videos but this one is great).
https://jakevdp.github.io/blog/2015/08/14/out-of-core-dataframes-in-python/
Benchmarks for ML implementations – Comparing between several open source implementations of machine learning algorithms for binary classification task. The data-set which was used for this task was the airlines data-set and the task was to predict whether a flight was delayed in more than 15 minutes. Results are interesting in several levels – the influence of bigger training set, scalebility of tools – Python fails on big training data, empirical comparison between different algorithms.
https://github.com/szilard/benchm-ml
Folium – I have recently created some visualization based on geo-data and looked for the perfect (preferably python) tool for it. Matplotlib have basemap and I could also use some porting to d3.js or vincent but for all of them maps \ geo-data is just another type of data like bar chart and pie chart. Folium which is a wrapper of Leaflet.js focuses on interactive maps and does it well.
https://github.com/python-visualization/folium
Graph Databases for Beginners: Data Modeling Pitfalls to Avoid – Everything in this post is true and the pitfalls described there are indeed pitfalls when modeling data but I don’t think they are unique pitfalls to graph database, they are also relevant to relational models.
http://neo4j.com/blog/data-modeling-pitfalls/
Data driven city – I’m fascinated by how public data can help improve the life of all of us. It can be governmental data about budgets to expose irregularities, data about air pollution, education, etc. This data is exists in most governments, government offices and municipalities and need to be formatted and clean. A work to put and maintain but it may be profitable. This field might not seem attractive to many entrepreneurs because it is not strictly business but there are many cities around the world fighting the same difficulties. I believe that the next big jump there is yet to be made and this post is one example of it.
http://blog.dominodatalab.com/optimizing-chicagos-services-with-the-power-of-analytics/