Visualization – Data scientist toolkit

Data scientist are said to have better development knowledge than the average statistician and better statistic knowledge than the average developer. However, together with those skills one also needs marketing skills – the ability to communicate your, no so simple job and results to other people. Those people can be the CTO or VP R&D, team members, customers or sales and marketing people. They don’t necessarily share your knowledge or dive into the details as fast as you.

One of the best ways to make data and results accessible is creating visualizations, automatically of course. In this post I’ll review several visualizations tools, mostly for Python with some additional side kicks.

Matplotlib – probably the most known python visualization package. Includes most of the standard charts – bar charts, pie charts, scatters, ability to embed images, etc. Since there are many users using it there are many questions, examples and documentations around the web. However, the downside for me is that it is more complex than it should be. I have used it in several projects and I don’t yet acquired the intuition to fully utilize.

Matplotlib have several extensions including –

graphviz – Designated for drawing graphs. Graph drawing software with python package. pygraphviz is a python package for graphviz which provides a drawing layer and graph layout algorithms. The first downside of this is that you need to download the graphviz software. I have done it several times on several different machines (most of the consist of ubuntu) it never passed smoothly and I was not able to do it only from the command line which make it problematic if one wants to deploy it on remote machines. I believe that it could be done but at the moment I find this process only as an irksome overhead.

Side kicks –

  • PyDot – Implements DOT graph description language. PyDot is basically an interface to interact with PyGraphviz dot layout. The main advantage of the dot files and data is the advantage in standardization – one can create dot file in one process and use it in other process. DOT is an intuitive language which focuses on drawing the graph and not on calculating the graph. I would say that it is the last step in the chain.
  • Networkx – a package for working and manipulating graph. Implements many graph algorithms such as shortest path, clustering, minimum spanning tree, etc. The graphs created in Networkx can be drawn using either matplotlib or pygraphviz and can also create dot files.

Vincent – A relatively new python visualization package. Vincent translates Python to Vega which is a visualization  grammar. I like it because it is easy, interactive and simple to output either as JSON or as HTML . However, I’m not sure that both Vincent and Vega are mature enough at this point to answer all the needs. It is important to mention that Vega is actually a wrapper above D3 which is an amazing tool with growing community.

Additional related tools I’m not (yet) experienced with –

  • xlsxwriter – creating excel files (xlsx format) including embedding charts on those files.
  • plot.ly – very talked about tool for collaborating data and graphing tool which have a Python client. I try to keep my data as private as possible and don’t want to be dependent on internet connection (for example – creating graph with a lot of data) so this is the down side for me in this tool. However, the social \ collaborative aspect of this product is also an important part and the graphing is only one aspect of it.
  • Google charts – same downside as plot.ly – I like to be as independent as possible. However, comparing to plot.ly it looks more mature and has far more options, chart types than plot.ly at this stage and there is also a sand box to play with it. Plot.ly has advantages over Google charts in the ease of usage for non programmers.
  • Bokeh – Nice, interactive charts on large data sets. Maybe the next big thing for plotting in Python.

5 interesting things (2/6/2014)

Tell me your name and I’ll tell you your age – playing with data and a conditioned distribution  

http://fivethirtyeight.com/features/how-to-tell-someones-age-when-all-you-know-is-her-name/

Analysis of Over 2,000 Computer Science Professors at Top Universities – while this data is interesting I would say that it is only an introduction. It is not an anthropological study but it is still curious to ask how many of those professors are women? what are their ages? is the median age of a women when she joins similar to those of man? What is the distribution over the different sub-fields (i.e. are women more to theory than man?)? What is the percentage of the foreign professor or professor who did not took their undergrad studies in the US?

This is a nice starting point and there are many other things that can be done with this data. 

http://jeffhuang.com/computer_science_professors.html

Good runner, bad investor? – An empirical evidence \ variation on Kahneman and Tversky “Cognitive Biases” but a nice one.

http://www.nytimes.com/2014/04/23/upshot/what-good-marathons-and-bad-investments-have-in-common.html

Link shortening – short, elegant links satisfy both technological standards (twitter 140 char limit) and  human needs for short links which fit in line. Apparently those links cause much overhead due to the redirection, harm user experience and worth a lot of money.

http://t37.net/why-link-shorteners-harm-your-readers-and-destroy-the-web.html

Hacker News aggregation – I’m probably the last one to find out but for me it is an easier way to follow new links in Hacker News.

http://www.hndigest.com/

5 interesting things (23/5/2014)

Is that a baby bump in your status? Janet Vertesi tried to keep her pregnancy a way from the social networks. Did she succeed? What was the price? (high)

http://time.com/83200/privacy-internet-big-data-opt-out/

Spurious Correlation – “marriages in Alabama are causing deaths by electrocution” – the nightmare of every data scientist. Are two variable correlated? Is one the cause of the other or the other is the cause of the one?  This post took it one step forward – 

http://www.popsci.com/article/science/algorithm-reveals-link-between-sour-cream-and-traffic-accidents

History of Machine Learning – Another way to view machine learning and the progress made over the years. I’m wondering about the exact starting point \ initial formation of machine learning, I think it is a bit earlier than what is stated there. The reason to read this article is given in Pirkei Avot: “Know from where you came and where you are going and before whom you are destined to give account and reckoning” (3;1).

www.erogol.com/brief-history-machine-learning/

Breaking Python 2 for 1 – I have missed the first post when it was published so got 2 for 1 now. Two fascinating blog posts about under the hood of CPython, ctypes, garbage collector, etc. I really appreciate the hands on approach, it reveals a side of Python which I’m not exposed to on my daily work (and probably for most projects it is not a good practice to mess with).

http://blog.hakril.net/articles/0-understanding-python-by-breaking-it.html

http://blog.hakril.net/articles/1-understanding-python-by-breaking-it—lvl-2.html

Where is Waldo? – find Waldo in just 42 lines of code (not surprising having that 42 is the meaning of life). I have not background in image processing but those tools really make it easy, almost out of the box (and great flashback to my childhood).

http://machinelearningmastery.com/using-opencv-python-and-template-matching-to-play-wheres-waldo/

Learning: Spatial database

A while ago I was asked a question about Spatial database and figured out I didn’t know anything about it. So, I use this platform to document my learning about this topic.

What is Spatial data?

The first lesson in “database 101” is that the design of a database should fit to the queries made on it. The special thing about spatial databases is that they hold data about objects in a geometric spaces. Most of the traditional databases don’t index the information in an optimize way for spatial queries, examples for such queries can be –

  • How to far is location A to location B?
  • How to get from location A to location B?
  • What is the closest restaurants to some location?
  • What is the path people usualy take in some park \ market?
  • Is location A inside city B?

We can divide spatial queries to few groups (based on Wikipedia)

  • Spatial Measurements: Computes line length, polygon area, the distance between geometries, etc.
  • Spatial Functions: Modify existing features to create new ones, for example by providing a buffer around them, intersecting features, etc.
  • Spatial Predicates: Allows true/false queries about spatial relationships between geometries. Examples include “do two polygons overlap” or ‘is there a residence located within a mile of the area we are planning to build the landfill?’
  • Geometry Constructors: Creates new geometries, usually by specifying the vertices (points or nodes) which define the shape.
  • Observer Functions: Queries which return specific information about a feature such as the location of the center of a circle.

What is next?

This post is actually the preview for additional two posts – a theoretic one and a productive one.

The theoretic one will compare the different spatial indices and on which needs each of them answers. This post will talk about the most common index of this kind is R-Tree and some of its’ extensions and presumably some additional related theory.

The productive one will present some of the current solutions for spatial data among them is PostGis used in PostgreSQL, MySQL and additional NoSQL solution (probably either Neo4j or MongoDB).

Beside the basic curiosity of learning a new thing I believe that the field of “Location intelligence”, i.e., using and analyzing spatial information as part of the decision making process, is an emerging field yet to be discovered. In other words location intelligence is another layer of business intelligence (BI) and as the technology evolves and the data is gathered we can now use this data better than ever before.

Some of the current commercial application of location intelligence – 

  • Route planning – such as Waze
  • Geo targeting – for example using IP address to display relevant ads. 
  • Travel planning – hotel, restaurant, attractions such as GetYourGuide
  • Sales analysis – such as SpatialKey

 

5 interesting things (16/5/2014)

MailPin -Turn an email to a web page –  For me it is a cool tool that I probably won’t use but it is cool and that is also important  thing those days.
http://mailp.in/
 
10 years to LiMux project – “How Munich switched 15,000 PCS from Windows to Linux”. Interesting both on the technological aspect and both on the sociological \ anthropological aspect.
 
Why Python is Slow: Looking Under the Hood: I really like posts which help me understand better what I’m doing and this one is also very well written. Through Python Weekly.
 
How to Marry The Right Girl: A Mathematical Solution: I must admit that I don’t like those kind of posts or at least their header. I feel they are very sexist not to say misanthropic. However, there was one point which I relate to in this post but on a different scope  work interviews. Doing some work interviews lately, my gut feeling is that being in the first group really harm the chances of getting hired. Many interviewers don’t really know what exactly they are looking for and refine their requirements only after several interviews. To conclude the content was interesting but could be written in a different tone.
 
 
The Economics of Kickstart project: Crowd-funding is one of the latest trends. As in everything there are both advantages and disadvantages. One of the main advantages as I see it is the global exposure to one’s ideas. The down side of this advantage is that it eases the way to still ideas (some my call it inspiration). More over it does not really clearly if it is always profitable as expected. More about it in the attached post.

Coursera R programming course

I have recently took “R programming” course in Coursera. This is the second course I take there. Before that I took “Machine Learning” course which was much heavier course with respect both to what was taught, to the course length and to the assignments and other requirements.
  • In the world of SciPy, Numpy, Pandas and others in Python I don’t really see the advantage in R. Those libraries have almost the same capabilities while Python is a much stronger and more common and therefore supported and documented tool.
  • Visualizing the data – for me one of the best practices to gain better understanding of the data and to introduce the data to other people is creating a visualization. I really lack that part in Coursera course, I think that they should have done that extra mile, this would make the data processing more meaningful.
  • I’m not an R expert now, not even close. However, I gained some background next time I will need to handle code in R or that I will need to consider using R in a project. I saw some similarity to Matlab syntax so this knowledge might be relevant there too.
  • Language background, advantages, disadvantages, limits, theory (scoping, typing) etc. is important yet asking in a quiz from which university the language developers came from is not that interesting.
This is my code repository for this course, not brilliant implementation but those the job – https://github.com/tomron/rprog