5 interesting things (26/08/2015)

Best programming language ever!!!!!1 – I really agree with this guy approach and the way he presents it. No better tool but better choices.

http://coding-geek.com/the-best-programming-language/

Apache Spark @ TripAdvisor – the title of this post is “Using Apache Spark for Massively Parallel NLP” but this is a bit misleading. Apache Spark is discussed only in the last third of the post and does not dive to the technical details as I expected or how Spark integrates with the rest of their architecture. However, the methodology they present regarding asking the users questions and the inherent users’ bias he points at. I’ll wait for the next post where he says he will talk more about the algorithm.

http://engineering.tripadvisor.com/using-apache-spark-for-massively-parallel-nlp/

Let me guess where are you from? I love this post because it uses an interesting dataset (a list of people who have Wikipedia pages by country) and he explains the process he did and the tools he used step by step – getting the data, choosing the model, optimizing it, fitting it to production constraints and deploying it. Real pleasure to read.

http://nxn.se/post/127065307170/let-me-guess-where-youre-from

Do you need a data scientist? I agree with many points in this post – sometimes data science and data engineering is mixed. I think that since “data scientist” become some kind of buzz word organizations might try to attract employees by this title. Regarding the point of having a problem to solve vs wanting to do something cool with data – yes, you have to have a concrete tasks for a person hired as a data scientist but you also have to have time for innovation and vision based on data and not all the time existing employees have those capabilities and knowledge.

http://yanirseroussi.com/2015/08/24/you-dont-need-a-data-scientist-yet/

AWS CLI Jungle – this tool is kind of a wrapper to AWS CLI which wants to make it more intuitive. The lack of support in wild cards disturbs me as a user and I am glad they solve it. It currently implements functionality for ec2 and elb, I hope they will also implement s3 functionality soon.

https://github.com/achiku/jungle

Advertisement

Redmonk programming language ranking

I usually don’t find those ranking very interesting but Redmonk ranking which compares between programming languages’ popularity on GitHub (by projects) and popularity on Stack Overflow (by questions) was interesting for me.

What can we see in the chart below – very popular language (Java, Python, JavaScript, PHP) are very close to the y=x line – popularity on GitHub is almost the same as their popularity in Stack Overflow. Other languages which are very close to this line are – R, perl, Scala, Shell, Objective-C.

Languages that are below the line are languages that are more popular in GitHub the in Stack Overflow.Very extreme example in this category is VimL (vim script). Why is that? one possible reason for that is that those languages are very well documented or that those languages have very limited use cases so the question scope is restricted. Another possible reason is the belief that there is no community, therefore people don’t ask questions and don’t get answers or ask in under another tag. For example VimL questions are sometimes asked with Vim tag and without VimL tag or a vimscript question. See this search.

The languages above the line are languages which are more popular in Stack Overflow than in Github. Why is that? the opposite reasons – languages which are not documented very well or add new features frequently and therefore users have many questions about it. Another possibility is very common languages that are taught in the university and therefore many students which are not very experience ask a lot about them. A very extreme example for that is SQL. As well, XML is not a programming language and mostly used for configuration so it is quite clear why there are no projects in GitHub that are strictly XML. I think that both XQuery and DOT are above the line from the same reason.

It was therefore surprising for me to see CSS very close to the line (btw could find HTML in the chart) but it possibly depends on the methodology of counting GitHub projects.

Some of the languages are not really programming languages but rather a technology, e.g. Arduino, DOT, xml. It would be also interesting to see such a comparison to general technologies which are not programming languages – MongoDB, Elasticsearch, hadoop, etc. And to see it overtime – maybe documentation gets better, technology reaches to a stable state, etc.

Is Python random really random?

The question I started with, as the title states, was quite simple – “is python random really random”?

Let’s divide the question to two – python random and really random.

Python random – a module from python standard library which implements pseudo-random number generators for various distributions. This module allows one to generate random integers in a given range, shuffle a sequence, take a sample from the sequence and so on. This post is a sanity check of whether I can trust python random or not.

Really random – this is matter of statistics. If we have a coin and we tossed it 10 times and got 4 times head and 6 times tails what is our confidence that this coin is really fair? And if we got 8 times head and 2 times tails? or if we tossed it 100 times and got 40 times head and 60 times tails?

So of course the bigger the sample is the higher the confidence we have whether the coin \ dice \ generator is random or not. This test actually asks if a random sequence python random produces has a mean like a “real” random distribution. A repeating sequence of 0, 1, 0, 1, … will pass this test but it is clearly not random.

from random import randint
from collections import Counter
from math import sqrt, exp

sizes = [10, 100, 1000, 10000, 100000]

def normpdf(x, mean, sd):
 var = float(sd)**2
 pi = 3.1415926
 denom = (2*pi*var)**.5
 num = exp(-(float(x)-float(mean))**2/(2*var))
 return num/denom

COIN_SIDES = 2
for size in sizes:
 sample = Counter([randint(1, COIN_SIDES) 
 for i in xrange(0, size)])
 expected_mean = size / COIN_SIDES
 expected_variance = size * \
 (1.0 / COIN_SIDES) * \
 (1 - 1.0/COIN_SIDES)
 expected_stdev = sqrt(expected_variance)
 p = 100 * normpdf(max(sample.values()), 
 expected_mean, expected_stdev)
 q = 100 * normpdf(min(sample.values()), 
 expected_mean, expected_stdev)
 print "size: %s, p(observed_mean > %s) = %s%%"\
 %("{0:.3f}".format(1.0/COIN_SIDES),
 size, "{0:.3f}".format(p))
 print "size: %s, p(observed_mean < %s) = %s%%"\
 %("{0:.3f}".format(1.0/COIN_SIDES),
 size, "{0:.3f}".format(q))
 print "samples: %s"%sample

And the output – for sample size 100000 the probability that the observed mean is greater \ less than 0.5 is 0.251%. For p-value of 5%, i.e. the probability that the probability is less or great than 0.5 sample of size 1000+- is enough.

As said before, it is a small sanity check but it passed it well.

5 interesting things (20/08/2015)

Beautiful Lies – in the good case charts can be unclear due to too much data, no distinct colors, missing labels etc. in a worse case it can be miss leading. Few tips and example of such –

http://flowingdata.com/2015/08/11/real-chart-rules-to-follow

RDBMS under the hood – a big part of this post deals with complexity and data structures. This introduction provides the ground to the rest of the post about different processes in the RDBMS, their costs and different optimizations. While I learned a big part of the post content as an undergraduate I wish I would also learn the more interesting part about optimizations.

http://coding-geek.com/how-databases-work/

Scaling and sharding @ pinterest – another link to a post about RDBMS. RDBMS is sometimes considered less sexy comparing to NoSQL, bigdata fancy solutions. This post explains why pinterest chose MySQL as their data warehouse solution and how they scale it. Comparing to the mambo jambo that happen in many companies the fact that this solution works for the for 3 years now is amazing.

https://engineering.pinterest.com/blog/sharding-pinterest-how-we-scaled-our-mysql-fleet/

What python cannot do – I was actually very optimistic about this post but ended up a bit disappointed. Some of the points it points on are relevant and the other I find less relevant. By less relevant points I refer to other programming languages with the same disadvantages  – java and issues with versions. Several python modules lack good support  – really? everywhere there is an open source (and also where there are commercial packages) documentation and quality are never perfect. “Errors can be identified only on run time” – dynamic typing can be both a pro and a con of python, it is a feature of the language. “Slower than other compiled languages” – this is miss leading as python is not compiled and comparing it C++ and C is weird.

http://www.allaboutweb.biz/what-is-it-that-python-cannot-do/

Visual tutorial of Kalman filters – very interesting and well explained tutorial about Kalman filters.

http://www.bzarg.com/p/how-a-kalman-filter-works-in-pictures

5 interesting things (17/08/2015)

Hello Dask – Dask is a few month old addition to the PyData toolbox. It aims to enable light-weighted parallel computation. The idea behind it is that it produces a graph, which describes a data flow and is lazily evaluated. In this post Jake VanderPlas presents his usage of Dask on OpenStreetMap data and refer to Rob Story talk from PyData conference which presents the several tools in the PyData eco-system and big data tools (I usually don’t have patience to see technical videos but this one is great).

https://jakevdp.github.io/blog/2015/08/14/out-of-core-dataframes-in-python/

Benchmarks for ML implementations – Comparing between several open source implementations of machine learning algorithms for binary classification task. The data-set which was used for this task was the airlines data-set and the task was to predict whether a flight was delayed in more than 15 minutes. Results are interesting in several levels – the influence of bigger training set, scalebility of tools – Python fails on big training data, empirical comparison between different algorithms.

https://github.com/szilard/benchm-ml

Folium – I have recently created some visualization based on geo-data and looked for the perfect (preferably python) tool for it. Matplotlib have basemap and I could also use some porting to d3.js or vincent but for all of them maps \ geo-data is just another type of data like bar chart and pie chart. Folium which is a wrapper of Leaflet.js focuses on interactive maps and does it well.

https://github.com/python-visualization/folium

Graph Databases for Beginners: Data Modeling Pitfalls to Avoid – Everything in this post is true and the pitfalls described there are indeed pitfalls when modeling data but I don’t think they are unique pitfalls to graph database, they are also relevant to relational models.

http://neo4j.com/blog/data-modeling-pitfalls/

Data driven city – I’m fascinated by how public data can help improve the life of all of us. It can be governmental data about budgets to expose irregularities, data about air pollution, education, etc. This data is exists in most governments, government offices and municipalities and need to be formatted and clean. A work to put and maintain but it may be profitable. This field might not seem attractive to many entrepreneurs because it is not strictly business but there are many cities around the world fighting the same difficulties. I believe that the next big jump there is yet to be made and this post is one example of it.

http://blog.dominodatalab.com/optimizing-chicagos-services-with-the-power-of-analytics/

5 interesting things (10/08/2015)

rrrepo –  at first glance it looked me like a checklist site and then I understood it is a repository of links. I tried to register, explore and create a site and I think they have many usability issues, some issue for example –
  • After registering I expected to get some mail confirming my registration and to automatically log into the site. It was unclear how to log into the site.
  • Wanted to create a repo in the name which already exists, wanted to navigate to it but I was redircted to somewhere else.
  • Exploring new repositories is not very clear.
  • Privacy settings.
  • When logged in, user is redirected to the news feed and cannot visit the home page or log out.
I have a bit of dejavu to urli.st which was closed a while ago but with a nicer UI. It is currently on very very early stage and many improvements to be done.

Data Science blog list
 – I was like to explore new ideas and thoughts of others so this is a good starting point. Regarding the previous link (rrrepo) it can be a nice feature to import git repositories like this which is mainly links to a nice dashboard like rrrepo creates.

Composting music with recurrent neural networks
– while traditional online music services recommend users existing music maybe we can just generate users new music?

Trust me I’m a data scientist?
 –  Data scientist often explain their results to colleagues, users, clients which are not familiar with the full terminology of data science but want to make sure they can relay on the presented results. This post can make them trust you

Streaming 101
 – a 2 post series by O’reilly about streaming systems and data. This is only the first post in the series and it is packed with streaming related terminology, ideas and processing patterns. Good jump to the water for newbies.

5 interesting things (8/8/2015)

Nba shots charts – Well, I’m in love 🙂 I like visualizations, I like sports and this post gives both the motivation and the how –

NBA players KNN
– going on with the sports theme Vik Paruchuri tried to predict the points NBA player will contribute based on his nearest neighbors. I find this analysis a bit missing – first his results are not there, was this process successful or not? He looks for the most similar players to Lebron James but does not show who they are, for me that’s interesting to know, some gut feeling about the result. He uses only numeric columns which is ok, but there are also categorical variables (e.g position), why does he use them as well \ how can he use them in the future? And last but not least some visualization – for each player show me the KNN, some histogram regarding the prediction vs true values. Something.. I feel he wrote only half a post..

Rpython and pandasshell – as I see it both tools make a step toward the other. That’s an interesting questions whether they both survive. As a software developer I don’t usually see that need to switch to R.

https://github.com/robdmc/pandashells

http://kirbyfan64.github.io/posts/the-magic-of-rpython.html

http://rpython.readthedocs.org/en/latest/

Wining as a service – twitter vs twitter, using twitter power (+ coding + creativity 🙂 to win only twitter lotteries and competitions.

Engineers to managers – this post aroused too much thoughts to TL;DR it in only few sentences. I’m not sure it the timeline is correct or if I’ll adopt it myself but the ideas and changes about becoming a manager are inspiring and real food for thought for me.

Music Information Retrieval from a Multicultural Perspective

Yesterday I attended a “Music Information Retrieval from a Multicultural Perspective” meetup in SoundCloud offices. It was one of the most thought provoking meetups I have attended in a long time. Prof Xavier Serra, who gave the talk has a lot of experience with Music Information Retrieval and with community projects along with industrial projects – dunya, acousticbrainz, compmusic and more.

For me music information retrieval is completely new domain. I had several thoughts in the past about it such as – when you upload a clip to youtube how do the make sure you don’t violate any copyright? how do you do it fast and scaleable? How do you recommend someone a new music? Can you just design it like any other recommendation system? How to process the signal to differentiate and recognize the different instruments and so on.
As a data scientist and software engineer it is clear that when choosing features, which features to extract, we are skewed. We have a gut feeling and we usually follow that. Even when choosing which algorithms to use or which tool to use we usually turn either to something we already succeeded with and have experience or to a new algorithm \ technology we want to learn. We are sometimes blind to those biases.
Big parts of software development and data science is done on the western world, i.e US, Europe, etc. there are cultural biases to the engineers and the way they think. Our gut feeling is biased. In this case our gut feeling is biased toward how we in the western world understand what music is and how it is built – scores, notes, intonation, etc. Giving the examples in the lecture and stories behind them, it is clear that the way music is interpreted and experienced different in different cultures.
It is quite shocking to think about science that way – although we think we are quite rational and following some scientific \ research \ independent process we are biased to begin with. The implications for me are immediate – how recommendation system work (not only music). Simply think of movies, maybe the categorization to genres is external to other cultures? Maybe they care about different features in products?
I’ll try to bare it in mind next time I design a recommendation system.

Data Driven – creating a data culture

I have just read “Data Driven – creating a data culture” by Hilary Mason and DJ Patil.

While computer science is considered a new science or a new discipline data science is even a newer one. Well is it? Is it a discipline? What is the goals of data scientist and data science team within different organizations? Where should they place? What are their responsibilities and how is success measured?

This remained me the course “Scientific Thinking” I took as an undergrad and specifically Weber’s “Wissenschaft als Beruf” and Kuhn’s “The Structure of Scientific Revolutions“.

For me it seems that data science is in the phase of Kuhn’s pre-paradigm and Mason and Patil tries to move it in to the phase of normal science. They argue about the place and limitation of data science, what are the characteristics of data scientists. Maybe most importantly they talk about how to measure success and how to set goals to data science process.

So what about Weber?

In this lecture, Weber talks about the advantages of choosing an academic career versus an industry career. He also compare between the university systems in Europe and in the US. Almost 100 years has passed since then and the world slightly changed, including the role of work in our life. I think that for many people working in the industry as data scientist has both the advantages of being a scientist (doing some research) and working in the industry (salary, technology)… But I’m not sure if it is a really a “science”. For me personally this title or definition is not important – I love what I do and it creates a measurable value to the organizations I work for.

I see many of the processes suggested in this book as trying to bridge between and academic life and the industry life.

Over all, this hand book was ok, but I expected a bit more.