5 interesting things (23/6/2014)

Celery best practices – python celery package is on my “todo list” once I have a relevant task. However, some of the thoughts and ideas which are talked about in this post are more general regarding software engineering. 

https://denibertovic.com/posts/celery-best-practices/

Keep calm and learn d3.js – d3.js is also something which is on my “todo list”. I used it randomly here-and-there but would like to do it better (and also to gain some deeper experience with JS). However, this is a very nice tutorial to start with d3.js.

http://slides.com/kentenglish/getting-started-with-d3-js-using-the-toronto-parking-ticket-data

DBSCAN well explained – unsupervised learning sometimes feels like no man’s land and k-means is almost always the choice when picking a clustering algorithm. This post not only does a great job by explaining the algorithm itself but it also gives great examples and show how to adjust the parameters. 

http://cjauvin.blogspot.ca/2014/06/dbscan-blues.html

Machine learning in Airbnb – This post is a case study of machine learning as it is used in Airbnb. I like reading such posts because it is interesting to learn about the challenges other organizations face and about their solutions and how they use existing tools and packages and adjust and optimize them to their need (in this post – scikit learn and re-writing R export function in c++).

http://nerds.airbnb.com/architecting-machine-learning-system-risk/

Current events – world cup predictions – as I love both sports and machine learning, statistics and so forth there are several posts which try to employ ML techniques to predict world cup results. The most interesting post I read so far is Nate Silver’s post. It takes many features into account and explain them clearly.

http://fivethirtyeight.com/features/its-brazils-world-cup-to-lose/

Advertisement

Andrew Ng and PyStruct meetup

Yesterday I attended the “Andrew Ng and pyStruct” meetup. 

http://www.meetup.com/berlin-machine-learning/events/179264562/

I was lucky enough to get a place to the meetup due to the Germany-Portugal game that happened on the same time 🙂

The first part by Andrew Ng was a video meetup joint to 3 locations – Paris, Zurich and Berlin. Andrew Ng is a co-founder of Coursera and a Machine Learning guru. He teaches the ML course in Coursera which is one of the most popular courses in Coursera (took it myself and it is a very good and structured introduction to machine learning, new session started yesterday). He teaches in Standford and soon he will be leaving to Baidu research.

The talk included 15-20 minutes of introduction to deep learning, recent results,  applications and challenges. He mainly focused on scaling up deep learning algorithms for using billions features \ properties. The rest of the talk was question answering mostly regarding the theoretical aspects of deep learning, future challenges, etc. For me one of the most important things he said was “innovation is a result of team work”.

Some known applications of deep learning is – speak recognition, image processing, etc.

In the end he suggested taking Stanford deep learning tutorial – http://deeplearning.stanford.edu/tutorial/

T.R – there are currently 2 python packages I know which deal with deep learning – 

 The next talk was given by Andreas Mueller. You can find his slides here.

Muelller introduced structured prediction which is a natural extension or a generalization of regression problems to a structured output rather than just a number \ class . Structured learning has advantage over other algorithms of supervised learning as it can learn several properties at once and use the correlations between those properties.

Example – costumers data, several properties of costumers – gender, marriage status, has children, owns a car, etc. One can guess that married and has children properties are highly correlated and that when learning about those properties together there is a better chance of getting good results. It is better than LDA in the sense that it has less classes (not every combination of the variables is a class) and it requires less training data.

Other examples include pixel classification – classifying each pixel to an object in the image and OCR, etc.

He then talked about PyStruct – a python package for structured prediction. Actually not much to add that is not written in the documents. 

5 interesting things (14/6/2014)

My little helper – search engine for code examples and use cases. I haven’t yet tried that “live” when I needed something, but I hope I will remember it next time.

https://sourcegraph.com/

Httpie and Percol – two unrelated tools but I see a lot of similarity between them as they try to change the common way we do things on command line. Http try to make curl requests more human understandable and percol which try to make filtering using piping more interactive. Reminds me a bit of edinting in sublime.

https://github.com/jakubroztocil/httpie

https://github.com/mooz/percol

 

Python 3 is good for you – A bit long but very interesting. Overview 10 features which are new in Python 3. There were recently a lot of posts around the web discussing whether Python 3 is better than Python 2.x, whether Python 3 should be rolled back and buried forever etc. This is one of the most informative posts (although it could be summarized and shorter) I have read. I think that one of the main reasons organization don’t currently move to Python 3 beside the fact the people and organization don’t love changes is because it is an expensive process (mainly compatibility) and  even this post does not succeed in convincing with its’ added value.

http://asmeurer.github.io/python3-presentation/slides.html#1

Are we humans or are we dancerssorry computers 

http://wired.com/2014/01/how-to-hack-okcupid/all

Kernel tricks – very clear post about kernel trick which also make clear additional machine learning terminology and the examples are very good. I would say that this is a very good post for beginners-intermediates in Machine Learning. Going the extra mile would be writing something similar about PCA has it has a lot of similar ideas.

http://www.eric-kim.net/eric-kim-net/posts/1/kernel_trick.html

Visualization – Data scientist toolkit

Data scientist are said to have better development knowledge than the average statistician and better statistic knowledge than the average developer. However, together with those skills one also needs marketing skills – the ability to communicate your, no so simple job and results to other people. Those people can be the CTO or VP R&D, team members, customers or sales and marketing people. They don’t necessarily share your knowledge or dive into the details as fast as you.

One of the best ways to make data and results accessible is creating visualizations, automatically of course. In this post I’ll review several visualizations tools, mostly for Python with some additional side kicks.

Matplotlib – probably the most known python visualization package. Includes most of the standard charts – bar charts, pie charts, scatters, ability to embed images, etc. Since there are many users using it there are many questions, examples and documentations around the web. However, the downside for me is that it is more complex than it should be. I have used it in several projects and I don’t yet acquired the intuition to fully utilize.

Matplotlib have several extensions including –

graphviz – Designated for drawing graphs. Graph drawing software with python package. pygraphviz is a python package for graphviz which provides a drawing layer and graph layout algorithms. The first downside of this is that you need to download the graphviz software. I have done it several times on several different machines (most of the consist of ubuntu) it never passed smoothly and I was not able to do it only from the command line which make it problematic if one wants to deploy it on remote machines. I believe that it could be done but at the moment I find this process only as an irksome overhead.

Side kicks –

  • PyDot – Implements DOT graph description language. PyDot is basically an interface to interact with PyGraphviz dot layout. The main advantage of the dot files and data is the advantage in standardization – one can create dot file in one process and use it in other process. DOT is an intuitive language which focuses on drawing the graph and not on calculating the graph. I would say that it is the last step in the chain.
  • Networkx – a package for working and manipulating graph. Implements many graph algorithms such as shortest path, clustering, minimum spanning tree, etc. The graphs created in Networkx can be drawn using either matplotlib or pygraphviz and can also create dot files.

Vincent – A relatively new python visualization package. Vincent translates Python to Vega which is a visualization  grammar. I like it because it is easy, interactive and simple to output either as JSON or as HTML . However, I’m not sure that both Vincent and Vega are mature enough at this point to answer all the needs. It is important to mention that Vega is actually a wrapper above D3 which is an amazing tool with growing community.

Additional related tools I’m not (yet) experienced with –

  • xlsxwriter – creating excel files (xlsx format) including embedding charts on those files.
  • plot.ly – very talked about tool for collaborating data and graphing tool which have a Python client. I try to keep my data as private as possible and don’t want to be dependent on internet connection (for example – creating graph with a lot of data) so this is the down side for me in this tool. However, the social \ collaborative aspect of this product is also an important part and the graphing is only one aspect of it.
  • Google charts – same downside as plot.ly – I like to be as independent as possible. However, comparing to plot.ly it looks more mature and has far more options, chart types than plot.ly at this stage and there is also a sand box to play with it. Plot.ly has advantages over Google charts in the ease of usage for non programmers.
  • Bokeh – Nice, interactive charts on large data sets. Maybe the next big thing for plotting in Python.

5 interesting things (2/6/2014)

Tell me your name and I’ll tell you your age – playing with data and a conditioned distribution  

http://fivethirtyeight.com/features/how-to-tell-someones-age-when-all-you-know-is-her-name/

Analysis of Over 2,000 Computer Science Professors at Top Universities – while this data is interesting I would say that it is only an introduction. It is not an anthropological study but it is still curious to ask how many of those professors are women? what are their ages? is the median age of a women when she joins similar to those of man? What is the distribution over the different sub-fields (i.e. are women more to theory than man?)? What is the percentage of the foreign professor or professor who did not took their undergrad studies in the US?

This is a nice starting point and there are many other things that can be done with this data. 

http://jeffhuang.com/computer_science_professors.html

Good runner, bad investor? – An empirical evidence \ variation on Kahneman and Tversky “Cognitive Biases” but a nice one.

http://www.nytimes.com/2014/04/23/upshot/what-good-marathons-and-bad-investments-have-in-common.html

Link shortening – short, elegant links satisfy both technological standards (twitter 140 char limit) and  human needs for short links which fit in line. Apparently those links cause much overhead due to the redirection, harm user experience and worth a lot of money.

http://t37.net/why-link-shorteners-harm-your-readers-and-destroy-the-web.html

Hacker News aggregation – I’m probably the last one to find out but for me it is an easier way to follow new links in Hacker News.

http://www.hndigest.com/