EuroPython 2014 – Python under the hood

This is actually a summary of few talks which deals with “under the hood” topics, the talks partially overlap. Those topics include – memory allocation and management, inheritance, over-ridden built-in methods, etc.

Relevant talks –
The magic of attribute access by Petr Viktorin
Performance Python for Numerical Algorithms by Yves
Metaprogramming, from Decorators to Macros by Andrea Crotti
Everything you always wanted to know about Memory in Python but were afraid to ask by Piotr Przymus
Practical summary –
  • __slots__ argument – limited the memory allocation for objects in Python by overriding the __dict__ attribute.
  • Strings – the empty strings and strings of length=1 are saved as constants. Use intern (or sys.intern in python 3.x) on strings to avoid allocating string variables with the same values this will help making memory usage more efficiency and quicker string comparison. More about this topic here .
  • Numerical algorithms – the way a 2-dimensional array is allocated (row wise or column wise) and store has a great impact on the performance of different algorithms (even for simple sum function).
  • Working with the GPU – there are packages that process some of the data on the GPU. It is efficient when the data is big and less efficient when the data is small since copying all the data to the GPU has some overhead.
  • Cython use c “malloc” function for re-allocating space when list \ dictionaries \ set grow or shrink. On one hand this function can be overridden, on the other hand one can try to avoid costly processes which cause space allocation operations or to use more efficient data structures, e.g list instead of dictionary where it is possible.
  • Note the garbage collector! Python garbage collector is based on reference count. Over-ridding the __dell__ function may disrupt the garbage collector.
  • Suggested Profiling and monitoring tools – psutilmemory_profilerobjgraphRunSnakeRun + Meliaevalgrind
Bottom line of all of those – “knowledge itself is power”. I.e. knowing the internal and the impact of what we are doing can bring to a significant improvements.
There are always several ways to do things and each has cons and pros fitted to the specific case. Some of those are simple to implement and use and can donate to a great improvement on both running time and memory usage. On the other hand some of those suggestion are really “shoot in the foot” – causing memory leaks and other unexpected behavior, beware.

Graph Databases, a little connected tour

EuroPython talk by: Francisco Fernández Castaño
 
 
The talk was classified as novice and so it was – very basic graph database ideas.
 
Castaño started by presenting the general idea of graph databases and showing some use cases.
 
One of the most known use cases – social media data, friends-of-friends.
 
Then he presented Neo4j which is graph database written in Java but originally was written in Python. Neo4j is known to be very scalable and to support ACID transactions. Another nice property about Neo4j is the ability to extend the give rest API but your own needs.
 
The next part of talked focused on the cypher language – which is the way to query the Neo4j database. Neo4j have some nice UI properties and I really missed some reference or example for that on this talk.
 
Last but not least tip given in the talk, you can try Neo4j without having to install it on your machine – http://www.graphenedb.com/ (free sand box of up to 1k nodes and 10k edges).

Log everything with logstash and elasticsearch

EuroPython talk by: Peter Hoffmann@peterhoffmann

 
A talk by Peter Hoffmann from Blue Yonder which is one of the sponsors of the conference.

 

Another very good talk by an experienced speaker. However the name  is kind of misleading. Yes – logstash and elasticsearch were mentioned however to main concept of the talk was really logging chain and centralized logging while logstash and elasticsearch are two tools along this chain and there are some alternatives in every step of this chain.

This talk gave me a lot to think about the logging on the company aspect and how should they run, monitor, etc. Also there is some tension to solve \ define about what is logged and what error message \ outputs an app should provide (“logging best practices”).

The video is already available in the EuroPython site and I hope that the slides would be available too soon.

Full Stack Python

EuroPython talk by: Matt Makai, @mattmakai 

 
I expected this talk to be full with buzzwords and it was… but in a good sense. 

 
Makai is the builder of Full Stack Python site as such he spoke about what you need from the moment you have an idea to a python web app until you deploy including all the essential steps – wsgi server, hosting, logging etc.
 
Every layer in the site include relevant link and tutorials. Good starting point for a python web-app developer.
 
The talk was interesting due to Makai enthusiasm to teach and to share his knowledge (and of course to promote his site and that’s legit as well) and his professional knowledge. There are really few people the can speak so fluently for 25 minutes.
 

Andrew Ng and PyStruct meetup

Yesterday I attended the “Andrew Ng and pyStruct” meetup. 

http://www.meetup.com/berlin-machine-learning/events/179264562/

I was lucky enough to get a place to the meetup due to the Germany-Portugal game that happened on the same time 🙂

The first part by Andrew Ng was a video meetup joint to 3 locations – Paris, Zurich and Berlin. Andrew Ng is a co-founder of Coursera and a Machine Learning guru. He teaches the ML course in Coursera which is one of the most popular courses in Coursera (took it myself and it is a very good and structured introduction to machine learning, new session started yesterday). He teaches in Standford and soon he will be leaving to Baidu research.

The talk included 15-20 minutes of introduction to deep learning, recent results,  applications and challenges. He mainly focused on scaling up deep learning algorithms for using billions features \ properties. The rest of the talk was question answering mostly regarding the theoretical aspects of deep learning, future challenges, etc. For me one of the most important things he said was “innovation is a result of team work”.

Some known applications of deep learning is – speak recognition, image processing, etc.

In the end he suggested taking Stanford deep learning tutorial – http://deeplearning.stanford.edu/tutorial/

T.R – there are currently 2 python packages I know which deal with deep learning – 

 The next talk was given by Andreas Mueller. You can find his slides here.

Muelller introduced structured prediction which is a natural extension or a generalization of regression problems to a structured output rather than just a number \ class . Structured learning has advantage over other algorithms of supervised learning as it can learn several properties at once and use the correlations between those properties.

Example – costumers data, several properties of costumers – gender, marriage status, has children, owns a car, etc. One can guess that married and has children properties are highly correlated and that when learning about those properties together there is a better chance of getting good results. It is better than LDA in the sense that it has less classes (not every combination of the variables is a class) and it requires less training data.

Other examples include pixel classification – classifying each pixel to an object in the image and OCR, etc.

He then talked about PyStruct – a python package for structured prediction. Actually not much to add that is not written in the documents. 

Visualization – Data scientist toolkit

Data scientist are said to have better development knowledge than the average statistician and better statistic knowledge than the average developer. However, together with those skills one also needs marketing skills – the ability to communicate your, no so simple job and results to other people. Those people can be the CTO or VP R&D, team members, customers or sales and marketing people. They don’t necessarily share your knowledge or dive into the details as fast as you.

One of the best ways to make data and results accessible is creating visualizations, automatically of course. In this post I’ll review several visualizations tools, mostly for Python with some additional side kicks.

Matplotlib – probably the most known python visualization package. Includes most of the standard charts – bar charts, pie charts, scatters, ability to embed images, etc. Since there are many users using it there are many questions, examples and documentations around the web. However, the downside for me is that it is more complex than it should be. I have used it in several projects and I don’t yet acquired the intuition to fully utilize.

Matplotlib have several extensions including –

graphviz – Designated for drawing graphs. Graph drawing software with python package. pygraphviz is a python package for graphviz which provides a drawing layer and graph layout algorithms. The first downside of this is that you need to download the graphviz software. I have done it several times on several different machines (most of the consist of ubuntu) it never passed smoothly and I was not able to do it only from the command line which make it problematic if one wants to deploy it on remote machines. I believe that it could be done but at the moment I find this process only as an irksome overhead.

Side kicks –

  • PyDot – Implements DOT graph description language. PyDot is basically an interface to interact with PyGraphviz dot layout. The main advantage of the dot files and data is the advantage in standardization – one can create dot file in one process and use it in other process. DOT is an intuitive language which focuses on drawing the graph and not on calculating the graph. I would say that it is the last step in the chain.
  • Networkx – a package for working and manipulating graph. Implements many graph algorithms such as shortest path, clustering, minimum spanning tree, etc. The graphs created in Networkx can be drawn using either matplotlib or pygraphviz and can also create dot files.

Vincent – A relatively new python visualization package. Vincent translates Python to Vega which is a visualization  grammar. I like it because it is easy, interactive and simple to output either as JSON or as HTML . However, I’m not sure that both Vincent and Vega are mature enough at this point to answer all the needs. It is important to mention that Vega is actually a wrapper above D3 which is an amazing tool with growing community.

Additional related tools I’m not (yet) experienced with –

  • xlsxwriter – creating excel files (xlsx format) including embedding charts on those files.
  • plot.ly – very talked about tool for collaborating data and graphing tool which have a Python client. I try to keep my data as private as possible and don’t want to be dependent on internet connection (for example – creating graph with a lot of data) so this is the down side for me in this tool. However, the social \ collaborative aspect of this product is also an important part and the graphing is only one aspect of it.
  • Google charts – same downside as plot.ly – I like to be as independent as possible. However, comparing to plot.ly it looks more mature and has far more options, chart types than plot.ly at this stage and there is also a sand box to play with it. Plot.ly has advantages over Google charts in the ease of usage for non programmers.
  • Bokeh – Nice, interactive charts on large data sets. Maybe the next big thing for plotting in Python.