setdefault vs get vs defaultdict

You have a python dictionary, you want to get the value of specific key in the dictionary, so far so good, right?

And then a KeyError –

Traceback (most recent call last):
File “<stdin>”, line 1, in <module>
KeyError: 1 

Hmmm, well if this key does not exist in the dictionary I can use some default value like None, 10, empty string. What’s my options of doing so?

I can think of 3 –

  • get method
  • setdefault method
  • defaultdict data structure
    get method

Let’s investigate first –

key, value = "key", "value"
data = {}
x = data.get(key,value)
print x, data #value {}
data= {}
x = data.setdefault(key,value)
print x, data #value {'key': 'value'}

Well, we get almost the same result, x obtains the same value and in get data is not changed while in setdefault data changes. When does it become a problem?

key, value = "key", "value"
data = {}
x = data.get(key,[])append(value)
print x, data #None {}
data= {}
x = data.setdefault(key,[]).append(value)
print x, data None {'key': ['value']}

So, when we are dealing with mutable data types the difference is clearer and error prone.

When to use each? mainly depends on the content of your dictionary and its’ size.

We can time the differences but it does not really matter as they produce different output and it was not significant for any direction anyhow.

And for defaultdict –

from collections import defaultdict
data = defaultdict(list)
print data[key] #[]
data[key].append(value)
print data[key] #['value']

setdefault sets the default value to a specific key we access to while defaultdict is the type of the data variable and set this default value to every key we access to.

So, if we get roughly the same result I timed the processes for several dictionary sizes (left most column) and run each 1000 times (code below) –

dict size default value method time
100 list setdefault 0.0229508876801
defaultdict 0.0204179286957
set setdefault 0.0209970474243
defaultdict 0.0194549560547
int setdefault 0.0236239433289
defaultdict 0.0225579738617
string setdefault 0.020693063736
defaultdict 0.0240340232849
10000 list setdefault 2.09283614159
defaultdict 2.31266093254
set setdefault 2.12825512886
defaultdict 3.43549799919
int setdefault 2.04997992516
defaultdict 1.87312483788
“” setdefault 2.05423784256
defaultdict 1.93679213524
100000 list setdefault 22.4799249172
defaultdict 29.7850298882
set setdefault 23.5321040154
defaultdict 41.7523541451
int setdefault 26.6693091393
defaultdict 23.1293339729
string setdefault 26.4119689465
defaultdict 23.6694099903

Conclusions and summary –

  • Working with sets is almost always more expensive time-wise than working with lists
  • As the dictionary size grows simple types – string and int perform better with defaultdict then with setdefault while set and list perform worse.
  • Main conclusion – choosing between defaultdict and setdefault also mainly depends in the type of the default value.
  • In this test I tested a particular use case – accessing each key twice. Different use cases \ distributions such as assignment, accessing to the same key over and over again, etc. may have different properties.
  • There is no firm conclusion here just investigating some of interpreter capabilities.

Code –

import timeit
from collections import defaultdict
from itertools import product

def measure_setdefault(n, defaultvalue):
 data = {}
 for i in xrange(0,n):
 x = data.setdefault(i,defaultvalue)
 for i in xrange(0,n):
 x = data.setdefault(i,defaultvalue)

def measure_defaultdict(n,defaultvalue):
 data = defaultdict(type(defaultvalue))
 for i in xrange(0,n):
 x = data[i]
 for i in xrange(0,n):
 x = data[i]

if __name__ == '__main__':
 import timeit
 number = 1000
 dict_sizes = [100,10000, 100000]
 defaultvalues = [[], 0, "", set()]
 for dict_size, defaultvalue in product(dict_sizes, defaultvalues):
 print "dict_size: ", dict_size, " defaultvalue: ", type(defaultvalue)
 print "\tsetdefault:", timeit.timeit("measure_setdefault(dict_size, defaultvalue)", setup="from __main__ import measure_setdefault, dict_size, defaultvalue", number=number)
 print "\\tdefaultdict:", timeit.timeit("measure_defaultdict(dict_size, defaultvalue)", setup="from __main__ import measure_defaultdict, dict_size, defaultvalue", number=number)

EuroPython 2014 – Python under the hood

This is actually a summary of few talks which deals with “under the hood” topics, the talks partially overlap. Those topics include – memory allocation and management, inheritance, over-ridden built-in methods, etc.

Relevant talks –
The magic of attribute access by Petr Viktorin
Performance Python for Numerical Algorithms by Yves
Metaprogramming, from Decorators to Macros by Andrea Crotti
Everything you always wanted to know about Memory in Python but were afraid to ask by Piotr Przymus
Practical summary –
  • __slots__ argument – limited the memory allocation for objects in Python by overriding the __dict__ attribute.
  • Strings – the empty strings and strings of length=1 are saved as constants. Use intern (or sys.intern in python 3.x) on strings to avoid allocating string variables with the same values this will help making memory usage more efficiency and quicker string comparison. More about this topic here .
  • Numerical algorithms – the way a 2-dimensional array is allocated (row wise or column wise) and store has a great impact on the performance of different algorithms (even for simple sum function).
  • Working with the GPU – there are packages that process some of the data on the GPU. It is efficient when the data is big and less efficient when the data is small since copying all the data to the GPU has some overhead.
  • Cython use c “malloc” function for re-allocating space when list \ dictionaries \ set grow or shrink. On one hand this function can be overridden, on the other hand one can try to avoid costly processes which cause space allocation operations or to use more efficient data structures, e.g list instead of dictionary where it is possible.
  • Note the garbage collector! Python garbage collector is based on reference count. Over-ridding the __dell__ function may disrupt the garbage collector.
  • Suggested Profiling and monitoring tools – psutilmemory_profilerobjgraphRunSnakeRun + Meliaevalgrind
Bottom line of all of those – “knowledge itself is power”. I.e. knowing the internal and the impact of what we are doing can bring to a significant improvements.
There are always several ways to do things and each has cons and pros fitted to the specific case. Some of those are simple to implement and use and can donate to a great improvement on both running time and memory usage. On the other hand some of those suggestion are really “shoot in the foot” – causing memory leaks and other unexpected behavior, beware.

5 interesting things (03/08/2014)

Goooooooooaaaaaalllll – world cup is over but Python and soccer are forever :). This post tries to identify goals and interesting events on soccer games based on the volume on youtube. This post only touches the subject but what a nice start!

http://zulko.github.io/blog/2014/07/04/automatic-soccer-highlights-compilations-with-python/

Scheduling with Celery – and I thought I can only make a soup with celery.. Scheduling is not the main goal of Celery but it can be used as such, cron style. More about it  –

http://www.caktusgroup.com/blog/2014/06/23/scheduling-tasks-celery/

Markov chains – the two links below complement one another. One has a great visualization and the other explain things a bit more deeply. 

http://setosa.io/blog/2014/07/26/markov-chains/

http://www.analyticsvidhya.com/blog/2014/07/markov-chain-simplified

My Ig-Noble candidate – 

http://www.bbc.com/news/magazine-20578627

Hotels by WiFi – geeky but important those days.

http://www.hotelwifitest.com

Graph Databases, a little connected tour

EuroPython talk by: Francisco Fernández Castaño
 
 
The talk was classified as novice and so it was – very basic graph database ideas.
 
Castaño started by presenting the general idea of graph databases and showing some use cases.
 
One of the most known use cases – social media data, friends-of-friends.
 
Then he presented Neo4j which is graph database written in Java but originally was written in Python. Neo4j is known to be very scalable and to support ACID transactions. Another nice property about Neo4j is the ability to extend the give rest API but your own needs.
 
The next part of talked focused on the cypher language – which is the way to query the Neo4j database. Neo4j have some nice UI properties and I really missed some reference or example for that on this talk.
 
Last but not least tip given in the talk, you can try Neo4j without having to install it on your machine – http://www.graphenedb.com/ (free sand box of up to 1k nodes and 10k edges).

The Shogun Machine Learning Toolbox

EuroPython talk by: Heiko Strathmann, herrstrathmann.de
 
 
Strathmann presented the Shogun toolbox. Shogun is a Machine Learning toolbox which implement most of the common supervised and unsupervised algorithms. Shogun is implemented in C++ and interacts with Python as well as with Java, Ruby, Matlab, Lau, etc. Shogun is meant to run on single machine and no distributed.
 
Shogun is an open source project started on 2004. It has 8 core developers + 20 contributors and it is fairly active. It also has a collebaration with Google summer of code – 29 projects so far, 8 on going.
 
The C++ maybe the most significant advantage of Shogun over SciPy. It is simply faster and enables more efficient memory allocation and run time optimization and well as multi-language support . Therefore enabling training and classifying data based on larger number of samples and higher dimension.
 
The binding \ interfaces to other languages is done using – http://www.swig.org

Building, bug tracing and deployment is done using – http://buildbot.net/.

 
Summarizing – Specially when going larger Shogun seems like a good alternative to SciPy (competition makes both products better). Shogun site offers some tutorials include notebooks and demos.

Log everything with logstash and elasticsearch

EuroPython talk by: Peter Hoffmann@peterhoffmann

 
A talk by Peter Hoffmann from Blue Yonder which is one of the sponsors of the conference.

 

Another very good talk by an experienced speaker. However the name  is kind of misleading. Yes – logstash and elasticsearch were mentioned however to main concept of the talk was really logging chain and centralized logging while logstash and elasticsearch are two tools along this chain and there are some alternatives in every step of this chain.

This talk gave me a lot to think about the logging on the company aspect and how should they run, monitor, etc. Also there is some tension to solve \ define about what is logged and what error message \ outputs an app should provide (“logging best practices”).

The video is already available in the EuroPython site and I hope that the slides would be available too soon.

Full Stack Python

EuroPython talk by: Matt Makai, @mattmakai 

 
I expected this talk to be full with buzzwords and it was… but in a good sense. 

 
Makai is the builder of Full Stack Python site as such he spoke about what you need from the moment you have an idea to a python web app until you deploy including all the essential steps – wsgi server, hosting, logging etc.
 
Every layer in the site include relevant link and tutorials. Good starting point for a python web-app developer.
 
The talk was interesting due to Makai enthusiasm to teach and to share his knowledge (and of course to promote his site and that’s legit as well) and his professional knowledge. There are really few people the can speak so fluently for 25 minutes.
 

5 interesting things (18/7/2014)

A\B Testing – No need to say that A\B is a very hot buzz word in the industry. A\B tests were used long ago in psychology but today there are much more accessible and easy to set. The following series of posts (one is still not publish) describe 5 simple guide lines for A\B testing.

http://sl8r000.github.io/ab_testing_statistics/

Shellify – decorator which turns Python models into shell. I don’t see an everyday usage to it, maybe on developing but it has a high coolness factor which is important as well.

https://bitbucket.org/johannestaas/shellify

Code review without your glasses – I like this post because in my humble opinion it is very creative, beyond the box. It is also a good reminder about about what to notice in code review.

http://robertheaton.com/2014/06/20/code-review-without-your-eyes/

Side kick – the last link reminded me of rubber duck debugging and on the way I found this – http://rubberduckreview.com/

Deployment academy – a series of posts by Rainforest in their blog. This time a post about “zero downtime database migrations”. The post is simple and easy to understand and answer a common issues in the life of a developer. Of course, practices and specific problems differ from one organization to another but the core ideas are as is.

https://blog.rainforestqa.com/2014-06-27-zero-downtime-database-migrations/

Awesome *

Are we going back?

In the past week I have bumped into two github repositories such as awesome-python and awesome-sysadmin. Both repositories do a great job and compose and interesting list \ index of relevant tools.

However, this made me feel that we are going back. Those indices reminded me the pre-search-engines days or shell I say the BME days (before modern era ;). Specifically this reminded me of Alta Vista (but also other indices sites – do you recall Lycos?) where all the links where indexed under some category and sub categories and one should have dig in those categories to find what he looked for.

Aren’t the search engines today strong enough to answer the query “python machine learning package”?. Is human indexing really our resort? I don’t really think so. I believe in the power of the human behind the machine and their ability to build a good enough searching engines. Those indices can be very useful but I would rather have them built automatically and not manually.