You have a python dictionary, you want to get the value of specific key in the dictionary, so far so good, right?
And then a KeyError –
Traceback (most recent call last):
File “<stdin>”, line 1, in <module>
KeyError: 1
Hmmm, well if this key does not exist in the dictionary I can use some default value like None, 10, empty string. What’s my options of doing so?
I can think of 3 –
- get method
- setdefault method
- defaultdict data structure
get method
Let’s investigate first –
key, value = "key", "value" data = {} x = data.get(key,value) print x, data #value {} data= {} x = data.setdefault(key,value) print x, data #value {'key': 'value'}
Well, we get almost the same result, x obtains the same value and in get data is not changed while in setdefault data changes. When does it become a problem?
key, value = "key", "value" data = {} x = data.get(key,[])append(value) print x, data #None {} data= {} x = data.setdefault(key,[]).append(value) print x, data None {'key': ['value']}
So, when we are dealing with mutable data types the difference is clearer and error prone.
When to use each? mainly depends on the content of your dictionary and its’ size.
We can time the differences but it does not really matter as they produce different output and it was not significant for any direction anyhow.
And for defaultdict –
from collections import defaultdict data = defaultdict(list) print data[key] #[] data[key].append(value) print data[key] #['value']
setdefault sets the default value to a specific key we access to while defaultdict is the type of the data variable and set this default value to every key we access to.
So, if we get roughly the same result I timed the processes for several dictionary sizes (left most column) and run each 1000 times (code below) –
dict size | default value | method | time |
---|---|---|---|
100 | list | setdefault | 0.0229508876801 |
defaultdict | 0.0204179286957 | ||
set | setdefault | 0.0209970474243 | |
defaultdict | 0.0194549560547 | ||
int | setdefault | 0.0236239433289 | |
defaultdict | 0.0225579738617 | ||
string | setdefault | 0.020693063736 | |
defaultdict | 0.0240340232849 | ||
10000 | list | setdefault | 2.09283614159 |
defaultdict | 2.31266093254 | ||
set | setdefault | 2.12825512886 | |
defaultdict | 3.43549799919 | ||
int | setdefault | 2.04997992516 | |
defaultdict | 1.87312483788 | ||
“” | setdefault | 2.05423784256 | |
defaultdict | 1.93679213524 | ||
100000 | list | setdefault | 22.4799249172 |
defaultdict | 29.7850298882 | ||
set | setdefault | 23.5321040154 | |
defaultdict | 41.7523541451 | ||
int | setdefault | 26.6693091393 | |
defaultdict | 23.1293339729 | ||
string | setdefault | 26.4119689465 | |
defaultdict | 23.6694099903 |
Conclusions and summary –
- Working with sets is almost always more expensive time-wise than working with lists
- As the dictionary size grows simple types – string and int perform better with defaultdict then with setdefault while set and list perform worse.
- Main conclusion – choosing between defaultdict and setdefault also mainly depends in the type of the default value.
- In this test I tested a particular use case – accessing each key twice. Different use cases \ distributions such as assignment, accessing to the same key over and over again, etc. may have different properties.
- There is no firm conclusion here just investigating some of interpreter capabilities.
Code –
import timeit from collections import defaultdict from itertools import product def measure_setdefault(n, defaultvalue): data = {} for i in xrange(0,n): x = data.setdefault(i,defaultvalue) for i in xrange(0,n): x = data.setdefault(i,defaultvalue) def measure_defaultdict(n,defaultvalue): data = defaultdict(type(defaultvalue)) for i in xrange(0,n): x = data[i] for i in xrange(0,n): x = data[i] if __name__ == '__main__': import timeit number = 1000 dict_sizes = [100,10000, 100000] defaultvalues = [[], 0, "", set()] for dict_size, defaultvalue in product(dict_sizes, defaultvalues): print "dict_size: ", dict_size, " defaultvalue: ", type(defaultvalue) print "\tsetdefault:", timeit.timeit("measure_setdefault(dict_size, defaultvalue)", setup="from __main__ import measure_setdefault, dict_size, defaultvalue", number=number) print "\\tdefaultdict:", timeit.timeit("measure_defaultdict(dict_size, defaultvalue)", setup="from __main__ import measure_defaultdict, dict_size, defaultvalue", number=number)
Thanks for posting the code!
There is an important difference between `measure_setdefault()` and `measure_defaultdict()` which will skew the timing results.
– `defaultdict()` constructs a new value object for each key accessed.
– `dict.setdefault()` does not construct a new value object. It does _not_ copy the `defaultvalue` given; it keeps the same reference.
This will make `setdefault()` look much faster for complex objects because `defaultdict` is spending time constructing new values while `setdefault()` is not.
A proper comparison would be to do the same value construction, `type(defaultvalue)()`, before the first `setdefault()`. Really, we want to test default _types_, not default _values_.
(I know this post is 8 years old, but Google is listing this as the first answer for `defaultdict vs setdefault`, so I felt I should comment.)