2014)

Gooey – Command line to application! Very cool. Works by mapping ArgumentParser to GUI objects. Very cool and make it easier to make instant tools to play around the office. Future tasks include supporting in more themes, different arg parses (e.g docopt http://docopt.org/), etc. I believe there is much potential in this project.

https://github.com/chriskiehl/Gooey

Datamesh – stats command line tool. I like working for the command line, at least to begin with and to feel data before doing complex things, this tool is really what I need. However, the documentation is missing \ not clear enough \ open source style.

http://www.gnu.org/software/datamash/

In that sense, q, http://harelba.github.io/q/, is also very cool and answer the need to things and speak in the same framework in different tasks.

Getting started with Python internals – how to dive to the deep water. Very enriching post which also include links to other interesting posts.

http://akaptur.github.io/blog/2014/08/03/getting-started-with-python-internals/

python-ftfy – unicode strings can be very painful when it includes special characters as umlauts (ö), html tags (>) and everything users can think of :). The goal of this package is to at least partly ease this pain.

https://github.com/LuminosoInsight/python-ftfy

Fraud detection Lyft vs Uber – I think that the interesting part here is the data visualization as tool to understand the problem better.

http://linkurio.us/lyft-vs-uber-visualizing-fraud-patterns/

setdefault vs get vs defaultdict

You have a python dictionary, you want to get the value of specific key in the dictionary, so far so good, right?

And then a KeyError –

Traceback (most recent call last):
File “<stdin>”, line 1, in <module>
KeyError: 1

Hmmm, well if this key does not exist in the dictionary I can use some default value like None, 10, empty string. What’s my options of doing so?

I can think of 3 –

get method
setdefault method
defaultdict data structure
get method

Let’s investigate first –

key, value = "key", "value"
data = {}
x = data.get(key,value)
print x, data #value {}
data= {}
x = data.setdefault(key,value)
print x, data #value {'key': 'value'}

Well, we get almost the same result, x obtains the same value and in get data is not changed while in setdefault data changes. When does it become a problem?

key, value = "key", "value"
data = {}
x = data.get(key,[])append(value)
print x, data #None {}
data= {}
x = data.setdefault(key,[]).append(value)
print x, data None {'key': ['value']}

So, when we are dealing with mutable data types the difference is clearer and error prone.

When to use each? mainly depends on the content of your dictionary and its’ size.

We can time the differences but it does not really matter as they produce different output and it was not significant for any direction anyhow.

And for defaultdict –

from collections import defaultdict
data = defaultdict(list)
print data[key] #[]
data[key].append(value)
print data[key] #['value']

setdefault sets the default value to a specific key we access to while defaultdict is the type of the data variable and set this default value to every key we access to.

So, if we get roughly the same result I timed the processes for several dictionary sizes (left most column) and run each 1000 times (code below) –

dict size	default value	method	time
100	list	setdefault	0.0229508876801
	list	defaultdict	0.0204179286957
	set	setdefault	0.0209970474243
	set	defaultdict	0.0194549560547
	int	setdefault	0.0236239433289
	int	defaultdict	0.0225579738617
	string	setdefault	0.020693063736
	string	defaultdict	0.0240340232849
10000	list	setdefault	2.09283614159
	list	defaultdict	2.31266093254
	set	setdefault	2.12825512886
	set	defaultdict	3.43549799919
	int	setdefault	2.04997992516
	int	defaultdict	1.87312483788
	“”	setdefault	2.05423784256
	“”	defaultdict	1.93679213524
100000	list	setdefault	22.4799249172
	list	defaultdict	29.7850298882
	set	setdefault	23.5321040154
	set	defaultdict	41.7523541451
	int	setdefault	26.6693091393
	int	defaultdict	23.1293339729
	string	setdefault	26.4119689465
	string	defaultdict	23.6694099903

Conclusions and summary –

Working with sets is almost always more expensive time-wise than working with lists
As the dictionary size grows simple types – string and int perform better with defaultdict then with setdefault while set and list perform worse.
Main conclusion – choosing between defaultdict and setdefault also mainly depends in the type of the default value.
In this test I tested a particular use case – accessing each key twice. Different use cases \ distributions such as assignment, accessing to the same key over and over again, etc. may have different properties.
There is no firm conclusion here just investigating some of interpreter capabilities.

Code –

import timeit
from collections import defaultdict
from itertools import product

def measure_setdefault(n, defaultvalue):
 data = {}
 for i in xrange(0,n):
 x = data.setdefault(i,defaultvalue)
 for i in xrange(0,n):
 x = data.setdefault(i,defaultvalue)

def measure_defaultdict(n,defaultvalue):
 data = defaultdict(type(defaultvalue))
 for i in xrange(0,n):
 x = data[i]
 for i in xrange(0,n):
 x = data[i]

if __name__ == '__main__':
 import timeit
 number = 1000
 dict_sizes = [100,10000, 100000]
 defaultvalues = [[], 0, "", set()]
 for dict_size, defaultvalue in product(dict_sizes, defaultvalues):
 print "dict_size: ", dict_size, " defaultvalue: ", type(defaultvalue)
 print "\tsetdefault:", timeit.timeit("measure_setdefault(dict_size, defaultvalue)", setup="from __main__ import measure_setdefault, dict_size, defaultvalue", number=number)
 print "\\tdefaultdict:", timeit.timeit("measure_defaultdict(dict_size, defaultvalue)", setup="from __main__ import measure_defaultdict, dict_size, defaultvalue", number=number)

EuroPython 2014 – Python under the hood

This is actually a summary of few talks which deals with “under the hood” topics, the talks partially overlap. Those topics include – memory allocation and management, inheritance, over-ridden built-in methods, etc.

Relevant talks –

The magic of attribute access by Petr Viktorin

https://ep2014.europython.eu/en/schedule/sessions/123/

Performance Python for Numerical Algorithms by Yves

https://ep2014.europython.eu/en/schedule/sessions/64/

Metaprogramming, from Decorators to Macros by Andrea Crotti

https://ep2014.europython.eu/en/schedule/sessions/84/

Everything you always wanted to know about Memory in Python but were afraid to ask by Piotr Przymus

https://ep2014.europython.eu/en/schedule/sessions/28/

Practical summary –

__slots__ argument – limited the memory allocation for objects in Python by overriding the __dict__ attribute.
Strings – the empty strings and strings of length=1 are saved as constants. Use intern (or sys.intern in python 3.x) on strings to avoid allocating string variables with the same values this will help making memory usage more efficiency and quicker string comparison. More about this topic here .
Numerical algorithms – the way a 2-dimensional array is allocated (row wise or column wise) and store has a great impact on the performance of different algorithms (even for simple sum function).
Working with the GPU – there are packages that process some of the data on the GPU. It is efficient when the data is big and less efficient when the data is small since copying all the data to the GPU has some overhead.
Cython use c “malloc” function for re-allocating space when list \ dictionaries \ set grow or shrink. On one hand this function can be overridden, on the other hand one can try to avoid costly processes which cause space allocation operations or to use more efficient data structures, e.g list instead of dictionary where it is possible.
Note the garbage collector! Python garbage collector is based on reference count. Over-ridding the __dell__ function may disrupt the garbage collector.
Suggested Profiling and monitoring tools – psutil, memory_profiler, objgraph, RunSnakeRun + Meliae, valgrind

Bottom line of all of those – “knowledge itself is power”. I.e. knowing the internal and the impact of what we are doing can bring to a significant improvements.

There are always several ways to do things and each has cons and pros fitted to the specific case. Some of those are simple to implement and use and can donate to a great improvement on both running time and memory usage. On the other hand some of those suggestion are really “shoot in the foot” – causing memory leaks and other unexpected behavior, beware.

5 interesting things (03/08/2014)

Goooooooooaaaaaalllll – world cup is over but Python and soccer are forever :). This post tries to identify goals and interesting events on soccer games based on the volume on youtube. This post only touches the subject but what a nice start!

http://zulko.github.io/blog/2014/07/04/automatic-soccer-highlights-compilations-with-python/

Scheduling with Celery – and I thought I can only make a soup with celery.. Scheduling is not the main goal of Celery but it can be used as such, cron style. More about it –

http://www.caktusgroup.com/blog/2014/06/23/scheduling-tasks-celery/

Markov chains – the two links below complement one another. One has a great visualization and the other explain things a bit more deeply.

http://setosa.io/blog/2014/07/26/markov-chains/

http://www.analyticsvidhya.com/blog/2014/07/markov-chain-simplified

My Ig-Noble candidate –

http://www.bbc.com/news/magazine-20578627

Hotels by WiFi – geeky but important those days.

http://www.hotelwifitest.com

	5 interesting things… on 5 interesting things (16/…
	Did you Miss me? PyC… on Missing data in Python –…
	Did you Miss me? PyC… on pandas read_csv and missing…
	Did you Miss me? PyC… on Pandas fillna vs scikit-learn…
	Few thoughts on Clou… on Startup guide to data cost opt…