Category: Uncategorized
5 interesting things (04/01/2015)
Json as python modules – Making life simpler. Making it possible to do -“import x” for x.json.
Data Science Ontology – just having fun with d3.js.
Nips Experiment – there was relatively a lot of chatter regarding the experiment done by NIPS committee. Non the less, it creates a very uncomfortable feeling regarding committees.
5 interesting things (14/12/2014)
Databases compression – Summarizes the high level and things to know about each database very well. A good starting point when evaluating several solutions.
Common Crawl meetup
- Monthly build (currently prefer bigger monthly builds over more frequent builds)
- Latest build was 220TB with crawl data of 2.98 billion web pages.
- Data include 3 type of files –
- WARC files of the raw crawl data
- WAT files which include the metadata for the data stored in the WARC (about 1/3 of the raw crawl data)
- WET files which hold the plaintext from the data stored in the WARC (about 15% of the raw crawl data).
- Delay between publication of page and crawl \ public time is approx month-month and a half.
- No incremental dumps are planned at the moment.
- The data is currently skewed to dot com domains. They plan to improve it changes will hopefully be seen on January dump.
- They crawl using Apache Nutch – Nutch is an open source web crawler and cooperate with Blekko in order to avoid spam.
What can you do with Common Crawl
5 interesting things (06/11/2014)
Toolz – we all know the term design patterns. Toolz provides implementation pattern. Everyday utilities every developer needs – iterators, dictionaries, etc. Two links – blog post with common use cases and Toolz documentation.
http://matthewrocklin.com/blog/work/2014/07/04/Streaming-Analytics/
Running my first EMR – lessons learned
Today I was trying to run my first EMR, here are few lessons I learned during this day. I have previously run hadoop streaming mapreduce so I was familiar with the mapreduce state of mind. However, I was not familiar with the EMR environment.
I used boto – Amazon official python interface.
Operating system: Debian 5.0 (Lenny)
Applications: Hadoop 0.20 and 0.18 (default); Hive 0.5, 0.7 (default), 0.7.1; Pig 0.3 (on Hadoop 0.18), 0.6 (on Hadoop 0.20)
Languages: Perl 5.10.0, PHP 5.2.6, Python 2.5.2, R 2.7.1, Ruby 1.8.7
File system: ext3 for root and ephemeral
Kernel: Red Hat
- Does not include json – new in version 2.6.
- collections is new in python 2.4, but not all the models were added in this version –
| namedtuple() | factory function for creating tuple subclasses with named fields |
New in version 2.6. |
| deque | list-like container with fast appends and pops on either end |
New in version 2.4. |
| Counter | dict subclass for counting hashable objects |
New in version 2.7. |
| OrderedDict | dict subclass that remembers the order entries were added |
New in version 2.7. |
| defaultdict | dict subclass that calls a factory function to supply missing values |
New in version 2.5. |
- dict comprehensions is also kind of late addition (python 2.7)
output="s3n://<my-bucket>/output/"+str(int(time.time()))
4. Why my process failed – one option which produces are relatively understandable explanation is – conn.describe_jobflow(jobid).laststatechangereason
5. cache_files – enables you to import files you need for the map reduce process. Super important to “specify a fragment”, i.e. specify the local file name
cache_files=['s3n://<file-location>/<file-name>#<local-file-name>']
5 interesting things (04/09/2014)
C3.JS – I have previously wrote a post about the importance of visualization in the skill set of data scientist. C3.js is a JavaScript chart library based on d3.js which seems at least in a glimpse to be simple and intuitive. I would like to see a Python client for that but that for the future to come.
nvd3 also do something like – charts based 3d.js and also have a Python client which I worked with a bit. Comparing the two c3.js seems a little bit more mature than nvd3, ignoring the lack of Python client but I’m sure that gap would be filled soon.
Harvard Visualization course – I went through some of the slides and it was fascinating but what is even more exciting is the great collection of links about visualization examples, theory and tools. Great work.
http://www.cs171.org/#!index.md
textract – I needed high flexibility of input types in a project I do and of course I wanted to deal with as transparent as possible without looking myself to all the relevant packages or adjust my code to the API of each package. Fortunately somebody already did it – textarct. The package is not perfect and there are some “glitches” mostly concerning the parsers themselves (line splitting, non-ascii, etc) and not to the unified API textract provides. However, it is a very good start.
http://textract.readthedocs.org/en/latest/
Visualizing Garbage Collection Algorithms – both very cool visualization and good explanations. Design wise I think the visualization should be larger but the concept itself is very neat.
http://spin.atomicobject.com/2014/09/03/visualizing-garbage-collection-algorithms/
SmartCSV – making CSV reader more structured by defining a model and validating it while reading. Enables skipping rows (and soon skipping also columns). It is on going project and feature requests and issues are currently addressed quickly.
5 interesting things (26/08/2014)
Gooey – Command line to application! Very cool. Works by mapping ArgumentParser to GUI objects. Very cool and make it easier to make instant tools to play around the office. Future tasks include supporting in more themes, different arg parses (e.g docopt http://docopt.org/), etc. I believe there is much potential in this project.
https://github.com/chriskiehl/Gooey
Datamesh – stats command line tool. I like working for the command line, at least to begin with and to feel data before doing complex things, this tool is really what I need. However, the documentation is missing \ not clear enough \ open source style.
http://www.gnu.org/software/datamash/
In that sense, q, http://harelba.github.io/q/, is also very cool and answer the need to things and speak in the same framework in different tasks.
Getting started with Python internals – how to dive to the deep water. Very enriching post which also include links to other interesting posts.
http://akaptur.github.io/blog/2014/08/03/getting-started-with-python-internals/
python-ftfy – unicode strings can be very painful when it includes special characters as umlauts (ö), html tags (>) and everything users can think of :). The goal of this package is to at least partly ease this pain.
https://github.com/LuminosoInsight/python-ftfy
Fraud detection Lyft vs Uber – I think that the interesting part here is the data visualization as tool to understand the problem better.
setdefault vs get vs defaultdict
You have a python dictionary, you want to get the value of specific key in the dictionary, so far so good, right?
And then a KeyError –
Traceback (most recent call last):
File “<stdin>”, line 1, in <module>
KeyError: 1
Hmmm, well if this key does not exist in the dictionary I can use some default value like None, 10, empty string. What’s my options of doing so?
I can think of 3 –
- get method
- setdefault method
- defaultdict data structure
get method
Let’s investigate first –
key, value = "key", "value"
data = {}
x = data.get(key,value)
print x, data #value {}
data= {}
x = data.setdefault(key,value)
print x, data #value {'key': 'value'}
Well, we get almost the same result, x obtains the same value and in get data is not changed while in setdefault data changes. When does it become a problem?
key, value = "key", "value"
data = {}
x = data.get(key,[])append(value)
print x, data #None {}
data= {}
x = data.setdefault(key,[]).append(value)
print x, data None {'key': ['value']}
So, when we are dealing with mutable data types the difference is clearer and error prone.
When to use each? mainly depends on the content of your dictionary and its’ size.
We can time the differences but it does not really matter as they produce different output and it was not significant for any direction anyhow.
And for defaultdict –
from collections import defaultdict data = defaultdict(list) print data[key] #[] data[key].append(value) print data[key] #['value']
setdefault sets the default value to a specific key we access to while defaultdict is the type of the data variable and set this default value to every key we access to.
So, if we get roughly the same result I timed the processes for several dictionary sizes (left most column) and run each 1000 times (code below) –
| dict size | default value | method | time |
|---|---|---|---|
| 100 | list | setdefault | 0.0229508876801 |
| defaultdict | 0.0204179286957 | ||
| set | setdefault | 0.0209970474243 | |
| defaultdict | 0.0194549560547 | ||
| int | setdefault | 0.0236239433289 | |
| defaultdict | 0.0225579738617 | ||
| string | setdefault | 0.020693063736 | |
| defaultdict | 0.0240340232849 | ||
| 10000 | list | setdefault | 2.09283614159 |
| defaultdict | 2.31266093254 | ||
| set | setdefault | 2.12825512886 | |
| defaultdict | 3.43549799919 | ||
| int | setdefault | 2.04997992516 | |
| defaultdict | 1.87312483788 | ||
| “” | setdefault | 2.05423784256 | |
| defaultdict | 1.93679213524 | ||
| 100000 | list | setdefault | 22.4799249172 |
| defaultdict | 29.7850298882 | ||
| set | setdefault | 23.5321040154 | |
| defaultdict | 41.7523541451 | ||
| int | setdefault | 26.6693091393 | |
| defaultdict | 23.1293339729 | ||
| string | setdefault | 26.4119689465 | |
| defaultdict | 23.6694099903 |
Conclusions and summary –
- Working with sets is almost always more expensive time-wise than working with lists
- As the dictionary size grows simple types – string and int perform better with defaultdict then with setdefault while set and list perform worse.
- Main conclusion – choosing between defaultdict and setdefault also mainly depends in the type of the default value.
- In this test I tested a particular use case – accessing each key twice. Different use cases \ distributions such as assignment, accessing to the same key over and over again, etc. may have different properties.
- There is no firm conclusion here just investigating some of interpreter capabilities.
Code –
import timeit
from collections import defaultdict
from itertools import product
def measure_setdefault(n, defaultvalue):
data = {}
for i in xrange(0,n):
x = data.setdefault(i,defaultvalue)
for i in xrange(0,n):
x = data.setdefault(i,defaultvalue)
def measure_defaultdict(n,defaultvalue):
data = defaultdict(type(defaultvalue))
for i in xrange(0,n):
x = data[i]
for i in xrange(0,n):
x = data[i]
if __name__ == '__main__':
import timeit
number = 1000
dict_sizes = [100,10000, 100000]
defaultvalues = [[], 0, "", set()]
for dict_size, defaultvalue in product(dict_sizes, defaultvalues):
print "dict_size: ", dict_size, " defaultvalue: ", type(defaultvalue)
print "\tsetdefault:", timeit.timeit("measure_setdefault(dict_size, defaultvalue)", setup="from __main__ import measure_setdefault, dict_size, defaultvalue", number=number)
print "\\tdefaultdict:", timeit.timeit("measure_defaultdict(dict_size, defaultvalue)", setup="from __main__ import measure_defaultdict, dict_size, defaultvalue", number=number)
EuroPython 2014 – Python under the hood
This is actually a summary of few talks which deals with “under the hood” topics, the talks partially overlap. Those topics include – memory allocation and management, inheritance, over-ridden built-in methods, etc.
- __slots__ argument – limited the memory allocation for objects in Python by overriding the __dict__ attribute.
- Strings – the empty strings and strings of length=1 are saved as constants. Use intern (or sys.intern in python 3.x) on strings to avoid allocating string variables with the same values this will help making memory usage more efficiency and quicker string comparison. More about this topic here .
- Numerical algorithms – the way a 2-dimensional array is allocated (row wise or column wise) and store has a great impact on the performance of different algorithms (even for simple sum function).
- Working with the GPU – there are packages that process some of the data on the GPU. It is efficient when the data is big and less efficient when the data is small since copying all the data to the GPU has some overhead.
- Cython use c “malloc” function for re-allocating space when list \ dictionaries \ set grow or shrink. On one hand this function can be overridden, on the other hand one can try to avoid costly processes which cause space allocation operations or to use more efficient data structures, e.g list instead of dictionary where it is possible.
- Note the garbage collector! Python garbage collector is based on reference count. Over-ridding the __dell__ function may disrupt the garbage collector.
- Suggested Profiling and monitoring tools – psutil, memory_profiler, objgraph, RunSnakeRun + Meliae, valgrind