5 interesting things (04/01/2015)

Is Wikipedia a microcsmos of our world? News of 2014 as they are reflected in Wikipedia.

Json as python modules
 – Making life simpler. Making it possible to do -“import x” for x.json.
 
Python the fast way – living on the edge with Python 🙂 Some hints on how to make your code run faster. Another possibility – checking different interpreters. It was nice to see the assembly commands, I was expecting something more advanced.

Data Science Ontology
 – just having fun with d3.js.

Nips Experiment
 – there was relatively a lot of chatter regarding the experiment done by NIPS committee. Non the less, it creates a very uncomfortable feeling regarding committees.

5 interesting things (14/12/2014)

Twitter analytics with Spark – I really enjoyed this post from several reasons. First because the need and the way that lead to the solution is clear and well explained. Second, it uses a novel approach, mining social network for non trivial uses. Moreover, querying the app resulted a very nice results which is not trivial. I missed some reasoning \ explanation in the app. I.e., better understanding the connections between two organizations. I also wonder about the velocity the relations change. The technical part was interesting as well although I haven’t got to work with Spark so far.
 
Teaching creative – I really love to teach and I also try to adjust the methods, content, example to the audience and to be as creative as I can and to be passionate about the content. I believe that this way the audience will remember what I talked about. Also to be hands on help to remember and understand. This post which demonstrates creativity on so many levels is amazing and is very inspiring.

Databases compression
 – Summarizes the high level and things to know about each database very well. A good starting point when evaluating several solutions.
 
Experiments at AirBnB – Although I usually prefer to read about private people that did something cool with data I like AirBnB’s data blog as there is always something interesting to read about. This time it is about experiments, the very trendy “AB testing”. It raises
 
DevOps bookmarks – devops tools and frameworks. An aggregator, reminds assume * but very tempting implementation.

Common Crawl meetup

Yesterday I attended big data beers meetup. The meetup included 2 talks by Common Crawl employees – Lisa Green and Stephen Merity. Both talks was great and the connection between them  was empowering.
The meetup was sponsored by Data Artisans which are working on Apache Flink. Too bad I don’t have time to go to their meet up today
What is Common Crawl?
Common Crawl is a NGO that makes web data accessible to everyone with little or not cost. They crawl the web, release a monthly build which is stored in AWS S3 under public data sets. They respect no robots and no follow flags and basically try to be good citizens in the internet cosmos. As Lisa Green said in her talk – they believe that the web is “a digital copy of our world” and the greatest data set and their mission is to make it available to everyone.
Technicalities
  • Monthly build (currently prefer bigger monthly builds over more frequent builds)
  • Latest build was 220TB with crawl data of 2.98 billion web pages.
  • Data include 3 type of files –
    • WARC files of the raw crawl data
    • WAT files which include the metadata for the data stored in the WARC (about 1/3 of the raw crawl data)
    • WET files which hold the plaintext from the data stored in the WARC (about 15% of the raw crawl data).
  • Delay between publication of page and crawl \ public time is approx month-month and a half.
  • No incremental dumps are planned at the moment.
  • The data is currently skewed to dot com domains. They plan to improve it changes will hopefully be seen on January dump.
  • They crawl using Apache Nutch –  Nutch is an open source web crawler and cooperate with  Blekko in order to avoid spam.
Common Crawl encourages researchers, universities and commercial companies to use their data. If you ask politely they will even grant you some Amazon credit.
The Talks
Lisa Green talked about the general idea of open data (governments data, commercial data, web data) and gave some examples for using open data in general. The example I liked the most – using Orange cell phone data to identify Ebola spreading patterns, see here. This was a very inspiring introduction to Stephen Merity more technical talk.
Stephen Merity spoke about the more technical parts and gave an amazing examples on how to do both fast and cheap computations (spot instances rock). He showed interesting data about computing PageRank on the entire Common Crawl data, some NLP stuff and other interesting insights about their data.
Another relevant talk in the area is Jordan Mendelson talk from Berlin BuzzWords – “Big Data for Cheapskates” (if your are on a hurry start from the 18th minute).
Slides are available in –

What can you do with Common Crawl

Treating it as a data set there is a lot to explore –
1. Train it for language detection – train it for language detection for specific domains.
2. Named Entity Recognition.
3. Investigate the relations between different domains, web structre – identify competitors, page rank, etc.
4. Investigate the relations between different technologies – which js libraries appear together, changes of technology usage over time.

5 interesting things (06/11/2014)

Geeks pleasure – fulling math. Creating two images (and actually every type of document) with the same md5 hash.
Django vs Flask vs Pyramid – Python has a great open source community which is growing rapidly. One of the advantages is having several solution to the same or to near by problems. This post compares between 3 well known Python web frameworks – Django, Flask and Pyramid.
From my point of view working with Django and Flask. Flask is like riding a motorcycle while Django is Tank. Django is more tightly coupled with SQL backedend and the relevant dependencies and plugins while Flask allow quick, light functionality.
The Science of Crawl – half related to a project I currently do. Those two posts concern to problems that everyone how indexed and \ or crawled data faced with.
The invisible wall – 25 years later and the invisible wall still separates east from west in Germany. Beside living in Berlin visualizations convey the point very well.
http://www.washingtonpost.com/blogs/worldviews/wp/2014/10/31/the-berlin-wall-fell-25-years-ago-but-germany-is-still-divided/

Toolz – we all know the term design patterns. Toolz provides implementation pattern. Everyday utilities every developer needs – iterators, dictionaries, etc. Two links – blog post with  common use cases and Toolz documentation.

http://matthewrocklin.com/blog/work/2014/07/04/Streaming-Analytics/

http://toolz.readthedocs.org/en/latest/

Running my first EMR – lessons learned

Today I was trying to run my first EMR, here are few lessons I learned during this day. I have previously run hadoop streaming mapreduce so I was familiar with the mapreduce state of mind. However, I was not familiar with the EMR environment.

I used boto – Amazon official python interface.

1. AMI version – default AMI version is 1.0.0 – first release. This means the following specifications –

Operating system: Debian 5.0 (Lenny)

Applications: Hadoop 0.20 and 0.18 (default); Hive 0.5, 0.7 (default), 0.7.1; Pig 0.3 (on Hadoop 0.18), 0.6 (on Hadoop 0.20)

Languages: Perl 5.10.0, PHP 5.2.6, Python 2.5.2, R 2.7.1, Ruby 1.8.7

File system: ext3 for root and ephemeral

Kernel: Red Hat

For me Python 2.5.2 means –
  • Does not include json – new in version 2.6.
  • collections is new in python 2.4, but not all the models were added in this version –
namedtuple() factory function for creating tuple subclasses with named fields

New in version 2.6.

deque list-like container with fast appends and pops on either end

New in version 2.4.

Counter dict subclass for counting hashable objects

New in version 2.7.

OrderedDict dict subclass that remembers the order entries were added

New in version 2.7.

defaultdict dict subclass that calls a factory function to supply missing values

New in version 2.5.

 
Therefore specifying the ami_version version can be critical. Version 2.2.0 worked fine for me.
2. Must process all the input!
Naturally we will want to process all the input. However, for testing I went over only the n-first lines and then added a break to make things run faster. I was not consuming all the lines and therefore got an error. More about it here –
3. Output folder must not exists. This is the same as in hadoop streaming map reduce, for me the way to avoid it was to add a timestamp –

output="s3n://<my-bucket>/output/"+str(int(time.time()))

4. Why my process failed – one option which produces are relatively understandable explanation is  – conn.describe_jobflow(jobid).laststatechangereason

5. cache_files – enables you to import files you need for the map reduce process. Super important to “specify a fragment”, i.e. specify the local file name

cache_files=['s3n://<file-location>/<file-name>#<local-file-name>']
Otherwise you will obtain the following error –
“Streaming cacheFile and cacheArchive must specify a fragment”
6. Status – the are 7 different status your flow may have – COMPLETED, FAILED, TERMINATED, RUNNING, SHUTTING_DOWN, STARTING and WAITING. The right order of statuses if everything goes well is STARTING -> RUNNING ->SHUTTING_DOWN -> COMPLETED.
The SHUTTING_DOWN may take a while even for a very simple flow I measured about 1 minute of SHUTTING_DOWN process.
Resources I used –

http://boto.readthedocs.org/en/latest/emr_tut.html

5 interesting things (04/09/2014)

C3.JS – I have previously wrote a post about the importance of visualization in the skill set of data scientist. C3.js is a JavaScript chart library based on d3.js which seems at least in a glimpse to be simple and intuitive. I would like to see a Python client for that but that for the future to come.  

http://c3js.org/

nvd3 also do something  like – charts based 3d.js and also have a Python client which I worked with a bit. Comparing the two c3.js seems a little bit more mature than nvd3, ignoring the lack of Python client but I’m sure that gap would be filled soon. 

nvd3.org

Harvard Visualization course – I went through some of the slides and it was fascinating but what is even more exciting is the great collection of links about visualization examples, theory and tools. Great work.

http://www.cs171.org/#!index.md

textract – I needed high flexibility of input types in a project I do and of course I wanted to deal with as transparent as possible without looking myself to all the relevant packages or adjust my code to the API of each package. Fortunately somebody already did it – textarct. The package is not perfect and there are some “glitches” mostly concerning the parsers themselves (line splitting, non-ascii, etc) and not to the unified API textract provides. However, it is a very good start.

http://textract.readthedocs.org/en/latest/

Visualizing Garbage Collection Algorithms – both very cool visualization and good explanations. Design wise I think the visualization should be larger but the concept itself is very neat.

http://spin.atomicobject.com/2014/09/03/visualizing-garbage-collection-algorithms/

SmartCSV – making CSV reader more structured by defining a model and validating it while reading. Enables skipping rows (and soon skipping also columns). It is on going project and feature requests and issues are currently addressed quickly.

 

5 interesting things (26/08/2014)

Gooey – Command line to application! Very cool. Works  by mapping ArgumentParser to GUI objects. Very cool and make it easier to make instant tools to play around the office. Future tasks include supporting in more themes, different arg parses (e.g docopt http://docopt.org/), etc. I believe there is much potential in this project.

https://github.com/chriskiehl/Gooey

Datamesh – stats command line tool. I like working for the command line, at least to begin with and to feel data before doing complex things, this tool is really what I need. However, the documentation is missing \ not clear enough \ open source style.  

http://www.gnu.org/software/datamash/

In that sense, q, http://harelba.github.io/q/, is also very cool and answer the need to things and speak in the same framework in different tasks. 

Getting started with Python internals – how to dive to the deep water. Very enriching post which also include links to other interesting posts.

http://akaptur.github.io/blog/2014/08/03/getting-started-with-python-internals/

python-ftfy – unicode strings can be very painful when it includes special characters as umlauts (ö), html tags (&gt;) and everything users can think of :). The goal of this package is to at least partly ease this pain.

https://github.com/LuminosoInsight/python-ftfy

Fraud detection Lyft vs Uber – I think that the interesting part here is the data visualization as tool to understand the problem better. 

setdefault vs get vs defaultdict

You have a python dictionary, you want to get the value of specific key in the dictionary, so far so good, right?

And then a KeyError –

Traceback (most recent call last):
File “<stdin>”, line 1, in <module>
KeyError: 1 

Hmmm, well if this key does not exist in the dictionary I can use some default value like None, 10, empty string. What’s my options of doing so?

I can think of 3 –

  • get method
  • setdefault method
  • defaultdict data structure
    get method

Let’s investigate first –

key, value = "key", "value"
data = {}
x = data.get(key,value)
print x, data #value {}
data= {}
x = data.setdefault(key,value)
print x, data #value {'key': 'value'}

Well, we get almost the same result, x obtains the same value and in get data is not changed while in setdefault data changes. When does it become a problem?

key, value = "key", "value"
data = {}
x = data.get(key,[])append(value)
print x, data #None {}
data= {}
x = data.setdefault(key,[]).append(value)
print x, data None {'key': ['value']}

So, when we are dealing with mutable data types the difference is clearer and error prone.

When to use each? mainly depends on the content of your dictionary and its’ size.

We can time the differences but it does not really matter as they produce different output and it was not significant for any direction anyhow.

And for defaultdict –

from collections import defaultdict
data = defaultdict(list)
print data[key] #[]
data[key].append(value)
print data[key] #['value']

setdefault sets the default value to a specific key we access to while defaultdict is the type of the data variable and set this default value to every key we access to.

So, if we get roughly the same result I timed the processes for several dictionary sizes (left most column) and run each 1000 times (code below) –

dict size default value method time
100 list setdefault 0.0229508876801
defaultdict 0.0204179286957
set setdefault 0.0209970474243
defaultdict 0.0194549560547
int setdefault 0.0236239433289
defaultdict 0.0225579738617
string setdefault 0.020693063736
defaultdict 0.0240340232849
10000 list setdefault 2.09283614159
defaultdict 2.31266093254
set setdefault 2.12825512886
defaultdict 3.43549799919
int setdefault 2.04997992516
defaultdict 1.87312483788
“” setdefault 2.05423784256
defaultdict 1.93679213524
100000 list setdefault 22.4799249172
defaultdict 29.7850298882
set setdefault 23.5321040154
defaultdict 41.7523541451
int setdefault 26.6693091393
defaultdict 23.1293339729
string setdefault 26.4119689465
defaultdict 23.6694099903

Conclusions and summary –

  • Working with sets is almost always more expensive time-wise than working with lists
  • As the dictionary size grows simple types – string and int perform better with defaultdict then with setdefault while set and list perform worse.
  • Main conclusion – choosing between defaultdict and setdefault also mainly depends in the type of the default value.
  • In this test I tested a particular use case – accessing each key twice. Different use cases \ distributions such as assignment, accessing to the same key over and over again, etc. may have different properties.
  • There is no firm conclusion here just investigating some of interpreter capabilities.

Code –

import timeit
from collections import defaultdict
from itertools import product

def measure_setdefault(n, defaultvalue):
 data = {}
 for i in xrange(0,n):
 x = data.setdefault(i,defaultvalue)
 for i in xrange(0,n):
 x = data.setdefault(i,defaultvalue)

def measure_defaultdict(n,defaultvalue):
 data = defaultdict(type(defaultvalue))
 for i in xrange(0,n):
 x = data[i]
 for i in xrange(0,n):
 x = data[i]

if __name__ == '__main__':
 import timeit
 number = 1000
 dict_sizes = [100,10000, 100000]
 defaultvalues = [[], 0, "", set()]
 for dict_size, defaultvalue in product(dict_sizes, defaultvalues):
 print "dict_size: ", dict_size, " defaultvalue: ", type(defaultvalue)
 print "\tsetdefault:", timeit.timeit("measure_setdefault(dict_size, defaultvalue)", setup="from __main__ import measure_setdefault, dict_size, defaultvalue", number=number)
 print "\\tdefaultdict:", timeit.timeit("measure_defaultdict(dict_size, defaultvalue)", setup="from __main__ import measure_defaultdict, dict_size, defaultvalue", number=number)

EuroPython 2014 – Python under the hood

This is actually a summary of few talks which deals with “under the hood” topics, the talks partially overlap. Those topics include – memory allocation and management, inheritance, over-ridden built-in methods, etc.

Relevant talks –
The magic of attribute access by Petr Viktorin
Performance Python for Numerical Algorithms by Yves
Metaprogramming, from Decorators to Macros by Andrea Crotti
Everything you always wanted to know about Memory in Python but were afraid to ask by Piotr Przymus
Practical summary –
  • __slots__ argument – limited the memory allocation for objects in Python by overriding the __dict__ attribute.
  • Strings – the empty strings and strings of length=1 are saved as constants. Use intern (or sys.intern in python 3.x) on strings to avoid allocating string variables with the same values this will help making memory usage more efficiency and quicker string comparison. More about this topic here .
  • Numerical algorithms – the way a 2-dimensional array is allocated (row wise or column wise) and store has a great impact on the performance of different algorithms (even for simple sum function).
  • Working with the GPU – there are packages that process some of the data on the GPU. It is efficient when the data is big and less efficient when the data is small since copying all the data to the GPU has some overhead.
  • Cython use c “malloc” function for re-allocating space when list \ dictionaries \ set grow or shrink. On one hand this function can be overridden, on the other hand one can try to avoid costly processes which cause space allocation operations or to use more efficient data structures, e.g list instead of dictionary where it is possible.
  • Note the garbage collector! Python garbage collector is based on reference count. Over-ridding the __dell__ function may disrupt the garbage collector.
  • Suggested Profiling and monitoring tools – psutilmemory_profilerobjgraphRunSnakeRun + Meliaevalgrind
Bottom line of all of those – “knowledge itself is power”. I.e. knowing the internal and the impact of what we are doing can bring to a significant improvements.
There are always several ways to do things and each has cons and pros fitted to the specific case. Some of those are simple to implement and use and can donate to a great improvement on both running time and memory usage. On the other hand some of those suggestion are really “shoot in the foot” – causing memory leaks and other unexpected behavior, beware.

5 interesting things (03/08/2014)

Goooooooooaaaaaalllll – world cup is over but Python and soccer are forever :). This post tries to identify goals and interesting events on soccer games based on the volume on youtube. This post only touches the subject but what a nice start!

http://zulko.github.io/blog/2014/07/04/automatic-soccer-highlights-compilations-with-python/

Scheduling with Celery – and I thought I can only make a soup with celery.. Scheduling is not the main goal of Celery but it can be used as such, cron style. More about it  –

http://www.caktusgroup.com/blog/2014/06/23/scheduling-tasks-celery/

Markov chains – the two links below complement one another. One has a great visualization and the other explain things a bit more deeply. 

http://setosa.io/blog/2014/07/26/markov-chains/

http://www.analyticsvidhya.com/blog/2014/07/markov-chain-simplified

My Ig-Noble candidate – 

http://www.bbc.com/news/magazine-20578627

Hotels by WiFi – geeky but important those days.

http://www.hotelwifitest.com