2015)

Is Wikipedia a microcsmos of our world? News of 2014 as they are reflected in Wikipedia.

http://www.brianckeegan.com/2014/12/the-news-on-wikipedia-in-2014/

Json as python modules – Making life simpler. Making it possible to do -“import x” for x.json.

Python the fast way – living on the edge with Python 🙂 Some hints on how to make your code run faster. Another possibility – checking different interpreters. It was nice to see the assembly commands, I was expecting something more advanced.

http://pythonfasterway.uni.me/

Data Science Ontology – just having fun with d3.js.

http://www.datascienceontology.com/

Nips Experiment – there was relatively a lot of chatter regarding the experiment done by NIPS committee. Non the less, it creates a very uncomfortable feeling regarding committees.

http://mrtz.org/blog/the-nips-experiment/

5 interesting things (14/12/2014)

Twitter analytics with Spark – I really enjoyed this post from several reasons. First because the need and the way that lead to the solution is clear and well explained. Second, it uses a novel approach, mining social network for non trivial uses. Moreover, querying the app resulted a very nice results which is not trivial. I missed some reasoning \ explanation in the app. I.e., better understanding the connections between two organizations. I also wonder about the velocity the relations change. The technical part was interesting as well although I haven’t got to work with Spark so far.

http://eugenezhulenev.com/blog/2014/11/20/twitter-analytics-with-spark/

Teaching creative – I really love to teach and I also try to adjust the methods, content, example to the audience and to be as creative as I can and to be passionate about the content. I believe that this way the audience will remember what I talked about. Also to be hands on help to remember and understand. This post which demonstrates creativity on so many levels is amazing and is very inspiring.

http://blog.mattwaite.com/post/103144761014/a-classroom-experiment-in-twitter-bots-and

Databases compression – Summarizes the high level and things to know about each database very well. A good starting point when evaluating several solutions.

http://kkovacs.eu/cassandra-vs-mongodb-vs-couchdb-vs-redis

A bonus track – http://blog.nahurst.com/visual-guide-to-nosql-systems

Experiments at AirBnB – Although I usually prefer to read about private people that did something cool with data I like AirBnB’s data blog as there is always something interesting to read about. This time it is about experiments, the very trendy “AB testing”. It raises

http://nerds.airbnb.com/experiments-at-airbnb/

DevOps bookmarks – devops tools and frameworks. An aggregator, reminds assume * but very tempting implementation.

http://www.devopsbookmarks.com/

Common Crawl meetup

Yesterday I attended big data beers meetup. The meetup included 2 talks by Common Crawl employees – Lisa Green and Stephen Merity. Both talks was great and the connection between them was empowering.

The meetup was sponsored by Data Artisans which are working on Apache Flink. Too bad I don’t have time to go to their meet up today.

What is Common Crawl?

Common Crawl is a NGO that makes web data accessible to everyone with little or not cost. They crawl the web, release a monthly build which is stored in AWS S3 under public data sets. They respect no robots and no follow flags and basically try to be good citizens in the internet cosmos. As Lisa Green said in her talk – they believe that the web is “a digital copy of our world” and the greatest data set and their mission is to make it available to everyone.

Technicalities

Monthly build (currently prefer bigger monthly builds over more frequent builds)
Latest build was 220TB with crawl data of 2.98 billion web pages.
Data include 3 type of files –
- WARC files of the raw crawl data
- WAT files which include the metadata for the data stored in the WARC (about 1/3 of the raw crawl data)
- WET files which hold the plaintext from the data stored in the WARC (about 15% of the raw crawl data).
Delay between publication of page and crawl \ public time is approx month-month and a half.
No incremental dumps are planned at the moment.
The data is currently skewed to dot com domains. They plan to improve it changes will hopefully be seen on January dump.
They crawl using Apache Nutch – Nutch is an open source web crawler and cooperate with Blekko in order to avoid spam.

Common Crawl encourages researchers, universities and commercial companies to use their data. If you ask politely they will even grant you some Amazon credit.

The Talks

Lisa Green talked about the general idea of open data (governments data, commercial data, web data) and gave some examples for using open data in general. The example I liked the most – using Orange cell phone data to identify Ebola spreading patterns, see here. This was a very inspiring introduction to Stephen Merity more technical talk.

Stephen Merity spoke about the more technical parts and gave an amazing examples on how to do both fast and cheap computations (spot instances rock). He showed interesting data about computing PageRank on the entire Common Crawl data, some NLP stuff and other interesting insights about their data.

Another relevant talk in the area is Jordan Mendelson talk from Berlin BuzzWords – “Big Data for Cheapskates” (if your are on a hurry start from the 18th minute).

Slides are available in –

Stephen Merity slides

What can you do with Common Crawl

Treating it as a data set there is a lot to explore –

1. Train it for language detection – train it for language detection for specific domains.

2. Named Entity Recognition.

3. Investigate the relations between different domains, web structre – identify competitors, page rank, etc.

4. Investigate the relations between different technologies – which js libraries appear together, changes of technology usage over time.

5 interesting things (06/11/2014)

Geeks pleasure – fulling math. Creating two images (and actually every type of document) with the same md5 hash.

http://natmchugh.blogspot.de/2014/10/how-i-created-two-images-with-same-md5.html

Django vs Flask vs Pyramid – Python has a great open source community which is growing rapidly. One of the advantages is having several solution to the same or to near by problems. This post compares between 3 well known Python web frameworks – Django, Flask and Pyramid.

From my point of view working with Django and Flask. Flask is like riding a motorcycle while Django is Tank. Django is more tightly coupled with SQL backedend and the relevant dependencies and plugins while Flask allow quick, light functionality.

https://www.airpair.com/python/posts/django-flask-pyramid

The Science of Crawl – half related to a project I currently do. Those two posts concern to problems that everyone how indexed and \ or crawled data faced with.

http://blog.urx.com/urx-blog/2014/9/4/the-science-of-crawl-part-1-deduplication-of-web-content

http://blog.urx.com/urx-blog/2014/10/23/the-science-of-crawl-part-2-content-freshness

The invisible wall – 25 years later and the invisible wall still separates east from west in Germany. Beside living in Berlin visualizations convey the point very well.

http://www.washingtonpost.com/blogs/worldviews/wp/2014/10/31/the-berlin-wall-fell-25-years-ago-but-germany-is-still-divided/

Toolz – we all know the term design patterns. Toolz provides implementation pattern. Everyday utilities every developer needs – iterators, dictionaries, etc. Two links – blog post with common use cases and Toolz documentation.

http://matthewrocklin.com/blog/work/2014/07/04/Streaming-Analytics/

http://toolz.readthedocs.org/en/latest/

Running my first EMR – lessons learned

Today I was trying to run my first EMR, here are few lessons I learned during this day. I have previously run hadoop streaming mapreduce so I was familiar with the mapreduce state of mind. However, I was not familiar with the EMR environment.

I used boto – Amazon official python interface.

1. AMI version – default AMI version is 1.0.0 – first release. This means the following specifications –

Operating system: Debian 5.0 (Lenny)

Applications: Hadoop 0.20 and 0.18 (default); Hive 0.5, 0.7 (default), 0.7.1; Pig 0.3 (on Hadoop 0.18), 0.6 (on Hadoop 0.20)

Languages: Perl 5.10.0, PHP 5.2.6, Python 2.5.2, R 2.7.1, Ruby 1.8.7

File system: ext3 for root and ephemeral

Kernel: Red Hat

http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/ami-versions-supported.html

For me Python 2.5.2 means –

Does not include json – new in version 2.6.

collections is new in python 2.4, but not all the models were added in this version –

namedtuple()	factory function for creating tuple subclasses with named fields	New in version 2.6.
deque	list-like container with fast appends and pops on either end	New in version 2.4.
Counter	dict subclass for counting hashable objects	New in version 2.7.
OrderedDict	dict subclass that remembers the order entries were added	New in version 2.7.
defaultdict	dict subclass that calls a factory function to supply missing values	New in version 2.5.

dict comprehensions is also kind of late addition (python 2.7)

Therefore specifying the ami_version version can be critical. Version 2.2.0 worked fine for me.

2. Must process all the input!

Naturally we will want to process all the input. However, for testing I went over only the n-first lines and then added a break to make things run faster. I was not consuming all the lines and therefore got an error. More about it here –

http://stackoverflow.com/questions/9881269/broken-pipe-error-causes-streaming-elastic-mapreduce-job-on-aws-to-fail

3. Output folder must not exists. This is the same as in hadoop streaming map reduce, for me the way to avoid it was to add a timestamp –

output="s3n://<my-bucket>/output/"+str(int(time.time()))

4. Why my process failed – one option which produces are relatively understandable explanation is – conn.describe_jobflow(jobid).laststatechangereason

5. cache_files – enables you to import files you need for the map reduce process. Super important to “specify a fragment”, i.e. specify the local file name

cache_files=['s3n://<file-location>/<file-name>#<local-file-name>']

Otherwise you will obtain the following error –

“Streaming cacheFile and cacheArchive must specify a fragment”

6. Status – the are 7 different status your flow may have – COMPLETED, FAILED, TERMINATED, RUNNING, SHUTTING_DOWN, STARTING and WAITING. The right order of statuses if everything goes well is STARTING -> RUNNING ->SHUTTING_DOWN -> COMPLETED.

The SHUTTING_DOWN may take a while even for a very simple flow I measured about 1 minute of SHUTTING_DOWN process.

Resources I used –

http://boto.readthedocs.org/en/latest/emr_tut.html

http://atbrox.com/2010/10/01/programmatic-deployment-to-elastic-mapreduce-with-boto-and-bootstrap-action/

5 interesting things (04/09/2014)

C3.JS – I have previously wrote a post about the importance of visualization in the skill set of data scientist. C3.js is a JavaScript chart library based on d3.js which seems at least in a glimpse to be simple and intuitive. I would like to see a Python client for that but that for the future to come.

http://c3js.org/

nvd3 also do something like – charts based 3d.js and also have a Python client which I worked with a bit. Comparing the two c3.js seems a little bit more mature than nvd3, ignoring the lack of Python client but I’m sure that gap would be filled soon.

nvd3.org

Harvard Visualization course – I went through some of the slides and it was fascinating but what is even more exciting is the great collection of links about visualization examples, theory and tools. Great work.

http://www.cs171.org/#!index.md

textract – I needed high flexibility of input types in a project I do and of course I wanted to deal with as transparent as possible without looking myself to all the relevant packages or adjust my code to the API of each package. Fortunately somebody already did it – textarct. The package is not perfect and there are some “glitches” mostly concerning the parsers themselves (line splitting, non-ascii, etc) and not to the unified API textract provides. However, it is a very good start.

http://textract.readthedocs.org/en/latest/

Visualizing Garbage Collection Algorithms – both very cool visualization and good explanations. Design wise I think the visualization should be larger but the concept itself is very neat.

http://spin.atomicobject.com/2014/09/03/visualizing-garbage-collection-algorithms/

SmartCSV – making CSV reader more structured by defining a model and validating it while reading. Enables skipping rows (and soon skipping also columns). It is on going project and feature requests and issues are currently addressed quickly.

https://github.com/santiagobasulto/smartcsv

5 interesting things (26/08/2014)

Gooey – Command line to application! Very cool. Works by mapping ArgumentParser to GUI objects. Very cool and make it easier to make instant tools to play around the office. Future tasks include supporting in more themes, different arg parses (e.g docopt http://docopt.org/), etc. I believe there is much potential in this project.

https://github.com/chriskiehl/Gooey

Datamesh – stats command line tool. I like working for the command line, at least to begin with and to feel data before doing complex things, this tool is really what I need. However, the documentation is missing \ not clear enough \ open source style.

http://www.gnu.org/software/datamash/

In that sense, q, http://harelba.github.io/q/, is also very cool and answer the need to things and speak in the same framework in different tasks.

Getting started with Python internals – how to dive to the deep water. Very enriching post which also include links to other interesting posts.

http://akaptur.github.io/blog/2014/08/03/getting-started-with-python-internals/

python-ftfy – unicode strings can be very painful when it includes special characters as umlauts (ö), html tags (>) and everything users can think of :). The goal of this package is to at least partly ease this pain.

https://github.com/LuminosoInsight/python-ftfy

Fraud detection Lyft vs Uber – I think that the interesting part here is the data visualization as tool to understand the problem better.

http://linkurio.us/lyft-vs-uber-visualizing-fraud-patterns/

setdefault vs get vs defaultdict

You have a python dictionary, you want to get the value of specific key in the dictionary, so far so good, right?

And then a KeyError –

Traceback (most recent call last):
File “<stdin>”, line 1, in <module>
KeyError: 1

Hmmm, well if this key does not exist in the dictionary I can use some default value like None, 10, empty string. What’s my options of doing so?

I can think of 3 –

get method
setdefault method
defaultdict data structure
get method

Let’s investigate first –

key, value = "key", "value"
data = {}
x = data.get(key,value)
print x, data #value {}
data= {}
x = data.setdefault(key,value)
print x, data #value {'key': 'value'}

Well, we get almost the same result, x obtains the same value and in get data is not changed while in setdefault data changes. When does it become a problem?

key, value = "key", "value"
data = {}
x = data.get(key,[])append(value)
print x, data #None {}
data= {}
x = data.setdefault(key,[]).append(value)
print x, data None {'key': ['value']}

So, when we are dealing with mutable data types the difference is clearer and error prone.

When to use each? mainly depends on the content of your dictionary and its’ size.

We can time the differences but it does not really matter as they produce different output and it was not significant for any direction anyhow.

And for defaultdict –

from collections import defaultdict
data = defaultdict(list)
print data[key] #[]
data[key].append(value)
print data[key] #['value']

setdefault sets the default value to a specific key we access to while defaultdict is the type of the data variable and set this default value to every key we access to.

So, if we get roughly the same result I timed the processes for several dictionary sizes (left most column) and run each 1000 times (code below) –

dict size	default value	method	time
100	list	setdefault	0.0229508876801
	list	defaultdict	0.0204179286957
	set	setdefault	0.0209970474243
	set	defaultdict	0.0194549560547
	int	setdefault	0.0236239433289
	int	defaultdict	0.0225579738617
	string	setdefault	0.020693063736
	string	defaultdict	0.0240340232849
10000	list	setdefault	2.09283614159
	list	defaultdict	2.31266093254
	set	setdefault	2.12825512886
	set	defaultdict	3.43549799919
	int	setdefault	2.04997992516
	int	defaultdict	1.87312483788
	“”	setdefault	2.05423784256
	“”	defaultdict	1.93679213524
100000	list	setdefault	22.4799249172
	list	defaultdict	29.7850298882
	set	setdefault	23.5321040154
	set	defaultdict	41.7523541451
	int	setdefault	26.6693091393
	int	defaultdict	23.1293339729
	string	setdefault	26.4119689465
	string	defaultdict	23.6694099903

Conclusions and summary –

Working with sets is almost always more expensive time-wise than working with lists
As the dictionary size grows simple types – string and int perform better with defaultdict then with setdefault while set and list perform worse.
Main conclusion – choosing between defaultdict and setdefault also mainly depends in the type of the default value.
In this test I tested a particular use case – accessing each key twice. Different use cases \ distributions such as assignment, accessing to the same key over and over again, etc. may have different properties.
There is no firm conclusion here just investigating some of interpreter capabilities.

Code –

import timeit
from collections import defaultdict
from itertools import product

def measure_setdefault(n, defaultvalue):
 data = {}
 for i in xrange(0,n):
 x = data.setdefault(i,defaultvalue)
 for i in xrange(0,n):
 x = data.setdefault(i,defaultvalue)

def measure_defaultdict(n,defaultvalue):
 data = defaultdict(type(defaultvalue))
 for i in xrange(0,n):
 x = data[i]
 for i in xrange(0,n):
 x = data[i]

if __name__ == '__main__':
 import timeit
 number = 1000
 dict_sizes = [100,10000, 100000]
 defaultvalues = [[], 0, "", set()]
 for dict_size, defaultvalue in product(dict_sizes, defaultvalues):
 print "dict_size: ", dict_size, " defaultvalue: ", type(defaultvalue)
 print "\tsetdefault:", timeit.timeit("measure_setdefault(dict_size, defaultvalue)", setup="from __main__ import measure_setdefault, dict_size, defaultvalue", number=number)
 print "\\tdefaultdict:", timeit.timeit("measure_defaultdict(dict_size, defaultvalue)", setup="from __main__ import measure_defaultdict, dict_size, defaultvalue", number=number)

EuroPython 2014 – Python under the hood

This is actually a summary of few talks which deals with “under the hood” topics, the talks partially overlap. Those topics include – memory allocation and management, inheritance, over-ridden built-in methods, etc.

Relevant talks –

The magic of attribute access by Petr Viktorin

https://ep2014.europython.eu/en/schedule/sessions/123/

Performance Python for Numerical Algorithms by Yves

https://ep2014.europython.eu/en/schedule/sessions/64/

Metaprogramming, from Decorators to Macros by Andrea Crotti

https://ep2014.europython.eu/en/schedule/sessions/84/

Everything you always wanted to know about Memory in Python but were afraid to ask by Piotr Przymus

https://ep2014.europython.eu/en/schedule/sessions/28/

Practical summary –

__slots__ argument – limited the memory allocation for objects in Python by overriding the __dict__ attribute.
Strings – the empty strings and strings of length=1 are saved as constants. Use intern (or sys.intern in python 3.x) on strings to avoid allocating string variables with the same values this will help making memory usage more efficiency and quicker string comparison. More about this topic here .
Numerical algorithms – the way a 2-dimensional array is allocated (row wise or column wise) and store has a great impact on the performance of different algorithms (even for simple sum function).
Working with the GPU – there are packages that process some of the data on the GPU. It is efficient when the data is big and less efficient when the data is small since copying all the data to the GPU has some overhead.
Cython use c “malloc” function for re-allocating space when list \ dictionaries \ set grow or shrink. On one hand this function can be overridden, on the other hand one can try to avoid costly processes which cause space allocation operations or to use more efficient data structures, e.g list instead of dictionary where it is possible.
Note the garbage collector! Python garbage collector is based on reference count. Over-ridding the __dell__ function may disrupt the garbage collector.
Suggested Profiling and monitoring tools – psutil, memory_profiler, objgraph, RunSnakeRun + Meliae, valgrind

Bottom line of all of those – “knowledge itself is power”. I.e. knowing the internal and the impact of what we are doing can bring to a significant improvements.

There are always several ways to do things and each has cons and pros fitted to the specific case. Some of those are simple to implement and use and can donate to a great improvement on both running time and memory usage. On the other hand some of those suggestion are really “shoot in the foot” – causing memory leaks and other unexpected behavior, beware.

5 interesting things (03/08/2014)

Goooooooooaaaaaalllll – world cup is over but Python and soccer are forever :). This post tries to identify goals and interesting events on soccer games based on the volume on youtube. This post only touches the subject but what a nice start!

http://zulko.github.io/blog/2014/07/04/automatic-soccer-highlights-compilations-with-python/

Scheduling with Celery – and I thought I can only make a soup with celery.. Scheduling is not the main goal of Celery but it can be used as such, cron style. More about it –

http://www.caktusgroup.com/blog/2014/06/23/scheduling-tasks-celery/

Markov chains – the two links below complement one another. One has a great visualization and the other explain things a bit more deeply.

http://setosa.io/blog/2014/07/26/markov-chains/

http://www.analyticsvidhya.com/blog/2014/07/markov-chain-simplified

My Ig-Noble candidate –

http://www.bbc.com/news/magazine-20578627

Hotels by WiFi – geeky but important those days.

http://www.hotelwifitest.com

	Nicole S on 5 Python NLP pacakges
	blissful4bdd2399fa on CSV to radar plot
	tom on CSV to radar plot
	Matt on CSV to radar plot
	“ – Tom… on 📚 Book club Q1 2024 – 3…