5 interesting things (31/05/2015)

Vis.js demo – (Yet another) Javascript visualization library. I love the relative graph approach, I find it easier for Python developers that d3.js (which I also like) –
Predition.io –  a machine learning server for developers and data scientists. Built on Apache Spark, HBase and Spray. Haven’t used it but it seems very interesting –
Django performance – 4 litlle tricks – I’m familiar and experienced with Django but not deep dive. Possibly those are very basic tricks but for me they were good to keep in mind
Deep dive into elasticsearch storage – as I currently work a lot with elasticsearch (should say elastic..) I find this post very interesting
Amazon machine learning – the new buzz in the neighborhood

6 interesting things (20/2/2015)

Making an exception but I really came across some interesting things –

A\A Testing – well known idea in machine learning (train, cross-validation, test) and in other research best practices now used as a take off on the buzz word term – “A\B testing” but yet well explained and can be eye opening on the right moment –
Intro to into – Easier conversion between somehow complex data types in python
Reading this I was also exposed
Getting started with Spark in Python – very very clear tutorial about all the required steps to get started. I cannot wait to find a good enough excuse to work with spark.
Fuzzy – Fast Python phonetic algorithms. Nothing new or too fancy, just came across it this week and found it useful.
Typo Distance – finds typo distance between two strings. It uses qwerty layout but you can configure different layout pretty easily. The algorithm is quite heavy and time consuming, there is some room for improvement (although it is not actively maintained). For example – adding a max parameter which stops the computation once the typo distance is higher than the allowed distance.

Topy – Python script to fix typos in text, using rule-sets developed by theRegExTypoFix project from Wikipedia. The basic rule set is an English rule set but other rule sets are also available. Trying it, I’m positive about it but it is not baked \ mature enough and I would like it to be more easy to use in code than as a command line tool.

https://github.com/intgr/topy

5 interesting things (03/02/2015)

spaCy – yet another python library for text processing? maybe, haven’t tried it but seems like another tool taking part in this growing world. I’m looking for a package that will do a good job on short texts such as tweets and so on.
Moto – not what you thought.. not a short from automotive but mocking +boto. “A library that allows your python tests to easily mock out the boto library”. Looks very neat but still has some way to go.
Nothing like CLI – examples that for certain use cases command line tools can be much faster than hadoop cluster.
Self hell – elegance in code although I’m not sure how efficient it is.
NLP made visual with Neo4j –  the concept is cool – creating an interesting visualization not with the trivial \ designated tool for it. I doubt how much it scales.

5 interesting things (04/01/2015)

Is Wikipedia a microcsmos of our world? News of 2014 as they are reflected in Wikipedia.

Json as python modules
 – Making life simpler. Making it possible to do -“import x” for x.json.
 
Python the fast way – living on the edge with Python 🙂 Some hints on how to make your code run faster. Another possibility – checking different interpreters. It was nice to see the assembly commands, I was expecting something more advanced.

Data Science Ontology
 – just having fun with d3.js.

Nips Experiment
 – there was relatively a lot of chatter regarding the experiment done by NIPS committee. Non the less, it creates a very uncomfortable feeling regarding committees.

5 interesting things (14/12/2014)

Twitter analytics with Spark – I really enjoyed this post from several reasons. First because the need and the way that lead to the solution is clear and well explained. Second, it uses a novel approach, mining social network for non trivial uses. Moreover, querying the app resulted a very nice results which is not trivial. I missed some reasoning \ explanation in the app. I.e., better understanding the connections between two organizations. I also wonder about the velocity the relations change. The technical part was interesting as well although I haven’t got to work with Spark so far.
 
Teaching creative – I really love to teach and I also try to adjust the methods, content, example to the audience and to be as creative as I can and to be passionate about the content. I believe that this way the audience will remember what I talked about. Also to be hands on help to remember and understand. This post which demonstrates creativity on so many levels is amazing and is very inspiring.

Databases compression
 – Summarizes the high level and things to know about each database very well. A good starting point when evaluating several solutions.
 
Experiments at AirBnB – Although I usually prefer to read about private people that did something cool with data I like AirBnB’s data blog as there is always something interesting to read about. This time it is about experiments, the very trendy “AB testing”. It raises
 
DevOps bookmarks – devops tools and frameworks. An aggregator, reminds assume * but very tempting implementation.

Common Crawl meetup

Yesterday I attended big data beers meetup. The meetup included 2 talks by Common Crawl employees – Lisa Green and Stephen Merity. Both talks was great and the connection between them  was empowering.
The meetup was sponsored by Data Artisans which are working on Apache Flink. Too bad I don’t have time to go to their meet up today
What is Common Crawl?
Common Crawl is a NGO that makes web data accessible to everyone with little or not cost. They crawl the web, release a monthly build which is stored in AWS S3 under public data sets. They respect no robots and no follow flags and basically try to be good citizens in the internet cosmos. As Lisa Green said in her talk – they believe that the web is “a digital copy of our world” and the greatest data set and their mission is to make it available to everyone.
Technicalities
  • Monthly build (currently prefer bigger monthly builds over more frequent builds)
  • Latest build was 220TB with crawl data of 2.98 billion web pages.
  • Data include 3 type of files –
    • WARC files of the raw crawl data
    • WAT files which include the metadata for the data stored in the WARC (about 1/3 of the raw crawl data)
    • WET files which hold the plaintext from the data stored in the WARC (about 15% of the raw crawl data).
  • Delay between publication of page and crawl \ public time is approx month-month and a half.
  • No incremental dumps are planned at the moment.
  • The data is currently skewed to dot com domains. They plan to improve it changes will hopefully be seen on January dump.
  • They crawl using Apache Nutch –  Nutch is an open source web crawler and cooperate with  Blekko in order to avoid spam.
Common Crawl encourages researchers, universities and commercial companies to use their data. If you ask politely they will even grant you some Amazon credit.
The Talks
Lisa Green talked about the general idea of open data (governments data, commercial data, web data) and gave some examples for using open data in general. The example I liked the most – using Orange cell phone data to identify Ebola spreading patterns, see here. This was a very inspiring introduction to Stephen Merity more technical talk.
Stephen Merity spoke about the more technical parts and gave an amazing examples on how to do both fast and cheap computations (spot instances rock). He showed interesting data about computing PageRank on the entire Common Crawl data, some NLP stuff and other interesting insights about their data.
Another relevant talk in the area is Jordan Mendelson talk from Berlin BuzzWords – “Big Data for Cheapskates” (if your are on a hurry start from the 18th minute).
Slides are available in –

What can you do with Common Crawl

Treating it as a data set there is a lot to explore –
1. Train it for language detection – train it for language detection for specific domains.
2. Named Entity Recognition.
3. Investigate the relations between different domains, web structre – identify competitors, page rank, etc.
4. Investigate the relations between different technologies – which js libraries appear together, changes of technology usage over time.

5 interesting things (06/11/2014)

Geeks pleasure – fulling math. Creating two images (and actually every type of document) with the same md5 hash.
Django vs Flask vs Pyramid – Python has a great open source community which is growing rapidly. One of the advantages is having several solution to the same or to near by problems. This post compares between 3 well known Python web frameworks – Django, Flask and Pyramid.
From my point of view working with Django and Flask. Flask is like riding a motorcycle while Django is Tank. Django is more tightly coupled with SQL backedend and the relevant dependencies and plugins while Flask allow quick, light functionality.
The Science of Crawl – half related to a project I currently do. Those two posts concern to problems that everyone how indexed and \ or crawled data faced with.
The invisible wall – 25 years later and the invisible wall still separates east from west in Germany. Beside living in Berlin visualizations convey the point very well.
http://www.washingtonpost.com/blogs/worldviews/wp/2014/10/31/the-berlin-wall-fell-25-years-ago-but-germany-is-still-divided/

Toolz – we all know the term design patterns. Toolz provides implementation pattern. Everyday utilities every developer needs – iterators, dictionaries, etc. Two links – blog post with  common use cases and Toolz documentation.

http://matthewrocklin.com/blog/work/2014/07/04/Streaming-Analytics/

http://toolz.readthedocs.org/en/latest/

Running my first EMR – lessons learned

Today I was trying to run my first EMR, here are few lessons I learned during this day. I have previously run hadoop streaming mapreduce so I was familiar with the mapreduce state of mind. However, I was not familiar with the EMR environment.

I used boto – Amazon official python interface.

1. AMI version – default AMI version is 1.0.0 – first release. This means the following specifications –

Operating system: Debian 5.0 (Lenny)

Applications: Hadoop 0.20 and 0.18 (default); Hive 0.5, 0.7 (default), 0.7.1; Pig 0.3 (on Hadoop 0.18), 0.6 (on Hadoop 0.20)

Languages: Perl 5.10.0, PHP 5.2.6, Python 2.5.2, R 2.7.1, Ruby 1.8.7

File system: ext3 for root and ephemeral

Kernel: Red Hat

For me Python 2.5.2 means –
  • Does not include json – new in version 2.6.
  • collections is new in python 2.4, but not all the models were added in this version –
namedtuple() factory function for creating tuple subclasses with named fields

New in version 2.6.

deque list-like container with fast appends and pops on either end

New in version 2.4.

Counter dict subclass for counting hashable objects

New in version 2.7.

OrderedDict dict subclass that remembers the order entries were added

New in version 2.7.

defaultdict dict subclass that calls a factory function to supply missing values

New in version 2.5.

 
Therefore specifying the ami_version version can be critical. Version 2.2.0 worked fine for me.
2. Must process all the input!
Naturally we will want to process all the input. However, for testing I went over only the n-first lines and then added a break to make things run faster. I was not consuming all the lines and therefore got an error. More about it here –
3. Output folder must not exists. This is the same as in hadoop streaming map reduce, for me the way to avoid it was to add a timestamp –

output="s3n://<my-bucket>/output/"+str(int(time.time()))

4. Why my process failed – one option which produces are relatively understandable explanation is  – conn.describe_jobflow(jobid).laststatechangereason

5. cache_files – enables you to import files you need for the map reduce process. Super important to “specify a fragment”, i.e. specify the local file name

cache_files=['s3n://<file-location>/<file-name>#<local-file-name>']
Otherwise you will obtain the following error –
“Streaming cacheFile and cacheArchive must specify a fragment”
6. Status – the are 7 different status your flow may have – COMPLETED, FAILED, TERMINATED, RUNNING, SHUTTING_DOWN, STARTING and WAITING. The right order of statuses if everything goes well is STARTING -> RUNNING ->SHUTTING_DOWN -> COMPLETED.
The SHUTTING_DOWN may take a while even for a very simple flow I measured about 1 minute of SHUTTING_DOWN process.
Resources I used –

http://boto.readthedocs.org/en/latest/emr_tut.html

5 interesting things (04/09/2014)

C3.JS – I have previously wrote a post about the importance of visualization in the skill set of data scientist. C3.js is a JavaScript chart library based on d3.js which seems at least in a glimpse to be simple and intuitive. I would like to see a Python client for that but that for the future to come.  

http://c3js.org/

nvd3 also do something  like – charts based 3d.js and also have a Python client which I worked with a bit. Comparing the two c3.js seems a little bit more mature than nvd3, ignoring the lack of Python client but I’m sure that gap would be filled soon. 

nvd3.org

Harvard Visualization course – I went through some of the slides and it was fascinating but what is even more exciting is the great collection of links about visualization examples, theory and tools. Great work.

http://www.cs171.org/#!index.md

textract – I needed high flexibility of input types in a project I do and of course I wanted to deal with as transparent as possible without looking myself to all the relevant packages or adjust my code to the API of each package. Fortunately somebody already did it – textarct. The package is not perfect and there are some “glitches” mostly concerning the parsers themselves (line splitting, non-ascii, etc) and not to the unified API textract provides. However, it is a very good start.

http://textract.readthedocs.org/en/latest/

Visualizing Garbage Collection Algorithms – both very cool visualization and good explanations. Design wise I think the visualization should be larger but the concept itself is very neat.

http://spin.atomicobject.com/2014/09/03/visualizing-garbage-collection-algorithms/

SmartCSV – making CSV reader more structured by defining a model and validating it while reading. Enables skipping rows (and soon skipping also columns). It is on going project and feature requests and issues are currently addressed quickly.

 

5 interesting things (26/08/2014)

Gooey – Command line to application! Very cool. Works  by mapping ArgumentParser to GUI objects. Very cool and make it easier to make instant tools to play around the office. Future tasks include supporting in more themes, different arg parses (e.g docopt http://docopt.org/), etc. I believe there is much potential in this project.

https://github.com/chriskiehl/Gooey

Datamesh – stats command line tool. I like working for the command line, at least to begin with and to feel data before doing complex things, this tool is really what I need. However, the documentation is missing \ not clear enough \ open source style.  

http://www.gnu.org/software/datamash/

In that sense, q, http://harelba.github.io/q/, is also very cool and answer the need to things and speak in the same framework in different tasks. 

Getting started with Python internals – how to dive to the deep water. Very enriching post which also include links to other interesting posts.

http://akaptur.github.io/blog/2014/08/03/getting-started-with-python-internals/

python-ftfy – unicode strings can be very painful when it includes special characters as umlauts (ö), html tags (&gt;) and everything users can think of :). The goal of this package is to at least partly ease this pain.

https://github.com/LuminosoInsight/python-ftfy

Fraud detection Lyft vs Uber – I think that the interesting part here is the data visualization as tool to understand the problem better.