Common Crawl meetup

Yesterday I attended big data beers meetup. The meetup included 2 talks by Common Crawl employees – Lisa Green and Stephen Merity. Both talks was great and the connection between them  was empowering.
The meetup was sponsored by Data Artisans which are working on Apache Flink. Too bad I don’t have time to go to their meet up today
What is Common Crawl?
Common Crawl is a NGO that makes web data accessible to everyone with little or not cost. They crawl the web, release a monthly build which is stored in AWS S3 under public data sets. They respect no robots and no follow flags and basically try to be good citizens in the internet cosmos. As Lisa Green said in her talk – they believe that the web is “a digital copy of our world” and the greatest data set and their mission is to make it available to everyone.
Technicalities
  • Monthly build (currently prefer bigger monthly builds over more frequent builds)
  • Latest build was 220TB with crawl data of 2.98 billion web pages.
  • Data include 3 type of files –
    • WARC files of the raw crawl data
    • WAT files which include the metadata for the data stored in the WARC (about 1/3 of the raw crawl data)
    • WET files which hold the plaintext from the data stored in the WARC (about 15% of the raw crawl data).
  • Delay between publication of page and crawl \ public time is approx month-month and a half.
  • No incremental dumps are planned at the moment.
  • The data is currently skewed to dot com domains. They plan to improve it changes will hopefully be seen on January dump.
  • They crawl using Apache Nutch –  Nutch is an open source web crawler and cooperate with  Blekko in order to avoid spam.
Common Crawl encourages researchers, universities and commercial companies to use their data. If you ask politely they will even grant you some Amazon credit.
The Talks
Lisa Green talked about the general idea of open data (governments data, commercial data, web data) and gave some examples for using open data in general. The example I liked the most – using Orange cell phone data to identify Ebola spreading patterns, see here. This was a very inspiring introduction to Stephen Merity more technical talk.
Stephen Merity spoke about the more technical parts and gave an amazing examples on how to do both fast and cheap computations (spot instances rock). He showed interesting data about computing PageRank on the entire Common Crawl data, some NLP stuff and other interesting insights about their data.
Another relevant talk in the area is Jordan Mendelson talk from Berlin BuzzWords – “Big Data for Cheapskates” (if your are on a hurry start from the 18th minute).
Slides are available in –

What can you do with Common Crawl

Treating it as a data set there is a lot to explore –
1. Train it for language detection – train it for language detection for specific domains.
2. Named Entity Recognition.
3. Investigate the relations between different domains, web structre – identify competitors, page rank, etc.
4. Investigate the relations between different technologies – which js libraries appear together, changes of technology usage over time.
Advertisements

5 interesting things (06/11/2014)

Geeks pleasure – fulling math. Creating two images (and actually every type of document) with the same md5 hash.
Django vs Flask vs Pyramid – Python has a great open source community which is growing rapidly. One of the advantages is having several solution to the same or to near by problems. This post compares between 3 well known Python web frameworks – Django, Flask and Pyramid.
From my point of view working with Django and Flask. Flask is like riding a motorcycle while Django is Tank. Django is more tightly coupled with SQL backedend and the relevant dependencies and plugins while Flask allow quick, light functionality.
The Science of Crawl – half related to a project I currently do. Those two posts concern to problems that everyone how indexed and \ or crawled data faced with.
The invisible wall – 25 years later and the invisible wall still separates east from west in Germany. Beside living in Berlin visualizations convey the point very well.
http://www.washingtonpost.com/blogs/worldviews/wp/2014/10/31/the-berlin-wall-fell-25-years-ago-but-germany-is-still-divided/

Toolz – we all know the term design patterns. Toolz provides implementation pattern. Everyday utilities every developer needs – iterators, dictionaries, etc. Two links – blog post with  common use cases and Toolz documentation.

http://matthewrocklin.com/blog/work/2014/07/04/Streaming-Analytics/

http://toolz.readthedocs.org/en/latest/