Common Crawl meetup

Yesterday I attended big data beers meetup. The meetup included 2 talks by Common Crawl employees – Lisa Green and Stephen Merity. Both talks was great and the connection between them  was empowering.
The meetup was sponsored by Data Artisans which are working on Apache Flink. Too bad I don’t have time to go to their meet up today
What is Common Crawl?
Common Crawl is a NGO that makes web data accessible to everyone with little or not cost. They crawl the web, release a monthly build which is stored in AWS S3 under public data sets. They respect no robots and no follow flags and basically try to be good citizens in the internet cosmos. As Lisa Green said in her talk – they believe that the web is “a digital copy of our world” and the greatest data set and their mission is to make it available to everyone.
  • Monthly build (currently prefer bigger monthly builds over more frequent builds)
  • Latest build was 220TB with crawl data of 2.98 billion web pages.
  • Data include 3 type of files –
    • WARC files of the raw crawl data
    • WAT files which include the metadata for the data stored in the WARC (about 1/3 of the raw crawl data)
    • WET files which hold the plaintext from the data stored in the WARC (about 15% of the raw crawl data).
  • Delay between publication of page and crawl \ public time is approx month-month and a half.
  • No incremental dumps are planned at the moment.
  • The data is currently skewed to dot com domains. They plan to improve it changes will hopefully be seen on January dump.
  • They crawl using Apache Nutch –  Nutch is an open source web crawler and cooperate with  Blekko in order to avoid spam.
Common Crawl encourages researchers, universities and commercial companies to use their data. If you ask politely they will even grant you some Amazon credit.
The Talks
Lisa Green talked about the general idea of open data (governments data, commercial data, web data) and gave some examples for using open data in general. The example I liked the most – using Orange cell phone data to identify Ebola spreading patterns, see here. This was a very inspiring introduction to Stephen Merity more technical talk.
Stephen Merity spoke about the more technical parts and gave an amazing examples on how to do both fast and cheap computations (spot instances rock). He showed interesting data about computing PageRank on the entire Common Crawl data, some NLP stuff and other interesting insights about their data.
Another relevant talk in the area is Jordan Mendelson talk from Berlin BuzzWords – “Big Data for Cheapskates” (if your are on a hurry start from the 18th minute).
Slides are available in –

What can you do with Common Crawl

Treating it as a data set there is a lot to explore –
1. Train it for language detection – train it for language detection for specific domains.
2. Named Entity Recognition.
3. Investigate the relations between different domains, web structre – identify competitors, page rank, etc.
4. Investigate the relations between different technologies – which js libraries appear together, changes of technology usage over time.

Andrew Ng and PyStruct meetup

Yesterday I attended the “Andrew Ng and pyStruct” meetup.

I was lucky enough to get a place to the meetup due to the Germany-Portugal game that happened on the same time 🙂

The first part by Andrew Ng was a video meetup joint to 3 locations – Paris, Zurich and Berlin. Andrew Ng is a co-founder of Coursera and a Machine Learning guru. He teaches the ML course in Coursera which is one of the most popular courses in Coursera (took it myself and it is a very good and structured introduction to machine learning, new session started yesterday). He teaches in Standford and soon he will be leaving to Baidu research.

The talk included 15-20 minutes of introduction to deep learning, recent results,  applications and challenges. He mainly focused on scaling up deep learning algorithms for using billions features \ properties. The rest of the talk was question answering mostly regarding the theoretical aspects of deep learning, future challenges, etc. For me one of the most important things he said was “innovation is a result of team work”.

Some known applications of deep learning is – speak recognition, image processing, etc.

In the end he suggested taking Stanford deep learning tutorial –

T.R – there are currently 2 python packages I know which deal with deep learning – 

 The next talk was given by Andreas Mueller. You can find his slides here.

Muelller introduced structured prediction which is a natural extension or a generalization of regression problems to a structured output rather than just a number \ class . Structured learning has advantage over other algorithms of supervised learning as it can learn several properties at once and use the correlations between those properties.

Example – costumers data, several properties of costumers – gender, marriage status, has children, owns a car, etc. One can guess that married and has children properties are highly correlated and that when learning about those properties together there is a better chance of getting good results. It is better than LDA in the sense that it has less classes (not every combination of the variables is a class) and it requires less training data.

Other examples include pixel classification – classifying each pixel to an object in the image and OCR, etc.

He then talked about PyStruct – a python package for structured prediction. Actually not much to add that is not written in the documents.