What is Common Crawl?
Common Crawl is a NGO that makes web data accessible to everyone with little or not cost. They crawl the web, release a monthly build which is stored in AWS S3 under public data sets. They respect no robots and no follow flags and basically try to be good citizens in the internet cosmos. As Lisa Green said in her talk – they believe that the web is “a digital copy of our world” and the greatest data set and their mission is to make it available to everyone.
- Monthly build (currently prefer bigger monthly builds over more frequent builds)
- Latest build was 220TB with crawl data of 2.98 billion web pages.
- Data include 3 type of files –
- WARC files of the raw crawl data
- WAT files which include the metadata for the data stored in the WARC (about 1/3 of the raw crawl data)
- WET files which hold the plaintext from the data stored in the WARC (about 15% of the raw crawl data).
- Delay between publication of page and crawl \ public time is approx month-month and a half.
- No incremental dumps are planned at the moment.
- The data is currently skewed to dot com domains. They plan to improve it changes will hopefully be seen on January dump.
- They crawl using Apache Nutch – Nutch is an open source web crawler and cooperate with Blekko in order to avoid spam.
Common Crawl encourages researchers, universities and commercial companies to use their data. If you ask politely they will even grant you some Amazon credit.
Lisa Green talked about the general idea of open data (governments data, commercial data, web data) and gave some examples for using open data in general. The example I liked the most – using Orange cell phone data to identify Ebola spreading patterns, see here
. This was a very inspiring introduction to Stephen Merity more technical talk.
Stephen Merity spoke about the more technical parts and gave an amazing examples on how to do both fast and cheap computations (spot instances rock). He showed interesting data about computing PageRank on the entire Common Crawl data, some NLP stuff and other interesting insights about their data.
Another relevant talk in the area is Jordan Mendelson talk from Berlin BuzzWords – “Big Data for Cheapskates
” (if your are on a hurry start from the 18th minute).
Slides are available in –
What can you do with Common Crawl
Treating it as a data set there is a lot to explore –
1. Train it for language detection – train it for language detection for specific domains.
2. Named Entity Recognition.
3. Investigate the relations between different domains, web structre – identify competitors, page rank, etc.
4. Investigate the relations between different technologies – which js libraries appear together, changes of technology usage over time.