Common Crawl
for NLP

It's a non-profit that makes

web data

freely accessible to anyone

Each crawl archive is billions of pages:


October crawl archive is

3.72 billion web pages

~254 terabytes uncompressed

Released

totally free
without additional
intellectual property restrictions

Common Crawl File Formats

  • WARC
    + Raw HTTP response headers
    + Raw HTTP responses

  • WAT
    + HTML head data
    + HTTP header fields
    + Extracted links / script tags

  • WET
    + Extracted text

Common Crawl WET


WET (Web Extracted Text) is released in 
the crawl archive each month

Data attempts to cover widest range of use cases


No distinction between header / navigation / content:

  • Does not remove boilerplate
  • Does not re-format text as appears in browser

Common Crawl for NLP


The web is largely unannotated,
so how are people using it for NLP?


(a) Use extracted text for unsupervised algorithms

 (b) Filter it into being semi-annotated or annotated
(big data ⇒ filter ⇒ curated smaller dataset)

Examples of Previous Work


Unsupervised Algorithms

+ N-gram & language models
+ GloVe: Global Vectors for Word Representation


    Filtering

    + Web tables for gazetteers
    + Dirt Cheap Web-Scale Parallel Text

      N-gram Counts & Language Models from the Common Crawl
      Christian Buck, Kenneth Heafield, Bas van Ooyen
      (
      Edinburgh, Stanford, Owlin BV)


      975 billion deduplicated tokens

      Improvement over Google N-grams:
      • Inclusion of low count entries
      • Deduplication to reduce boilerplate

      N-gram Counts & Language Models from the Common Crawl



      "The advantages of structured text do not outweigh the extra computing power needed to process them."
      -- 
      N-gram Counts & Language Models from the Common Crawl (Buck et al.)

      N-gram Counts & Language Models from the Common Crawl


      English (23TB), German (1.02TB), Spanish (986GB), French (912GB), Japanese (577GB), Russian (537GB), Polish (334GB), Italian (325GB) ...

      Only 0.14% of the corpus was Finnish, yet yielded a useful corpus of 47GB.

      42 languages with >10GB

      73 languages with >1GB

      N-gram & Language Models


      Sentence level deduplication led to a removal of 80% of the English corpus, lower for other languages
      (in line with Bergsma et al. (2010))


      Before preprocessing (English): 23.62 TB


      After preprocessing (English): 5.14 TB

      (59 billion lines, 975 billion tokens)

      N-gram & Language Models

      "...even though the web data is quite noisy even limited amounts give improvements."


      N-gram & Language Models


      Project data was released at
      http://statmt.org/ngrams

      • Raw text split by language

      • Deduped text split by language

      • Resulting language models

      GloVe: Global Vectors for Word Representation

      Jeffrey Pennington, Richard Socher, Christopher D. Manning


      Word vector representations:

      king - queen = man - woman

      king - man + woman = queen

      (produces dimensions of meaning)

      Unsupervised methods: perfect for Common Crawl



      GloVe: Global Vectors for Word Representation


      Trained on non-zero entries of a
      global word-word 
      co-occurrence matrix

      Populating matrix requires single pass
      Subsequent training is far faster


      GloVe = O(|C|⁰⋅⁸)

      On-line window-based methods = O(|C|) 

      GloVe On Various Corpora     

      • Semantic: "Athens is to Greece as Berlin is to _?"
      • Syntactic: "Dance is to dancing as fly is to_?"

      GloVe over Big Data

      GloVe using 42 billion tokens from Common Crawl beat word2vec w/ 100 billion tokens (Google News)


      Largest GloVe model to prove scalability uses
      840 billion tokens from Common Crawl

      Source code and pre-trained models at
      http://www-nlp.stanford.edu/projects/glove/

      Examples of Previous Work


      Unsupervised Algorithms

      + N-gram & language models
      + GloVe: Global Vectors for Word Representation


        Filtering

        + Web tables for gazetteers
        + Dirt Cheap Web-Scale Parallel Text

        Gazetteers for NER


        + Want the widest variety of topics possible

        + Aim to keep them modern / up to date

        + Capture relationships between similar words
           (disambiguation)

        Google Sets


        Web tables as a source of gazetteers + relations


        Querying ["cat"],

        returns ["dog", "bird", "horse", "rabbit", ...]

        Querying ["cat", "ls"],

        returns ["cd", "head", "cut", "vim", ...]


        Web Data Commons Web Tables



        Extracted 11.2 billion tables from WARC files,
        filtered to keep relational tables via trained classifier


        Only 1.3% of the original data was kept,
        yet it still remains hugely valuable

        Web Data Commons Web Tables


        Resulting dataset:
        11.2 billion tables ⇒ 147 million relational web tables


        Popular column headers: name, title, artist, location, model, manufacturer, country ...


        Released at webdatacommons.org/webtables/

        Web Data Commons Web Tables


        Web-Scale Parallel Text


        Dirt Cheap Web-Scale Parallel Text from the Common Crawl (Smith et al.)

        "...nothing more than a set of common two-letter language codes ... [we] mined 32 terabytes ... in just under a day"

        Processed all text from URLs of the style:
        website.com/[langcode]/

        Web-Scale Parallel Text



        (source = foreign language, target = English)

        Web-Scale Parallel Text


        Both EuroParl & United Nations are large and well curated parallel texts,

        but both have very specific domains & genres.


        Web-Scale Parallel Text


        "...resulting in improvements of up to 1.5 BLEU on standard test sets, and 5 BLEU on test sets outside of the news domain."

        Minimal cleaning & filtering still resulted in a substantial improvement in SMT performance


        Manual inspection across three languages:

        80% of the data contained good translations



        Common Crawl and the NeIC


        We're excited about a potential
        collaboration with NeIC


        Tackling big data is challenging yet fundamentally important for advances in scientific computing:

        NeIC is a forerunner in enabling large-scale  computational experimentation in NLP

        Thank you


        Stephen Merity
        stephen@commoncrawl.org
        commoncrawl.org

        Common Crawl for NLP

        By smerity

        Common Crawl for NLP

        • 7,400