wiki:Corpora/CorpusFactory

Corpus Factory Method

Corpus Factory performs the following steps to collect a corpus of a language

  • Download Wikipedia Dump and parse it to get Wiki corpus
  • Generate frequency list of a language form Wiki corpus
  • Build queries from the mid frequent words in the frequency list
  • send queries to Google or Yahoo, and download the search hit pages
  • Clean the corpus
    • Remove biolerplate text (html tags and advertisements)
    • Using the wiki frequency list, compute ratio of frequent words to non frequent words and determine if a page is continuous (i.e. is meaningful)
    • Remove duplicates
  • Tokenise and (if tools are available) lemmatise and part-of-speech tag
  • Load into our corpus query tool, the Sketch Engine

Full details can be obtained from the paper  Kilgarriff et al. at LREC 2010.