Corpus Factory Method
Corpus Factory performs the following steps to collect a corpus of a language
- Download Wikipedia Dump and parse it to get Wiki corpus
- Generate frequency list of a language form Wiki corpus
- Build queries from the mid frequent words in the frequency list
- send queries to Google or Yahoo, and download the search hit pages
- Clean the corpus
- Remove biolerplate text (html tags and advertisements)
- Using the wiki frequency list, compute ratio of frequent words to non frequent words and determine if a page is continuous (i.e. is meaningful)
- Remove duplicates
- Tokenise and (if tools are available) lemmatise and part-of-speech tag
- Load into our corpus query tool, the Sketch Engine
Full details can be obtained from the paper Kilgarriff et al. at LREC 2010.
