UKWaC British English web corpus
The corpus was prepared by Adriano Ferraresi. The process is described in Ferraresi et al (LREC 2008) .
All material is taken from the .uk domain. It was part-of-speech tagged and lemmatised using TreeTagger, a leading part-of-speech tagger which has been trained for a number of languages. It uses Penn Treebank Tagset.
Grammatical relation definitions as prepared by David Tugwell for other English corpora were used.
Changelog
v2.0 (8 April 2010)
- fixed tokenisation problems (e.g. broken URLs, e-mail addresses, etc)
- fixed character encoding problems
