wiki:Corpora/UKWaC

UKWaC British English web corpus

The corpus was prepared by Adriano Ferraresi. The process is described in  Ferraresi et al (LREC 2008) .

All material is taken from the .uk domain. It was part-of-speech tagged and lemmatised using  TreeTagger, a leading part-of-speech tagger which has been trained for a number of languages. It uses Penn Treebank Tagset.

Grammatical relation definitions as prepared by David Tugwell for other English corpora were used.

Changelog

v2.0 (8 April 2010)

  • fixed tokenisation problems (e.g. broken URLs, e-mail addresses, etc)
  • fixed character encoding problems