wiki:Corpora/deTenTen

deTenTen

German TenTen corpus.

The corpus is double-tagged with  RFTagger (attribute tag, tagset reference) and  TreeTagger (attribute tt_tag,  tagset reference).

Changelog

v2.0 (28 April 2011)

  • fixed problems with part-of-speech tagging which caused a major data loss in the previous version
  • 2.8 billion tokens

v1.0 (30 November 2010)

  • initial version -- 1.2 billion tokens