wiki:Corpora/skTenTen

skTenTen

Slovak TenTen corpus.

The corpus has been tagged by the  Ľ. Štúr Institute of Linguistics of  Slovak Academy of Sciences. Information about tagging including the tagset reference can be found  here (in Slovak).

Apart from standard word, tag, lemma attributes, the corpus also contains an extra attribute called amblevel which is an integer number indicating the level of ambiguity of each word form. It is the number of possible POS-tags for given word form (from which the disambiguator selected one).

Word sketches have been prepared by  Vladimír Benko.

Changelog

v1.0 (13 September 2011)

  • initial version -- 876 million tokens