wiki:Corpora/zhTenTen

zhTenTen

Simplified Chinese TenTen corpus.

The corpus has been processed with  Stanford Chinese Word Segmenter and  Stanford Log-linear Part-Of-Speech Tagger using the  Chinese Penn Treebank standard models.

Sketch grammar has been prepared by Simon Smith.

Changelog

v1.0 (2 December 2011)

  • initial version -- 2.1 billion tokens