zhTenTen
Simplified Chinese TenTen corpus.
The corpus has been processed with Stanford Chinese Word Segmenter and Stanford Log-linear Part-Of-Speech Tagger using the Chinese Penn Treebank standard models.
Sketch grammar has been prepared by Simon Smith.
Changelog
v1.0 (2 December 2011)
- initial version -- 2.1 billion tokens
