Last modified 2 weeks ago
TenTen corpora
TenTen is a new generation of Web corpora. These corpora are created by Web crawling and processed with our latest boilerplate cleaning and de-duplication tools. The "TenTen" designates the target sizes of the corpora which is 1010 (10 billion) words.
Available corpora:
- zhTenTen (Chinese, Simplified)
- enTenTen (English)
- deTenTen (German)
- itTenTen (Italian)
- noTenTen (Norwegian)
- ptTenTen (Portuguese)
- skTenTen (Slovak)
- esTenTen (Spanish)
New available corpora:
