Sketch Grammar development corpora
This page provides links to vertical files with samples of corpora for various languages. These samples can be used for developing sketch grammars. Most samples have around 1 million words.
In order to develop a sketch grammar you need to:
- download a corpus sample (right click and select "Download link as...")
- create a corpus with appropriate configuration template in Corpus Architect
- upload the sample to the corpus
- compile the corpus
- upload your sketch grammar (presumably repeatedly)
Some corpora can be used with preloaded configuration templates, others need custom configuration templates. Details are specified below.
| Language | Corpus | Tagset | Download | Config. template |
| English | ukWaC | tagset | english1M.vert (14MB) | TreeTagger for English |
| French | frWaC | tagset | french1M.vert (16MB) | TreeTagger for French |
| German | bigDeWaC | tagset1 | german1M.vert (28MB) | custom (see below) |
| Italian | itWaC | tagset | itwac1M.vert (16MB) | TreeTagger for Italian |
| Slovene | FidaPLUS | tagset | slovene1M.vert (15MB) | custom (see below) |
| Spanish | Spanish Web Corpus | tagset | spanish1M.vert (13MB) | TreeTagger for Spanish |
1 The German corpus is tagged with both the TreeTagger (4th column) and the RFTagger (2nd column). The link to the tagset reference of the former is provided. The tagset of the latter is much richer and should be more appropriate for developing a sketch grammar. It is completely undocumented though. Hopefully the meaning of the tags should not be too difficult to guess. Here is the full list of tags.
Custom configuration templates
German
- Corpus attributes
- word, tag, lempos, tt_tag
- WPOSLIST
- ,adjective,ADJ.*,adposition,AP.*,adverb,ADV.*,conjunction,KO.*,determiner,(ART.*|PPOS.*),interjection,ITJ.*,noun,N.*,numeral,CARD.*,particle,PTK.*,pronoun,P[DIPRWA].*,verb,V.*,full stop,\$.
- LPOSLIST
- ,adjective,-j,adposition,-i,adverb,-r,conjunction,-c,noun,-n,numeral,-m,pronoun,-p,verb,-v
- TAGSETDOC
- http://www.ims.uni-stuttgart.de/projekte/corplex/TagSets/stts-table.html
Slovene
- Corpus attributes
- word, tag, lempos
- WPOSLIST
- ,samostalnik,S.*,glagol,G.*,pridevnik,P.*,prislov,R.*,zaimek,Z.*,predlog,D.*,veznik,V.*,členek,L.*,medmet,M.*,števnik,K.*,okrajšava,O.*,neuvrščeno,N.*
- LPOSLIST
- ,samostalnik,-s,glagol,-g,pridevnik,-p,prislov,-r,zaimek,-z,predlog,-d,veznik,-v,členek,-l,medmet,-m,števnik,-k,okrajšava,-o,neuvrščeno,-n
- TAGSETDOC
- http://www.sketchengine.co.uk/tagsets/slovene.html
