wiki:SkE/GrDev/Corpora

Sketch Grammar development corpora

This page provides links to vertical files with samples of corpora for various languages. These samples can be used for developing sketch grammars. Most samples have around 1 million words.

In order to develop a sketch grammar you need to:

  1. download a corpus sample (right click and select "Download link as...")
  2. create a corpus with appropriate configuration template in  Corpus Architect
  3. upload the sample to the corpus
  4. compile the corpus
  5. upload your sketch grammar (presumably repeatedly)

Some corpora can be used with preloaded configuration templates, others need custom configuration templates. Details are specified below.

LanguageCorpusTagsetDownloadConfig. template
EnglishukWaCtagsetenglish1M.vert (14MB)TreeTagger for English
FrenchfrWaC tagsetfrench1M.vert (16MB)TreeTagger for French
GermanbigDeWaC tagset1german1M.vert (28MB)custom (see below)
ItalianitWaC tagsetitwac1M.vert (16MB)TreeTagger for Italian
SloveneFidaPLUStagsetslovene1M.vert (15MB)custom (see below)
SpanishSpanish Web Corpus tagsetspanish1M.vert (13MB)TreeTagger for Spanish

1 The German corpus is tagged with both the  TreeTagger (4th column) and the  RFTagger (2nd column). The link to the tagset reference of the former is provided. The tagset of the latter is much richer and should be more appropriate for developing a sketch grammar. It is completely undocumented though. Hopefully the meaning of the tags should not be too difficult to guess. Here is the full list of tags.

Custom configuration templates

German

Corpus attributes
word, tag, lempos, tt_tag
WPOSLIST
,adjective,ADJ.*,adposition,AP.*,adverb,ADV.*,conjunction,KO.*,determiner,(ART.*|PPOS.*),interjection,ITJ.*,noun,N.*,numeral,CARD.*,particle,PTK.*,pronoun,P[DIPRWA].*,verb,V.*,full stop,\$.
LPOSLIST
,adjective,-j,adposition,-i,adverb,-r,conjunction,-c,noun,-n,numeral,-m,pronoun,-p,verb,-v
TAGSETDOC
http://www.ims.uni-stuttgart.de/projekte/corplex/TagSets/stts-table.html

Slovene

Corpus attributes
word, tag, lempos
WPOSLIST
,samostalnik,S.*,glagol,G.*,pridevnik,P.*,prislov,R.*,zaimek,Z.*,predlog,D.*,veznik,V.*,členek,L.*,medmet,M.*,števnik,K.*,okrajšava,O.*,neuvrščeno,N.*
LPOSLIST
,samostalnik,-s,glagol,-g,pridevnik,-p,prislov,-r,zaimek,-z,predlog,-d,veznik,-v,členek,-l,medmet,-m,števnik,-k,okrajšava,-o,neuvrščeno,-n
TAGSETDOC
http://www.sketchengine.co.uk/tagsets/slovene.html