wiki:SkE/StartCorpusBuilding

Getting started with Corpus Building at Faculty of Informatics, MU Brno

If you are interested in general instructions on compiling corpora, not related to FI MU, Brno, please go to  Preparing Corpus Overview.

If you are new to SkE at FI MU, you might need to create several user accounts. To arrange this contact rychly at gmail dot com.

There are two servers, where you will probably need access. sketchengine.co.uk and corpora.fi.muni.cz, depending on what are you going to do, you will need user account in corpadm group to access the computer and another administrative account to sketch engine. In addition, if you are reading this page as a guest, you will also need an account to trac wiki at trac.sketchengine.co.uk. Hence up to five different accounts.

If you managed to obtain all your accounts, you can start creating your corpus.

Create following pages at  http://trac.sketchengine.co.uk/wiki/Corpora/<YourCorpusName?> and  http://trac.sketchengine.co.uk/wiki/Private/Corpora/<YourCorpusName?>. the first one should contain information about corpus that a regular user wants to know (what is it about, how big is it, whom to contact...), the second should contain information that will help to someone, who will continue in your work on your corpus (where are the sources, your scripts, etc.).

then login to SkE server:

  • ssh <username>@sketchengine.co.uk

or

  • ssh <username>@corpora.fi.muni.cz

Make yourself familiar with directory structure on SkE server:

  • /corpora/registry/ - registry fles
  • /corpora/manatee/ - compiled corpuses (binaries)
  • /corpora/vert/ - vertical (analogy of /nlp/corpora/priprava_dat/)
  • /corpora/wsdef/ - files with WS definitions
  • /var/ske/registry/preloaded/ - registry files, which the administration system proccesses. (mostly symb. links to /corpora/registry/)
  • /var/ske/registry/preloaded/default/ - default corpuses (those, which are displayed as 'default' in admin. system), if you put here something, always do a symlink to /var/ske/registry/preloaded/

(this one is valid for sketchengine.co.uk, the corpora.fi.muni.cz tree is similar, but it mostly starts with /nlp/corpora)

Before creating corpora study SkE/PreparingCorpusOverview.

In addition to regular lines, your config file of your corpus should contain:

MINOR "1"

  • the corpus isn't displayed to user in list until clicking on "more corpora"

LANGUAGE "language"

  • language of corpus

INFOHREF "url"

Useful scripts

genws.sh CORPUS WSDEF_FILE

  • generating WS

thes.sh CORPUS

  • generating thesaurus (first needs genws)

install_corpus_for_ws.sh USER CORPUS

  • Enable to USER to create WS definitions for a corpus CORPUS in CorpusBuilder?.

 More utilities

To make your corpus accessible from sketchengine you need to add it to your user account in admin system of sketchengine: http://www.sketchengine.co.uk/admin/