wiki:SkE/CorpusConfig

The Corpus Configuration File: Overview

For the software to be able to use a corpus, there are a number of things in needs to know. They are specified in the corpus configuration file, a file located in the registry directory with a filename which is the corpus identifier on the system (and is, in simple cases, the corpus name, eg 'bnc').

Types of information held in the corpus config file are:

A simple corpus config file is:

PATH  /corpora/test1
ATTRIBUTE  word
ATTRIBUTE  tag
ATTRIBUTE  lemma

STRUCTURE doc {
    ATTRIBUTE title
    ATTRIBUTE region
}
STRUCTURE p
STRUCTURE s

This shows, firstly, the structure of a corpus config file. It contains a set of feature-value pairs where the feature, on the left, must be one of a set of words that the system recognises and knows how to interpret. All features are explained in this set of documentation pages and the full list (with minimal documentation) is [SkE/Config/FullDoc here].

Note: The configuration file uses a general ATTRIBUTE value syntax. If the value contains anything else than lower-case letters, you have to enclose it into quotes or apostrophes, just like this: ATTRIBUTE "Complex-value".

The example states that

  • The location of the indexed corpus data on the system is /corpora/test1
  • The vertical file contains three columns, contents of which will be called 'word', 'tag' and 'lemma'
  • The text is in structural units of type 'doc', 's' and 'p'. Units of type 'doc' have associated attributes 'title' and 'region'.

As the example illustrates,

  • each nonempty line begns with a feature name and then gives its value
  • values can be simple or can themselves be complex or further specified in a block enclosed in { ... }.