The Corpus Configuration File: Overview
For the software to be able to use a corpus, there are a number of things in needs to know. They are specified in the corpus configuration file, a file located in the registry directory with a filename which is the corpus identifier on the system (and is, in simple cases, the corpus name, eg 'bnc').
Types of information held in the corpus config file are:
- Basic facts about the corpus
- Locations of data on the file system
- Attributes of words and structures
- To control the search interface:
- options for Word classes and lemmas
- Subcorpus creation interface
- To control concordance display
- Management, technical, processing details
A simple corpus config file is:
PATH /corpora/test1
ATTRIBUTE word
ATTRIBUTE tag
ATTRIBUTE lemma
STRUCTURE doc {
ATTRIBUTE title
ATTRIBUTE region
}
STRUCTURE p
STRUCTURE s
This shows, firstly, the structure of a corpus config file. It contains a set of feature-value pairs where the feature, on the left, must be one of a set of words that the system recognises and knows how to interpret. All features are explained in this set of documentation pages and the full list (with minimal documentation) is [SkE/Config/FullDoc here].
Note: The configuration file uses a general ATTRIBUTE value syntax. If the value contains anything else than lower-case letters, you have to enclose it into quotes or apostrophes, just like this: ATTRIBUTE "Complex-value".
The example states that
- The location of the indexed corpus data on the system is /corpora/test1
- The vertical file contains three columns, contents of which will be called 'word', 'tag' and 'lemma'
- The text is in structural units of type 'doc', 's' and 'p'. Units of type 'doc' have associated attributes 'title' and 'region'.
As the example illustrates,
- each nonempty line begns with a feature name and then gives its value
- values can be simple or can themselves be complex or further specified in a block enclosed in { ... }.
