The Corpus Configuration File: Overview
For the software to be able to use a corpus, there are a number of things in needs to know. They are specified in the corpus configuration file, a file located in the registry directory with a filename which is the corpus identifier on the system (and is, in simple cases, the corpus name, eg 'bnc').
Types of information held in the corpus config file are:
- Basic facts about the corpus
- Locations of data on the file system
- Attributes of words and structures
- To control the search interface:
- options for Word classes and lemmas
- Subcorpus creation interface
- To control concordance display
- Management, technical, processing details
A simple corpus config file is:
PATH /corpora/test1
ATTRIBUTE word
ATTRIBUTE tag
ATTRIBUTE lemma
STRUCTURE doc {
ATTRIBUTE title
ATTRIBUTE region
}
STRUCTURE p
STRUCTURE s
This shows, firstly, the structure of a corpus config file. It contains a set of feature-value pairs where the feature, on the left, must be one of a set of words that the system recognises and knows how to interpret. All features are explained in this set of documentation pages and the full list (with minimal documentation) is here.
The example states that
- The location of the indexed corpus data on the system is /corpora/test1
- The vertical file contains three columns, contents of which will be called 'word', 'tag' and 'lemma'
- The text is in structural units of type 'doc', 's' and 'p'. Units of type 'doc' have associated attributes 'title' and 'region'.
As the example illustrates,
- each nonempty line begns with a feature name and then gives its value
- values can be simple or can themselves be complex or further specified in a block enclosed in { ... }.
