Sketch Engine
  • Login
  • Wiki
  • Timeline
  • View Tickets
  • New Ticket
  • Search
  • Settings

Wiki Navigation

  • Start Page
  • Index by Title
  • Index by Date
  • Last Change

The Corpus Configuration File: Overview

For the software to be able to use a corpus, there are a number of things in needs to know. They are specified in the corpus configuration file, a file located in the registry directory with a filename which is the corpus identifier on the system (and is, in simple cases, the corpus name, eg 'bnc').

Types of information held in the corpus config file are:

  • Basic facts about the corpus
  • Locations of data on the file system
  • Attributes of words and structures
    • Dynamic attributes
  • To control the search interface:
    • options for Word classes and lemmas
    • Subcorpus creation interface
  • To control concordance display
  • Management, technical, processing details

A simple corpus config file is:

PATH  /corpora/test1
ATTRIBUTE  word
ATTRIBUTE  tag
ATTRIBUTE  lemma

STRUCTURE doc {
    ATTRIBUTE title
    ATTRIBUTE region
}
STRUCTURE p
STRUCTURE s

This shows, firstly, the structure of a corpus config file. It contains a set of feature-value pairs where the feature, on the left, must be one of a set of words that the system recognises and knows how to interpret. All features are explained in this set of documentation pages and the full list (with minimal documentation) is here.

The example states that

  • The location of the indexed corpus data on the system is /corpora/test1
  • The vertical file contains three columns, contents of which will be called 'word', 'tag' and 'lemma'
  • The text is in structural units of type 'doc', 's' and 'p'. Units of type 'doc' have associated attributes 'title' and 'region'.

As the example illustrates,

  • each nonempty line begns with a feature name and then gives its value
  • values can be simple or can themselves be complex or further specified in a block enclosed in { ... }.

Download in other formats:

  • Plain Text

Sketch Engine
Bringing Corpora to the Masses

Lexical Computing Ltd

Brought to you by
Lexical Computing Ltd