Corpus Configuration File: All Features
About the corpus
NAME
name of the corpus; defaults to the corpus config filename
ENCODING
corpus encoding (should be one which Tcl supports), default encoding is "iso8859-1"
LOCALE
locale code of a used language (and region), this value is used in the query evaluation (of regular expressions) and the concordance lines sorting, default locale is standard Posix locale (`C')
RIGHTTOLEFT
indicates whether the language of the corpus is in the right-to-left script (e.g. Arabic)
ALIGNED
for parallel corpora only: the name of an aligned corpus. Both corpora should have a structural tag named "align" with one to one correspondence of respective token sequences for the 'align' structure
NEWVERSION
for old versions of corpora only: the name of the new version of the corpus
DEFAULTATTR
default attribute for CQL query evaluation. It is also used to map attribute alias "-" in the web API.
Location features
PATH
full path of the corpus home directory which contains all data files
INFO
arbitrary corpus information like source, size etc. There is no automatic processing of this data. If the value begins with the "@" character the rest is taken as a full path of a file containing INFO data
INFOHREF
link to arbitrary documentation on the web
VERTICAL
full path of the source vertical text, it is used only by "encodevert" program, if the value starts with "|" the rest is treated as a shell command, and the vertical text will be taken from standard output of the command
WSBASE
path to compiled word sketches data files defaults to PATH/WSATTR-ws (prefix), use "none" to disable WordSketch menu items
WSDEF
path to the word sketches grammar definition file
WSTHES
path to word sketches thesaurus data files, defaults to PATH/WSATTR-thes
TAGSETDOC
URL of the tagset documentation, so users can quickly refer to it from a button next to the CQL box in the search interface. If absent, the button does not appear in the interface
SUBCDEF
path for the subcorpus definition file. See Subcorpus config documentation. A subcorpus definition file allows you to share subcorpora with all users of the corpus
Structures and Attributes
ATTRIBUTE
This provides the definition of a positional attribute. At least one positional attribute should be defined. The first defined attribute is the default one (in most cases it is the word form and the name of this attribute is "word"). Order is important: the nth ATTRIBUTE in the corpus config file provides a name for the contents of the nth column in the vertical file. Some features of SkE require attributes called 'tag', 'lemma', 'lempos', 'lc'. The order of attributes is not important, it is used only during the initial encoding and to display the list of attributes in the concordance "View options" form. Attribute names must start with an alphabetic character or underscore and subsequent characters must be alphanumerical (including underscore). i.e. ('a'..'z'|'A'..'Z'|'_')('a'..'z'|'A'..'Z'|'0'..'9'|'_')*
STRUCTURE
This provides the definition of a structural tag. Structures can themselves have attributes (structural attributes as opposed to the positional attribute described above). Structure names must start with an alphabetic character or underscore and subsequent characters must be alphanumerical (including underscore). i.e. the same criteria as ATTRIBUTE names above.
ATTRIBUTE and STRUCTURE options can be repeated and enriched with an additional information block, for example with:
- MULTIVALUE
indicate whether the attribute has multivalues
- DEFAULTVALUE
default value for this attribute if not present in the source vertical
[since manatee 2.30] if not overriden by this configuration option, the default DEFAULTVALUE is set to "===NONE===".
- MULTISEP
defines multivalue separator, if empty ("") value is split into characters
- HIERARCHICAL
states that the attribute should be treated as hierarchical. Its value is the separator of the fields in hierarchy (can be any string). For structural attributes (header fields) only.
- ATTRDOC
optional link to the attribute values documentation. For structural attributes (header fields) only.
- ATTRDOCLABEL
name for the ATTRDOC link.
- NUMERIC
indicate that attribute values will be sorted according to their numeric value. For structural attributes (header fields) only.
Advanced topics on attributes and structures, see below.
Controlling display in concordances
SHORTREF
the attribute of a structure to display as a default reference in the left hand column of a concordance. Defaults to the first attribute of the first structure or "#" (token number) if no attribute of a structure exists
STRUCTATTRLIST
comma-separated list of references that will be used to determine the References list in view options. Defaults to all attributes of the structures specified in the config file.
FULLREF
comma-separated list of references which will be displayed as a full reference at the bottom of the window when the user clicks on the SHORTREF for a concordance line. Defaults to the value of STRUCTATTRLIST.
HARDCUT
maximum number of query result lines in query evaluation, default=0 meaning no limit
MAXCONTEXT
maximum number of positions in context for displaying and saving concordance, default=100 (if you want unlimited context use MAXCONTEXT=0)
MAXDETAIL
maximum number of positions in the detail view (at the bottom of conc view), default=MAXCONTEXT
STRUCTCTX
display the whole structure in the detail view (at the bottom of conc view)
WRAPDETAIL
name of the structure that will cause line wrap in the detail context window (new in bonito 2.76), default none
In an additional information block of a STRUCTURE option there can be arbitrary many ATTRIBUTE options (with possible additional option blocks), which can include the following:
LABEL
label used in references instead of <STRUCTURE>.<ATTRIBUTE>
DISPLAYTAG
if "1" (by default) it displays an xml tag like <s>, <p> in concordances; set it to "0" not to display a tag, use other DISPLAY... options to modify concordance output
DISPLAYCLASS
a class of included text; can be used to change style of text in a structure, but to do that also requires adding the given class in the cascading style sheet view.css on the server, for example
DISPLAYCLASS "bold"
could be used to dislay heading in bold
DISPLAYBEGIN
for example one can display quotation mark instead of <q> and </q>
special value "_EMPTY_" means display nothing and eat spaces, it is used for <g/>:
STRUCTURE g { DISPLAYTAG 0 DISPLAYBEGIN "_EMPTY_" }
[since manatee 2.28] structure attributes can be displayed using the %(attribute_name) syntax, e.g. if you'd like the structure to be marked by the text "STR-" concatenated with the id attribute of structure str, use the following syntax:
STRUCTURE str {
ATTRIBUTE id
DISPLAYTAG 0
DISPLAYBEGIN "STR-%(id)"
}
DISPLAYEND
same as DISPLAYBEGIN only for the end tag
MAXLISTSIZE
in text types, if an attribute has more than 22 possible options, an input text field with autolookup is offered to user rather than a list of checkboxes. MAXLISTSIZE can change the default value. Example:
STRUCTURE document { ATTRIBUTE id ATTRIBUTE domain { MAXLISTSIZE "30" } }
Controlling the subcorpus creation interface
SUBCORPATTRS
comma-separated list of document attributes used for creating subcorpora. Use "|" instead of comma to display attributes on the same row in the subcorpus creation form. Example:
SUBCORPATTRS "bncdoc.alltyp|bncdoc.alltim,bncdoc.wridom|bncdoc.wrimed"
--- subcorp form contains 2 rows:
1: alltyp and alltim 2: wridom and wrimed
If SUBCORPATTRS is not defined, all attributes will be shown in the 'Text Type' part of the concordance form (usually not the desired outcome)
Word classes and lemmas
WPOSLIST
list of pairs providing a mapping between a user-friendly name for a word class, and a regular expression matching the POS-tags which are instances of it. The first character of the string is a separator used to separate values in the rest of the string. If specifed, users can select items like 'noun', 'verb' from a menu when specifying right or left context for a concordance search. Example for TreeTagger English tagset (modified version of Penn tagset):
WPOSLIST ",adjective,JJ.?,adverb,RB.?.,conjunction,CC,determiner,DT,noun,N.*,noun singular,NN,noun plural,NNS,preposition,IN,pronoun,PP,verb,V..?|MD"
LPOSLIST
list of pairs providing a mapping between a word class suffix, and a user-friendly name for the word class. Only makes sense when there is a mechanism in place for relating lemmas to lemmas-with-a-word-class-suffix, so that, for example, brush (noun) and brush (verb) can get different word sketches. The first character of the string is a separator used to separate values in the rest of the string.
Example from BNC:
LPOSLIST ",adjective,-j,adverb,-a,conjunction,-c,noun,-n,preposition,-p,pronoun,-d,verb,-v"
WSPOSLIST
LPOSLIST of word sketch POSes. Same format as, and defaults to, LPOSLIST, but LPOSLIST if used after Lemma box in the Concordance form whereas WSPOSLIST is used in Word Sketch and Thesaurus forms
WSATTR
attribute name for which word sketches are computed, defaults to "lempos" if the corpus has that attribute, or "lemma" if the corpus has that attribute, or DEFAULTATTR otherwise
WSSTRIP
number of characters to strip from the end of a word in a word skech listings, defaults to 2 if WSATTR is "lempos", or 0 otherwise
Dynamic attributes
DYNAMIC
if this option exists, the attribute is a dynamic one and the value of this option is the name of the C function which defines the attribute. One case is the dynamic attribute 'lemma' where the field given in the vertical file is 'lempos', built from lemma + '-' + a letter to indicate word class, so lemma intend maps to lempos intend-v. Then lemma is a dynamic attricbute, with the associated function stripping off the last two characters of the lempos. The mechanism is used in the BNC to support querying word sketches, which are word class specific so are defined for a lempos.
Here is the definition of the 'lemma' dynamic attribute. The embedded features used are documented below.
ATTRIBUTE lemma {
DYNAMIC striplastn
DYNLIB internal
ARG1 "2"
FUNTYPE i
FROMATTR lempos
TYPE index
}
DYNLIB
[SkE/Config/DynamicAttributes#Sharedlibrary dynamic library] containing given function
FUNTYPE
type of given function
- 0 -- no extra argument
- c -- one char extra argument
- s -- one (const char*) extra argument
- i -- one int extra argument
- cc -- two char extra arguments
- ii -- two int extra arguments
- ss -- two (const char*) extra arguments
- ci -- two extra arguments, first char, second int
- cs, sc, si, ic, is -- likewise
ARG1
the first optional fixed parameter
ARG2
the second optional fixed parameter
FROMATTR
the name of the attribute from which the dynamic attribute is created
DYNTYPE
type of the dynamic attribute, possible values are plain, lexicon, index (default)
- plain -- only displaying is enabled
- lexicon -- displaying and counting (frequency distribution) are enabled
- index -- all features including querying are enabled
TRANSQUERY
use transformation function for queries (multivalues not supported) Example:
ATTRIBUTE lc { DYNAMIC lowercase DYNLIB internal ARG1 "C" FUNTYPE s FROMATTR word TYPE index TRANSQUERY yes }
This means that, for query [lc="Test"] we apply the function "lowercase" to the argument "Test" to search for "test"; without TRANSQUERY, we would search for "Test" and find nothing
Wordcount
Structure attribute "wordcount" represents number of words in a structure. The value is calculated during compilation, therefore the attribute should not be present in the vertical file.
STRUCTURE doc { ATTRIBUTE wordcount }
Data representation options
FD_FGD
Type of an attribute which should be used for any corpora with the the main binary file (.text) bigger than 500MB (approx 250M tokens, depending on lexicon size)
ATTRIBUTE word { TYPE "FD_FGD" }
file64
A structure type. Enables the range file (.rng) to address 264 corpus positions. The information is read from file. Use when there is more than 232 tokens in the corpus.
STRUCTURE g { TYPE "file64" }
map64
A structure type. Same meaning as file64 with one exception: the information is mapped into memory when working with the corpus. Do not use when the range file is too large to not allocate too much system resources.
STRUCTURE doc { TYPE "map64" }
Notes
- There must be a new line character at the end of the configuration file.
Navigation
- Up to Corpus Configuration file
- Up twice to Preparing Corpus Overview
- More example config files
