Corpus Configuration File: All Features
Pavel - first the problem cases:
REBUILDUSER:: not used anymore
PROCESS:: (used only in corpus builder)
definition of a process for preprocessing text before encoding a corpus or postprocessing corpus data ?? I don't understand - can you add a simple example, or take it out??
??FD_ETC::
I cant remember the name but I know there is an FD feature or feature-value which is 'for very large corpora' - I suspect it is obsolete, and it also looks out of place in the corpus config file. It's not in the version of the documentation I have in front of me but it was in another. Can you give em an update on nits status - again, I'm not sure that corpusconfig is where it belongs??
--- it is the type of an attribute
ATTRIBUTE word {
TYPE "FD_FGD"
}
should be used for any corpora with the the main binary file (.text) bigger than 500MB (approx 250M tokens, depending on lexicon size) PR
About the corpus
NAME::
name of the corpus; defaults to the corpus config filename ?? - is the last part true_ak??
ENCODING::
corpus encoding (should be one which Tcl supports), default encoding is "iso8859-1"
LOCALE::
locale code of a used language (and region), this value is used in the query evaluation (of regular expressions) and the concordance lines sorting, default locale is standard Posix locale (`C')
ALIGNED::
for parallel corpora only: the name of an aligned corpus. Both corpora should have a structural tag named "align" with one to one correspondence of respective token sequences for the 'align' structure
Location features
PATH::
full path of the corpus home directory which contains all data files
INFO::
arbitrary corpus information like source, size etc. There is no automatic processing of this data. If the value begins with the "@" character the rest is taken as a full path of a file containing INFO data
VERTICAL::
full path of the source vertical text, it is used only by "encodevert" program, if the value starts with "|" the rest is treated as a shell command, and the vertical text will be taken from standard output of the command
WSBASE::
path to word sketches data files defaults to PATH/WSATTR-ws, use "none" to disable WordSketch? menu items
WSTHES::
path to word sketches thesaurus data files, defaults to PATH/WSATTR-thes
TAGSETDOC::
URL of the tagset documentation, so users can quickly refer to it from a button next to the CQL box in the search interface. If absent, the button does not appear in the interface
Attributes
ATTRIBUTE::
definition of a positional attribute. At least one positional attribute should be defined. The first defined attribute is the default one (in most cases it is the word form and the name of this attribute is "word"). Order is important: the nth ATTRIBUTE in the corpus config file provides a name for the contents of the nth column in the vertical file. For a part-of-speech-tagged, lemmatised corpus, in order that features including WPOSLIST, LEMPOSLIST work in the intended way without additional work, the second attribute should be 'tag', the part-of-speech tag. ??Is this correct? Do we also need other constraints for 'everything to work easily with minimal re-setting of features, eg lempos' - AK??
(Some features of SkE require attributes called 'tag', 'lemma', 'lempos', 'lc'. The order of attributes is not important, it is used during the initial encoding and to display list of attributes in the concordance "View options" form. PR)
DEFAULTATTR::
default attribute for CQL query evaluation ??and for anything else-ak it is also used to map attribute alias "-" in API, nothing visible??
STRUCTURE::
definition of a structural tag
ATTRIBUTE and STRUCTURE options can be repeated and enriched with an additional information block, for example with:
MULTIVALUE::
indicate whether the attribute has multivalues
MULTISEP::
defines multivalue separator, if empty ("") value is split into characters
Advanced topics on attributes and structures, see below.
Controlling display in concordances
SHORTREF::
the attribute of a structure to display as a default reference in the left hand column of a concordance. Defaults to the first attribute of the first structure or "#" (token number) if no attribute of a structure exists
FULLREF::
comma-separated list of references which will be displayed as a full reference at the bottom of the window when the user clicks on the SHORTREF for a concordance line. Defaults to ?? please complete --- the first attribute of the first structure in the corpus config file PR??
HARDCUT::
maximum number of query result lines ??when user is doing what? Default = very-high-number? ?? --- in query evaluation, default=0 meaning no limit PR
MAXCONTEXT::
maximum number of positions in context ??when user is doing what? Default = very-high-number? ??
---- in displaying/saving concordance (KWICLines API), default=0 meaning no limit PR
MAXDETAIL::
maximum number of positions in detail ??when user is doing what? Default = very-high-number? ?? --- in detail contex (at the bottom of conc view), default=MAXCONTEXT PR
In an additional information block of a STRUCTURE option there can be arbitrary many ATTRIBUTE options (with possible additional option blocks), which can include the following: ??Is this right? do all the items to the next heading relate go in structure.attribute blocks??
LABEL::
label used in references instead of <STRUCTURE>.<ATTRIBUTE>
DISPLAYTAG::
if "1" (by default) it displays an xml tag like <s>, <p>, ... ??in the concordances - is that right? I think this relates to structure tags, not to structure+attribute pairs, yes --- yes, yes PR??; set it to "0" not to display a tag, use other DISPLAY... options to modify concordance output
DISPLAYCLASS::
a class of included text ??please explain?? --- it can be used to change style of text in a structure, but it also require adding given class in the cascading style sheet view.css on the server, for example
DISPLAYCLASS "bold"
could be used to dislay heading in bold PR
DISPLAYBEGIN::
for example one can display quotation mark instead of <q> and </q> PR
special value "_EMPTY_" means display nothing and eat spaces, it is used for <g/>:
STRUCTURE g { DISPLAYTAG 0 DISPLAYBEGIN "_EMPTY_" }
DISPLAYEND::
same as DISPLAYBEGIN only for the end tag PR
Controlling the subcorpus creation interface
SUBCORPATTRS::
comma-separated list of document attributes used for creating subcorpora. Use "|" instead of comma to display attributes on the same row in the subcorpus creation form. Example:
??please add! What is default behaviour if it's absent?? SUBCORPATTRS "bncdoc.alltyp|bncdoc.alltim,bncdoc.wridom|bncdoc.wrimed"
--- subcorp form contains 2 rows:
1: alltyp and alltim 2: wridom and wrimed PR
Word classes and lemmas
WPOSLIST::
list of pairs providing a mapping between a user-friendly name for a word class, and a regular expression matching the POS-tags which are instances of it. The first character of the string is a separator used to separate values in the rest of the string. If specifed, users can select items like 'noun', 'verb' from a menu when specifying right or left context for a concordance search. Example for TreeTagger English tagset (modified version of Penn tagset):
WPOSLIST ",adjective,JJ.?,adverb,RB.?.,conjunction,CC, determiner,DT,noun,N.*,noun singular,NN, noun plural,NNS,preposition,IN,pronoun,PP,verb,V..?|MD"
LPOSLIST::
list of pairs providing a mapping between a word class suffix, and a user-friendly name for the word class. Only makes sense when there is a mechanism in place for relating lemmas to lemmas-with-a-word-class-suffix, so that, for example, brush (noun) and brush (verb) can get different word sketches. The first character of the string is a separator used to separate values in the rest of the string.
Example from BNC:
LPOSLIST ",adjective,-j,adverb,-a,conjunction,-c,noun,-n,preposition,-p,pronoun,-d,verb,-v"
WSPOSLIST::
LPOSLIST of word sketch POSes ??Please make fuller explanation and add example?? --- same as and defaults to LPOSLIST, but LPOSLIST if used after Lemma box in the Concordance form an WSPOSLIST is unsed in Word Sketch and Thesaurus forms PR
WSATTR::
attribute name for which word sketches are computed, defaults to "lempos" if the corpus has that attribute, or "lemma" if the corpus has that attribute, or DEFAULTATTR otherwise
WSSTRIP::
number of characters to strip from the end of a word in a word skech listings, defaults to 2 if WSATTR is "lempos", or 0 otherwise ?? doesn't this duplicate stuff in the dynamic-attribute discussion below - is something obsolete --- yes it is a small diplication, but it is not easy to use dynamic-attribute here, we should correct it (with very low priority :-) PR??
Dynamic attributes
DYNAMIC::
if this option exists, the attribute is a dynamic one and the value of this option is the name of C function which defines the attribute. One case is the dynamic attribute 'lemma' where the field given in the vertical file is 'lempos', built from lemma + '-' +a letter to indicate word class, so lemma intend maps to lempos intend-v. Then lemma is a dynamic attricbute, with the associated function stripping off the last two characters. The mechanism is used in the BNC to support querying word sketches, which are word class specific so are defined for a lempos.
Here is the definition of the 'lemma' dynamic attribute. The embedded features used are documented below.
ATTRIBUTE lemma {
DYNAMIC striplastn
DYNLIB internal
ARG1 "2"
FUNTYPE i
FROMATTR lempos
TYPE index
}
DYNLIB::
dynamic library containing given function
FUNTYPE::
type of given function
- 0 -- no extra argument
- c -- one char extra argument
- s -- one (const char*) extra argument
- i -- one int extra argument
- cc -- two char extra arguments
- ii -- two int extra arguments
- ss -- two (const char*) extra arguments
- ci -- two extra arguments, first char, second int
- cs, sc, si, ic, is -- likewise
ARG1::
the first optional fixed parameter
ARG2::
the second optional fixed parameter
FROMATTR::
the name of the attribute from which the dynamic attribute is created
DYNTYPE::
type of the dynamic attribute, possible values are plain, lexicon, index (default)
- plain -- only displaying is enabled
- lexicon -- displaying and counting (frequency distribution) are enabled
- index -- all features including querying are enabled
TRANSQUERY::
use transformation function for queries (multivalues not supported) ?? I don't understand thsi, can you explain further, maybe with example ?? --- example:
ATTRIBUTE lc { DYNAMIC lowercase DYNLIB internal ARG1 "C" FUNTYPE s FROMATTR word TYPE index TRANSQUERY yes }query [lc="Test"] apply lowecase function to the argument "Test" a searching for "test"; withoud TRANSQUERY, it will search for "Test" and find nothing PR
