Preparing a Corpus for the Sketch Engine
Input format
Input format is so called "vertical" or "word-per-line (WPL)" text. It is plain text (ASCII) file in selected character encoding without any formating (word-processing options). Words are written in a column, i.e. each line contains one word, number or punctuation mark. Optional annotation is on the same line as the respective word, separated by the tab character. For example the following sentence:
Suddenly, however, their posture changed.
is in vertical text:
Suddenly , however , their posture changed .
and with tag and lemma annotation (modified "Lancaster" tagset from SUSANNE corpus):
Suddenly RR suddenly , YC - however RR however , YC - their APPGh2 their posture NN1n posture changed VVDv change . YF -
XML tags are use for structural annotation (like sentence or paragraph boundaries, head lines etc. Also "glue" tag <g/> to signify that there should not be space between two tokens.). For example:
<doc id="G10" n=32> <head type=min> FEDERAL CONSTITUTION <g/> , 1789 </head> <p n=1> " <g/> we the People
There can be any number of attributes associated with words and with xml markup.
The input format is as defined at the University of Stuttgart in the 1990s and as widely used in the corpus linguistics community.
Corpus Configuration File
A corpus configuration is stored in one text (ASCII) file. The name of such file is the ID of the corpus in the whole system. The configuration consists of attribute-value pairs. Each nonempty line begins with an option name and ends with an option value enclosed in quotation marks. Values consisting of only lowercase letters can be written without quotation marks.
Global options are described in the following table.
- PATH
- full path of the corpus home directory which contains all data files
- INFO
- arbitrary corpus information like source, size etc. There is no automatic processing of this data. If the value begins with the "@" character the rest is taken as a full path of a file containing INFO data
- NAME
- name of the corpus
- ENCODING
- corpus encoding (should be one which Tcl supports), default encoding is "iso8859-1"
- VERTICAL
- full path of the source vertical text, it is used only by "encodevert" program, if the value starts with "|" the rest is treated as a shell command, and the vertical text will be taken from standard output of the command
- DEFAULTATTR
- default attribute for query evaluation
- ATTRIBUTE
- definition of a positional attribute at least one positional attribute should be defined, the first defined attribute is the default one (in most cases it is the word form and the name of this attribute is "word")
- STRUCTURE
- definition of a structural tag
- PROCESS
- definition of a process for preprocessing text before encoding a corpus or postprocessing corpus data
- ALIGNED
- name of an aligned corpus, both corpora should have a structural tag named "align" with one to one correspondence of respective token sequences
- SHORTREF
- atribute of a structure to display as a default reference in concordance, defaults to the first attribute of the first structure or "#" (token number) if no attribute of a structure exists
- FULLREF
- comma separated list of references which will be displayed as a full reference in Sketch Engine
- HARDCUT
- maximum number of query result lines
- MAXCONTEXT
- maximum number of positions in context
- MAXDETAIL
- maximum number of positions in detail
- REBUILDUSER
- comma separated list of users with permission to rebuild the corpus, special value "*" means any user can rebuild the corpus
- WPOSLIST
-
list of pairs providing a mapping between a user-friendly name for a word class, and a regular expression matching the POS-tags which are instances of it. If specified, users can select items like "noun", "verb" from a menu when specifying right or left context for a concordance search. The first character of the string is a separator used to separate values in the rest of the string. Example for TreeTagger English tagset (modified version of Penn tagset):
WPOSLIST ",adjective,AJ.,adverb,AV.,conjunction,CJ.,determiner,AT0,noun,NN., noun singular,NN1,noun plural,NN2,preposition,PRP,pronoun,DPS,verb,VV."
- LPOSLIST
-
list of pairs providing a mapping between a word class suffix, and a user-friendly name for the word class. Only makes sense when there is a lempos attribute (lemmas-with-a-word-class-suffix) available in a corpus, so that, for example, "brush as noun" (brush-n) and "brush as verb" (brush-v) can get different word sketches. The first character of the string is a separator used to separate values in the rest of the string. Example from BNC:
LPOSLIST ",adjective,-j,adverb,-a,conjunction,-c,noun,-n,preposition,-p,pronoun,-d,verb,-v"
- TAGSETDOC
- URL to corpus tagset summary page, the link is displayed on the concordance entry form
- SUBCORPATTRS
- comma separated list of document attributes used for creating subcorpora, use "|" insted of comma to display atributes on the same row in the subcorpus creation form
- WSATTR
- attribute name for which word sketches are computed, defaults to "lempos" if the corpus has that attribute, or "lemma" if the corpus has that attribute, or DEFAULTATTR otherwise
- WSPOSLIST
- LPOSLIST of word sketch POSes
- WSSTRIP
- number of characters to strip from the end of a word in a word skech listings, defaults to 2 if WSATTR is "lempos", or 0 otherwise
- WSBASE
- path to word sketches data files defaults to PATH/WSATTR-ws, use "none" to disable WordSketch menu items
- WSTHES
- path to word sketches thesaurus data files, defaults to PATH/WSATTR-thes
ATTRIBUTE and STRUCTURE options can be repeated and enriched with an additional information block. Additional attribute options are described in the following table.
- LOCALE
- locale code of a used language (and region), this value is used in the query evaluation (of regular expressions) and the concordance lines sorting, default locale is standard Posix locale (`C')
- MULTIVALUE
- indicate whether the attribute has multivalues
- MULTISEP
- defines multivalue separator, if empty ("") value is split into characters
- DYNAMIC
- if this option exists given attribute is a dynamic one and the value of this option is the name of C function which defines given dynamic attribute
- DYNLIB
- dynamic library containing given function
- FUNTYPE
-
type of given function
- 0 -- no extra argument
- c -- one char extra argument
- s -- one (const char*) extra argument
- i -- one int extra argument
- cc -- two char extra arguments
- ii -- two int extra arguments
- ss -- two (const char*) extra arguments
- ci -- two extra arguments, first char, second int
- cs, sc, si, ic, is -- likewise
- ARG1
- the first optional fixed parameter
- ARG2
- the second optional fixed parameter
- FROMATTR
- the name of attribute from witch the given attribute is created
- DYNTYPE
-
type of the dynamic attribute, possible values are plain, lexicon, index (default)
- plain -- only displaying is enabled
- lexicon -- displaying and counting (frequency distribution) are enabled
- index -- all features including querying are enabled
- TRANSQUERY
- use transformation function for queries (multivalues not supported)
In an additional information block of a STRUCTURE option there can be arbitrary many ATTRIBUTE options (with possible additional option blocks).
- LABEL
- label used in references instead of <STRUCTURE>.<ATTRIBUTE>
- DISPLAYTAG
- if "1" (by default) it displays a xml tag like <s>, <p>, ...; set it to "0" not to display a tag, use other DISPLAY... options to modify concordance output
- DISPLAYCLASS
- a class of included text
- DISPLAYBEGIN
- _EMPTY_
- DISPLAYEND
Examples
If your vertical text contains only words and no annotation, a configuration can be very simple:
Example 1
PATH /corpora/test1 ATTRIBUTE word
If you omit VERTICAL, you have to specify a source file for encodevert command:
% encodevert -c test1 /corpora/src/test1.vertical
VERTICAL addition simplifies encodevert command:
% encodevert -c test2
Select an appropriate ENCODING for a proper display of characters in Sketch Engine. For each attribute you can specify a LOCALE for proper sorting and regular expression character classes handling. Default "C" locale corresponds to English. The following example uses ISO Latin 2 encoding and Czech locale.
Example 2
PATH /corpora/test1
VERTICAL "/corpora/src/test1.vertical"
ENCODING "iso8859-2"
ATTRIBUTE word {
LOCALE "cs_CZ.ISO8859-2"
}
If your vertical text contains a POS tagging for each token (word) specify also the second attribute.
Example 3
PATH /corpora/test1
VERTICAL "/corpora/src/test1.vertical"
ENCODING "iso8859-2"
ATTRIBUTE word {
LOCALE "cs_CZ.ISO8859-2"
}
ATTRIBUTE pos
If your vertical text contains sentence boundaries annotated with <s> and </s> and document boundaries annotated with <doc> and </doc>, add structures definition.
Example 4
PATH /corpora/test2 VERTICAL "/corpora/src/test2.vertical" ENCODING "iso8859-2" ATTRIBUTE word STRUCTURE doc STRUCTURE s
If your <doc> annotation contains document meta-information about the author and the date of publication in form <doc author="Lewis Carroll" date="1876"> add structure attribute definition.
Example 5
PATH /corpora/test3
VERTICAL "/corpora/src/test3.vertical"
ENCODING "iso8859-2"
ATTRIBUTE word
STRUCTURE doc {
ATTRIBUTE author
ATTRIBUTE date
}
STRUCTURE s
If your POS attribute contains ambiguous tags like NN1-VVB in BNC, and you would like to find this tag for [pos="NN1"] queries, add multivalue configuration.
Example 6
PATH /corpora/test4
ENCODING "iso8859-2"
ATTRIBUTE word
ATTRIBUTE pos {
MULTIVALUE yes
MULTISEP "-"
}
If you would like to add a dynamic attribute, add a new attribute definition. In the following example the vertical text contains words only (one column), but the corpus has additional attribute lc generated from the word attribute. Values of lc consists of respective words transformed into lowercase letters. The transformation function is an internal function named "lowercase" (one can see the definition in stddynfun.c file). It accepts two arguments: first is a word and second a locale (in this corpus "cs_CZ"). DEFAULTATTR ensures that lc will be used in evaluation of queries without an attribute name. TRANSQUERY ensures that the transformation function will be applied to a query string before query evaluation.
Example 7
PATH /corpora/test1
VERTICAL "/corpora/src/test1.vertical"
ENCODING "iso8859-2"
DEFAULTATTR lc
ATTRIBUTE word {
LOCALE "cs_CZ"
}
ATTRIBUTE lc {
LOCALE "cs_CZ"
DYNAMIC lowercase
DYNLIB internal
FUNTYPE s
FROMATTR word
ARG1 "cs_CZ"
TRANSQUERY yes
}
A transformation function of a dynamic attribute can also be an external function. DYNLIB then shows the full path to a dynamic library. The following example lists two dynamic attributes which add a lemma and a morphological annotation into a corpus. Both transformation functions (tags and lemmata) returns ambiguous values separated by a comma.
Example 8
PATH /corpora/test1
VERTICAL "/corpora/src/test1.vertical"
ENCODING "iso8859-2"
ATTRIBUTE word {
LOCALE "cs_CZ"
}
ATTRIBUTE lemma {
LOCALE "cs_CZ"
DYNAMIC lemmata
DYNLIB /corpora/bin/alibfun.so
ARG1 0
FUNTYPE i
FROMATTR word
MULTIVALUE yes
MULTISEP ","
}
ATTRIBUTE tag {
DYNAMIC tags
DYNLIB /corpora/bin/alibfun.so
FUNTYPE 0
FROMATTR word
MULTIVALUE yes
MULTISEP ","
}
Parallel corpora are handled as two separate corpora. ALIGNED indicates the name of the parallel part. Both corpora should have a structure named "align" with one to one correspondence of respective token sequences. The following example shows two configuration files -- one for each corpus.
Example 9a (paren)
PATH /corpora/par-en
VERTICAL "/corpora/src/par-en.vertical"
ENCODING "iso8859-1"
ATTRIBUTE word
STRUCTURE doc {
ATTRIBUTE id
}
STRUCTURE s
STRUCTURE align
ALIGNED parcs
Example 9b (parcs)
PATH /corpora/par-cs
VERTICAL "/corpora/src/par-cs.vertical"
ENCODING "iso8859-2"
ATTRIBUTE word {
LOCALE "cs_CZ"
}
STRUCTURE doc {
ATTRIBUTE id
}
STRUCTURE s
STRUCTURE align
ALIGNED paren
The final example is a part of a BNC configuration. It shows usage of INFO and FULLREF.
Example 10
PATH /corpora/bnc
INFO "British National Corpus"
VERTICAL /corpora/src/bnc.vert
ENCODING "iso8859-1"
DEFAULTATTR lc
FULLREF "bncdoc.id,bncdoc.author,bncdoc.title,bncdoc.date,bncdoc.info"
ATTRIBUTE word
ATTRIBUTE tag {
MULTIVALUE y
MULTISEP "-"
}
ATTRIBUTE lc {
DYNAMIC lowercase
DYNLIB internal
FUNTYPE s
ARG1 "C"
FROMATTR word
TRANSQUERY yes
}
STRUCTURE bncdoc {
ATTRIBUTE id
ATTRIBUTE date
ATTRIBUTE year {
DYNAMIC firstn
DYNLIB internal
FUNTYPE i
ARG1 4
FROMATTR date
}
ATTRIBUTE author {
MULTIVALUE y
MULTISEP ";"
}
ATTRIBUTE title
ATTRIBUTE info
ATTRIBUTE allava
ATTRIBUTE alltim
ATTRIBUTE alltyp
ATTRIBUTE wriaag
ATTRIBUTE wriad
ATTRIBUTE wriase
}
STRUCTURE stext {
ATTRIBUTE org
}
STRUCTURE text {
ATTRIBUTE org
}
STRUCTURE s {
ATTRIBUTE n
}
STRUCTURE p {
ATTRIBUTE rend
}
STRUCTURE body
New draft version: wiki:SkE/PreparingCorpusOverview
