Sketch Engine
  • Login
  • Wiki
  • Timeline
  • View Tickets
  • New Ticket
  • Search
  • Settings

Wiki Navigation

  • Start Page
  • Index by Title
  • Index by Date
  • Last Change

Preparing a Corpus for the Sketch Engine

Input format

Input format is so called "vertical" or "word-per-line (WPL)" text. It is plain text (ASCII) file in selected character encoding without any formating (word-processing options). Words are written in a column, i.e. each line contains one word, number or punctuation mark. Optional annotation is on the same line as the respective word, separated by the tab character. For example the following sentence:

Suddenly, however, their posture changed.

is in vertical text:

Suddenly 
, 
however 
, 
their 
posture 
changed 
.

and with tag and lemma annotation (modified "Lancaster" tagset from SUSANNE corpus):

Suddenly	RR	suddenly 
,	YC	- 
however	RR	however 
,	YC	- 
their	APPGh2	their 
posture	NN1n	posture 
changed	VVDv	change 
.	YF	-  

XML tags are use for structural annotation (like sentence or paragraph boundaries, head lines etc. Also "glue" tag <g/> to signify that there should not be space between two tokens.). For example:

<doc id="G10" n=32> 
<head type=min> 
FEDERAL 
CONSTITUTION 
<g/> 
, 
1789 
</head> 
<p n=1> 
" 
<g/> 
we 
the 
People  

There can be any number of attributes associated with words and with xml markup.

The input format is as defined at the University of Stuttgart in the 1990s and as widely used in the corpus linguistics community.

Corpus Configuration File

A corpus configuration is stored in one text (ASCII) file. The name of such file is the ID of the corpus in the whole system. The configuration consists of attribute-value pairs. Each nonempty line begins with an option name and ends with an option value enclosed in quotation marks. Values consisting of only lowercase letters can be written without quotation marks.

Global options are described in the following table.

PATH
full path of the corpus home directory which contains all data files
INFO
arbitrary corpus information like source, size etc. There is no automatic processing of this data. If the value begins with the "@" character the rest is taken as a full path of a file containing INFO data
NAME
name of the corpus
ENCODING
corpus encoding (should be one which Tcl supports), default encoding is "iso8859-1"
VERTICAL
full path of the source vertical text, it is used only by "encodevert" program, if the value starts with "|" the rest is treated as a shell command, and the vertical text will be taken from standard output of the command
DEFAULTATTR
default attribute for query evaluation
ATTRIBUTE
definition of a positional attribute at least one positional attribute should be defined, the first defined attribute is the default one (in most cases it is the word form and the name of this attribute is "word")
STRUCTURE
definition of a structural tag
PROCESS
definition of a process for preprocessing text before encoding a corpus or postprocessing corpus data
ALIGNED
name of an aligned corpus, both corpora should have a structural tag named "align" with one to one correspondence of respective token sequences
SHORTREF
atribute of a structure to display as a default reference in concordance, defaults to the first attribute of the first structure or "#" (token number) if no attribute of a structure exists
FULLREF
comma separated list of references which will be displayed as a full reference in Sketch Engine
HARDCUT
maximum number of query result lines
MAXCONTEXT
maximum number of positions in context
MAXDETAIL
maximum number of positions in detail
REBUILDUSER
comma separated list of users with permission to rebuild the corpus, special value "*" means any user can rebuild the corpus
WPOSLIST
list of pairs providing a mapping between a user-friendly name for a word class, and a regular expression matching the POS-tags which are instances of it. If specified, users can select items like "noun", "verb" from a menu when specifying right or left context for a concordance search. The first character of the string is a separator used to separate values in the rest of the string. Example for TreeTagger English tagset (modified version of Penn tagset):
WPOSLIST ",adjective,AJ.,adverb,AV.,conjunction,CJ.,determiner,AT0,noun,NN.,
noun singular,NN1,noun plural,NN2,preposition,PRP,pronoun,DPS,verb,VV."
LPOSLIST
list of pairs providing a mapping between a word class suffix, and a user-friendly name for the word class. Only makes sense when there is a lempos attribute (lemmas-with-a-word-class-suffix) available in a corpus, so that, for example, "brush as noun" (brush-n) and "brush as verb" (brush-v) can get different word sketches. The first character of the string is a separator used to separate values in the rest of the string. Example from BNC:
LPOSLIST ",adjective,-j,adverb,-a,conjunction,-c,noun,-n,preposition,-p,pronoun,-d,verb,-v"
TAGSETDOC
URL to corpus tagset summary page, the link is displayed on the concordance entry form
SUBCORPATTRS
comma separated list of document attributes used for creating subcorpora, use "|" insted of comma to display atributes on the same row in the subcorpus creation form
WSATTR
attribute name for which word sketches are computed, defaults to "lempos" if the corpus has that attribute, or "lemma" if the corpus has that attribute, or DEFAULTATTR otherwise
WSPOSLIST
LPOSLIST of word sketch POSes
WSSTRIP
number of characters to strip from the end of a word in a word skech listings, defaults to 2 if WSATTR is "lempos", or 0 otherwise
WSBASE
path to word sketches data files defaults to PATH/WSATTR-ws, use "none" to disable WordSketch menu items
WSTHES
path to word sketches thesaurus data files, defaults to PATH/WSATTR-thes

ATTRIBUTE and STRUCTURE options can be repeated and enriched with an additional information block. Additional attribute options are described in the following table.

LOCALE
locale code of a used language (and region), this value is used in the query evaluation (of regular expressions) and the concordance lines sorting, default locale is standard Posix locale (`C')
MULTIVALUE
indicate whether the attribute has multivalues
MULTISEP
defines multivalue separator, if empty ("") value is split into characters
DYNAMIC
if this option exists given attribute is a dynamic one and the value of this option is the name of C function which defines given dynamic attribute
DYNLIB
dynamic library containing given function
FUNTYPE
type of given function
  • 0 -- no extra argument
  • c -- one char extra argument
  • s -- one (const char*) extra argument
  • i -- one int extra argument
  • cc -- two char extra arguments
  • ii -- two int extra arguments
  • ss -- two (const char*) extra arguments
  • ci -- two extra arguments, first char, second int
  • cs, sc, si, ic, is -- likewise
ARG1
the first optional fixed parameter
ARG2
the second optional fixed parameter
FROMATTR
the name of attribute from witch the given attribute is created
DYNTYPE
type of the dynamic attribute, possible values are plain, lexicon, index (default)
  • plain -- only displaying is enabled
  • lexicon -- displaying and counting (frequency distribution) are enabled
  • index -- all features including querying are enabled
TRANSQUERY
use transformation function for queries (multivalues not supported)

In an additional information block of a STRUCTURE option there can be arbitrary many ATTRIBUTE options (with possible additional option blocks).

LABEL
label used in references instead of <STRUCTURE>.<ATTRIBUTE>
DISPLAYTAG
if "1" (by default) it displays a xml tag like <s>, <p>, ...; set it to "0" not to display a tag, use other DISPLAY... options to modify concordance output
DISPLAYCLASS
a class of included text
DISPLAYBEGIN
_EMPTY_
DISPLAYEND

Examples

If your vertical text contains only words and no annotation, a configuration can be very simple:

Example 1

PATH /corpora/test1
ATTRIBUTE word

If you omit VERTICAL, you have to specify a source file for encodevert command:

% encodevert -c test1 /corpora/src/test1.vertical

VERTICAL addition simplifies encodevert command:

% encodevert -c test2

Select an appropriate ENCODING for a proper display of characters in Sketch Engine. For each attribute you can specify a LOCALE for proper sorting and regular expression character classes handling. Default "C" locale corresponds to English. The following example uses ISO Latin 2 encoding and Czech locale.

Example 2

PATH /corpora/test1
VERTICAL "/corpora/src/test1.vertical"
ENCODING "iso8859-2"
ATTRIBUTE word {
	  LOCALE "cs_CZ.ISO8859-2"
}

If your vertical text contains a POS tagging for each token (word) specify also the second attribute.

Example 3

PATH /corpora/test1
VERTICAL "/corpora/src/test1.vertical"
ENCODING "iso8859-2"
ATTRIBUTE word {
	LOCALE "cs_CZ.ISO8859-2"
}
ATTRIBUTE pos

If your vertical text contains sentence boundaries annotated with <s> and </s> and document boundaries annotated with <doc> and </doc>, add structures definition.

Example 4

PATH /corpora/test2
VERTICAL "/corpora/src/test2.vertical"
ENCODING "iso8859-2"
ATTRIBUTE word 
STRUCTURE doc
STRUCTURE s

If your <doc> annotation contains document meta-information about the author and the date of publication in form <doc author="Lewis Carroll" date="1876"> add structure attribute definition.

Example 5

PATH /corpora/test3
VERTICAL "/corpora/src/test3.vertical"
ENCODING "iso8859-2"
ATTRIBUTE word 
STRUCTURE doc {
	ATTRIBUTE author
	ATTRIBUTE date
}
STRUCTURE s

If your POS attribute contains ambiguous tags like NN1-VVB in BNC, and you would like to find this tag for [pos="NN1"] queries, add multivalue configuration.

Example 6

PATH /corpora/test4
ENCODING "iso8859-2"
ATTRIBUTE word 
ATTRIBUTE pos {
	MULTIVALUE yes
	MULTISEP "-"
}

If you would like to add a dynamic attribute, add a new attribute definition. In the following example the vertical text contains words only (one column), but the corpus has additional attribute lc generated from the word attribute. Values of lc consists of respective words transformed into lowercase letters. The transformation function is an internal function named "lowercase" (one can see the definition in stddynfun.c file). It accepts two arguments: first is a word and second a locale (in this corpus "cs_CZ"). DEFAULTATTR ensures that lc will be used in evaluation of queries without an attribute name. TRANSQUERY ensures that the transformation function will be applied to a query string before query evaluation.

Example 7

PATH /corpora/test1
VERTICAL "/corpora/src/test1.vertical"
ENCODING "iso8859-2"
DEFAULTATTR lc
ATTRIBUTE word {
	LOCALE "cs_CZ"
}
ATTRIBUTE   lc {
	LOCALE "cs_CZ"

	DYNAMIC    lowercase
	DYNLIB     internal
	FUNTYPE    s
	FROMATTR   word
	ARG1       "cs_CZ"
	TRANSQUERY yes
}

A transformation function of a dynamic attribute can also be an external function. DYNLIB then shows the full path to a dynamic library. The following example lists two dynamic attributes which add a lemma and a morphological annotation into a corpus. Both transformation functions (tags and lemmata) returns ambiguous values separated by a comma.

Example 8

PATH /corpora/test1
VERTICAL "/corpora/src/test1.vertical"
ENCODING "iso8859-2"

ATTRIBUTE   word {
	LOCALE "cs_CZ"
}
ATTRIBUTE   lemma {
	 LOCALE "cs_CZ"
	 DYNAMIC	lemmata
         DYNLIB  	/corpora/bin/alibfun.so
	 ARG1    	0
	 FUNTYPE	i
	 FROMATTR	word

	 MULTIVALUE	yes
	 MULTISEP	","
}
ATTRIBUTE   tag {
	 DYNAMIC	tags
         DYNLIB  	/corpora/bin/alibfun.so
	 FUNTYPE	0
	 FROMATTR	word

	 MULTIVALUE	yes
	 MULTISEP	","
}

Parallel corpora are handled as two separate corpora. ALIGNED indicates the name of the parallel part. Both corpora should have a structure named "align" with one to one correspondence of respective token sequences. The following example shows two configuration files -- one for each corpus.

Example 9a (paren)

PATH /corpora/par-en
VERTICAL "/corpora/src/par-en.vertical"
ENCODING "iso8859-1"
ATTRIBUTE word 
STRUCTURE doc {
	ATTRIBUTE id
}
STRUCTURE s
STRUCTURE align
ALIGNED	  parcs

Example 9b (parcs)

PATH /corpora/par-cs
VERTICAL "/corpora/src/par-cs.vertical"
ENCODING "iso8859-2"
ATTRIBUTE word {
	LOCALE "cs_CZ"
}
STRUCTURE doc {
	ATTRIBUTE id
}
STRUCTURE s
STRUCTURE align
ALIGNED	  paren

The final example is a part of a BNC configuration. It shows usage of INFO and FULLREF.

Example 10

PATH   /corpora/bnc
INFO   "British National Corpus"
VERTICAL /corpora/src/bnc.vert
ENCODING "iso8859-1"

DEFAULTATTR lc

FULLREF "bncdoc.id,bncdoc.author,bncdoc.title,bncdoc.date,bncdoc.info"

ATTRIBUTE   word
ATTRIBUTE   tag {
	MULTIVALUE y
	MULTISEP   "-"
}

ATTRIBUTE   lc {
	DYNAMIC lowercase
	DYNLIB  internal
	FUNTYPE s
	ARG1    "C"
	FROMATTR word
	TRANSQUERY	yes
}
	
STRUCTURE   bncdoc {
	ATTRIBUTE id
	ATTRIBUTE date
	ATTRIBUTE year {
		DYNAMIC firstn
		DYNLIB  internal
		FUNTYPE i
		ARG1    4
		FROMATTR date
	}
	ATTRIBUTE author {
		MULTIVALUE y
		MULTISEP   ";"
	}
	ATTRIBUTE title
	ATTRIBUTE info

	ATTRIBUTE allava
	ATTRIBUTE alltim
	ATTRIBUTE alltyp

	ATTRIBUTE wriaag
	ATTRIBUTE wriad
	ATTRIBUTE wriase
}

STRUCTURE   stext {
	ATTRIBUTE org
}
STRUCTURE   text {
	ATTRIBUTE org
}

STRUCTURE   s {
	ATTRIBUTE n
}

STRUCTURE   p {
	ATTRIBUTE rend
}
STRUCTURE   body 

New draft version: wiki:SkE/PreparingCorpusOverview

Download in other formats:

  • Plain Text

Sketch Engine
Bringing Corpora to the Masses

Lexical Computing Ltd

Brought to you by
Lexical Computing Ltd