Sketch Engine
  • Login
  • Wiki
  • Timeline
  • View Tickets
  • New Ticket
  • Search
  • Settings

Wiki Navigation

  • Start Page
  • Index by Title
  • Index by Date
  • Last Change

Preparing Corpus Text for the Sketch Engine

The input format is "vertical" or "word-per-line (WPL)" text, as defined at the University of Stuttgart in the 1990s. Words are written one word per line, so each line contains one word, number or punctuation mark. It is a plain text (ASCII) file in a selected character encoding, without any formatting.

Suddenly, however, their posture changed.

is in vertical text

Suddenly 
, 
however 
, 
their 
posture 
changed 
.

If the input text is part-of-speech-tagged and lemmatised, then we provide two additional columns, tab-separated, for tag and lemma as here (showing tags from Penn tagset):

Suddenly	RB	suddenly
<g/> 
,	,	, 
however	RR	however 
<g/>
,	,	, 
their	PP$	their 
posture	NN	posture 
changed	VVD	change 
<g/>
.	SENT	.

The "glue" tag <g/> is used to specify that there should not be space between two tokens, as between a word and following punctuation (in Latin and other Western scripts).

Sometimes there might be multiple or disjunctive values for an attribute, for example if the POS-tagger was undecided between classifying a word as a noun (NN) or a lexical verb (VV), or if a word is associated with two grammatical relations. This can be encoded using a separator character as specified in the Corpus Configuration file (attributes MULTIVALUE and MULTISEP), here ";"

brush   NN;VV    brush

XML tags are use for structural annotation including document, sentence or paragraph boundaries, headlines etc. and can have associated attribute-value pairs. For example:

<doc id="G10" n=32> 
<head type=min> 
FEDERAL 
CONSTITUTION 
<g/> 
, 
1789 
</head> 
<p n=1> 
" 
<g/> 
we 
the  
People  

There can be any number of attributes associated with words. While the 'standard' ones are lemma and POS-tag, the framework can also be used for stating thesaurus category, grammatical function, and a number of other varieties of markup. Sometimes this markup will be most suitably associated with a word, and sometimes with a structural attribute such as phrase, sentence and paragraph. (There will be different implications on what searches can easily be made, depending on the choice of encoding.) For the special case of text type or 'header' information, see SkE/PrepareHeaders.

Navigation

  • Up to SkE/PreparingCorpusOverview
  • Headers: SkE/PrepareHeaders
  • Corpus Configuration file

Download in other formats:

  • Plain Text

Sketch Engine
Bringing Corpora to the Masses

Lexical Computing Ltd

Brought to you by
Lexical Computing Ltd