Sketch Engine
  • Login
  • Wiki
  • Timeline
  • View Tickets
  • New Ticket
  • Search
  • Settings

Wiki Navigation

  • Start Page
  • Index by Title
  • Index by Date
  • Last Change

Text Types, Headers and Subcorpora in the Sketch Engine

Overview

For many kinds of language study, text type is important. If we wish to describe the behaviour of a word, phrase, or grammatical construction, it is always salient to ask whether it occurs across the varieties of the language, or whether it occurs mostly in one dialect, or one domain, or is constrained only to informal language. (We follow Biber (1989) in using 'text type' as a coverall terms for the many ways in which we might classify one text, or discourse, as being of a different type to another).

The Sketch Engine supports research into text type distinctions by making it easy for users to constrain searches to particular text types, and by providing analyses of the frequency of a word, phrase or other unit by text type. (See the 'text type' options in the main concordance window and, once a concordance is being viewed, 'text type' options under the 'frequency' function.)

These functions only work well if

  • the documents in the corpus have been classified for text type
  • the corpus has been prepared in a way that makes the text type information accessible to the software.

The basic method is this. The corpus is, we assume, structured as a set of documents. In the vertical file there is a structural unit, let's call it <doc>, for each document. Text type information is associated with the <doc> element, as a series of XML, atribute-value pairs, so for example if the text type features are 'region' and 'domain' and a particular document is Australian and about sport we might have

<doc region="Aus" domain=sport>

(Where values of features comprise only alphanumeric characters, quotation marks are optional.) This is the document's header, and text type information is sometimes also called 'header information'. Also each feature-value pair specifies a subcorpus of all the documents having that pair: text type inforamtion will sometimes also be called subcorpus information.

There may be any number of feature-value pairs. In contrast to approaches to document headers found in, for example, the Text Encoding Initiative, SkE document headers are flat lists of feature-value pairs, not structured objects.

For the Sketch Engine to make the information available for searching, it needs to know about the features: for how they are specified see SkE/CorpusConfig.

Recommendations on text type feature design

Usually, when a corpus is being prepared for the Sketch Engine, it already comes with some header information. The simplest thing to do is to format that information as feature-value pairs without further review. This often does not work well. Information may have been included in headers for a number of reasons and will often include copyright status or a log of who did what and when, and will often not be complete or consistent. While there is no harm in including the copyright or log information in headers, it is not likely to be of use for linguistic research.

The person preparing the corpus needs to ask "what subcorpora would the users like to be able to specify, in order to constrain their searches?"

And then "is that information already in the headers, and if not explicit, is it implicitly?"

Not too many subcorpora, and keep them large

Most corpora only support a limited number of linguistically useful subcorpora, and if subcorpora are to be used to constrain searches, a subcorpus must be quite large, or most searches will return no hits. This fits with a user interface consideration: we want to present the user with a limited number of options, all of which he/she understands, in a single screen. For all of these reasons, as a rule of thumb, we suggest that the team preparing the corpus focus on not more than ten features which are likely to be useful for creating subcorpora, with each feature not having more than ten values, and each feature-value pair accounting for atleast 2-3% of the whole corpus.

For example, for the English component of the NCI (New Corpus for Ireland), the features and their possible values are

genre:     imaginative, informative
mode:      spoken, written
region:    Irish, British, Amnerican
ie region: North, South, East, West, u 
     (applies only to Irish English, all else is u(nclassified)
genre2:    arts/culture, business/finance, drama, fiction, govt,
           hard/applied-science, information, leisure, 
           non-fiction, politics, religion/philosophy, 
           social-science, u
medium:    book, conversation, newspaper, official-govt, 
           periodical, unpublished, website, u
subcorpusbysource:  
           bnc, gigaword, lexmc, limerick-corp, nitcs, web

While the list dispalys a range of anomalies, it also shows an attempt to take a range of kinds of material from a range of types of sources (as listed as values of the last feature; we used several existing corpora as input) into a coherent and usable whole.

Implicit information

There is much information available that is implicit. Two examples: if there is date information available, then a feature for decade can be built. If the corpus spans several decades, then this will be a useful feature for exploring language change. The date feature by itself will have too many values, each accounting for too little data, to be useful.

Second: many corpora are built up from a number of newspapers as well as other sources, with the name of the newspaper held somewhere in the header or filename. But the newspaper name is not directly useful to users for building subcorpora. A 'medium' feature which takes the value 'newspaper' for all newspaper material will be useful and can be inferred from information that is available.

In sum: the people preparing the corpus need to consider what subcorpora will be useful to their users, and then, to work out how corresponding features can be built for all or most documents, given the information available in the document headers, filenames, or anywhere else.

Download in other formats:

  • Plain Text

Sketch Engine
Bringing Corpora to the Masses

Lexical Computing Ltd

Brought to you by
Lexical Computing Ltd