Sketch Engine
  • Login
  • Wiki
  • Timeline
  • View Tickets
  • New Ticket
  • Search
  • Settings

Wiki Navigation

  • Start Page
  • Index by Title
  • Index by Date
  • Last Change

Getting started with the Sketch Engine

1. Background

The Sketch Engine is a web-based program which takes as its input a corpus of any language with an appropriate level of linguistic mark-up. The Sketch Engine has a number of language-analysis functions, the core ones being:

  • the Concordancer (which is very fast and offers a high level of functionality)
  • the Word Sketch program (which will be described later)

For the purposes of this guide, we use examples based on the Sketch Engine loaded with a sample corpus of English, the British National Corpus (BNC). For more information about the Sketch Engine, see Kilgarriff et al 2004 in Proc EURALEX. For more information about the BNC, see http://www.natcorp.ox.ac.uk/

2. Home page

The software is on the Sketch Engine website: http://www.sketchengine.co.uk/

Follow the links from this page to either set up an account, or log in. The "home" screen looks like this:

Sketch Engine home page

Here you can select your corpus (or use a couple of other tools). Here we want to explore the BNC, so we click on that.

3. Generating a concordance

We then see the below:

Search form

The six buttons along the top will take you to other parts of the program, or back to the start ("Home"). You enter the main search term in the query box.

If, like the BNC, the corpus is lemmatized, the terms will match the lemma as well as the word. If you enter save, the Sketch Engine will generate a concordance of all of the following:

  • save-saved-saves-saving (verb)
  • save-saves (noun - what goalkeepers make)
  • save (preposition: everyone was killed save Franco himself)

You can also enter phrases in the query box.

Hit <return>, or click on Make Concordance, to see the concordance.

To make more specific searches, click on the + beside Keyword, Context or Text Types and more options become available. First, we click on Keyword:

Search form, keywords

The Lemma box: here you can enter the lemma with a particular part of speech (eg save, noun). (Here and below we assume the corpus is, like the BNC, lemmatized and part-of-speech tagged. If it is not, not all of these options are available.)

The Phrase box: here you can enter any multiword expression, such as a compound noun or preposition (like business school or in preference to) or a longer string, such as you must be joking or weapons of mass destruction. Note also that any large corpus will have assorted errors in the markup, which occasionally result in search terms not appearing when you think they should. In this case, try putting your search term in the Phrase box instead of the Lemma box, as this field is matched directly against the text, without further analysis.

The Word Form box: this allows you to search for a specific word form, such as burns, and you can optionally specify that you are looking for burns as a verb or burns as a plural noun. You can make your search case-sensitive by checking the "match case" box: this will enable you to search for Bush rather than bush or pole but not Pole.

The CQL box is for inputting complex queries using Corpus Query Language, described in Corpus Querying and Grammar Writing available from the Sketch Engine home page. "Default attribute" controls how CQL queries will be understood. The "tagset summary" box gives details of the part-of-speech tags used in the tagging.

If you do not want to specify context any more precisely, you are now ready to hit the "Make Concordance" button and see the concordance. You will find more information about manipulating the output in Section 4 below. If however you would like to limit your search to a specific context or text type, read on.

The Context section

Close the Keyword section by clicking the "-" beside Keyword, and open the Context section by clicking the + beside Context and you see the following:

Search form, context

Here you can specify the right and/or left context of your search word, within a window of up to ten items on either side of the search word (though in practice you are unlikely to need such a large window). As context, you can specify either a particular word or one or more word classes (POS). Here are some examples:

  1. you want to search for the string shake (verb) followed by head (noun), to find instances such as she shook her head, if you agree shake your head, and shaking their heads in disbelief... You can do the following:
    • either key "shake" in the Lemma box and specify the POS "verb". Then key "head" in the Right Context lemma box and specify a window size (say 3 tokens)
    • or key "head" in the Lemma box and specify the POS "noun". Then key "shake" in the Left Context lemma box and specify a window size (say 3 tokens)

The results will be the same whichever route you take.

  1. you want to search for the verb taste followed by any adjective; since a following adjective may appear either in position 1 (it tastes horrible), position 2 (it tastes really delicious), or even position 3 (it didn't taste quite so good), key "taste" in the Lemma box and specify the POS "verb". Then - in the Right Context area - select "adjective" from the POS list and specify a window size of 3 tokens. This generates a concordance of 482 lines in the BNC. You can further refine your search by specifying two POS in the Context section. In this case, if you select both "adjective" and "adverb" you will get a smaller concordance of 127 lines, with examples such as it tastes bloody awful and it tastes surprisingly good. (In order to select more than one POS from this Context list, you need to hold down the "Ctrl" key while clicking it; the POS you have selected will now have a blue background.)

You can clear any boxes simply by hitting the "Concordance" button on the top of the screen.

There are many more complex searches you can carry out using this feature - it is worth trying things out to see what is possible. For example, you could further refine the first search here (with head=Lemma and shake=Left Context ) by also specifying a POS in the Right Context. Thus specifying "adverb" in the Right Context will generate lines such as shook his head disapprovingly, whereas specifying "noun" will generate shook their heads in agreement. There are very many searches one might try, though in practical lexicography (where severe time constraints operate) most searches are relatively simple.

Context searches can also be used to exclude unwanted items: thus you could key "weapons of" in the Phrase box, then exclude "destruction" by keying it into the Right Context Lemma box then selecting "None" from the "Query Type" drop-down list. This returns a concordance for any lines containing the string "weapons of" without the word "destruction".

The Text Type section

Close the Context section and open the Text Type section and you see the following:

Search form, text types

Here you can limit your search to a part of the corpus. If you want to see how a word behaves in the spoken part of the corpus, enter the word in the search box (or combine with other search specifications as described above: Text Type †can be open at the same time as Keyword and Context) and tick the boxes for "Spoken context governed" and "Spoken demographic". Your concordance will contain only spoken-language examples. The searches which can be specified depend on the composition and header information in the corpus.

4. Manipulating your concordance output

Once you have generated a concordance, there are several options for increasing its usefulness.

The concordance screen looks like this:

Concordances

The top section

As before, the first row of six buttons running along the top will take you to other parts of the program. The buttons on the second row allow you to work on this concordance.

The blue-tinted box in the top right-hand corner tells you which corpus you are using, and how many hits match your search item. (Here the search item is haunt, and there are 1098 concordance lines.)

Second row buttons

View Options: lets you toggles between standard KWIC concordance view (which appears by default) and full sentence view, and also takes you to a new screen that allows you to change the concordance view in various ways. To summarise its functions:

  • the Attributes column allows you to change from the default display (in which only the text is visible in the concordance line) to a number of alternative views in which you can see POS-tags, lemmatized forms, and any other fields of information, either for the node word only ("KWIC tokens only") or for every word in the concordance line ("For each token"). The function will rarely be needed for lexicography, but it can be useful for finding out why an unexpected corpus line has matched a query, as the cause is sometimes an incorrect POS-tag or lemmatization
  • the Structures column allows you to change from the default display to show the beginning and end tags for structures such as sentences, paragraphs and documents. Again, this is unlikely to be needed in mainstream lexicography
  • the References column dictates the type of information regarding the source texts which appears (in blue) at the left-hand end of the concordance line. The default is an identifier for the document that the concordance line is taken from. Any other fields of information about corpus documents can be selected and the value that the concordance line has for that field will then be seen. For example, if the corpus encodes whether a document is imaginative writing or not, and the feature (e.g. "genre") is selected in the References column, then any concordance line that comes from an "imaginative" text will be identified in the left hand column.
  • the Page Size box (bottom left) allows you to can specify a longer page length for the display: the default is that each page of concordances contains 20 lines, but you can increase this as far as 500 lines. (This will slow down initial retrieval of the concordance.)

The Sample button: useful if you are looking at a very frequent search item. It allows you to create a random sample of the corpus lines, to any figure you specify. This if you search for play=verb and decide that you do not want to analyse 37,632 lines, use the Sample button to reduce this to a manageable number.

The Sort buttons: you can either use the three small icons on a submenu to do a simple sort (one place to the left, sort by node word, or one place to the right) or use the Sort screen to specify a more complex sort procedure. Sorting is often a quick way of revealing patterns: a right sort of a haunt shows 9 instances of haunt me for [TIME-PHRASE] in the BNC.

The Frequency button allows you to view two types of frequency information regarding your search term:

  • Multilevel frequency distribution shows the frequency of each form of a given lemma.† To see how this works, make a concordance for forge-verb: when the concordance arrives, go to the Frequency screen and the "Multilevel frequency distribution" option. The (default) "first level" shows you the frequencies of the forms "forge", "forged", "forging" and "forges". The second and third levels allow more complex searches of this type: for example if you check "second level" and select "1R" (=word one position to right of node word) you will see which words appear in this position and how frequent each of these words is.
  • Text type frequency distribution shows how your search term is distributed through the texts in the corpus. You may find, for example, that a word like police appears significantly more often in newspaper texts than in other text types. This is a potentially useful tool which could show you - for example - that a particular medical term is not restricted to specialised medical discourse. As with the references column in the View Options screen, the actual values you can select depend on the corpus you are using, and how it has been set up in the Sketch Engine.

The Collocation button allows you to generate lists of words that co-occur frequently with your node word (its "collocates"). Where word sketches are available, they give a more sophisticated account of collocates in most cases.

Moving around the concordance

You can move from one part of the concordance to another either by specifying a number in the "Page" box and hitting "Go", or by clicking on Next, Last, First or Previous.

Finding out about a particular concordance line

If you click on one of the node words, more of its context appears in the pane at the bottom of the screen, thus:

Working with concordances

and you can further expand the context by clicking on expand left and/or expand right.

To get information about the source-text a particular concordance line comes from, click the document-id code at the left-hand end of the relevant line (assuming you have not changed the "View option" relating to "references", as described above). This brings up "header" information in the bottom pane.

5. The Word Sketch function

A Word Sketch is a corpus-based summary of a word's grammatical and collocational behaviour.

Click on Word Sketch on the Home page, and this takes you to the Word Sketch entry form, which looks like this:

Word sketches form

Note that in the current version, this function works only with full corpora; you cannot (currently) view Word Sketches for subcorpora.

Choose a lemma and specify its part of speech using the drop-down list. Word Sketches are available for nouns, verbs, and adjectives, but not for other word classes. They also depend on the availability of substantial amounts of data, so if you try to create a Word Sketch for a fairly rare item (e.g. coagulate) you will see a message saying there is no Word Sketch available. (This is entirely logical: the point of the Word Sketches is to provide helpful summaries when there is too much corpus data to scan efficiently using a concordance; but there are only 19 concordance lines for coagulate, so it is easy enough to analyse them all "manually".) In general, you need several hundred instances of a word to make a useful word sketch.

The following screen-shot shows (part of) a Word Sketch for the noun challenge:

Word sketches results

Each column show the words that typically combine with challenge in a particular grammatical relations (or "gramrels"). Most of these gramrels are self-explanatory. For example, "object_of" lists - in order of statistical significance rather than raw frequency - the verbs that most typically occupy the verb slot in cases where challenge is the object of a verb.† Most of the data is lexicographically relevant, though one might query the adjectival modifier larval: it turns out that "larval challenge" is a technical term used in parasitology, discussed in a BNC document.

You can at any time switch between Concordance† mode and Word Sketch mode, and this is a useful way of getting more information about a particular word combination. Thus, if you want to look at examples of the string "pose + challenge", simply click on the number next to "pose" in the object_of list (93) and you will be taken directly to a concordance showing all instances of this combination.

6. The Thesaurus function

The software checks to see which words occur with the same collocates as other words, and on the basis of this data it generates a "distributional thesaurus". The thesaurus function lists, for each adjective, noun or verb, the other words most similar to it in their use in the language.

Click on the Thesaurus button on the Home page, or in the bar at the top of any page, and then input the word you are interested in.

7. The Sketch Difference function

Sketch Difference is a neat way of comparing two very similar words: it shows those patterns and combinations that the two items have in common, and also those patterns and combinations that are more typical of, or unique to, one word rather than the other. Click on any word in a Thesaurus entry for a word, and you will be taken straight to a screen showing the Sketch Difference between the two words. Alternatively,† you can click on Word Sketch Difference on the Home page, or hit the "Sketch-Diff" button at the top of any screen, and this will take you to the sketch difference entry form.

Suppose you want to compare clever and intelligent. In the thesaurus entry for clever, intelligent comes top of the list: it is the most similar word. Click on intelligent and you are taken to a new screen is in three main parts:† the first part shows "Common Patterns" (those combinations where clever and intelligent behave quite similarly), the second and third parts show "clever only patterns" and "intelligent only patterns". Part of it is shown here:

Sketch difference

In the "Common Patterns" part, there are four numbers next to each collocate. The first two indicate the frequency of co-occurence with the first and second lemma, the last two show the salience scores for the collocate with both lemmas. All collocates are sorted according to maximum of the two salience scores and coloured according to difference between the scores.

Try this out, and look at the difference in the "and/or" lists: people can be "intelligent and generous/mature/stable" etc, but they are often "clever and devious/cunning/dangerous".

Attachments

  • ske_homepage.jpg (45.4 kB) -Sketch Engine home page, added by jan on 06/22/07 20:12:46.
  • ske_searchform.jpg (16.8 kB) -Search form, added by jan on 06/22/07 20:13:36.
  • ske_searchform_keywords.jpg (27.0 kB) -Search form, keywords, added by jan on 06/22/07 20:14:11.
  • ske_searchform_context.jpg (30.5 kB) -Search form, context, added by jan on 06/22/07 20:14:43.
  • ske_searchform_texttypes.jpg (34.1 kB) -Search form, text types, added by jan on 06/22/07 20:15:13.
  • ske_concordances.jpg (62.5 kB) -Concordances, added by jan on 06/22/07 20:15:54.
  • ske_concordances2.jpg (61.8 kB) -Working with concordances, added by jan on 06/22/07 20:16:33.
  • ske_wsform.jpg (17.8 kB) -Word sketches form, added by jan on 06/22/07 20:17:20.
  • ske_wsresults.jpg (52.2 kB) -Word sketches results, added by jan on 06/22/07 20:18:05.
  • ske_sketchdiff.jpg (41.6 kB) -Sketch difference, added by jan on 06/22/07 20:18:30.

Download in other formats:

  • Plain Text

Sketch Engine
Bringing Corpora to the Masses

Lexical Computing Ltd

Brought to you by
Lexical Computing Ltd