= Getting started with the Sketch Engine = == 1. Background == The Sketch Engine is a web-based program which takes as its input a corpus of any language with an appropriate level of linguistic mark-up. The Sketch Engine has a number of language-analysis functions, the core ones being: * the Concordancer (which is very fast and offers a high level of functionality) * the Word Sketch program (which will be described later) For the purposes of this guide, we use examples based on the Sketch Engine loaded with a sample corpus of English, the British National Corpus (BNC). For more information about the Sketch Engine, see [attachment:wiki:SkE/DocsIndex:sketch-engine-elx04.pdf Kilgarriff et al 2004 in Proc EURALEX]. For more information about the BNC, see [http://www.natcorp.ox.ac.uk/ http://www.natcorp.ox.ac.uk/] == 2. Home page == The software is on the Sketch Engine website: [http://www.sketchengine.co.uk/ http://www.sketchengine.co.uk/] Follow the links from this page to either set up an account, or log in. The "home" screen looks like this: [[Image(ske_homepage.jpg)]] Here you can select your corpus (or use a couple of other tools). Here we want to explore the BNC, so we click on that. == 3. Generating a concordance == We then see the below: [[Image(ske_searchform.jpg)]] The six buttons along the top will take you to other parts of the program, or back to the start ("Home"). You enter the main search term in the query box. If, like the BNC, the corpus is lemmatized, the terms will match the lemma as well as the word. If you enter '''save''', the Sketch Engine will generate a concordance of all of the following: * '''save-saved-saves-saving''' (verb) * '''save-save'''s (noun - what goalkeepers make) * '''save''' (preposition: ''everyone was killed '''save''' Franco himself'') You can also enter phrases in the query box. Hit , or click on '''Make Concordance''', to see the concordance. To make more specific searches, click on the '''+''' beside '''Keyword, Context '''or '''Text Types''' and more options become available. First, we click on '''Keyword: ''' [[Image(ske_searchform_keywords.jpg)]] ''The Lemma box: ''here you can enter the lemma with a particular part of speech (eg '''save, noun'''). (Here and below we assume the corpus is, like the BNC, lemmatized and part-of-speech tagged. If it is not, not all of these options are available.) ''The Phrase box'': here you can enter any multiword expression, such as a compound noun or preposition (like '''business school''' or '''in preference to''') or a longer string, such as '''you must be joking '''or''' weapons of mass destruction'''. Note also that any large corpus will have assorted errors in the markup, which occasionally result in search terms not appearing when you think they should. In this case, try putting your search term in the Phrase box instead of the Lemma box, as this field is matched directly against the text, without further analysis. ''The Word Form box'': this allows you to search for a specific word form, such as '''burns''', and you can optionally specify that you are looking for '''burns''' as a verb or '''burns''' as a plural noun. You can make your search case-sensitive by checking the "match case" box: this will enable you to search for '''Bush''' rather than '''bush '''or '''pole''' but not '''Pole'''. ''The CQL box ''is for inputting complex queries using Corpus Query Language, described in ''Corpus Querying and Grammar Writing'' available from the Sketch Engine home page. "Default attribute" controls how CQL queries will be understood. The "tagset summary" box gives details of the part-of-speech tags used in the tagging. If you do not want to specify context any more precisely, you are now ready to hit the "Make Concordance" button and see the concordance. You will find more information about manipulating the output in Section 4 below. If however you would like to limit your search to a specific context or text type, read on. === The Context section === Close the Keyword section by clicking the "-" beside Keyword, and open the Context section by clicking the + beside Context and you see the following: [[Image(ske_searchform_context.jpg)]] Here you can specify the right and/or left context of your search word, within a window of up to ten items on either side of the search word (though in practice you are unlikely to need such a large window). As context, you can specify __either__ a particular word __or__ one or more word classes (POS). Here are some examples: 1. you want to search for the string '''shake''' (verb) followed by '''head''' (noun), to find instances such as ''she shook her head, if you agree shake your head, ''and'' shaking their heads in disbelief...'' You can do the following: * ''either'' key "shake" in the Lemma box and specify the POS "verb". Then key "head" in the Right Context lemma box and specify a window size (say 3 tokens) * ''or'' key "head" in the Lemma box and specify the POS "noun". Then key "shake" in the Left Context lemma box and specify a window size (say 3 tokens) The results will be the same whichever route you take. 2. you want to search for the verb '''taste''' followed by ''any'' adjective; since a following adjective may appear either in position 1 (''it'' ''tastes horrible''), position 2 (''it tastes really delicious''), or even position 3 (''it didn't taste quite so good''), key "taste" in the Lemma box and specify the POS "verb". Then - in the Right Context area - select "adjective" from the POS list and specify a window size of 3 tokens. This generates a concordance of 482 lines in the BNC. You can further refine your search by specifying ''two'' POS in the Context section. In this case, if you select both "adjective" and "adverb" you will get a smaller concordance of 127 lines, with examples such as ''it tastes bloody awful'' and ''it tastes surprisingly good''. (In order to select more than one POS from this Context list, you need to hold down the "Ctrl" key while clicking it; the POS you have selected will now have a blue background.) You can clear any boxes simply by hitting the "Concordance" button on the top of the screen. There are many more complex searches you can carry out using this feature - it is worth trying things out to see what is possible. For example, you could further refine the first search here (with '''head'''=Lemma and '''shake'''=Left Context ) by ''also'' specifying a POS in the Right Context. Thus specifying "adverb" in the Right Context will generate lines such as ''shook his head __disapprovingly__'', whereas specifying "noun" will generate ''shook their heads in __agreement__''. There are very many searches one might try, though in practical lexicography (where severe time constraints operate) most searches are relatively simple. Context searches can also be used to ''exclude'' unwanted items: thus you could key "weapons of" in the Phrase box, then exclude "destruction" by keying it into the Right Context Lemma box then selecting "None" from the "Query Type" drop-down list. This returns a concordance for any lines containing the string "weapons of" ''without'' the word "destruction". === The Text Type section === Close the Context section and open the Text Type section and you see the following: [[Image(ske_searchform_texttypes.jpg)]] Here you can limit your search to a part of the corpus. If you want to see how a word behaves in the spoken part of the corpus, enter the word in the search box (or combine with other search specifications as described above: Text Type †can be open at the same time as Keyword and Context) and tick the boxes for "Spoken context governed" and "Spoken demographic". Your concordance will contain only spoken-language examples. The searches which can be specified depend on the composition and header information in the corpus. == 4. Manipulating your concordance output == Once you have generated a concordance, there are several options for increasing its usefulness. The concordance screen looks like this: [[Image(ske_concordances.jpg)]] === The top section === As before, the first row of six buttons running along the top will take you to other parts of the program. The buttons on the second row allow you to work on this concordance. The blue-tinted box in the top right-hand corner tells you which corpus you are using, and how many hits match your search item. (Here the search item is haunt, and there are 1098 concordance lines.) === Second row buttons === ''View Options'': lets you toggles between standard KWIC concordance view (which appears by default) and full sentence view, and also takes you to a new screen that allows you to change the concordance view in various ways. To summarise its functions: * the ''Attributes'' column allows you to change from the default display (in which only the text is visible in the concordance line) to a number of alternative views in which you can see POS-tags, lemmatized forms, and any other fields of information, either for the node word only ("KWIC tokens only") or for every word in the concordance line ("For each token"). The function will rarely be needed for lexicography, but it can be useful for finding out why an unexpected corpus line has matched a query, as the cause is sometimes an incorrect POS-tag or lemmatization * the ''Structures ''column allows you to change from the default display to show the beginning and end tags for structures such as sentences, paragraphs and documents. Again, this is unlikely to be needed in mainstream lexicography * the ''References'' column dictates the type of information regarding the source texts which appears (in blue) at the left-hand end of the concordance line. The default is an identifier for the document that the concordance line is taken from. Any other fields of information about corpus documents can be selected and the value that the concordance line has for that field will then be seen. For example, if the corpus encodes whether a document is imaginative writing or not, and the feature (e.g. "genre") is selected in the ''References ''column, then any concordance line that comes from an "imaginative" text will be identified in the left hand column. * the ''Page Size'' box (bottom left) allows you to can specify a longer page length for the display: the default is that each page of concordances contains 20 lines, but you can increase this as far as 500 lines. (This will slow down initial retrieval of the concordance.) ''The Sample button'': useful if you are looking at a very frequent search item. It allows you to create a random sample of the corpus lines, to any figure you specify. This if you search for '''play'''=verb and decide that you do not want to analyse 37,632 lines, use the Sample button to reduce this to a manageable number. ''The Sort buttons'': you can ''either'' use the three small icons on a submenu to do a simple sort (one place to the left, sort by node word, or one place to the right) ''or ''use the Sort screen to specify a more complex sort procedure. Sorting is often a quick way of revealing patterns: a right sort of a '''haunt ''' shows 9 instances of ''haunt me for [TIME-PHRASE]'' in the BNC. ''The Frequency button'' allows you to view two types of frequency information regarding your search term: * ''Multilevel frequency distribution'' shows the frequency of each form of a given lemma.† To see how this works, make a concordance for '''forge-'''verb: when the concordance arrives, go to the Frequency screen and the "Multilevel frequency distribution" option. The (default) "first level" shows you the frequencies of the forms "forge", "forged", "forging" and "forges". The second and third levels allow more complex searches of this type: for example if you check "second level" and select "1R" (=word one position to right of node word) you will see which words appear in this position and how frequent each of these words is. * ''Text type frequency distribution'' shows how your search term is distributed through the texts in the corpus. You may find, for example, that a word like '''police''' appears significantly more often in newspaper texts than in other text types. This is a potentially useful tool which could show you - for example - that a particular medical term is not restricted to specialised medical discourse. As with the ''references'' column in the ''View Options'' screen, the actual values you can select depend on the corpus you are using, and how it has been set up in the Sketch Engine. ''The Collocation button'' allows you to generate lists of words that co-occur frequently with your node word (its "collocates"). Where word sketches are available, they give a more sophisticated account of collocates in most cases. === Moving around the concordance === You can move from one part of the concordance to another either by specifying a number in the "Page" box and hitting "Go", or by clicking on __Next__, __Last__, __First__ or __Previous__. === Finding out about a particular concordance line === If you click on one of the node words, more of its context appears in the pane at the bottom of the screen, thus: [[Image(ske_concordances2.jpg)]] and you can further expand the context by clicking on __expand left__ and/or __expand right__. To get information about the source-text a particular concordance line comes from, click the document-id code at the left-hand end of the relevant line (assuming you have not changed the "View option" relating to "references", as described above). This brings up "header" information in the bottom pane. == 5. The Word Sketch function == #wordsketchid A Word Sketch is a corpus-based summary of a word's grammatical and collocational behaviour. Click on __Word Sketch__ on the Home page, and this takes you to the Word Sketch entry form, which looks like this: [[Image(ske_wsform.jpg)]] Note that in the current version, this function works only with full corpora; you cannot (currently) view Word Sketches for subcorpora. Choose a lemma and specify its part of speech using the drop-down list. Word Sketches are available for nouns, verbs, and adjectives, but not for other word classes. They also depend on the availability of substantial amounts of data, so if you try to create a Word Sketch for a fairly rare item (e.g. '''coagulate''') you will see a message saying there is no Word Sketch available. (This is entirely logical: the point of the Word Sketches is to provide helpful summaries when there is too much corpus data to scan efficiently using a concordance; but there are only 19 concordance lines for '''coagulate''', so it is easy enough to analyse them all "manually".) In general, you need several hundred instances of a word to make a useful word sketch. The following screen-shot shows (part of) a Word Sketch for the noun '''challenge''': [[Image(ske_wsresults.jpg)]] Each column show the words that typically combine with '''challenge''' in a particular grammatical relations (or "gramrels"). Most of these gramrels are self-explanatory. For example, "object_of" lists - in order of statistical significance rather than raw frequency - the verbs that most typically occupy the verb slot in cases where '''challenge''' is the object of a verb.† Most of the data is lexicographically relevant, though one might query the adjectival modifier ''larval'': it turns out that "larval challenge" is a technical term used in parasitology, discussed in a BNC document. You can at any time switch between Concordance† mode and Word Sketch mode, and this is a useful way of getting more information about a particular word combination. Thus, if you want to look at examples of the string "pose + '''challenge'''", simply click on the number next to "pose" in the '''object_of''' list (__93__) and you will be taken directly to a concordance showing all instances of this combination. == 6. The Thesaurus function == #distributionalthesaurusid The software checks to see which words occur with the same collocates as other words, and on the basis of this data it generates a "distributional thesaurus". The thesaurus function lists, for each adjective, noun or verb, the other words ''most similar'' to it in their use in the language. Click on the Thesaurus button on the Home page, or in the bar at the top of any page, and then input the word you are interested in. == 7. The Sketch Difference function == #sketchdiffid Sketch Difference is a neat way of comparing two very similar words: it shows those patterns and combinations that the two items have in common, and also those patterns and combinations that are more typical of, or unique to, one word rather than the other. Click on any word in a Thesaurus entry for a word, and you will be taken straight to a screen showing the Sketch Difference between the two words. Alternatively,† you can click on __Word Sketch Difference__ on the Home page, ''or'' hit the "Sketch-Diff" button at the top of any screen, and this will take you to the sketch difference entry form. Suppose you want to compare '''clever ''' and '''intelligent'''. In the thesaurus entry for '''clever, intelligent '''comes top of the list: it is the most similar word. Click on '''intelligent '''and you are taken to a new screen is in three main parts:† the first part shows "Common Patterns" (those combinations where '''clever''' and '''intelligent''' behave quite similarly), the second and third parts show "clever only patterns" and "intelligent only patterns". Part of it is shown here: [[Image(ske_sketchdiff.jpg)]] In the "Common Patterns" part, there are four numbers next to each collocate. The first two indicate the frequency of co-occurence with the first and second lemma, the last two show the salience scores for the collocate with both lemmas. All collocates are sorted according to maximum of the two salience scores and coloured according to difference between the scores. Try this out, and look at the difference in the "and/or" lists: people can be "intelligent and generous/mature/stable" etc, but they are often "clever and devious/cunning/da