Corpus Querying and Grammar Writing for the Sketch Engine
The four sections in this page describe
- 1. The corpus query language
- 2. Grammatical relation definitions
- 3. Grammatical relations file: example
- 4. Macros in m4
1. The Corpus Query Language (CQL)
The language was developed at the Corpora and Lexicons group, IMS, University of Stuttgart in the early 1990s, see IMS Corpus Workbench. The CQL as used in Sketch Engine is an extension to the original language and varies in several ways. This documentation describes the CQL as implemented in manatee 2.45 (released October 2011).
- A query consists of a regular expression over attribute expressions and/or structures.
- The attributes used in the examples provided below are word and tag. These examples assume that in our corpus every word has an associated part-of-speech tag referred to as tag.
Simple attribute-value queries
- The general form to query a positional attribute value is
[attr="value"]
- For example, very often, you only want to look for a given word (e.g. teapot), so attr would be word and the value would be teapot
[word="teapot"]
- You might want to broaden the search, for example you want to find all occurrences of words beginning with confus. The full form is
[word="confus.*"]
but you can make use of the so called default attribute (in this example selected as word) so we can simply use"confus.*"
the default attribute can be changed using the drop-down list under the CQL box. - Case is significant to the query processor. If you want case-insensitive search, include (?i) in a string
"(?i)on"
- We often want a wild-card word: any single word, it doesn't matter which. We use the "match any token" operator [] (similar to the dot for "match any character" in regular expressions over strings): "confus.*" [] "by" This query finds all sequences of a word beginning with confus, followed by any word followed by by.
- We search for exactly two words between confus.* and by with
"confus.*" []{2} "by" - We search for between 0 and 3 words between confus.* and by with
"confus.*" []{0,3} "by" - The following comparison operators are also possible: =, !=, <=, >=, !<=, !>=, ==, !==.
For <=, >=, !<=, !>= operators, the attribute value is compared in such a way that alphabetical parts of the value are compared lexicographically and numerical parts numerically. The intended usage of this feature focuses on structure attributes, so that one can search for <doc id>="AB2010CD"> and that will include documents with id such as "BB0000CD", "AB2011CD" or "AB2010CE". The ==, !== operators differ from their single-character-counterparts in how they treat regular expression meta-characters. Normally such characters have to be escaped by backslash to gain their standard value, i.e. to find all dots, one needs to query for [word="\."]. Since this might be sometimes cumbersome, one can use ==, !== which evaluate the value as a fixed string and not a regular expression. Note that even in case of ==, !==, for obvious reasons two characters need to be escaped anyway: the quote (") and the backslash (\).
Regular expressions at positional attribute level
- We use regular expressions at two levels in our query system: at the level of attribute expressions and within string values.
- The regular expression operators available are:
- disjunction (|),
- Kleene star (*, as in our "confus.*" example above, this matches any number of repetitions, including 0),
- plus operator (+, matching 1 or more repetitions),
- optionality operator (?, optional, i.e. matches 0 or 1 occurrence)
- the interval operator
{n, k}
matches between n and k repetitions. If k is omitted, at least n repetitions are matched. If the interval has the form
{n}exactly n repetitions are matched. The examples below will clarify this.
- Each attribute expression is -- roughly speaking -- evaluated against the word (and/or other, additional attributes) at a given corpus position. It has the form
[Boolean expression]
that is, an attribute expression is a boolean expression surrounded by brackets. - A boolean expression is a set of attribute value tests, combined with the usual boolean expression operators conjunction (&), disjunction (|) and negation (! ). Parentheses may be used in the usual way.
Queries searching for structures
- It is possible to use structures in your search. If s is a valid structure in your corpus, then <s> matches the beginning of the structure, </s> matches its end and <s/> matches the whole structure including all tokens inside it.
- In the same way positional attributes are included in the query, one can limit the search on particular structures by their structure attribute values. The following will find the beginnings of all documents with an id of 2011, where a proper noun (singular) must occur at the beginning of a sentence, followed by an arbitrary number of unspecified words, and finally followed by a verb.
<doc id="2011"> [tag="N.*"] []* [tag="VB.*"]
- The N-th structure (in the order as appearing in the corpus) might be selected using the <doc #N> syntax, e.g. to retrieve the fifth document, one would use:
<doc #5>
- The negation of the previous query ("all documents except for the fifth") is possible as well:
<doc !#5>
Combining queries using within and containing operators
- If the corpus has sentence, paragraph or document markup, rather than constraining the match by specifying a number of tokens, we can specify it as within a unit (Here s for sentence.) We search for confus followed by by within a sentence with:
"confus.*" []* "by" within <s/>
- A generalization of the previous example is "QUERY within QUERY" so that you can match e.g. all noun phrases within a sequence starting and ending with a verb:
[tag="N.*"]+ within [tag="VB.*"] []* [tag="VB.*"]
Note that while the entire expression is matched, only the first query before within is highlighted in the concordance as the node or KWIC - As the counterpart to the within query, there is also a containing query with obvious semantics, you can e.g. match all sentences containing more than one noun:
<s/> containing []* [tag="N.*"] []* [tag="N.*"] []*
- Similarly, you can generalize to "QUERY containing QUERY" and construct a query matching a sequence starting and ending with a verb and containing at least one noun:
[tag="VB.*"] []* [tag="VB.*"] containing [tag="N.*"]
- Both of the within/containing queries support a shortcut of within/containing NUMBER which expands to within/containing []{NUMBER}.
- The within and containing operators might be mutually nested in an arbitrary way, making it possible to formulate complex queries like the following one which tries to look up particles:
[tag="PR.*"] within [tag="V.*"] [tag="AT0"]? [tag="AJ0"]* [tag="(PR.?|N.*)"] [tag="PR.*"] within <s/>
Contextual queries using meet and union operators
- meet queries represent a specific type of contextual queries: let's say you want to match every noun which is surrounded by a verb in a -3/+3 context. You can achieve this using the following query:
(meet [tag="N.*"] [tag="VB.*"] -3 3)
Only the first part ([tag="N.*"]) is highlighted as KWIC in the concordance, the [tag="VB.*"] is used as a contextual filter in the search - union queries can be used to collect the results of meet queries. E.g. if you'd like to extend the previous example by all adjectives surrounded by a verb in -2/+2 context, you can do that in the following way:
(union (meet [tag="N.*"] [tag="VB.*"] -3 3) (meet [tag="A.*"] [tag="VB.*"] -2 2))
- Both meet and union may occur wherever a positional attribute might be placed and can be combined with within/containing queries as demonstrated in the example below:
containing (meet [lemma="have"] [tag="P.*"] -5 5) containing (meet [tag="N.*"] [lemma="blue"])
Global conditions
- A global conditions part might be appended which postulates additional global constraints on positional attribute values. To make use of it, relevant positions must be prefixed by a numeric label, such as 1:[word="car"].
- Global conditions are introduced using the & operator and may occur only at the very end of the query.
- The example below would retrieve all neighbouring pairs of words with the same tag:
1:[] 2:[] & 1.tag = 2.tag
- A frequency function might be used to further limit the search:
1:[] 2:[] & 1.tag = 2.tag & f(1.tag) > 1000
Some query examples
Look for...
- thank starting with either upper or lower case:
"[tT]hank"
- a word beginning with confuse, followed by a preposition or a personal pronoun:
"confuse.*" [tag="IN" | tag="PP"] "confuse.*" ([tag="IN"] | [tag="PP"]) "confuse.*" [tag="IN|PP"]
The three alternatives have the same effect, but are handled at a different level of evaluation: the first at the level of boolean expressions, the second at the level of attribute expressions, and the third at the level of regular expressions over the character alphabet. - the same, but with at most 10 words in between:
"confuse.*" []{0,10} [tag="IN" | tag="PP"] - the same, but without full stops in between:
"confuse.*" [word!="\."]{0,10} [tag="IN" | tag="PP"]The backslash is needed to escape the dot, otherwise it will be treated as the matchall symbol of the regular expressions at the level of strings. If the backslash is omitted, all one-character tokens are excluded. - a sequence of an adjective, a noun, a conjunction and another noun:
[tag="JJ.*"] [tag="N.*"] "and|or" [tag="N.*"]
- a noun, followed by either is or was, followed by a verb ending in ed:
[tag="N.*"] "is|was" [tag="V.*" & word=".*ed"]
- similar, but is or was followed by a past participle (which is described by a particular POS tag, VBD):
[tag="N.*"] "is|was" [tag="VBD"]
- catch or caught, followed by a determiner, any number of adjectives and a noun, or a noun, followed by was or were, followed by caught:
"catch|caught" [tag="DT"] [tag="JJ"]* [tag="N.*"] | [tag="N.*"] "was|were" "caught"
- look or bring, followed by either up or down with at most 10 non-verbs in between:
"look|bring" [tag != "VB.*"]{0,10} "up|down"
2. Grammatical Relation Definitions
To build word sketches, we need to specify grammatical relations. For this we need to provide a simple grammar -- a collection of definitions that allow the system to automatically identify possible relations of words to the keyword.
Example
As an example, suppose the keyword is the verb "graze" and we are considering the following instances:
- ...as sheep graze a Gloucestershire pasture...
- I'd hoped he'd still be grazing that pasture there.
- ...took the flock to the monsoon settlement to graze the mountain pastures.
We can capture the relation of object of the verb in the following pattern (using the Modified Penn Treebank Tagset):
1:"VB.?" [tag="DT|PRP\$"]{0,1} "JJ.?"{0,3} "NN.*"{0,2} 2:"NN.*"
This pattern captures cases where the keyword (indicated by the prefix "1:") can be any verb (VBZ, VBD, VBN or VBG) followed possibly by a determiner or possessive pronoun (ie. where this tag occurs either 0 or 1 times), a string of 0-3 adjectives and 0-2 nouns and finally by a noun which is taken to be the head of the object noun phrase. The fact that it is this final noun that is the word we want to capture is indicated by the prefix "2:". In all cases "pasture" will be recovered as the object of "graze".
The attribute "tag" is taken as the default attribute and it can therefore be omitted (except in disjunctions).
We can add further definitions for the same relation to capture different constructions which realise the same underlying relation. For example, the subject of a passive verb plays the same role as the object of an active one:
- ...pastures would be grazed and never ploughed.
- When the original pasture is grazed again...
We can add to the definition of the gramrel, the query:
2:"NN.*" [tag="RB"|tag="VM"]{0,4} [lempos="be-v"] 1:"VBN"
Here the verb in the passive construction (that is, a past participle following the verb "be") is again marked as the keyword with the prefix "1:". The subject is marked with the prefix "2:" and is thus taken to be the underlying object of the verb. Between the subject and the verb "be" we allow the possibility of a string of adverbs ("RB") and/or modal verbs ("VM").
Nota bene
Grammars that use the pattern-matching approach described here will always be less than perfect -- there will be cases where they fail to capture the relation between two words, and cases where the grammar incorrectly supposes a relation exists. Such "noise" in the system is in most cases of little importance as the word sketches only display relations which occur much more often than expected. Therefore, one soon reaches a limit as to how further accuracy in the definitions improves the word sketch.
3. Grammatical Relations File: Example
An input for the program which compiles the sketches (compilecorp or genws see compiling corpora) is a wordsketch definition file. It is a text (ASCII) file containing queries for each grammatical relation (gramrel).
- Comments are lines beginning with the hash character (#). Empty lines are ignored.
- Lines beginning with the equal character (=) are gramrel names. A gramrel name can contain any character with the exception of slash (/) for dual gramrels (see below), trailing white spaces are stripped off.
- The gramrel name is followed by gramrel queries, with each query on a separate line.
- A regular gramrel query has to contain two labelled positions with labels "1:" and "2:". One query should be on one line: use a backslash (\) on the end of a line to split a query into multiple lines.
- Lines beginning with star (*) are processing directives. They modify handling of the lines that follow them:
- *DEFAULTATTR sets the default attribute for query evaluation. This directive is active to the end of the file or to the next *DEFAULTATTR directive.
- *STRUCTLIMIT limits query results to a structure, for example sentence. The sequence of tokens in the result cannot cross boundaries of the structure. This directive is active to the end of the file or to the next *STRUCTLIMIT directive.
- *FIXORDER specifies the ordering of grammatical relations for display in the interface. It is possible to specify only the first n relation names; the rest will be sorted randomly.
- *SYMMETRIC evaluates queries also with the "1:" and "2:" labels swapped. This directive is active up to the next gramrel line.
- *DUAL is similar to *SYMMETRIC but it affects gramrels. It defines two gramrels from the same set of gramrel queries. Gramrel names are separated by a slash (/). All queries are evaluated for the first gramrel and then for the second gramrel with the "1:" and "2:" labels swapped.
- *TRINARY is used for trinary relations. These are translated into regular binary relations with different names. A name of a trinary gramrel should contain "%s" and respective queries should contain the third label "3:". A value of the word sketch base attribute on the position labeled "3:" is then substituted for "%s" in the gramrel name.
- *UNARY says that the following gramrel is a unary relation. Only one label is used for unary gramrel queries.
- *SEPARATEPAGE indicates that the following *TRINARY relation should be displayed on a separate page with links from the main wordsketch page. Optional parameter is the name of the aggregated gramrel name, defaults to the relation name with %s substituted to '*'.
- *COLLOC specifies a created value for the collocation. It can contain '%' substitution strings, in the form %(n.attr), where n is the numeric label used in the query, and attr is the attribute name. It uses the created value for the collocation instead of the attribute given by the WSATTR option.
- *CONSTRUCTION indicates that the following gramrel should be displayed in the 'Constructions' list.
The example is for French. We assume a default feature of tag and a lemma feature. The tagset is a simple one with
- N for nouns
- All verbs start with V. Past participles are V:pp, infinitives are V:inf
- ADJ for adjectives
- ADV for adverbs
- DET for determiners
- PRO for pronouns
- PRP for prepositions
French words used: et (and) ou (or) de (of) ętre (the verb be) avoir (the verb have)
*STRUCTLIMIT s
*DEFAULTATTR tag
*FIXORDER sujet sujet_de objet objet_de
*DUAL
=objet/objet_de
1:"V.*" "ADJ|ADV|DET"{0,3} 2:"N"
#"The first argument is a verb, then there are between 0 and 3 adverbs,
#adjectives and determiners, then the second argument is a noun." In
# this simple example, no other constructions are covered.
*DUAL
=sujet/sujet_de
2:"N" "ADV|PRO"{0,2} "V.*"{0,2} "ADV|PRO"{0,2} 1:"V.*"
2:"N" "ADV|PRO"{0,2} 1:"V.*"
#First clause covers cases with auxiliaries, second covers simple verbs.
*SYMMETRIC
=et_ou
1:"N" "ADJ.?"{0,3} [lemma="et|ou"] "DET.*"? "AD[JV]"{0,3} 2:"N"
1:"V.*" [lemma="et|ou"] [lemma="de|être|avoir"|tag="ADV"]? 2:"V.*"
1:"ADJ" "AD[JV]"{0,3} [lemma="et|ou"|word=","] "AD[JV]"{0,3} 2:"ADJ"
1:"ADJ" "NOM" 2:"ADJ"
#Conjunction: one clause each for nouns and verbs (simple cases only covered),
#two clauses for adjectives to cover the case where both adjectives are next
#two each other, and the case where one comes before the head noun and the
#other comes after. Note that the comma (and other punctuation) is a regular
#token which can be searched on (counter-intuitively) as a "word".
*DUAL
=adj_sujet_de/adj_sujet
1:"N" "ADV"{0,2} [lemma="être"] "ADV"{0,2} 2:"ADJ" "[^AN].*"
*DUAL
=prédicat_de/prédicat
1:"N" "ADV"{0,2} [lemma="être"] "AD[JV]|DET"{0,3} 2:"N" "[^AN].*"
*DUAL
=modifier/modifié
2:"ADJ" "AD[JV]"{0,3} 1:"N"
1:"N" "ADJ"? 2:"ADJ"
1:"V.*" 2:"ADV"
2:"ADV" 1:"V.*"
=infin_comp
1:"V.*" "ADV"{0,3} 2:"V:inf"
*TRINARY
=pp_%s
1:"N|ADJ|V.*" 3:"PRP" "DET|ADJ"{0,3} 2:"N"
Example of usage for directives *CONSTRUCTION, *SEPARATEPAGE, *UNARY and *COLLOC:
*CONSTRUCTION *UNARY =wh_word 1:[] [tag="AVQ"|tag="DTQ"|tag="PNQ"] *SEPARATEPAGE pp_X *TRINARY =pp_%s 1:[tag="N.."|tag="AJ."] 3:"PR." 2:"N.." =pp_pp *COLLOC "%(3.word)_%(2.word)-p" 1:[tag="N.."|tag="AJ."] 3:"PR." 2:"N.."
4. Macros in m4
As we continue to expand the grammar to cover more relations and more patterns for each relation we will soon find ourselves repeating the same pattern many times. To keep the grammar simpler and more easy to manage we can write a macro for each recurring element in the language m4. So for the example,
2:"NN.*" [tag="RB"|tag="VM"]{0,4} [lempos="be-v"] 1:"VBN"
we could define a noun phrase macro as follows:
define('noun_phrase',\
'"DT|PRP\$"{0,1} "JJ.?"{0,3} "NN.*"{0,2} 2:"NN.*"')
We could also abstract away from the use of the actual tag for the lexical verb in our first definition by making the definition:
define(`lex_verb', `"VB?"')
Writing grammars in this way also allows us to make them independent of any particular tagset. If we want to use a different tagset we simply need to redefine the basic definitions while the higher level structures remain unchanged.
Using these two definitions we can now express our original clause for capturing the object of a verb as:
1:lex_verb noun_phrase
In an m4 file the additional macro definitions are placed before the relation definitions and between the lines:
divert(-1) ... divert
The program m4 (a standard unix utility) is then run over the file to give a 'full-form' version which is used to build word sketches.
Where to get more sketch grammars?
There are two ways to download existing Sketch grammar from Sketch Engine:
- either, if you are creating a new corpus for a language with some preloaded tagger (e.g. in second step select TreeTagger ), you will see a list of sketch grammars available (to use and/or download) in the third step of the process
- or if you open a corpus with word sketches and show word sketch for an arbitrary lemma than you can click on the name of any grammatical relation to open the whole word sketch grammar definition file that you can download and reuse.
