Sketch Engine
  • Login
  • Wiki
  • Timeline
  • View Tickets
  • New Ticket
  • Search
  • Settings

Wiki Navigation

  • Start Page
  • Index by Title
  • Index by Date
  • Last Change

Corpus Querying and Grammar Writing for the Sketch Engine

  1. The corpus query language
  2. Grammatical relation definitions
  3. Grammatical relations file: example
  4. Macros in m4

1. The Corpus Query Language

The language was developed at the Corpora and Lexicons group, IMS, University of Stuttgart in the early 1990s, see IMS Corpus Workbench. This page is a variant of the documentation prepared there.

  • A query consists of a regular expression over attribute expressions.
    • The attributes used in these examples are word and tag. Every word has an associated part-of-speech tag, tag.
  • Very often, you only want to look for a given word, for example you want to find all occurrences of words beginning with confus. The full form is
        [word="confus.*"]
    
    but for simple searches on the default attribute (here: word) we can simply use
        "confus.*"
    
    the default attribute can be changed using the drop-down list under the CQL box.
  • Case is significant to the query processor. If you want case-insensitive search, include (?i) in a string
        "(?i)on"
    
  • We often want a wild-card word: any single word, it doesn't matter which. We use the "match any token" operator [] (similar to the dot for "match any character" in regular expressions over strings): "confus.*" [] "by" This query finds all sequences of a word beginning with confus, followed by any word followed by by. The match-any operator must not be the first expression in a query.

    • We search for exactly two words between confus.* and by with
          "confus.*" []{2} "by"
      
    • We search for between 0 and 3 words between confus.* and by with
          "confus.*" []{0,3} "by"
      
  • If the corpus has sentence, paragraph or document markup, rather than constraining the match by specifying a number of tokens, we can specify it as within a unit (Here s for sentence.) We search for confus followed by by within a sentence with:
        "confus.*" []* "by" within <s>
    
    The within statement is always the last component of the query.
  • XML tags may be used to access boundaries of structural attributes (sentences, for example):
        <s> [tag="N.*"] []* [tag="VB.*"]
    
    Here, a proper noun (singular) must occur at the beginning of a sentence, followed by an arbitrary number of unspecified words, and finally followed by a verb.

A more formal account

  • We use regular expressions at two levels in our query system: at the level of attribute expressions and within string values.
  • The regular expression operators available are concatenation (as usual), disjunction (|), Kleene star (*, matching any number of repetitions, including 0), plus operator (matching 1 or more repetitions), the interval operator
        {n, k}
    
    matches between n and k repetitions. If k is omitted, at least n repetitions are matched. If the interval has the form
        {n}
    
    exactly n repetitions are matched. The examples below will clarify this.
  • Each attribute expression is -- roughly speaking -- evaluated against the word (and/or other, additional attributes) at a given corpus position. It has the form
        [Boolean expression]
    
    that is, an attribute expression is a boolean expression surrounded by brackets.
  • A boolean expression is a set of attribute value tests, combined with the usual boolean expression operators conjunction (&), disjunction (|) and negation (! ). Parentheses may be used in the usual way.
  • As noted above, we assume that the two attributes word and tag are defined for the corpus. These two attributes can now be used in attribute-value tests. Here, such a test has the form
        attribute_name operator string
    
    where attribute_name is either word or tag (in this demo version), operator is either the "match" operator = or the non-match operator !=, and string is either a "plain" string or a (POSIX egrep) regular expression, both enclosed in double quotes.

Some query examples

Look for...

  • thank starting with either upper or lower case:
        "[tT]hank"
    
  • a word beginning with confuse, followed by a preposition or a personal pronoun:
        "confuse.*" [tag="IN" | tag="PP"]
        "confuse.*" ([tag="IN"] | [tag="PP"]) 
        "confuse.*" [tag="IN|PP"]
    
    The three alternatives have the same effect, but are handled at a different level of evaluation: the first at the level of boolean expressions, the second at the level of attribute expressions, and the third at the level of regular expressions over the character alphabet.
  • the same, but with at most 10 words in between:
        "confuse.*" []{0,10} [tag="IN" | tag="PP"]
    
  • the same, but without full stops in between:
        "confuse.*" [word!="\."]{0,10} [tag="IN" | tag="PP"]
    
    The backslash is needed to escape the dot, otherwise it will be treated as the matchall symbol of the regular expressions at the level of strings. If the backslash is omitted, all one-character tokens are excluded.
  • a sequence of an adjective, a noun, a conjunction and another noun:
        [tag="JJ.*"] [tag="N.*"] "and|or" [tag="N.*"]
    
  • a noun, followed by either is or was, followed by a verb ending in ed:
        [tag="N.*"] "is|was" [tag="V.*" & word=".*ed"]
    
  • similar, but is or was followed by a past participle (which is described by a special POS tag):
        [tag="N.*"] "is|was" [tag="VBD"]
    
  • catch or caught, followed by a determiner, any number of adjectives and a noun, or a noun, followed by was or were, followed by caught:
        "catch|caught" [tag="DT"] [tag="JJ"]* [tag="N.*"] | [tag="N.*"] "was|were" "caught" 
    
  • look or bring, followed by either up or down with at most 10 non-verbs in between:
        "look|bring" [tag != "VB.*"]{0,10} "up|down"
    

2. Grammatical Relation Definitions

To build word sketches, we need to specify grammatical relations. For this we need to provide a simple grammar -- a collection of definitions that allow the system to automatically identify possible relations of words to the keyword.

Example

As an example, suppose the keyword is the verb "graze" and we are considering the following instances:

  • ...as sheep graze a Gloucestershire pasture...
  • I'd hoped he'd still be grazing that pasture there.
  • ...took the flock to the monsoon settlement to graze the mountain pastures.

We can capture the relation of object of the verb in the following pattern (using the Modified Penn Treebank Tagset):

  1:"VB.?" [tag="DT|PRP\$"]{0,1} "JJ.?"{0,3} "NN.*"{0,2} 2:"NN.*"

This pattern captures cases where the keyword (indicated by the prefix "1:") can be any verb (VBZ, VBD, VBN or VBG) followed possibly by a determiner or possessive pronoun (ie. where this tag occurs either 0 or 1 times), a string of 0-3 adjectives and 0-2 nouns and finally by a noun which is taken to be the head of the object noun phrase. The fact that it is this final noun that is the word we want to capture is indicated by the prefix "2:". In all cases "pasture" will be recovered as the object of "graze".

The attribute "tag" is taken as the default attribute and it can therefore be omitted (except in disjunctions).

We can add further definitions for the same relation to capture different constructions which realise the same underlying relation. For example, the subject of a passive verb plays the same role as the object of an active one:

  • ...pastures would be grazed and never ploughed.
  • When the original pasture is grazed again...

We can add to the definition of the gramrel, the query:

  2:"NN.*" [tag="RB"|tag="VM"]{0,4} [lempos="be-v"] 1:"VBN"

Here the verb in the passive construction (that is, a past participle following the verb "be") is again marked as the keyword with the prefix "1:". The subject is marked with the prefix "2:" and is thus taken to be the underlying object of the verb. Between the subject and the verb "be" we allow the possibility of a string of adverbs ("RB") and/or modal verbs ("VM").

Nota bene

Grammars that use the pattern-matching approach described here will always be less than perfect -- there will be cases where they fail to capture the relation between two words, and cases where the grammar incorrectly supposes a relation exists. Such "noise" in the system is in most cases of little importance as the word sketches only display relations which occur much more often than expected. Therefore, one soon reaches a limit as to how further accuracy in the definitions improves the word sketch.

3. Grammatical Relations File: Example

An input for the genws program is a wordsketch definition file. It is a text (ASCII) file containing queries for each grammatical relation (gramrel).

  • Comments are lines beginning with the hash character (#). Empty lines are ignored.
  • Lines beginning with the equal character (=) are gramrel names. A gramrel name can contain any character with the exception of slash (/) for dual gramrels (see below), trailing white spaces are stripped off.
  • The gramrel name is followed by gramrel queries, with each query on a separate line.
  • A regular gramrel query has to contain two labelled positions with labels "1:" and "2:". One query should be on one line: use a backslash (\) on the end of a line to split a query into multiple lines.
  • Lines beginning with star (*) are processing directives. They modify handling of the lines that follow them:
    • *DEFAULTATTR sets the default attribute for query evaluation. This directive is active to the end of the file or to the next *DEFAULTATTR directive.
    • *STRUCTLIMIT limits query results to a structure, for example sentence. The sequence of tokens in the result cannot cross boundaries of the structure. This directive is active to the end of the file or to the next *STRUCTLIMIT directive.
    • *SYMMETRIC evaluates queries also with the "1:" and "2:" labels swapped. This directive is active up to the next gramrel line.
    • *DUAL is similar to *SYMMETRIC but it affects gramrels. It defines two gramrels from the same set of gramrel queries. Gramrel names are separated by a slash (/). All queries are evaluated for the first gramrel and then for the second gramrel with the "1:" and "2:" labels swapped.

The example is for French. We assume a default feature of tag and a lemma feature. The tagset is a simple one with

  • N for nouns
  • All verbs start with V. Past participles are V:pp, infinitives are V:inf
  • ADJ for adjectives
  • ADV for adverbs
  • DET for determiners
  • PRO for pronouns
  • PRP for prepositions

French words used: et (and) ou (or) de (of) ętre (the verb be) avoir (the verb have)

*STRUCTLIMIT s
*DEFAULTATTR tag

*DUAL
=objet/objet_de
	1:"V.*" "ADJ|ADV|DET"{0,3} 2:"N"
#"The first argument is a verb, then there are between 0 and 3 adverbs, 
#adjectives and determiners, then the second argument is a noun."  In 
# this simple example, no other constructions are covered.

*DUAL
=sujet/sujet_de
	2:"N" "ADV|PRO"{0,2} "V.*"{0,2}  "ADV|PRO"{0,2} 1:"V.*"
	2:"N"  "ADV|PRO"{0,2} 1:"V.*"
#First clause covers cases with auxiliaries, second covers simple verbs.

*SYMMETRIC
=et_ou
	1:"N" "ADJ.?"{0,3} [lemma="et|ou"] "DET.*"? "AD[JV]"{0,3} 2:"N"
	1:"V.*" [lemma="et|ou"] [lemma="de|être|avoir"|tag="ADV"]? 2:"V.*"
	1:"ADJ" "AD[JV]"{0,3} [lemma="et|ou"|word=","] "AD[JV]"{0,3} 2:"ADJ"
	1:"ADJ" "NOM" 2:"ADJ"
#Conjunction: one clause each for nouns and verbs (simple cases only covered),
#two clauses for adjectives to cover the case where both adjectives are next 
#two each other, and the case where one comes before the head noun and the 
#other comes after.  Note that the comma (and other punctuation) is a regular 
#token which can be searched on (counter-intuitively) as a "word".

*DUAL
=adj_sujet_de/adj_sujet
	1:"N" "ADV"{0,2} [lemma="être"] "ADV"{0,2} 2:"ADJ" "[^AN].*"

*DUAL
=prédicat_de/prédicat
	1:"N"  "ADV"{0,2} [lemma="être"] "AD[JV]|DET"{0,3} 2:"N" "[^AN].*"
*DUAL
=modifier/modifié
	2:"ADJ"  "AD[JV]"{0,3} 1:"N"
	1:"N"  "ADJ"? 2:"ADJ"
	1:"V.*" 2:"ADV"
	2:"ADV" 1:"V.*"

=infin_comp
	1:"V.*" "ADV"{0,3} 2:"V:inf"

*TRINARY
=pp_%s 
	1:"N|ADJ|V.*" 3:"PRP" "DET|ADJ"{0,3} 2:"N"

*TRINARY is used for trinary relations. These are translated into regular binary relations with different names. A name of a trinary gramrel should contain "%s" and respective queries should contain the third label "3:". A value of the word sketch base attribute on the position labeled "3:" is then substituted for "%s" in the gramrel name.

*UNARY (not illustrated here) says that the following gramrel is a unary relation. Only one label is used for unary gramrel queries.

4. Macros in m4

As we continue to expand the grammar to cover more relations and more patterns for each relation we will soon find ourselves repeating the same pattern many times. To keep the grammar simpler and more easy to manage we can write a macro for each recurring element in the language m4. So for the example,

  2:"NN.*" [tag="RB"|tag="VM"]{0,4} [lempos="be-v"] 1:"VBN"

we could define a noun phrase macro as follows:

  define('noun_phrase',\
         '"DT|PRP\$"{0,1} "JJ.?"{0,3} "NN.*"{0,2} 2:"NN.*"')

We could also abstract away from the use of the actual tag for the lexical verb in our first definition by making the definition:

  define(`lex_verb', `"VB?"')

Writing grammars in this way also allows us to make them independent of any particular tagset. If we want to use a different tagset we simply need to redefine the basic definitions while the higher level structures remain unchanged.

Using these two definitions we can now express our original clause for capturing the object of a verb as:

  1:lex_verb noun_phrase

In an m4 file the additional macro definitions are placed before the relation definitions and between the lines:

  divert(-1)
  ...
  divert

The program m4 (a standard unix utility) is then run over the file to give a 'full-form' version which is used to build word sketches.

Download in other formats:

  • Plain Text

Sketch Engine
Bringing Corpora to the Masses

Lexical Computing Ltd

Brought to you by
Lexical Computing Ltd