wiki:AK/NewGSL

Proposal: A New General Service List

Adam Kilgarriff, Sept 2010

Michael West’s General Service List has been doing good service since it was published by Longman in 1953 (and in earlier versions, since 1936). It was a very good piece of work, based on the leading resources of the day. It has been the basis for many other projects and activities, including Coxhead’s Academic Word List of the most used words across academic English that are not in the GSL.

The goal is to present the list of words that learners should learn. Frequency is, in principle, for West, just one of several factors, others including: Ease or difficulty of learning, necessity, cover, stylistic level, and issues relating to intensive and emotional words (West, ix-x). (See also discussion by David Lee of the different readings of ‘core vocabulary’.) However frequency, being the one quantifiable factor, was the main factor.

He was also well aware that his corpus was of printed, written material so did not tend to include greetings, colloquialisms and other forms and items which are predominantly found in speech. He excluded months, days of the week and numbers.

The GSL contains around 2000 word families.

It is old and has problems and anomalies:

  • some items are old (shilling)
  • others just seem wrong (yield)
  • it was pre-computer:
    • the exact word forms to be included in each word family are not made fully explicit
    • there are no good and complete computer-tractable versions of it available
    • some versions only give the headwords for the word families, not all the members
    • the version referenced in Wikipedia (from Salford.ac.uk) aims to give all the information available in the printed book, but is error-strewn, for example sour and sourly are given as separate headwords but should be in the same family. Lots more.

There have been a number of attempts to update or replace it, though none have been widely adopted. Perhaps this is because these attempts have been largely based on secondary sources such as other word lists (see eg Billuroğlu and Neufeld’s BNL) and have not been informed by contemporary corpus linguistics.

Multiwords

Many lexical units are multiword ones (phrasal verbs, compound nominals, and multiword conjunctions (according to, as well as) are three common subtypes in English. We won’t attempt to count them.

Homonymy/polysemy

Words also have different meanings of which some are rare, some common. Although it’s sometimes counter-intuitive our lists will be lists of words, not meanings. We can’t count meanings reliably.

Forms, Lemmas and word families

Our counts will be mainly of lemmas, so

  • nouns: singular and plural both lemmatised to the same lemma
  • verbs: base, -ing, -s, past tense, past participle etc lemmatised to the same lemma
  • uses of the word as a name are kept out of the count (as far as the POS-tagger can correctly identify them)

Our lemmas will not be word-class-specific: we shall treat brush (noun) and brush (verb) as the same lemma.

We will review possible shortcomings of this scheme after the basic list has been prepared

  • adding in ‘run-on entries’ like -ly adverbs, nouns in –ness
  • spelling variants (eg AmE vs BrE)
  • -er and –est for adjectives
  • lemmatisation of pronouns
  • - and plenty of others

The method

I want to propose a New GSL based on the words that are used, whatever kind of use we are making of English: the words we use whatever we are doing with the language. So, they should be found in most documents for (almost) all kinds of language.

I am proposing to implement this by gathering many Brown-style corpora – eg each has 500 samples of 2000 words each (or maybe 5 times bigger: 5m each of 2500 x 2000-word samples) and saying

a word belongs in the New GSL if, for every corpus, it occurs in at least 95% of the documents

Of course, we try this out and see how many words it gives us! The core database for exploring this question is one with a row for each lemma, a column for each corpus, and, in the cell, the number of samples in that corpus containing that word (out of a total of 500, if we use 1m corpora, or 2500, if we use 5m corpora). This database is to be published/available online so others can use it to explore text type.

Where we do not find samples as long as 2000 words, we shall compose smaller samples into 2000-word chunks.

Where samples are longer than 2000 words we truncate them at the end of the first sentence that takes us over 2000 words (and ignore the Sinclair objection that beginnings of docs are different to ends of them).

General issues about data

  • Project is closely related to the New Model Corpus (NMC) project, see AK/WorkInProgress
  • Desirable to merge the projects, at least to the extent of making all the samples used in GSL available as a corpus in SkE, also using the same samples where that works
  • As for NMC, it would be very nice if all samples were under Creative Commons so there was no copyright issue about distributing
  • Goal is, as far as possible, to use sets which are already collected,
    • and where they are already widely known and used, so much the better: there’s little reason to leave them out
  • As ever, within each sample, we go for as wide a spread of sources/text types as possible
  • some items come under several categories in list below
  • one criterion is: cheap to find/count collect. Given that, ‘the more the merrier’, no need to leave anyone’s pet text type out, if the data is available
  • It is not yet clear to me whether, eg, we have one (say) 5m sample for ‘essays’ which we subdivide between disciplines, or, where there is enough data to hand, 5m per discipline. The latter approach supports more research questions.

The text types and corpora: candidates

  • The Brown family
    • Brown (we assume the list won’t have changed in 50 years, tho need to look back over the lists to see if this is true)
    • LOB and company (1901 1931 versions)
    • FLOB and Brown
  • blog
    • just one for blog, or multiple? If so how?
  • newspaper
    • just one or subdivide
    • by title (eg Daily Mail) and/or by section (Sports vs editorial vs … like Brown did)
    • WSJ as in Penn Treebank
  • spoken
    • spoken BNC (conversational)
    • spoken BNC (context-governed)
    • BASE
    • MICASE
    • chatshow transcripts
    • film transcripts (scripted but speech-like)
    • lectures eg from MIT (automatically transcribed)
  • fiction
    • from BNC
    • from Gutenberg - classics
    • from self-publishing sites, which will be mostly science fiction/fantasy
    • Oxford Children’s Corpus
  • academic-written
    • essays (BAWE)
    • journal papers
    • use Iztok Kosem’s corpus, one/some for each of a longish list of disciplines?
    • Other data from PICAE?
  • writing for children
    • Oxford Children’s Corpus (fiction – noted above – and nonfiction parts)
  • children’s language
    • CHILDES db
  • NNS
    • Cambridge Learner Corpus material
    • note overlap/interaction with English Profile Project
    • ICLE
    • ask David Wible or John Milton?
  • piggy-backing on other diverse corpora
    • BNC
    • COCA
    • OEC
    • ANC
  • specialist-other
    • the domain corpora Avinesh is generating for all DANTE domains
    • software documentation

AK 18.9.10