wiki:SkE/SimpleMaths

The statistic we use for keywords is a variant on 'word W is N times as frequent in subcorpus X vs subcorpus Y'.

The calculation is this:

  • For each subcorpus Y of a corpus X
    • for each word
      • make a frequency list
      • normalise to per-million
      • add 100 to each normalised figure
        • (NB we add this N=100 to solve two problems:
          1. if a word does not occur on one of the corpora, its freq-per-million is zero, and you cannot divide by zero. If we add a number N, eg 100, to zero, we can do the division
          2. by varying N, we can get a keyword list that tends to have lower-frequency words in (with low N) or higher-frequency words in (with high N). Different researchers will be interested in lower frequency (content) words or higher-frequency (grammar) words, so the user can regulate the kind of comparison they are making by adjusting N. N=100 is a mid-point that gives a good mix of high and low frequency keywords.)
      • divide the number for corpus X by the number for corpus Y, to give the score.
    • Sort the words according to the score.

For more details see attachment.

Attachments