The statistic we use for keywords is a variant on 'word W is N times as frequent in subcorpus X vs subcorpus Y'.
The calculation is this:
- For each subcorpus Y of a corpus X
- for each word
- make a frequency list
- normalise to per-million
- add 100 to each normalised figure
- (NB we add this N=100 to solve two problems:
- if a word does not occur on one of the corpora, its freq-per-million is zero, and you cannot divide by zero. If we add a number N, eg 100, to zero, we can do the division
- by varying N, we can get a keyword list that tends to have lower-frequency words in (with low N) or higher-frequency words in (with high N). Different researchers will be interested in lower frequency (content) words or higher-frequency (grammar) words, so the user can regulate the kind of comparison they are making by adjusting N. N=100 is a mid-point that gives a good mix of high and low frequency keywords.)
- (NB we add this N=100 to solve two problems:
- divide the number for corpus X by the number for corpus Y, to give the score.
- Sort the words according to the score.
- for each word
For more details see attachment.
Attachments
-
liverpool[1].txt
(4.3 KB) -
added by adam 3 years ago.
