README for http://www.sketchengine.co.uk/KSC/
This directory contains a number of known-similarity corpus sets, designed for testing metrics for corpus similarity. For theory, documentation etc see the paper
@InProceedings{wvlc:96,
author = "Adam Kilgarriff",
title = "Using word frequency lists to measure corpus
homogeneity and similarity between corpora",
booktitle = "Proceedings, {ACL SIGDAT} workshop on very large corpora",
year = 1997,
address = "Beijing and Hong Kong",
month = "August"
}
also available (electronicaly or as hard copy) as techreport ITRI-97-07 ftp://ftp.itri.bton.ac.uk/pub/reports/ITRI-97-07.ps.gz
Each file is gzipped and tar'd. Once unzipped and untarred, you will have a directory, eg acc_gua, containing between 3 and 10 files such as
acc0_gua acc1_gua acc2_gua ...
acc and gua are acronyms for two distinct text types (in this case, "Accountancy" (a periodical) and The Guardian). The digit represents the number of tenths of the first text type, so acc2_gua is two tenths Accountancy, eight tenths Guardian. The value of the data set is that we can say with confidence that, eg, acc1_gua is more like acc2_gua than acc0_gua is like acc3_gua. See paper for fuller elucidation.
All corpora comprise 100,000 words in total.
All data is taken from the BNC (see http://info.ox.ac.uk/bnc ). BNC POS-tagging, sentence markup and BNC document references have been retained. The markup uses minimal SGML, with just 4 elements (w, s, bncDoc and c (for punctuation)) and can be easily removed by, eg, the following line of perl:
s/<[^>]+>//g
No chunk of text occurs in more than one of a set of known-similarity corpora.
At beginnings and ends of corpus 'slices', sentences are truncated. Elsewhere they will not be. Slices comprise 10,000 words (drawn from consecutive documents and files of the appropriate source in the BNC) so truncations are relatively few.
The corpora were built by taking, eg, 2 slices of Accountancy and then adding 8 of Guardian. This means that the beginnign of the corpus is of one text type, the end of another. For metrics which involve eg cross-validation, this must be born in mind before further sampling is undertaken.
Enjoy!
Adam Kilgarriff adam.kilgarriff@gmail.com
