The SiBol/Port corpus
The SiBol/Port (Siena-Bologna, Portsmouth) corpus is a corpus of English broadsheet newspapers.
Source data
The corpus consists of 787000 English newspaper articles from years 1993, 2005 and 2010. Newspapers included: The Times, The Guardian, The Daily Telegraph, The Sunday Times, The Sunday Telegraph.
The authors of the texts collection are Alan Scott Partington (Bologna University), John Morley (Siena University), Anna Marchi (Lancaster University), Charlotte Taylor (Portsmouth University). The SiBol/Port Corpus Linguistics Project
Processing
- raw textual data parsed into a documents structure
- tokenized using unitok with English model
- cleaned by removing duplicate documents using onion
- tagged by TreeTagger using Penn Treebank tagset, English parameter file (utf-8)
- compiled in the Sketch Engine using English sketch grammar for word sketches
Changelog
(1 Dec 2011)
- recompiled, installed at the production server
v1.1 (9 Nov 2011)
- changed deduplication settings to "-n 7 -m" -- 385 million tokens in 787000 newspaper articles
- set name to "SiBol/Port" to better reflect the data collections included
v1.0 (31 October 2011)
- initial version -- 332 million tokens in 643000 newspaper articles
Attachments
-
sibolport_graph_by_title.png
(36.6 KB) -
added by vit 6 months ago.
-
sibolport_graph_by_year.png
(27.5 KB) -
added by vit 6 months ago.


