Sketch Engine
  • Login
  • Wiki
  • Timeline
  • View Tickets
  • New Ticket
  • Search
  • Settings

Wiki Navigation

  • Start Page
  • Index by Title
  • Index by Date
  • Last Change

Greek Web as Corpus

- http://beta.sketchengine.co.uk/auth/corpora/run.cgi/first_form?corpname=preloaded/gkwac

Description

GkWaC is a 100 million word collection of POS-tagged texts downloaded from the Internet, prepared by Milos Husak of Masaryk University, Brno, for Lexical Computing Ltd., in collaboration with the Greek publishers Patakis and the Greek software company Neurolingo.

POS-tagging

The tokenization and Part-Of-Speech tagging uses the NeuroLingo? Collection Analyzer, which provides the following information:

word
lemma
tag
morph

- Tagset summary

NeuroLingo? Collection Analyzer
http://www.neurolingo.gr/

Sketch Grammar

The sketch grammar, used for the generation of Greek word sketches and distributional thesaurus, was developed by Mavina Pantazara and Christos Tsalidis of Neurolingo.

Structure

The corpus is divided into documents (<doc></doc>) identified by their id and containg also information about its url, genre, year and epoch of publishing. Each document is further structured using following tags:

paragraphs                <p></p>
sentences                 <s></s>
headers                   <h></h>
lists                    <ul></ul>
list lines               <li></li>
non-greek words   <non-greek></non-greek>
glue                       <g/>

Text gathering

The texts were downloaded using BootCat? according to an URL list generated by a list of Greek words provided by Patakis.

documents            :           96861
max doc per server   :             250

date                 :    October 2007

BootCat?, WebBootCat?
http://nlp.fi.muni.cz/publications/euralex2006_pomikale_rychly/WebBootCaT.pdf

Download in other formats:

  • Plain Text

Sketch Engine
Bringing Corpora to the Masses

Lexical Computing Ltd

Brought to you by
Lexical Computing Ltd