wiki:Corpora/RomanianWebCorpus

RoWaC (Romanian Web as Corpus)

This corpus was gathered by Monica Macoveiciuc, Alexandru Ioan Cuza University, Iasi from the web using two methods, based on WebBootCat and  Heritrix. The text collected through these tools was further processed in order to remove the unwanted content. First version: August 2009. A programme of additions and improvements over a number of years is anticipated.

It was part-of-speech tagged and lemmatized using  TTL (Tokenizing, Tagging and Lemmatizing free running texts), developed by  RACAI - Research Institute for Artificial Intelligence, Romanian Academy.

Word sketches were prepared by Monica Macoveiciuc.

MM, AK, August 2009