wiki:Corpora/HindiWaC

HindiWaC

The corpus is prepared by Corpus factory method described  here. Full details are described in  Kilgarriff et al. at LREC 2010.

Changelog

v2.0 (6th Jan 2012)

The corpus is tagged using POS tagger downloaded from  http://ltrc.iiit.ac.in/showfile.php?filename=downloads/shallow_parser.php.

The tagset details are described in  http://ltrc.iiit.ac.in/tr031/posguidelines.pdf

We wrote a simple sketch grammar for Hindi and generated first word sketches for Hindi. If you would like to contribute, please contact us.

v3.0 (17th Jan 2012)

We recollected Hindi Web Corpus in 2011. The corpus is of size (size to be added)

The corpus is tagged using a new POS tagger (91.31% accuracy), lemmatizer and morph analyzer downloaded from  http://sivareddy.in/downloads

The tagset details are described in  http://ltrc.iiit.ac.in/tr031/posguidelines.pdf

Sketch Grammar is revised with a new rules which make use of post-position markers (which are crucial in Hindi dependency parsing. More rules to be added. We invite collaborations from the interested parties.)