Sketch Engine
  • Login
  • Wiki
  • Timeline
  • View Tickets
  • New Ticket
  • Search
  • Settings

Ticket Navigation

Ticket #11 (new bug report)

Opened 1 year ago

Last modified 1 year ago

Hindi tokenization

Reported by: jan Assigned to: jan
Priority: normal Component: WebBootCaT
Version: Keywords:
Cc:

Description

Reported by Niels Ott:

I'm just having some conversation with Prashanth and it turns out that WebBootCat doesn't do Hindi tokenization properly.

I tried with a little Hindi corpus myself. Apparently, all Hindi characters are taken as single words. I guess all Hindi word characters are taken as delimiters for some reason.

Blank space is used as word delimiter in Hindi. So rough tokenization should work out.

Here's some Hindi you can copy-paste as seed terms:

हाथ पर आसमान
लोग ऊँची उड़ान रखते हैं
हाथ पर आसमान रखते हैं
शहर वालों की सादगी देखो-
अपने दिल में मचान रखते हैं

I'm not exactly sure why this happens. But I think it's worth being improved. :-)

Attachments

Change History

07/09/07 22:22:33 changed by jan

  • type changed from support request to bug report.

Add/Change #11 (Hindi tokenization)




 

Download in other formats:

  • Comma-delimited Text
  • Tab-delimited Text
  • RSS Feed

Sketch Engine
Bringing Corpora to the Masses

Lexical Computing Ltd

Brought to you by
Lexical Computing Ltd