Helping Our Own (HOO)
On the one hand, identifying grammatical and other linguistic errors in a text is a major challenge for language technology.
On the other, the majority of authors of computational linguistics papers are not native speakers of English, and for many of these -and also some of the native speakers- writing publication-quality papers, in English, following the appropriate conventions for the field, is very hard, and failures to do so may block good researchers from getting their work published.
In this project we aim to bring the research problem and the practical problem together, by defining a task that is 'error correction for draft computational linguistics papers' and encouraging researchers - who may well be the very researchers who hope to benefit from successful solutions - to participate in a shared task.
We follow the usual shared-task methodology: we define the task in some detail and prepare datasets of manually-corrected draft CL papers, one set for participants to use for developing their algorithms, and another set to be used for evaluation. We announce a schedule and encourage participation. Participants are then given a short period to download the evaluation dataset, process it with their tools, and return the output, which is then scored against the manual 'gold standard'. Finally we hold a workshop to present findings, compare methods, and plan the way forward.
We think that the ACL Anthology Reference Corpus may be a particularly useful resource, as it embodies the target text type (though not without errors). (It is available in a user-friendly corpus interface here). We also think that this task - domain-and-register-specific error correction - may contrast in interesting ways with 'vanilla', general purpose error correction. We shall take Microsoft Word's grammar checker as a reference system.
HOO has been supported by the Generation Challenges project and will culminate at ENLG 2011.
Links
- Paper making the case for HOO
- Robert Dale and Adam Kilgarriff: Helping Our Own
- International Natural Language Generation Conference 2010, Dublin, Ireland
- The coding scheme to be used in the project
- Diane Nicholls: The Cambridge Learner Corpus - error coding and analysis for lexicography and ELT
- Corpus Linguistics 2003, Lancaster, UK
- schedule
- First data sample to follow shortly
Organisers
Attachments
-
2010_DaleKilg_INLG_HOO.pdf
(81.3 KB) -
added by ak 14 months ago.
