Charles Explorer logo
🇬🇧

multiged-2023

Publication

Abstract

This corpus consists of texts written by non-native learners, used in the first shared task on Multilingual Grammatical Error Detection, MultiGED. We provide training, development and test data for each of the five languages: Czech, English, German, Italian and Swedish.

Some of these datasets are already used in Grammatical Error Detection/Correction (GED/GEC) research, but we also release two new datasets: REALEC (English) and SweLL-gold (Swedish). Where possible, we use the same train/dev/test split as previous work (GECCC, FCE, Falko-MERLIN), and only create new splits when necessary (REALEC, MERLIN, SweLL).

All datasets are derived from annotated second language learner essays.