Automatic Detection of Language and Annotation Model Information in CoNLL Corpora

Publikace

Abstrakt

We introduce AnnoHub, an on-going effort to automatically complement existing language resourceswith metadata about the languages they cover and the annotation schemes (tagsets) that they apply,to provide a web interface for their curation and evaluation by means of domain experts, and topublish them as a RDF dataset and as part of the (Linguistic) Linked Open Data (LLOD) cloud. Inthis paper, we focus on tabular formats with tab-separated values (TSV), a de-facto standard forannotated corpora as popularized as part of the CoNLL Shared Tasks. By extension, other formatsfor which a converter to CoNLL and/or TSV formats does exist, can be processed analoguously. Wedescribe our implementation and its evaluation against a sample of 93 corpora from the Universal

Dependencies, v.2.3.

Klíčová slova

LLOD CoNLL OLiA