Polishing the gold – how much revision do we need in treebanks?

Publikace

Abstrakt

We present the second version of PetroGold, a gold-standard treebank for the oil & gas domain in the Portuguese language. The corpus went through a series of revisions guided by three methods tested in the literature: inter-annotator disagreement, inconsistent n-grams and verification rules.

We perform an intrinsic evaluation and the model scores 90.92%, 89.09% and 84.07% in the UAS (unlabeled attachment score), LAS (labeled attachment score) and CLAS (content-word labeled attachment score) metrics respectively, CLAS being 1.11% higher than in the first version. We perform an experiment where we verify a negative impact in the intrinsic evaluation when simplifying the annotation related to prepositional verbal arguments and we conclude by discussing the results and future work.

Klíčová slova

Natural Language Processing Language resources Corpora reviewing Treebank