Hlasování taggerů pro urdštinu

Publikace na Matematicko-fyzikální fakulta |

2012

Abstrakt

In this paper, we focus on improving part-of-speech (POS) tagging for Urdu by using existing tools and data for the language. In our experiments, we use Humayoun’s morphological analyzer, the POS tagging module of an Urdu Shallow Parser and our own SVM Tool tagger trained on CRULP manually annotated data.

We convert the output of the taggers to a common format and more importantly unify their tagsets. On an independent test set, our tagger outperforms the other tools by far.

We gain some further improvement by implementing a voting strategy that allows us to consider not only our tagger but also include suggestions by the other tools. The ﬁnal tagger reaches the accuracy of 87.98%.

Klíčová slova

hlasování taggerů urdštinu