A Tagged Corpus and a Tagger for Urdu

Publication at Faculty of Mathematics and Physics |

2014

Abstract

In this paper, we describe a release of a sizeable monolingual Urdu corpus automatically tagged with part-of-speech tags. We extend the work of Jawaid and Bojar (2012) who use three different taggers and then apply a voting scheme to disambiguate among the different choices suggested by each tagger.

We run this complex ensemble on a large monolingual corpus and release the tagged corpus. Additionally, we use this data to train a single standalone tagger which will hopefully significantly simplify Urdu processing.

The standalone tagger obtains the accuracy of 88.74% on test data.

Keywords

tagged corpus tagger urdu