Twitter, a social media platform has experienced substantial growth over the last few years. Thus, huge number of tweets from various communities is available and used for various NLP applications such as Opinion mining, information extraction, sentiment analysis etc.
One of the key pre-processing steps in such NLP applications is Part-of-Speech (POS) tagging. POS tagging of Twitter data (also called noisy text) is different than conventional POS tagging due to informal nature and presence of Twitter specific elements.
Resources for POS tagging of tweet specific data are mostly available for English. Though, availability of tagset and language independent statistical taggers do provide opportunity for resource-poor languages such as Urdu to expand coverage of NLP tools to this new domain of POS tagging for which little effort has been reported.
The aim of this study is twofold. First, is to investigate how well the statistical taggers developed for POS tagging of structured text fare in the domain of tweet POS tagging.
Secondly, how can these taggers be used to overcome the bottleneck of manually annotated corpus for this new domain. To this end, Stanford and MorphoDiTa taggers were trained on 500 Urdu tweet gold-standard corpus and were utilized for semi-automatic corpus annotation in bootstrapped fashion.
Five bootstrapping iterations for both the taggers were performed. At the end of each iteration, the performance of taggers was evaluated against the development set and automatically tagged, manually corrected 100 tweets were added in the training set to retrain both models.
Finally, at the end of last iteration, tagger performance was evaluated against test set. Stanford tagger achieved an accuracy of 93.8% Precision, 92.9% Recall and 93.3% F-Measure.
Whereas, MorphoDiTa tagger achieved an accuracy of 93.5% Precision, 92.6% Recall and 93% F-Measure. A thorough error analysis on the output of both taggers is also presented.