Towards Silver Standard Dependency Treebank of Urdu Tweets

Publication

Abstract

Manually annotated corpus is a perquisite for several natural language processing applications including parsing. Nevertheless, annotated corpus is not always available for resource-poor languages, especially when domain under consideration is noisy user-generated data found on social media platforms such as Twitter.

To overcome this deficiency of hand-annotated corpus, researchers have focused their attention on semi-automatic corpus annotation methods. This paper describes the experiments carried out using semi-automatic methods like self-training and co-training in an attempt for creating silver-standard dependency treebank of Urdu tweets.

Six iterations of each approach were performed using same experimental conditions using MaltParser and Parsito parser, both statistical data driven parsers. For self-training experiments, the best performing MaltParser model was trained on 1250 Urdu tweets, with an accuracy of 70.2% LA, 74.4% UAS, 63% LAS.

Whereas the best performing Parsito model was also trained on 1250 Urdu tweets with an accuracy of 70.8% LA, 74.8% UAS, 63.4% LAS. For co-training experiments, best performing MaltParser model was trained on 1500 Urdu tweets, with an accuracy of 70.5% LA, 74.4% UAS, 63.2% LAS.

The best performing Parsito model was also trained on 1500 Urdu tweets with an accuracy of 70.5% LA, 74.3% UAS, 63% LAS. Although, there was not much difference between the results of both approaches, co-training results were slightly better for both parsers and is used for generating a silver-standard dependency treebank of 4500 Urdu tweets.

Keywords

co-training dependency parsing manual annotation silver-standard self-training tweets Universal Dependencies Urdu