Linguistically-augmented Perplexity-based Data Selection for Language Models

Publication at Faculty of Mathematics and Physics |

2015

Abstract

This paper explores the use of linguistic information for the selection of data to train language models. We depart from the state-of-the-art method in perplexity-based data selection and extend it in order to use word-level linguistic units (i.e. lemmas, named entity categories and part-of-speech tags) instead of surface forms.

We then present two methods that combine the different types of linguistic knowledge as well as the surface forms (1, naıve selection of the top ranked sentences selected by each method; 2, linear interpolation of the datasets selected by the different methods). The paper presents detailed results and analysis for four languages with different levels of morphologic complexity (English, Spanish, Czech and Chinese).

The interpolation-based combination outperforms the purely statistical baseline in all the scenarios, resulting in language models with lower perplexity. In relative terms the improvements are similar regardless of the language, with perplexity reductions achieved in

Keywords

linguistically augmented perplexity based data selection language models