Word and Sentence Boundaries in Automatic Text Processing

Publication

Abstract

This paper aims to explore the major linguistic challenges involved in the preprocessing of a corpus composed of theses and dissertations from the Oil and Gas domain. Besides posing specific questions related to this domain and to scientific texts, we measured to which extent dealing with these matters hinders the automatic processing.

We built a gold standard corpus of tokenization and sentence segmentation comprising several difficult cases, which are now available to the Portuguese NLP community. This corpus can be used to evaluate automatic tokenization methods, as well as to improve the quality of subsequent steps in processing.

Keywords

Natural Language Processing Computational linguistics Preprocessing Tokenization Text segmentation