Charles Explorer logo
🇬🇧

TrTok: A Fast and Trainable Tokenizer for Natural Languages

Publication at Faculty of Mathematics and Physics |
2012

Abstract

We present a universal data-driven tool for segmenting and tokenizing text. The presented tokenizer lets the user define where token and sentence boundaries should be considered.

These instances are then judged by a classifier which is trained from provided tokenized data. The features passed to the classifier are also defined by the user making, e.g., the inclusion of abbreviation lists trivial.

This level of customizability makes the tokenizer a versatile tool which we show is capable of sentence detection in English text as well as word segmentation in Chinese text. In the case of English sentence detection, the system outperforms previous methods.

The software is available as an open-source project on GitHub