Charles Explorer logo
🇬🇧

Splitting and Identifying Czech Compounds: A Pilot Study

Publication at Faculty of Mathematics and Physics |
2021

Abstract

We present pilot experiments on splitting and identifying Czech compound words. We created an algorithm measuring the linguistic similarity of two words based on finding the shortest path through a matrix of mutual estimated correspondences between two phonemic strings.

Additionally, a neural compound-splitting tool (Czech Compound Splitter) was implemented by using the Marian Neural Machine Translator framework, which was trained on a data set containing 1,164 hand-annotated compounds and about 280,000 synthetically created compounds. In compound splitting, the first solution achieved an accuracy of 28% and the second solution achieved 54% on a separate validation data set.

In compound identification, the Czech Compound Splitter achieved an accuracy of 91%.