Diacritics are very important in diacritical languages, because the meaning of sentences can be changed in accordance to diacritics. Writing without diacritics makes the sentences ambiguous; however, there are several reasons make people do not write words with diacritics, such as fast typing, convenience, or texting on unsupported diacritics devices.
As a result, these texts are very difficult to process on further natural language processing (NLP) tasks like machine translation, sentiment analysis, or question answering system. Therefore, diacritics restoration is critical for further usage or processing in NLP related tasks.
In this study, we propose a method which combines convolutional neural network (CNN) and bidirectional gated recurrent unit (Bi-GRU) to restore diacritics. In addition, we use residual block to resolve vanishing gradient problem of recurrent neural networks.
We applied the model for restoring diacritics of Vietnamese language that has the highest ratio of diacritics in words. This approach has character accuracy at 98.63% and word accuracy at 94.77%.