Language Technologies for Research in Humanities

Class at Faculty of Mathematics and Physics |

NPFL131

Syllabus

Using large texts, we will learn the basic methods of text processing needed to obtain non-trivial information. For Czech we will use texts of works by Karel Čapek, for Classical Chinese selected texts from https://github.com/kanripo, for other languages works according to the focus of the students. importance and statistical properties of Big Data unix shell; most basic commands more unix commands and basic Perl to manipulate texts text editors quantitative analysis of text comparing texts and visualizing differences search using regular expressions using regular expressions to batch edit text diacritic removal, sentence segmentation, tokenization getting information on Chinese characters from Unihan database rule-based automatic part of speech identification creating your own corpus

"NLP workflow engines" - GATE, OpenNLP, Treex calling REST APIs

UDPipe and select the appropriate model if there are more than one for the language visualization of analysis and results

Annotation

You will learn to efficiently use tools and procedures for the automatic processing of large-scale texts in different languages. The skills acquired will facilitate independent scientific work with language dataq in any area of humanities.