Charles Explorer logo

Language Data Resources

Class at Faculty of Mathematics and Physics |


1. Introduction - motivation for building language data resources - typology of language data, usage - principles of annotation - using annotated data for evaluation in Natural Language Processing tasks

2. Corpora - corpus typology, tag sets - example corpora, Czech National Corpus - parallel corpora - searching in corpora

3. Treebanks - constituency and dependency syntactic structures, convertibility - deep syntactic trees - treebank examples

4. Computer lexicography - types of lexical information - examples of lexical data (inflectional and derivational lexicons, wordnets, valency lexicons, translation lexicons etc.)

5. Other types of language data resources - named entity corpora, sentiment corpora, dialog corpora, etc.

6. Authors’ rights perspective on building language data resources; licenses


The goal of the course is to provide students with the survey of the field of Language Data Resources. Selected types of linguistic annotations will be described, with emphasis on annotating corpus data and lexical data. Students will gain practice in using software tools for processing such data, especially in the programming language

Python. Leading projects for English, Czech, and some other languages will be used for illustration.