Teacher: Mgr. Václav Cvrček, PhD. & Mgr. Lucie Chlumská vaclav.cvrcek@ff.cuni.cz; chlumska@trnka.ff.cuni.cz
Room: Computer lab, Národní 37 (Room 7)
Structure of this course
The course consists of 13 lessons (2 hours). 1. introduction to corpus linguistics, registration, structure of corpora in the Czech National Corpus project and British National Corpus 2. CQL (corpus query language), basic regular expressions (use of wild characters, operators of repetition etc.) 3. advanced regular expressions (logical operators, filters - positive and negative, using graphical interface for creating complex queries) 4. word - lemma - tag, principles of lemmatisation and morphological tagging of corpora (stochastic methods, rule-based methods of disambiguation); using lemmas and tags in complex queries 5. concept of representativeness in corpora of spoken and written language, Heaps' law and its consequences for corpus size 6. internal structure of the corpus (opus, document, sentence), text types (fiction, newspapers, science; formal, informal speech), work with subcorpora 7. collocations and statistical methods for their identification; multi-word units in the description of language (lexicon, grammar) 8. word-sketches, co-occurrences and semantic prosody; corpus-based syntagmatic and paradigmatic approach to language units 9. parallel and multilingual corpora (InterCorp) 10. basic statistics for corpus linguistics (mean, standard deviation, chi-square, normal distribution, correlation, Zipf's laws) 11. corpus-based vs. corpus-driven approach; corpus applications: phonology (graphemics), morphology (language system vs. prototypes) 12. corpus applications: lexicography (automatic term recognition, collocations dictionaries), syntax, stylometry 13. presentation of students' work, discussion
Evaluation
Credits: 5 (Z) a) active participation in lessons b) presentation of individual corpus-based research
Credits: 10 (Z+PP) a) active participation in lessons b) presentation of individual corpus-based research c) essay describing methods used and conclusions made in the research (emphasis is laid on interpretation of facts)
Objectives
This course is for all prospective users of language corpora. Its main objective is to show advantages of a corpus-based description of language and to teach students how to use corpora for their own linguistic research. The practical part of this course includes working with corpus client Bonito, and an introduction to the structure and philosophy of the corpora bundled in the Czech National Corpus project (namely SYN2005, Oral2008 and InterCorp) and to the British National Corpus. In the theoretical part, we will examine some basic notions of corpus linguistics such as collocations, representativeness of corpora, word sketches etc.