Charles Explorer logo

Work with Language Corpora

Class at Faculty of Arts |


1. What is a corpus? How is it formed? What is it used for? How can it be used? Basic typology of corpora. Czech and German corpora (DeReKo, CNK, InterCorp) in the form of presentation and joint work. Conditions and creation of facilities for working with corpora (laptops, internet access, registration to COSMAS II and CNK. Finding simple expressions and brainstorming together about the possibilities of using corpora, their limits and interpretation.

2. Text processing in corpora (presentation), brief development of the discipline, basic terminology (token, type, lemma, parser, disambiguation, corpus X corpus manager, concordance, etc.). Basic search scheme in (COSMAS II) - archive selection, corpus selection, formulation of the simplest queries.

3. DeReKo (IDS Mannheim) - COSMAS II. - basic functions of the corpus search engine; Practice basic search in COSMAS II., regular characters, settings (Optionen)

4. Advanced search and its practice in COSMAS II; multi-word phrases

5. Extension functions from the COSMAS II. menu; working with tagged corpora

6. Cooccurrence analysis - method of entering and evaluating generated data; practicing CQL (Corpus Query Language) ; phrases and collocations- a special task for corpus linguistics

7. Associated applications - CCDB, SOM, etc.; DWDS

8. Czech National Corpus - basic information, basic search

10. Czech National Corpus - practice searching, explanation of extension functions,

11. InterCorp

12. SyD, Morfio, Treq, KWords, WaG

13. Final discussion.


Working with language corpora is a course in which students of the Bachelor's degree get acquainted with the existence of language corpora and the possibilities of their use in linguistic practice. The course focuses on practical work and is a precursor to the Seminar in Corpus Linguistics, which follows in the NMgr. programme.

In addition to introducing the basic types and properties of corpora, the emphasis is on the practical applicability of corpora in the everyday life of a linguist. Emphasis is placed on corpora of Czech (CNK) and German (DeReKo and DWDS) including associated applications Kookkurrenzanalyse, CCDB, SOM, respectively Treq, SyD, Morfio, WaG, etc.

In the area of usability, the needs of students are reflected with regard to their research and writing of seminar and thesis papers in other disciplines.

Study programmes