Charles Explorer logo

English Language and Corpus Linguistics I: Corpora, tools and statistics

Class at Faculty of Arts |


Teacher: Mgr. Lucie Chlumská, Room: S131 (vzadu ve dvoře), Jana Palacha 2  The course consists of 12-13 lessons (2 academic hours).

1. introduction to the course, registration to the corpora (CNC, BNC, COCA)

2. introduction to corpus linguistics, types of corpora, basic queries, regular expressions (use of wild characters, operators of repetition etc.), KonText interface

3. principles of lemmatisation and morphological tagging of corpora (stochastic methods, rule-based methods of disambiguation), CQL (corpus query language), using lemmas and tags in complex queries

4. advanced regular expressions (logical operators), filters - positive and negative, creating subcorpora based on different metadata, concept of representativeness in corpora of spoken and written language

5. the InterCorp parallel corpus, searching for translation equivalents, false friends in translation, creating subcorpora on a parallel corpus

6. collocations and statistical methods for their identification

7. corpora in translation studies, translation universals in English

8. BNC - about the corpus, BNC Web interface

9. English spoken corpora (incl. spoken part of the BNC), corpus-based vs. corpus-driven approach

10. COCA, COHA and other corpora in Mark Davies’ interface, querying

11. British and American English - case studies

12. other English corpora and interfaces, building a corpus, AntConc (clusters, keywords)

13. presentation of students' work, discussion Evaluation Credits: 5 (Z) a) active participation in lessons b) presentation of individual corpus-based research


This course is intended mainly for students of English as a first introduction to corpus linguistics and corpus-based research. Its main objective is to show advantages of a corpus-based analysis and description of language and to teach students how to use corpora for their own linguistic research. The course is practical (hands-on), students work at the computers the whole time. During the course, they learn how to work with several corpus clients (interfaces: KonText, BNC-Web, COCA etc.), i.e. how to form a complex query and how to analyze the results using basic statistics and interface functions. The course also includes an introduction to the structure and philosophy of the corpora bundled within the Czech National Corpus project (especially the English-Czech part of the InterCorp parallel corpus), to the British National Corpus and American corpora COCA, COHA or Time. Moreover, several corpus-based tools – freely available – will be introduced, such as Treq (for the analysis of translation equivalents) or KWords (for the analysis of keywords in texts). Two lessons (at minimum) will be reserved for the introduction to free software tools enabling users to assemble and analyse their own corpora (e.g. AntConc and LancsBox).

As a necessary theoretical background, some of the basic notions of corpus linguistics (such as collocations, representativeness of corpora, n-grams, word sketches etc.) will be explained.

Study programmes