Charles Explorer logo
🇬🇧

From the Corpus as Open Source for Investigation to Commercial Products

Publication at Catholic Theological Faculty |
2007

Abstract

Corpora have developed from pure texts into sophisticated tagged tools. They provide quick and seemingly indisputable answers to complicated questions about language. Reasons as to why those answers do not necessarily describe the language are proposed, using examples from the Czech representative corpus SYN2000: (1) Texts are often represented in a way that nobody has (or could have) written/published them. (2) The tagging is far from plausible for labelling language phenomena, both as conception and as implementation. (3) Even if statistical data were proof enough, the explanation of language cannot be derived by the data themselves, but must be obtained through data and interpreting data. (4) Linguistic inquiry is impeded rather than supported by the development of corpus-based tools, as a researcher cannot modify/test how appropriate their setting is for each single case (so by the use of WordSketches). Thus, the more sophisticated the corpus tools, the less the guarantee of scientifically plausib