Characterizing Czech Internet Texts through Multi-dimensional Analysis

Publication at Faculty of Arts |

2020

Abstract

The rapid development of computer methods in natural language processing in the final decades of the 20th century has brought a significant impetus to most of the linguistic disciplines, including register research. Register, as a concept defined by functional variation, depends on either conscious or unconscious choices of speakers during communication.

To be able to detect these choices, it is necessary to measure as many of potentially relevant language variables as possible. The present paper is based on studies of Douglas Biber (e.g. 1986, 1987, 1990), who has developed a methodology known as multi-dimensional analysis (MDA).

The paper aims to apply MDA in the research of Czech internet texts. Data were obtained from the web-crawled corpus of Czech internet texts Araneum Bohemicum Maximum (Benko, 2014) and sampled for annotation purposes.

Each of 1,000 text samples was then manually assigned to one of the web registers (Biber & Egbert, 2016). An exploratory factor analysis (adapted to Czech data (Cvrček et al., 2018a, 2018b)) is then used to discover the dimensions of variation.

The distribution of text factor scores within individual registers can be considered a measure for appropriateness of categorization. The modality of data distribution reflects several principles on which the categorization is based (overlaps, fuzziness of borders etc.) Methodological issues including hybrid registers (proposed by Biber & Egbert, 2016) and other options of non-discrete categorization of internet texts will be considered with respect to Egbert et al., 2015, Asheghi 2016 or Santini, 2007.

References: ASHEGHI, N. R. – Sharoff, S. – Markert, K. (2016): Crowdsourcing for web genre annotation.

Language Resources and Evaluation. 50. 1-39. BENKO, V. (2014): Aranea: Yet Another Family of (Comparable) Web Corpora.

In: Petr Sojka – Aleš Horák – Ivan Kopeček – Karel Pala (eds.), TSD 2014. New York: Springer International Publishing, 257–264.

BIBER, D. (1986): On the investigation of spoken/written differences. Studia Linguistica 40.1-38.

BIBER, D. (1987): A textual comparison of British and American writing. American Speech 62.99-119.

BIBER, D. (1990): Methodological Issues Regarding Corpus-Based Analyses of Linguistic Variation. Literary and Linguistic Computing, 5(4). 257–269.

BIBER, D. – Egbert, J. (2016): Register Variation on the Searchable Web: A Multi-Dimensional Analysis. Journal of English Linguistics, 44(2). 95–137.

CVRČEK, V. – Komrsková, Z. – Lukeš, D. – Poukarová, P. – Řehořková, A. – Zasina, A. J. (2018a): From Extra- to Intratextual Characteristics: Charting the Space of Variation in Czech through MDA.

Corpus Linguistics and Linguistic Theory. CVRČEK, V. – Komrsková, Z. – Lukeš, D. – Poukarová, P. – Řehořková, A. – Zasina, A.

J. (2018b): Variabilita češtiny: multidimenzionální analýza. Slovo a slovesnost, 79(3), s. 293–321.

EGBERT, J. – Biber, D. – Davies, M. (2015): Developing a bottom-up, user-based method of web register classification. Journal of the Association for Information Science and Technology, 66(9), 1817–1831.

SANTINI, M. (2007): Characterizing Genres of Web Pages: Genre Hybridism and Individualization. In: Proceedings of the 40th Annual Hawaii International Conference on System Sciences (HICSS).

Waikoloa: IEEE.

Keywords

multidimensional analysis register variation internet