Corpus PSP: introduction and possibilities of data mining for the pruposes of corpus stylistics

Publication at Faculty of Arts |

2011

Abstract

In the first part of our paper, we introduce a newly founded PSP corpus. This corpus comprise cahiers of the Chamber of Deputies of the Parliament of the Czech Republic collected during the life of Parliament 2006-2010.

The corpus include 7 million text words. We take advantage of publicly known informations about speakers.

Thus we can recognize objective characteristics as gender, age, level of education and others - that allows us to distinguish between the influence of genre, author and theme. Also, we discuss certain trends observable in corpus linguistics, especially the expansion of corpus based methods to other branches of linguistics, which is connected to the following creation of small specialized corpora.

Small corpora are intended for a very special usage and they need a specific approach to data-mining. Since not all linguists have access to a team of programmers and technicians, we suggest alternative methods for their use in linguistic research.

The primary goal of the PSP corpus is collecting material that will be used for a corpus based stylistic analysis. We demonstrate several methods of corpus based stylistic analysis in the second part of our paper.

For the analysis, two speakers with similar parameters were chosen and lists of types of word forms and 2-5grams were extracted from the PSP corpus. Our method is based on the comparsion of the most frequent types of word forms that are common for both speakers.

We focused on signs which distinguish one speaker from the other. It was possible to identify preferences regarding the usage of certain invariant structures.

Furthermore, we compare the frequency of several highest-ranking 5grams and certain structural and content similarities between individual chunks of text could be identified as well. Such a similarity could be used for their automatic recognition in the future.

We suggest using metrics from information theory for measuring the difference between n-grams.

Keywords

Corpus PSP introduction possibilities data mining for pruposes corpus stylistics