Syntactically anotated corpus: lifelong work?

Publication

Abstract

Syntactic analysis of Czech from the centering perspective (eg. Grosz, Weinstein, Joshi, 1995; Walker, Joshi, Prince, 1998) is based on partially automated and partially manual annotation of the so called Centers of attention (Sidner, 1981; Brennan, Friedman, Pollard , 1987).

It is not possible to use existing corpora (ČNK, PDT) otherwise than as a source of individual texts, since the aim of the project Centering and Czech - syntactic analysis is to describe the general principles of construction of the Czech text. Centering theory has several basic features that are the determining criteria for the parameters of such corpus.

The first is the fact that the centering focuses on modeling local relationships in text and consequently the requirement for the corpus, which greatly affect its size - critical utterances constitute only half of the positions of corpus (it is clear that between two immediately following utterances there may not be symmetry). The rest are utterances necessary for annotation: immediately preceeding utterances, but hey can not be considered as critical tokens, as for them there are no immediately preceeding utterances available.

Another characteristic that affects corpus, is a methodological approach that centeringová theory applies - in determining the relationship between utterances it focuses on parts of their noun phrases (I use the term noun phrase for all cases of nominal expressions) In addition to these two basic criteria resulting from the applied theory, there are other complications and questions - how to handle corpus technically in terms of linking texts annotated within the text corpus PDT and CNC; how many tokend yorpus should contain; which style distinction I can omit in order to facilitate processing and which should be preserved.

Keywords

Centering Corpus Czech National Corpus (CNC) Prague Dependency Treebank (PDT) Syntax