Can a web-crawled corpus be as diverse as a traditional one? Comparing the range of variability through a multidimensional model

Publikace na Filozofická fakulta |

2019

Abstrakt

The amount of internet sources grows steadily from day to day, making it possible to access enormous collections of data in order to build corpora faster and cheaper. The popularity of web-crawled corpora has been growing in the last decade, obviously, one of the reasons being that similar web crawling methods are applicable to different languages, which gives the opportunity to compile a range of comparable corpora such as the WaC family (Baroni et al. 2009), the TenTen family (Jakubíček et al. 2013) or the Aranea family (Benko 2014).

With the development of new tools, it is also possible to create smaller, special purpose web corpora almost instantly and without advanced computer skills (Anthony 2018; Baroni et al. 2006). However, web crawling has palpable inconveniences as well.

First, the metadata of source texts tend to be scarce, with the exception of some technical information (domain, url, time, size etc.). Second, precisely because of missing metadata, it is often unclear which parts of language a web corpus represents, and in what proportions.

In comparison with traditional corpora, which contain detailed information about all texts, and their structure is well designed, web corpora have uncontrolled composition. The aim of the present paper is to compare the ranges of variability covered by two different types of corpora (traditional vs. web-crawled), by means of a multidimensional analysis (MDA) of register variability in the Czech language (Cvrček et al. 2018b, Cvrček et al. 2018c).

We show to what extent a corpus of one type can overlap with the other type and identify complementary domains which are covered only by one of them. To provide this kind of comparison, both corpora are projected onto the multidimensional (MD) model to reveal overlaps and discrepancies in the coverage of each dimension of variation.

In contrast to lexically driven methods of corpus comparison (Kilgarriff 2001, 2012; Piperski 2018), we thus present a comparison based on features from all linguistic levels. On the other hand, this MDA-based methodology places much less emphasis on domain-specific lexical differences: it is designed to assess the suitability and comprehensiveness of a corpus with respect to language description, rather than with respect to lexicography.

Klíčová slova

web corpus crawling register variation multi-dimensional analysis Czech