Charles Explorer logo
🇬🇧

W2C - Web To Corpus

Publication

Abstract

W2C is a collection of software and data. The software part radically facilitates creating a new text corpora for a given language, using text materials freely available on the Internet.

A special attention was given to components for filtering that allow to keep the material quality very high. The data part contains corpora for more than 100 languages, with around 10 million words in each.

This language data resource can be used especially by researchers specialized at developing multilingual technologies.

Keywords