Charles Explorer logo
🇬🇧

The InterCorp corpus, release 10

Publication

Abstract

- Total number of word forms in foreign language texts: 1,483 mil., including 258 mil. core and 1,225 mil. collections - Total number of tokens in Czech texts: 192 mil., including 102 mil. core and 89 mil. collections - A new collection: translations of the Bible (Old and New Testament) in 18 languages - Update of the Project Syndicate collection by new texts published in the previous two years - More reliable linguistic annotation for many languages (taggers process text without formatting and other markup) - Removing texts in languages other than specified from the Acquis collection - Catalan is now annotated with tags and lemmas - Bulgarian and Dutch is now annotated also with lemmas - Hungarian is now tagged by RFTagger (formerly by HunPOS) - For technical issues with the tagger, Lithuanian is not annotated with tags and lemmas; it was not annotated starting with release 7