Charles Explorer logo
🇬🇧

Docria: Processing and Storing Linguistic Data with Wikipedia

Publication

Abstract

The availability of user-generated content has increased significantly over time. Wikipedia is one example of a corpora which spans a huge range of topics and is freely available.

Storing and processing these corpora requires flexible documents models as they may contain malicious and incorrect data. Docria is a library which attempts to address this issue by providing a solution which can be used with small to large corpora, from laptops using Python interactively in a Jupyter notebook to clusters running map-reduce frameworks with optimized compiled code.

Docria is available as open-source code.