Media Text Classification Tool: COVID-19class

Publication

Abstract

The functional unit demonstrates a specialized media text processing tool (COVID19class for short). The tool implements various methods for sentiment and category classification of media documents.

Text classification is a prerequisite for evaluating trends in media messages and the reactions or attitudes of their readers. The tool has been developed as a support tool for the objectives of the project TL04000176.

The implementation and in particular the machine learning models used are based on an archive of news articles, articles and social media discussion posts collected by Newton Media, s.r.o. or already prepared datasets available on the Internet (Yelp, CSFD). Classical sentiment classification is implemented using several variants of recognition algorithms including neural networks [1].

Actual models for covid-related discussions are trained using the Czech annotated dataset created by the FSV UK team. From the user's point of view, covid19class enables the classification of text articles and discussion posts in large text databases (tested for databases having up to 60,000 posts).

The tool allows the selection of classifiers and provides statistical data on individual classification steps. The tool also allows to classify texts according to clusters of words related to a certain topic or aspect of it.

Such word clusters can be created manually or automatically, e.g. using clustering methods, topic detection methods or community detection. The functional unit is implemented in the Python 3 programming language.

It is controlled via the command line. It is implemented using the libraries Gensim[2], Sklearn[3], matplotlib[4], pandas[5], numpy[6].

In addition to the command line, the tool allows to control processing parameters using configuration files. For the English Yelp reviews, an overall sentiment classification accuracy of 84% (F1 score) was achieved on the test data.

For the Czech CSFD reviews, we achieved an overall accuracy of 77% (F1 score) with the tool on the test data. However, for discussion posts of news with covid topics annotated by FSV UK students, we did not exceed an accuracy of 68% (F1 score) on the test data.

However, this accuracy is consistent with the fact that the obtained corpus is too small (833 discussion posts) and the results obtained are consistent with the accuracy we obtained on the artificially limited previous two corpora of similar size. The best results were obtained using the MLPClassifier classifier [7].

The tool was also used to study the effect of the size of the embedding vector and the effect of the number of epochs of their learning on the classification accuracy of the results. The tool can also be used to create wordclouds for a lexicon of sentiment words.

Reference [1] Luis Pedro Coelho, Willi Richert. Building Machine Learning Systems with Python, Second Edition, 2015 [2] Gensim: https://radimrehurek.com/gensim/ [3] Scikit-learn: https://scikit-learn.org/ [4] Matplotlib: https://matplotlib.org/ [5] Pandas: https://pandas.pydata.org/ [6] Numpy: https://numpy.org/ [7]Multilayer Perceptron: https://scikit-learn.org/stable/modules/ neural_networks_supervised.html

Keywords

natural language processing media journalism sentiment analysis text classification COVID-19