A modified algorithm of the latent semantic analysis for text processing in the Russian language

Publikace

Abstrakt

The paper presents a methodology for analyzing texts in the Russian language. The methodology is based on the Latent Semantic Analysis (LSA) algorithm.

A number of disadvantages of the classical method are considered, and modification methods of extracting N-grams from the text are proposed. The modified method allows one to reduce a number of extracted N-grams and an increasing the meaningfulness of the retrieved collection in comparison with a standard method.

The reduction of the collection size leads to a reduced dimension of the TF-IDF matrix and accelerated the execution of the SVD method. The advantages of the developed machine learning algorithm are demonstrated on simple sentences.

Owing to discussed ideas it becomes possible to effectively parallelize the text processing at the lemmatization step.

Klíčová slova

Stylometrics Plagiarism Authorship Attribution