Complexity: Software tools for linguistic based analysis of genetic sequences

Publication

Abstract

The program “Complexity” provides analysis of genetic sequences based on mathematic linguistic methods, which are an alternative bioinformatics tools for qualitative analysis of genetic sequences. The program deals with nucleotide and protein sequences in standard bioinformatics formats and provides decomposition of the sequence into potential “words” of length n (so called Shannon’s n-grams) and their subsequent statistical analysis.

The program provides linguistic measures such as linguistic complexity, linguistic complexity suggested by E.N. Trifonov, Shannon´s entropy, Markov´s model of entropy and Wootton –Federhen index.

There are also other functions, which enable detection of potentially amphipathic peptides in proteins, random selections of given size from investigated text, filtration of the repetitiveness in sequences, comparison with random model using Monte Carlo simulations. Software is available in English version with manual.

Keywords

Complexity bioinformatics mathematic linguistic methods qualitative analysis of genetic sequences detection of potentially amphipathic peptides