Charles Explorer logo
🇬🇧

One common structural feature of "words" in protein sequences and human texts

Publication at Faculty of Science |
2014

Abstract

Frequently discussed analogy between genetic and human texts is explored by comparison of alternation of polar and non-polar amino-acid residues in proteins and alternation of consonants and vowels in human texts. In human languages, the usage of possible combinations of consonants and vowels is influenced by pronounceability of the combinations.

Similarly, oligopeptide composition of proteins is influenced by requirements of protein folding and stability. One special type of structure often present in proteins is amphipathic alpha-helices in which polar and non-polar amino acids alternate with the period 3.5 residues, not unlike alternation of consonants and vowels.

In this study, we evaluated the contribution made by amphipathic alternations to the protein sequence texts (20-24%). Their proportion is lower than respective values for alternating words in human texts (57-89%).

The proteomes (full sets of proteins for selected organisms) were transformed into ranked sequences of n-grams (words of length n), including periodical amphipathic structures. Similarly, human texts were transformed into sequences of alternating consonants and vowels.

Analysis of the vocabularies shows that in both types of texts (human languages and proteins) the alternating words are dominant or highly preferred, thus, strengthening the analogy between these two types of texts. The contribution of amphipathic words in the upper parts of the ranked lists for 10 analyzed proteomes varies between 58 and 74%.

In human texts respective values range between 90 and 100%.