We need to find the most suitable prosody formalism for the task of machine learning. The target application is a prosody generative module for text-to-speech synthesis.
This module will learn prosody marks (parameters or symbols) from large corpora. Formalism we are looking for should be general, perceptually relevant, restorable, automatically obtained, objective and learnable.
Main formalisms for the pitch description are briefly described and compared, namely Fujisaki model, ToBI, Intsint, Tilt and “Glissando threshold” adaptation. The most suitable method of pitch description for the task of machine learning is “Glissando threshold” adaptation with an additional simplification.