Statistical learning of materials properties or functions so far starts with a largely silent, nonchallenged step: the choice of the set of descriptive parameters (termed descriptor). However, when the scientific connection between the descriptor and the actuating mechanisms is unclear, the causality of the learned descriptor-property relation is uncertain.
Thus, a trustful prediction of new promising materials, identification of anomalies, and scientific advancement are doubtful. We analyze this issue and define requirements for a suitable descriptor.
For a classic example, the energy difference of zinc blende or wurtzite and rocksalt semiconductors, we demonstrate how a meaningful descriptor can be found systematically.