Group recommendations are a sub-domain of recommender systems (RS), where the final recommendations should comply with preferences of all members of the group. Usually, group recommendations are built on top of common" single-user" RS via aggregating models or predictions for multiple users with some notions of fairness and relevance in mind.
So far, group recommendations were usually evaluated off-line either as a tightly coupled pair with the underlying RS or in a decoupled fashion, where the relevance scores estimated by underlying RS serves as a ground truth. Both evaluation types may suffer from different biases that provide illicit advantages to some classes of group recommending strategies.
In experimental part, we evaluate several recent group recommendation models and show that the evaluation process itself significantly affects their perceived usability. While coupled evaluation favors group RS that tend to select per-user best items, decoupled evaluation favors strategies aiming to find items with (some degree of) overall agreement.
We further evaluate methods wrt several variants of inverse propensity based de-biasing scenario in order to reduce the popularity bias of coupled evaluations. Also in this case, if groups of similar users are considered, the magnitude of de-biasing has a determining effect on the ordering of individual methods.