Text is one of the most prevalent types of digital data that people create as they go about their lives. Digital footprints of people's language usage in social media posts were found to allow for inferences of their age and gender.
However, the even more prevalent and potentially more sensitive text from instant messaging services has remained largely uninvestigated. We analyze language variations in instant messages with regard to individual differences in age and gender by replicating and extending the methods used in prior research on social media posts.
Using a dataset of 309,229 WhatsApp messages from 226 volunteers, we identify unique age- and gender-linked language variations. We use cross-validated machine learning algorithms to predict volunteers' age (MAEMd = 3.95, rMd = 0.81, R2Md = 0.49) and gender (AccuracyMd = 85.7%, F1Md = 0.67, AUCMd = .82) significantly above baseline-levels and identify the most predictive language features.
We discuss implications for psycholinguistic theory, present opportunities for application in author profiling, and suggest methodological approaches for making predictions from small text data sets. Given the recent trend towards the dominant use of private messaging and increasingly weaker user data protection, we highlight rising threats to individual privacy rights in instant messaging.