In the last five years, there has been a significant focus in Natural Language Processing (NLP) on developing larger Pretrained Language Mod- els (PLMs) and introducing benchmarks such as SuperGLUE and SQuAD to measure their abilities in language understanding, reasoning, and reading comprehension. These PLMs have achieved impressive results on these bench- marks, even surpassing human performance in some cases.
This has led to claims of superhu- man capabilities and the provocative idea that certain tasks have been solved. In this position paper, we take a critical look at these claims and ask whether PLMs truly have superhuman abili- ties and what the current benchmarks are really evaluating.
We show that these benchmarks have serious limitations affecting the compar- ison between humans and PLMs and provide recommendations for fairer and more transpar- ent benchmarks.