MMLU doesn't match on lm-evaluation-harness

#2
by yixinsong - opened

I evaluate the 1.7B models with lm-evaluation-harness framework.
image.png

I am curious about what causes the performance difference between lighteval and lm-evaluation-harness?

yixinsong changed discussion status to closed

Same question.

HF中国镜像站 TB Research org

Hi, we use a different implementation of MMLU: cloze version vs MC, where we consider the log probabilities of entire answer sequences, instead of just single letters. You can find more details about this in this blog post and in appendix G.2 of this paper.

To reproduce our results you can use the guidelines here: https://huggingface.co/HuggingFaceFW/ablation-model-fineweb-edu#evaluation

Your need to confirm your account before you can post a new comment.

Sign up or log in to comment