MMLU doesn't match on lm-evaluation-harness

by yixinsong - opened Jul 18, 2024

Discussion

yixinsong

Jul 18, 2024

I evaluate the 1.7B models with lm-evaluation-harness framework.

I am curious about what causes the performance difference between lighteval and lm-evaluation-harness?

yixinsong changed discussion status to closed Jul 18, 2024

ldwang

Aug 14, 2024

Same question.

loubnabnl

HF中国镜像站 TB Research org Aug 14, 2024

Hi, we use a different implementation of MMLU: cloze version vs MC, where we consider the log probabilities of entire answer sequences, instead of just single letters. You can find more details about this in this blog post and in appendix G.2 of this paper.

To reproduce our results you can use the guidelines here: https://huggingface.co/HuggingFaceFW/ablation-model-fineweb-edu#evaluation

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

Your need to confirm your account before you can post a new comment.

· Sign up or log in to comment