SELM-Zephyr
Collection
See our paper at https://huggingface.co/papers/2405.19332.
•
5 items
•
Updated
•
1
Self-Exploring Language Models: Active Preference Elicitation for Online Alignment.
This model is a fine-tuned version of HuggingFaceH4/mistral-7b-sft-beta using synthetic data based on on the HuggingFaceH4/ultrafeedback_binarized dataset.
AlpacaEval 2.0 (LC WR) | MT-Bench (Average) | |
---|---|---|
SELM-Zephyr-7B-iter-3 | 24.00 | 7.48 |
SELM-Zephyr-7B-iter-2 | 23.40 | 7.72 |
SELM-Zephyr-7B-iter-1 | 20.28 | 7.42 |
DPO-Zephyr-7B | 14.45 | 7.28 |
The following hyperparameters were used during training: