Post
5470
π¨Exciting news for the Multilingual Synthetic Data Community!π¨
Iβve taken inspiration from the MAGPIE paper on Llama-3-8B-instruct and extended its capabilities. Hereβs whatβs new!
π The MAGPIE paper showcased that if you use the instruction-tuned version (
π€ While reading a script by Sebastian Raschka, PhD, I wondered: Could these advancements be replicated in other languages? Specifically, could they benefit non-English datasets?
π And the answer is YES! At least for Spanish. I've successfully adapted the techniques for Spanish, proving the model's flexibility and multilingual capabilities.
π©βπ» To make this accessible, I created a basic script (heavily inspired by the Sebastian Raschka one) that allows you to generate similar datasets using
[Script](https://gist.github.com/mrm8488/4650a5e3cc45523798a527a3446eb312)
π Explore the datasets π generated using our new script!
- [Llama-3-8B](https://huggingface.co/datasets/mrm8488/dataset_llama3_5000_samples_es_4231_filtered)
- [Phi-3-medium](https://huggingface.co/datasets/mrm8488/dataset_phi3-medium_5000_samples_es_3906_filtered)
- [Phi-3-mini](https://huggingface.co/datasets/mrm8488/dataset_phi3_5000_samples_es_3282_filtered)
Note: These datasets have basic filtering. Apply additional quality filters before using them to fine-tune large language models.
Inspiration and base script:
https://github.com/rasbt/LLMs-from-scratch/blob/main/ch07/05_dataset-generation/llama3-ollama.ipynb
https://www.linkedin.com/feed/update/urn:li:activity:7210982019751661568/
Iβve taken inspiration from the MAGPIE paper on Llama-3-8B-instruct and extended its capabilities. Hereβs whatβs new!
π The MAGPIE paper showcased that if you use the instruction-tuned version (
Llama-3-8B-instruct
) to generate synthetic instructions and then fine-tune the base version (Llama-3-8B
) on this dataset, you can improve even the it-tuned versionπ€ While reading a script by Sebastian Raschka, PhD, I wondered: Could these advancements be replicated in other languages? Specifically, could they benefit non-English datasets?
π And the answer is YES! At least for Spanish. I've successfully adapted the techniques for Spanish, proving the model's flexibility and multilingual capabilities.
π©βπ» To make this accessible, I created a basic script (heavily inspired by the Sebastian Raschka one) that allows you to generate similar datasets using
ollama
models (initially phi and llama3) automatically and upload it to the HFδΈε½ιεη« Hub
π Explore the datasets π generated using our new script!
- [Llama-3-8B](https://huggingface.co/datasets/mrm8488/dataset_llama3_5000_samples_es_4231_filtered)
- [Phi-3-medium](https://huggingface.co/datasets/mrm8488/dataset_phi3-medium_5000_samples_es_3906_filtered)
- [Phi-3-mini](https://huggingface.co/datasets/mrm8488/dataset_phi3_5000_samples_es_3282_filtered)
Note: These datasets have basic filtering. Apply additional quality filters before using them to fine-tune large language models.
Inspiration and base script:
https://github.com/rasbt/LLMs-from-scratch/blob/main/ch07/05_dataset-generation/llama3-ollama.ipynb
https://www.linkedin.com/feed/update/urn:li:activity:7210982019751661568/