215 126 675

Manuel Romero PRO

mrm8488

https://mrm8488.github.io

AI & ML interests

#AI Research and Democratization. NLP/NLG 🤗

Recent Activity

liked a dataset 1 day ago

open-r1/codeforces-cots

upvoted a collection 1 day ago

👩‍💻 OlympicCoder

upvoted an article 1 day ago

Open R1: Update #3

View all activity

Organizations

Posts 3

Post

5470

🚨Exciting news for the Multilingual Synthetic Data Community!🚨

I’ve taken inspiration from the MAGPIE paper on Llama-3-8B-instruct and extended its capabilities. Here’s what’s new!

🗞 The MAGPIE paper showcased that if you use the instruction-tuned version (Llama-3-8B-instruct) to generate synthetic instructions and then fine-tune the base version (Llama-3-8B) on this dataset, you can improve even the it-tuned version

🤔 While reading a script by Sebastian Raschka, PhD, I wondered: Could these advancements be replicated in other languages? Specifically, could they benefit non-English datasets?

🎉 And the answer is YES! At least for Spanish. I've successfully adapted the techniques for Spanish, proving the model's flexibility and multilingual capabilities.

👩‍💻 To make this accessible, I created a basic script (heavily inspired by the Sebastian Raschka one) that allows you to generate similar datasets using ollama models (initially phi and llama3) automatically and upload it to the HF中国镜像站 Hub!
[Script](https://gist.github.com/mrm8488/4650a5e3cc45523798a527a3446eb312)

🔍 Explore the datasets 📚 generated using our new script!

- [Llama-3-8B](https://huggingface.co/datasets/mrm8488/dataset_llama3_5000_samples_es_4231_filtered)
- [Phi-3-medium](https://huggingface.co/datasets/mrm8488/dataset_phi3-medium_5000_samples_es_3906_filtered)
- [Phi-3-mini](https://huggingface.co/datasets/mrm8488/dataset_phi3_5000_samples_es_3282_filtered)

Note: These datasets have basic filtering. Apply additional quality filters before using them to fine-tune large language models.

Inspiration and base script:
https://github.com/rasbt/LLMs-from-scratch/blob/main/ch07/05_dataset-generation/llama3-ollama.ipynb
https://www.linkedin.com/feed/update/urn:li:activity:7210982019751661568/

Post

6332