deepset (deepset)

anakin87

posted an update about 2 months ago

Post

1661

𝐍𝐞𝐰 𝐈𝐭𝐚𝐥𝐢𝐚𝐧 𝐒𝐦𝐚𝐥𝐥 𝐋𝐚𝐧𝐠𝐮𝐚𝐠𝐞 𝐌𝐨𝐝𝐞𝐥𝐬: 𝐆𝐞𝐦𝐦𝐚 𝐍𝐞𝐨𝐠𝐞𝐧𝐞𝐬𝐢𝐬 𝐜𝐨𝐥𝐥𝐞𝐜𝐭𝐢𝐨𝐧 💎🌍🇮🇹

I am happy to release two new language models for the Italian Language!

💪 Gemma 2 9B Neogenesis ITA
anakin87/gemma-2-9b-neogenesis-ita
Building on the impressive work by VAGO Solutions, I applied Direct Preference Optimization with a mix of Italian and English data.
Using Spectrum, I trained 20% of model layers.

📊 Evaluated on the Open ITA LLM leaderboard ( mii-llm/open_ita_llm_leaderboard), this model achieves strong performance.
To beat it on this benchmark, you'd need a 27B model 😎

🤏 Gemma 2 2B Neogenesis ITA
anakin87/gemma-2-2b-neogenesis-ita
This smaller variant is fine-tuned from the original Gemma 2 2B it by Google.
Through a combination of Supervised Fine-Tuning and Direct Preference Optimization, I trained 25% of the layers using Spectrum.

📈 Compared to the original model, it shows improved Italian proficiency, good for its small size.

Both models were developed during the recent #gemma competition on Kaggle.
📓 Training code: https://www.kaggle.com/code/anakin87/post-training-gemma-for-italian-and-beyond

🙏 Thanks @FinancialSupport and mii-llm for the help during evaluation.

3 replies

·

anakin87

posted an update about 2 months ago

Post

561

Hey, it has been a while... I was busy participating in 💎 𝐆𝐞𝐦𝐦𝐚 𝐜𝐨𝐦𝐩𝐞𝐭𝐢𝐭𝐢𝐨𝐧!

Here's the idea: Gemma open models have a large vocabulary size (256K), so improving them for a specific language or cultural context should be pretty affordable - no need for continued pre-training.

My submission: 💎🌍🇮🇹 𝐍𝐞𝐨𝐠𝐞𝐧𝐞𝐬𝐢𝐬 - 𝐏𝐨𝐬𝐭-𝐓𝐫𝐚𝐢𝐧𝐢𝐧𝐠 𝐆𝐞𝐦𝐦𝐚 𝐟𝐨𝐫 𝐈𝐭𝐚𝐥𝐢𝐚𝐧 𝐚𝐧𝐝 𝐛𝐞𝐲𝐨𝐧𝐝
📓 Kaggle notebook: https://www.kaggle.com/code/anakin87/post-training-gemma-for-italian-and-beyond

In this notebook, I show how I improve the performance of Gemma 2 2B on Italian via Post-Training.
I believe this method is adaptable to other languages and model sizes.

𝘒𝘦𝘺 𝘚𝘵𝘦𝘱𝘴
📊 Choose reference metrics
🧑‍🔬 Data curation for Instruction Fine Tuning: identify existing datasets + generate synthetic data
🏋️‍♂️ Efficient Instruction Fine Tuning with Spectrum
🧑‍🔬 Data curation for Preference Tuning: identify existing datasets + generate synthetic data
👍👎 Efficient Direct Preference Optimization with Spectrum
📈 Evaluation

🤗 HF中国镜像站 collection (with models and datasets): anakin87/gemma-neogenesis-67824b7bf13ac9cfe091fe2e

I'm also planning a 🎁 Gemma Giveaway (on LinkedIn - https://www.linkedin.com/in/stefano-fiorucci) in the next few days - sharing techniques, datasets, and models I used for my project... so stay tuned! 📻

anakin87

posted an update 3 months ago

Post

1661

Tulu 3 SFT Mixture by AllenAI is a massive, good, multilingual dataset for fine-tuning Language Models.

Unfortunately, it was missing the "language" column.

I added it using the good old fastText.

Check out the dataset here 👉 anakin87/tulu-3-sft-mixture-with-language

1 reply

·

anakin87

posted an update 4 months ago

Post

440

🐝🐝🐝 𝐀 𝐒𝐰𝐚𝐫𝐦 𝐨𝐟 𝐀𝐠𝐞𝐧𝐭𝐬 𝐰𝐢𝐭𝐡 𝐋𝐥𝐚𝐦𝐚 3.2, 𝐆𝐏𝐓-4𝐨 𝐦𝐢𝐧𝐢 𝐚𝐧𝐝 𝐂𝐥𝐚𝐮𝐝𝐞 3.5 𝐒𝐨𝐧𝐧𝐞𝐭

𝐓𝐋;𝐃𝐑: I reimplemented the Swarm concept using Haystack, but made it work with both open and proprietary models 💫

✍️ blog article: https://haystack.deepset.ai/blog/swarm-of-agents
📓 notebook: https://haystack.deepset.ai/cookbook/swarm

Some time ago OpenAI published Swarm: an educational framework for building multi-agent systems.

Their approach focuses on two main concepts:
・ 𝐑𝐨𝐮𝐭𝐢𝐧𝐞𝐬: Each agent follows specific 📜 instructions and uses 🛠️ tools to execute them.
・ 𝐇𝐚𝐧𝐝𝐨𝐟𝐟𝐬 🤝: Agents can transfer control to one another using tool/function calling.

When I first read these ideas, I thought: 𝘴𝘪𝘮𝘱𝘭𝘦 𝘣𝘶𝘵 𝘱𝘰𝘸𝘦𝘳𝘧𝘶𝘭! And they pair well with the recent unified tool support in Haystack.

🧑‍💻 So, I decided to re-implement these concepts using Haystack, and in just a few lines of code, I had a working prototype.

🆒 Bonus feature: this implementation isn't tied to a single model provider - different agents can be powered by different models!

I replicated the ACME customer service example from the original article, with 3 Agents:
🐝 Triage Agent - Llama 3.2 running on Ollama
🐝 Sales Agent - Anthropic Claude 3.5 Sonnet
🐝 Issues and Repairs Agent - OpenAI GPT-4o mini

Want to see the full implementation and give it a try? Check out the blog post and notebook! ✨

anakin87

in deepset/roberta-base-squad2 4 months ago

Update README.md

1

#29 opened 4 months ago by

Piterasny

anakin87

posted an update 5 months ago

Post

1108

Ok, you're finally convinced that synthetic data works... ⚗️

𝐍𝐨𝐰 𝐲𝐨𝐮 𝐰𝐚𝐧𝐭 𝐭𝐨 𝐠𝐞𝐧𝐞𝐫𝐚𝐭𝐞 𝐚𝐧 𝐢𝐧𝐬𝐭𝐫𝐮𝐜𝐭𝐢𝐨𝐧 𝐝𝐚𝐭𝐚𝐬𝐞𝐭 𝐟𝐨𝐫 𝐟𝐢𝐧𝐞-𝐭𝐮𝐧𝐢𝐧𝐠 𝐢𝐧 𝐚 𝐥𝐚𝐧𝐠𝐮𝐚𝐠𝐞 𝐨𝐭𝐡𝐞𝐫 𝐭𝐡𝐚𝐧 𝐄𝐧𝐠𝐥𝐢𝐬𝐡.
But how do you get started?

I explore how to do this with Magpie in my new article
https://huggingface.co/blog/anakin87/multilingual-magpie

---

🐦‍⬛ 𝐖𝐡𝐚𝐭 𝐢𝐬 𝐌𝐚𝐠𝐩𝐢𝐞?

It's a recent technique for creating synthetic instruction datasets.

Magpie is based on a simple but ingenious idea 👇
if you prompt an instruction-tuned model with a pre-query template, you can make it generate a plausible user query/instruction

Here's an example:
model: Llama-3-8B-Instruct
pre-query template: "<|begin_of_text|><|start_header_id|>user<|end_header_id|>"
generated user instruction: "What are some of the responsibilities of a commercial pilot?"

You can then feed this instruction back into the same model to get the assistant response.

By repeating this process, it's possible to generate large synthetic datasets with relatively little effort.

🪄 The authors demonstrate that using these datasets for Supervised Fine Tuning (SFT) can yield strong performance, even competitive with the original instruct model.

🧗𝐆𝐞𝐧𝐞𝐫𝐚𝐭𝐢𝐧𝐠 𝐧𝐨𝐧-𝐄𝐧𝐠𝐥𝐢𝐬𝐡 𝐝𝐚𝐭𝐚

Most Language Models are primarily trained on English texts, so they tend to produce data in English.

How can we overcome this?

Earlier approaches were complex or costly.

Then @mrm8488 found a simple solution: add the target language to the pre-query template.
For Spanish, the template becomes "<|begin_of_text|><|start_header_id|>user<|end_header_id|>spanish:".

This method works for Spanish and German!

❌ Unfortunately, it does not work well for other languages (🇮🇹, 🇳🇱, ...)

👇

1 reply

·

anakin87

updated a model 5 months ago

deepset/deberta-v3-base-injection

Text Classification • Updated Oct 15, 2024 • 111k • 35

anakin87

posted an update 6 months ago

Post

1761

🕵🏻 𝐀𝐠𝐞𝐧𝐭𝐢𝐜 𝐑𝐀𝐆 𝐰𝐢𝐭𝐡 🦙 𝐋𝐥𝐚𝐦𝐚 3.2

I was excited to explore Llama 3.2, but as a simple 🇪🇺 EU guy, I don't have access to Meta's multimodal models 😿

🤔 So I thought: why not challenge the small 3B text model with Agentic RAG?

🎯 The plan:
- Build a system that tries to answer questions using a knowledge base.
- If the documents don't contain the answer, use Web search for additional context.

Check out my experimental notebook here: 📓 https://colab.research.google.com/github/deepset-ai/haystack-cookbook/blob/main/notebooks/llama32_agentic_rag.ipynb

My stack:
🏗️ haystack (https://haystack.deepset.ai/): open-source LLM orchestration framework
🦙 meta-llama/Llama-3.2-3B-Instruct
🦆🌐 free DuckDuckGo API, integrated with Haystack

✨ 𝘛𝘩𝘦 𝘳𝘦𝘴𝘶𝘭𝘵𝘴? 𝘌𝘯𝘤𝘰𝘶𝘳𝘢𝘨𝘪𝘯𝘨 - 𝘢 𝘧𝘦𝘸 𝘮𝘰𝘯𝘵𝘩𝘴 𝘢𝘨𝘰, 𝘵𝘩𝘪𝘴 𝘭𝘦𝘷𝘦𝘭 𝘰𝘧 𝘱𝘦𝘳𝘧𝘰𝘳𝘮𝘢𝘯𝘤𝘦 𝘧𝘳𝘰𝘮 𝘢 𝘴𝘮𝘢𝘭𝘭 𝘮𝘰𝘥𝘦𝘭 𝘸𝘰𝘶𝘭𝘥'𝘷𝘦 𝘣𝘦𝘦𝘯 𝘶𝘯𝘵𝘩𝘪𝘯𝘬𝘢𝘣𝘭𝘦!
This probably reflects the impressive IFEval score of the model (comparable to Llama 3.1 8B).

julianrisch

updated 12 models 6 months ago

deepset

AI & ML interests

deepset's activity

Update README.md

deepset/deberta-v3-base-injection

deepset/bert-base-german-cased-oldvocab

deepset/gbert-base-germandpr-ctx_encoder

deepset/tapas-large-nq-reader

deepset/tapas-large-nq-hn-reader

deepset/gelectra-large-generator

deepset/gelectra-base-generator

deepset/gbert-base-germandpr-reranking

deepset/gbert-base-germandpr-question_encoder

deepset/gbert-large

deepset/gbert-base

deepset/gbert-large-sts

deepset/quora_dedup_bert_base

AI & ML interests

Team members 29

deepset's activity

Update README.md