Open R1

Enterprise

community

https://github.com/huggingface/open-r1

Activity Feed

AI & ML interests

None defined yet.

Recent Activity

lewtun updated a model 14 minutes ago

open-r1/OlympicCoder-32B

lewtun updated a model 16 minutes ago

open-r1/OlympicCoder-7B

lewtun new activity 28 minutes ago

open-r1/codeforces-cots:Why is there a discrepancy between the 'Solutions' subset and the 'Solutions_py' subset?

View all activity

Articles

Mini-R1: Reproduce Deepseek R1 „aha moment“ a RL tutorial

Jan 31

• 43

open-r1's activity

lewtun

updated a model 14 minutes ago

open-r1/OlympicCoder-32B

Text Generation • Updated 14 minutes ago • 434 • 58

lewtun

updated a model 16 minutes ago

open-r1/OlympicCoder-7B

Text Generation • Updated 16 minutes ago • 641 • 74

lewtun

in open-r1/codeforces-cots 28 minutes ago

Why is there a discrepancy between the 'Solutions' subset and the 'Solutions_py' subset?

#2 opened about 6 hours ago by

waple

burtenshaw

posted an update 43 minutes ago

Post

Still speed running Gemma 3 to think. Today I focused on setting up gpu poor hardware to run GRPO.

This is a plain TRL and PEFT notebook which works on mac silicone or colab T4. This uses the 1b variant of Gemma 3 and a reasoning version of GSM8K dataset.

🧑‍🍳 There’s more still in the oven like releasing models, an Unsloth version, and deeper tutorials, but hopefully this should bootstrap your projects.

Here’s a link to the 1b notebook: https://colab.research.google.com/drive/1mwCy5GQb9xJFSuwt2L_We3eKkVbx2qSt?usp=sharing

lewtun

updated a dataset about 1 hour ago

open-r1/codeforces-cots

Viewer • Updated about 1 hour ago • 155k • 783 • 26

lewtun

updated a Space about 1 hour ago

R1-distilled leaderboard

⚡

Display LLM leaderboard scores for open-r1 models

burtenshaw

posted an update about 7 hours ago

Post

419

everybody and their dog is fine-tuning Gemma 3 today, so I thought I'd do a longer post on the tips and sharp edges I find. let's go!

1. has to be install everything form main and nightly. this is what I'm working with to get unsloth and TRL running

git+https://github.com/huggingface/transformers@main
git+https://github.com/huggingface/trl.git@main
bitsandbytes
peft

plus this with --no-deps

git+https://github.com/unslothai/unsloth-zoo.git@nightly
git+https://github.com/unslothai/unsloth.git@nightly

2. will brown's code to turn GSM8k into a reasoning dataset is a nice toy experiment https://gist.github.com/willccbb/4676755236bb08cab5f4e54a0475d6fb

3. with a learning rate of 5e-6 rewards and loss stayed flat for the first 100 or so steps.

4. so far none of my runs have undermined the outputs after 1 epoch. therefore, I'm mainly experimenting with bigger LoRA adapters.

from trl import GRPOConfig

training_args = GRPOConfig(
    learning_rate = 5e-6,
    adam_beta1 = 0.9,
    adam_beta2 = 0.99,
    weight_decay = 0.1,
    warmup_ratio = 0.1,
    lr_scheduler_type = "cosine",
    optim = "adamw_8bit",
    logging_steps = 1,
    per_device_train_batch_size = 2,
    gradient_accumulation_steps = 1,
    num_generations = 2,
    max_prompt_length = 256,
    max_completion_length = 1024 - 256,
    num_train_epochs = 1,
    max_steps = 250,
    save_steps = 250,
    max_grad_norm = 0.1,
    report_to = "none",
)

5. vision fine-tuning isn't available in TRL's GRPOTrainer, so stick to text datasets. but no need to load the model differently in transformers or Unsloth

from transformers import AutoModelForImageTextToText

model = AutoModelForImageTextToText.from_pretrained("google/gemma-3-4b-it)

if you want an introduction to GRPO, check out the reasoning course, it walks you through the algorithm, theory, and implementation in a smooth way.

https://huggingface.co/reasoning-course

2 replies

guipenedo

updated 3 datasets about 16 hours ago

fdaudens

posted an update about 22 hours ago

Post

352

Ever wanted 45 min with one of AI’s most fascinating minds? Was with @thomwolf at HumanX Vegas. Sharing my notes of his Q&A with the press—completely changed how I think about AI’s future:

1️⃣ The next wave of successful AI companies won’t be defined by who has the best model but by who builds the most useful real-world solutions. "We all have engines in our cars, but that’s rarely the only reason we buy one. We expect it to work well, and that’s enough. LLMs will be the same."

2️⃣ Big players are pivoting: "Closed-source companies—OpenAI being the first—have largely shifted from LLM announcements to product announcements."

3️⃣ Open source is changing everything: "DeepSeek was open source AI’s ChatGPT moment. Basically, everyone outside the bubble realized you can get a model for free—and it’s just as good as the paid ones."

4️⃣ Product innovation is being democratized: Take Manus, for example—they built a product on top of Anthropic’s models that’s "actually better than Anthropic’s own product for now, in terms of agents." This proves that anyone can build great products with existing models.

We’re entering a "multi-LLM world," where models are becoming commoditized, and all the tools to build are readily available—just look at the flurry of daily new releases on HF中国镜像站.

Thom's comparison to the internet era is spot-on: "In the beginning you made a lot of money by making websites... but nowadays the huge internet companies are not the companies that built websites. Like Airbnb, Uber, Facebook, they just use the internet as a medium to make something for real life use cases."

Love to hear your thoughts on this shift!

1 reply

burtenshaw

posted an update about 24 hours ago

Post

786

Here’s a notebook to make Gemma reason with GRPO & TRL. I made this whilst prepping the next unit of the reasoning course:

In this notebooks I combine together google’s model with some community tooling

- First, I load the model from the HF中国镜像站 hub with transformers’s latest release for Gemma 3
- I use PEFT and bitsandbytes to get it running on Colab
- Then, I took Will Browns processing and reward functions to make reasoning chains from GSM8k
- Finally, I used TRL’s GRPOTrainer to train the model

Next step is to bring Unsloth AI in, then ship it in the reasoning course. Links to notebook below.

https://colab.research.google.com/drive/1Vkl69ytCS3bvOtV9_stRETMthlQXR4wX?usp=sharing

2 replies

edbeeching

authored a paper about 24 hours ago

Optimizing Test-Time Compute via Meta Reinforcement Fine-Tuning

Paper • 2503.07572 • Published 3 days ago • 26

lewtun

authored a paper about 24 hours ago

Optimizing Test-Time Compute via Meta Reinforcement Fine-Tuning

Paper • 2503.07572 • Published 3 days ago • 26

thomwolf

posted an update 1 day ago

Post

1175

We've kept pushing our Open-R1 project, an open initiative to replicate and extend the techniques behind DeepSeek-R1.

And even we were mind-blown by the results we got with this latest model we're releasing: ⚡️OlympicCoder ( open-r1/OlympicCoder-7B and open-r1/OlympicCoder-32B)

It's beating Claude 3.7 on (competitive) programming –a domain Anthropic has been historically really strong at– and it's getting close to o1-mini/R1 on olympiad level coding with just 7B parameters!

And the best part is that we're open-sourcing all about its training dataset, the new IOI benchmark, and more in our Open-R1 progress report #3: https://huggingface.co/blog/open-r1/update-3

Datasets are are releasing:
- open-r1/codeforces
- open-r1/codeforces-cots
- open-r1/ioi
- open-r1/ioi-test-cases
- open-r1/ioi-sample-solutions
- open-r1/ioi-cots
- open-r1/ioi-2024-model-solutions

eliebak

posted an update 1 day ago

Post

898

Google just dropped an exciting technical report for the brand-new Gemma3 model! 🚀 Here are my personal notes highlighting the most intriguing architectural innovations, design choices, and insights from this release:

1) Architecture choices:
> No more softcaping, replace by QK-Norm
> Both Pre AND Post Norm
> Wider MLP than Qwen2.5, ~ same depth
> SWA with 5:1 and 1024 (very small and cool ablation on the paper!)
> No MLA to save KV cache, SWA do the job!

2) Long context
> Only increase the rope in the global layer (to 1M)
> Confirmation that it's harder to do long context for smol models, no 128k for the 1B
> Pretrained with 32k context? seems very high
> No yarn nor llama3 like rope extension

3) Distillation
> Only keep te first 256 logits for the teacher
> Ablation on the teacher gap (tl;dr you need some "patience" to see that using a small teacher is better)
> On policy distillation yeahh (by
@agarwl_
et al), not sure if the teacher gap behave the same here, curious if someone have more info?

4) Others
> Checkpoint with QAT, that's very cool
> RL using improve version of BOND, WARM/WARP good excuse to look at
@ramealexandre
papers
> Only use Zero3, no TP/PP if i understand correctly ?
> Training budget relatively similar than gemma2

1 reply

lewtun

in open-r1/codeforces-cots 1 day ago

Update README.md

#1 opened 1 day ago by

lhoestq

lewtun

in open-r1/OlympicCoder-32B 1 day ago

Size of the weights > 140 GB for a 32 GB model?

#2 opened 1 day ago by

stelterlab

Remove fp32 weights

#4 opened 1 day ago by

lewtun

Remove fp32 weights

#3 opened 1 day ago by

lewtun

Open R1

AI & ML interests

Recent Activity

Articles

Open R1: Update #3

Open R1: Update #2

Open-R1: Update #1

Mini-R1: Reproduce Deepseek R1 „aha moment“ a RL tutorial

open-r1's activity

open-r1/OlympicCoder-32B

open-r1/OlympicCoder-7B

Why is there a discrepancy between the 'Solutions' subset and the 'Solutions_py' subset?

open-r1/codeforces-cots

R1-distilled leaderboard

open-r1/ioi-test-cases

open-r1/ioi

open-r1/ioi-2024-model-solutions

Optimizing Test-Time Compute via Meta Reinforcement Fine-Tuning

Optimizing Test-Time Compute via Meta Reinforcement Fine-Tuning

Update README.md

Size of the weights > 140 GB for a 32 GB model?

Remove fp32 weights

Remove fp32 weights

AI & ML interests

Recent Activity

Articles

Open R1: Update #3

Open R1: Update #2

Open-R1: Update #1

Mini-R1: Reproduce Deepseek R1 „aha moment“ a RL tutorial

Team members 32

open-r1's activity

Why is there a discrepancy between the 'Solutions' subset and the 'Solutions_py' subset?

R1-distilled leaderboard

Update README.md

Size of the weights > 140 GB for a 32 GB model?

Remove fp32 weights

Remove fp32 weights