46 500 628

Sugato Ray PRO

sugatoray

https://linkedin.com/in/sugatoray

AI & ML interests

None yet

Recent Activity

updated a collection about 23 hours ago

Papers-Fundamentals

updated a collection about 23 hours ago

Papers

upvoted a paper about 23 hours ago

Block Diffusion: Interpolating Between Autoregressive and Diffusion Language Models

View all activity

Organizations

sugatoray's activity

updated 2 collections about 23 hours ago

Papers-Fundamentals

Collection

20 items • Updated about 23 hours ago • 1

Papers

Collection

Large Language Model (LLM) and NLP related papers. • 226 items • Updated about 23 hours ago • 10

upvoted a paper about 23 hours ago

Block Diffusion: Interpolating Between Autoregressive and Diffusion Language Models

Paper • 2503.09573 • Published 1 day ago • 41

reacted to eliebak's post with 🔥 1 day ago

Post

1297

Google just dropped an exciting technical report for the brand-new Gemma3 model! 🚀 Here are my personal notes highlighting the most intriguing architectural innovations, design choices, and insights from this release:

1) Architecture choices:
> No more softcaping, replace by QK-Norm
> Both Pre AND Post Norm
> Wider MLP than Qwen2.5, ~ same depth
> SWA with 5:1 and 1024 (very small and cool ablation on the paper!)
> No MLA to save KV cache, SWA do the job!

2) Long context
> Only increase the rope in the global layer (to 1M)
> Confirmation that it's harder to do long context for smol models, no 128k for the 1B
> Pretrained with 32k context? seems very high
> No yarn nor llama3 like rope extension

3) Distillation
> Only keep te first 256 logits for the teacher
> Ablation on the teacher gap (tl;dr you need some "patience" to see that using a small teacher is better)
> On policy distillation yeahh (by
@agarwl_
et al), not sure if the teacher gap behave the same here, curious if someone have more info?

4) Others
> Checkpoint with QAT, that's very cool
> RL using improve version of BOND, WARM/WARP good excuse to look at
@ramealexandre
papers
> Only use Zero3, no TP/PP if i understand correctly ?
> Training budget relatively similar than gemma2

1 reply

reacted to burtenshaw's post with 👍 1 day ago

Post

1334

Here’s a notebook to make Gemma reason with GRPO & TRL. I made this whilst prepping the next unit of the reasoning course:

In this notebooks I combine together google’s model with some community tooling

- First, I load the model from the HF中国镜像站 hub with transformers’s latest release for Gemma 3
- I use PEFT and bitsandbytes to get it running on Colab
- Then, I took Will Browns processing and reward functions to make reasoning chains from GSM8k
- Finally, I used TRL’s GRPOTrainer to train the model

Next step is to bring Unsloth AI in, then ship it in the reasoning course. Links to notebook below.

https://colab.research.google.com/drive/1Vkl69ytCS3bvOtV9_stRETMthlQXR4wX?usp=sharing

3 replies

updated a collection 1 day ago

LLM Tools

Collection

A collection of tools as various HF Spaces on LLMs. • 112 items • Updated 1 day ago • 2

reacted to fdaudens's post with 🔥 1 day ago

Post

1664

🔥The Open R1 team just dropped OlympicCoder and it's wild:

- 7B model outperforms Claude 3.7 Sonnet on IOI benchmark (yes, 7B!!)
- 32B crushes all open-weight models tested, even those 100x larger 🤯

Open-sourcing the future of code reasoning! 🚀

Check it out https://huggingface.co/blog/open-r1/update-3