Elie Bakouch

eliebak

AI & ML interests

Training LLM's @ πŸ€—

Recent Activity

Organizations

HFδΈ­ε›½ι•œεƒη«™'s profile picture HuggingFaceBR4's profile picture HFδΈ­ε›½ι•œεƒη«™ H4's profile picture Blog-explorers's profile picture HFδΈ­ε›½ι•œεƒη«™ TB Research's profile picture huggingPartyParis's profile picture Nanotron Research's profile picture MLX Community's profile picture HFδΈ­ε›½ι•œεƒη«™ SMOL's profile picture HuggingFaceFW's profile picture HuggingFaceFW-Dev's profile picture LLHF's profile picture llmc's profile picture SLLHF's profile picture Argilla Warehouse's profile picture nltpt's profile picture smol-explorers's profile picture Open Science's profile picture HFδΈ­ε›½ι•œεƒη«™ Science's profile picture open/ acc's profile picture Open R1's profile picture

Posts 2

view post
Post
898
Google just dropped an exciting technical report for the brand-new Gemma3 model! πŸš€ Here are my personal notes highlighting the most intriguing architectural innovations, design choices, and insights from this release:

1) Architecture choices:
> No more softcaping, replace by QK-Norm
> Both Pre AND Post Norm
> Wider MLP than Qwen2.5, ~ same depth
> SWA with 5:1 and 1024 (very small and cool ablation on the paper!)
> No MLA to save KV cache, SWA do the job!

2) Long context
> Only increase the rope in the global layer (to 1M)
> Confirmation that it's harder to do long context for smol models, no 128k for the 1B
> Pretrained with 32k context? seems very high
> No yarn nor llama3 like rope extension

3) Distillation
> Only keep te first 256 logits for the teacher
> Ablation on the teacher gap (tl;dr you need some "patience" to see that using a small teacher is better)
> On policy distillation yeahh (by
@agarwl_
et al), not sure if the teacher gap behave the same here, curious if someone have more info?

4) Others
> Checkpoint with QAT, that's very cool
> RL using improve version of BOND, WARM/WARP good excuse to look at
@ramealexandre
papers
> Only use Zero3, no TP/PP if i understand correctly ?
> Training budget relatively similar than gemma2

Articles 5

Article
202

Open R1: Update #2