Steven Zheng's picture

Steven Zheng

Steveeeeeeen

AI & ML interests

speech & audio

Recent Activity

updated a dataset about 22 hours ago
Steveeeeeeen/mls_eng_10k
liked a Space about 24 hours ago
sesame/csm-1b
updated a dataset 1 day ago
Steveeeeeeen/whisper-leaderboard-evals
View all activity

Organizations

HF中国镜像站's profile picture Dynamic-SUPERB's profile picture Dynamic-SUPERB-Private's profile picture HF中国镜像站 for Audio's profile picture huggingPartyParis's profile picture MLX Community's profile picture TTS AGI's profile picture Whisper Multilingual Distillation's profile picture Audio Collabs's profile picture open/ acc's profile picture MultiLlasa's profile picture fluxions-hf's profile picture

Steveeeeeeen's activity

upvoted 2 articles 1 day ago
view article
Article

Open-R1: a fully open reproduction of DeepSeek-R1

808
reacted to eliebak's post with 🔥 3 days ago
view post
Post
1356
Google just dropped an exciting technical report for the brand-new Gemma3 model! 🚀 Here are my personal notes highlighting the most intriguing architectural innovations, design choices, and insights from this release:

1) Architecture choices:
> No more softcaping, replace by QK-Norm
> Both Pre AND Post Norm
> Wider MLP than Qwen2.5, ~ same depth
> SWA with 5:1 and 1024 (very small and cool ablation on the paper!)
> No MLA to save KV cache, SWA do the job!

2) Long context
> Only increase the rope in the global layer (to 1M)
> Confirmation that it's harder to do long context for smol models, no 128k for the 1B
> Pretrained with 32k context? seems very high
> No yarn nor llama3 like rope extension

3) Distillation
> Only keep te first 256 logits for the teacher
> Ablation on the teacher gap (tl;dr you need some "patience" to see that using a small teacher is better)
> On policy distillation yeahh (by
@agarwl_
et al), not sure if the teacher gap behave the same here, curious if someone have more info?

4) Others
> Checkpoint with QAT, that's very cool
> RL using improve version of BOND, WARM/WARP good excuse to look at
@ramealexandre
papers
> Only use Zero3, no TP/PP if i understand correctly ?
> Training budget relatively similar than gemma2
  • 1 reply
·
upvoted an article 3 days ago
upvoted an article 5 days ago
view article
Article

Introducing EuroBERT: A High-Performance Multilingual Encoder Model

By EuroBERT and 3 others
121
upvoted an article 8 days ago
view article
Article

LLM Inference on Edge: A Fun and Easy Guide to run LLMs via React Native on your Phone!

32
New activity in hf-audio/open_asr_leaderboard 11 days ago
view reply

Hey!
You're on the right track! The output you're seeing is still tokenized with special markers. To convert this into natural language text, you need to use your tokenizer's decode function.
You can find an example here: https://huggingface.co/HKUSTAudio/Llasa-1B