gg-hf (gg-hf)

posted an update 1 day ago

Post

1094

Gemma3 family is out! Reading the tech report, and this section was really interesting to me from a methods/scientific fairness pov.

Instead of doing over-hyped comparisons, they clearly state that **results are reported in a setup which is advantageous to their models**.
(Which everybody does, but people usually don't say)

For a tech report, it makes a lot of sense to report model performance when used optimally!
On leaderboards on the other hand, comparison will be apples to apples, but in a potentially unoptimal way for a given model family (like some user interact sub-optimally with models)

Also contains a cool section (6) on training data memorization rate too! Important to see if your model will output the training data it has seen as such: always an issue for privacy/copyright/... but also very much for evaluation!

Because if your model knows its evals by heart, you're not testing for generalization.

alvarobartt

posted an update 16 days ago

Post

2837

🔥 Agents can do anything! @microsoft Research just announced the release of Magma 8B!

Magma is a new Visual Language Model (VLM) with 8B parameters for multi-modal agents designed to handle complex interactions across virtual and real environments; and it's MIT licensed!

Magma comes with exciting new features such as:
- Introduces the Set-of-Mark and Trace-of-Mark techniques for fine-tuning
- Leverages a large amount of unlabeled video data to learn the spatial-temporal grounding and planning
- A strong generalization and ability to be fine-tuned for other agentic tasks
- SOTA in different multi-modal benchmarks spanning across UI navigation, robotics manipulation, image / video understanding and spatial understanding and reasoning
- Generates goal-driven visual plans and actions for agentic use cases

Model: microsoft/Magma-8B
Technical Report: Magma: A Foundation Model for Multimodal AI Agents (2502.13130)

giffmana

authored a paper 20 days ago

SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

Paper • 2502.14786 • Published 21 days ago • 129

Xenova

posted an update about 1 month ago

Post

9430

We did it. Kokoro TTS (v1.0) can now run 100% locally in your browser w/ WebGPU acceleration. Real-time text-to-speech without a server. ⚡️

Generate 10 seconds of speech in ~1 second for $0.

What will you build? 🔥
webml-community/kokoro-webgpu

The most difficult part was getting the model running in the first place, but the next steps are simple:
✂️ Implement sentence splitting, allowing for streamed responses
🌍 Multilingual support (only phonemization left)

Who wants to help?

9 replies

·

clefourrier

authored a paper about 1 month ago

SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model

Paper • 2502.02737 • Published Feb 4 • 203

reach-vb

authored a paper about 1 month ago

SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model

Paper • 2502.02737 • Published Feb 4 • 203

Xenova

authored a paper about 1 month ago

SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model

Paper • 2502.02737 • Published Feb 4 • 203

lvwerra

authored a paper about 1 month ago

SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model

Paper • 2502.02737 • Published Feb 4 • 203

Xenova

posted an update about 2 months ago

Post

6443

Introducing Kokoro.js, a new JavaScript library for running Kokoro TTS, an 82 million parameter text-to-speech model, 100% locally in the browser w/ WASM. Powered by 🤗 Transformers.js. WebGPU support coming soon!
👉 npm i kokoro-js 👈

Try it out yourself: webml-community/kokoro-web
Link to models/samples: onnx-community/Kokoro-82M-ONNX

You can get started in just a few lines of code!

import { KokoroTTS } from "kokoro-js";

const tts = await KokoroTTS.from_pretrained(
  "onnx-community/Kokoro-82M-ONNX",
  { dtype: "q8" }, // fp32, fp16, q8, q4, q4f16
);

const text = "Life is like a box of chocolates. You never know what you're gonna get.";
const audio = await tts.generate(text,
  { voice: "af_sky" }, // See `tts.list_voices()`
);
audio.save("audio.wav");

Huge kudos to the Kokoro TTS community, especially taylorchu for the ONNX exports and Hexgrad for the amazing project! None of this would be possible without you all! 🤗

The model is also extremely resilient to quantization. The smallest variant is only 86 MB in size (down from the original 326 MB), with no noticeable difference in audio quality! 🤯

10 replies

·

lvwerra

authored a paper about 2 months ago

Towards Best Practices for Open Datasets for LLM Training

Paper • 2501.08365 • Published Jan 14 • 56

Xenova

posted an update 2 months ago

Post

8355

First project of 2025: Vision Transformer Explorer

I built a web app to interactively explore the self-attention maps produced by ViTs. This explains what the model is focusing on when making predictions, and provides insights into its inner workings! 🤯

Try it out yourself! 👇
webml-community/attention-visualization

Source code: https://github.com/huggingface/transformers.js-examples/tree/main/attention-visualization

akhaliq

posted an update 3 months ago

Post

13209

Google drops Gemini 2.0 Flash Thinking

a new experimental model that unlocks stronger reasoning capabilities and shows its thoughts. The model plans (with thoughts visible), can solve complex problems with Flash speeds, and more

now available in anychat, try it out: akhaliq/anychat

3 replies

·

Xenova

posted an update 3 months ago

Post

4249

Introducing Moonshine Web: real-time speech recognition running 100% locally in your browser!
🚀 Faster and more accurate than Whisper
🔒 Privacy-focused (no data leaves your device)
⚡️ WebGPU accelerated (w/ WASM fallback)
🔥 Powered by ONNX Runtime Web and Transformers.js

Demo: webml-community/moonshine-web
Source code: https://github.com/huggingface/transformers.js-examples/tree/main/moonshine-web

5 replies

·

Narsil

posted an update 3 months ago

Post

1414

Performance leap: TGI v3 is out. Processes 3x more tokens, 13x faster than vLLM on long prompts. Zero config !

3x more tokens.

By reducing our memory footprint, we’re able to ingest many more tokens and more dynamically than before. A single L4 (24GB) can handle 30k tokens on llama 3.1-8B, while vLLM gets barely 10k. A lot of work went into reducing the footprint of the runtime and its effect are best seen on smaller constrained environments.
13x faster

On long prompts (200k+ tokens) conversation replies take 27.5s in vLLM, while it takes only 2s in TGI. How so ? We keep the initial conversation around, so when a new reply comes in, we can answer almost instantly. The overhead of the lookup is ~5us. Thanks @Dani ël de Kok for the beast data structure.
Zero config

That’s it. Remove all the flags your are using and you’re likely to get the best performance. By evaluating the hardware and model, TGI carefully selects automatic values to give best performance. In production, we don’t have any flags anymore in our deployments. We kept all existing flags around, they may come in handy in niche scenarios.

Read more: https://huggingface.co/docs/text-generation-inference/conceptual/chunking

Xenova

posted an update 3 months ago

Post

3287

Introducing TTS WebGPU: The first ever text-to-speech web app built with WebGPU acceleration! 🔥 High-quality and natural speech generation that runs 100% locally in your browser, powered by OuteTTS and Transformers.js. 🤗 Try it out yourself!

Demo: webml-community/text-to-speech-webgpu
Source code: https://github.com/huggingface/transformers.js-examples/tree/main/text-to-speech-webgpu
Model: onnx-community/OuteTTS-0.2-500M (ONNX), OuteAI/OuteTTS-0.2-500M (PyTorch)

reach-vb

posted an update 3 months ago

Post

5601

VLMs are going through quite an open revolution AND on-device friendly sizes:

1. Google DeepMind w/ PaliGemma2 - 3B, 10B & 28B: google/paligemma-2-release-67500e1e1dbfdd4dee27ba48

2. OpenGVLabs w/ InternVL 2.5 - 1B, 2B, 4B, 8B, 26B, 38B & 78B: https://huggingface.co/collections/OpenGVLab/internvl-25-673e1019b66e2218f68d7c1c

3. Qwen w/ Qwen 2 VL - 2B, 7B & 72B: Qwen/qwen2-vl-66cee7455501d7126940800d

4. Microsoft w/ FlorenceVL - 3B & 8B: https://huggingface.co/jiuhai

5. Moondream2 w/ 0.5B: https://huggingface.co/vikhyatk/

What a time to be alive! 🔥

clefourrier

authored a paper 3 months ago

Global MMLU: Understanding and Addressing Cultural and Linguistic Biases in Multilingual Evaluation

Paper • 2412.03304 • Published Dec 4, 2024 • 18

giffmana

authored a paper 3 months ago

PaliGemma 2: A Family of Versatile VLMs for Transfer

Paper • 2412.03555 • Published Dec 4, 2024 • 129

Xenova

posted an update 4 months ago

Post

4120

We just released Transformers.js v3.1 and you're not going to believe what's now possible in the browser w/ WebGPU! 🤯 Let's take a look:
🔀 Janus from Deepseek for unified multimodal understanding and generation (Text-to-Image and Image-Text-to-Text)
👁️ Qwen2-VL from Qwen for dynamic-resolution image understanding
🔢 JinaCLIP from Jina AI for general-purpose multilingual multimodal embeddings
🌋 LLaVA-OneVision from ByteDance for Image-Text-to-Text generation
🤸‍♀️ ViTPose for pose estimation
📄 MGP-STR for optical character recognition (OCR)
📈 PatchTST & PatchTSMixer for time series forecasting

That's right, everything running 100% locally in your browser (no data sent to a server)! 🔥 Huge for privacy!

Check out the release notes for more information. 👇
https://github.com/huggingface/transformers.js/releases/tag/3.1.0

Demo link (+ source code): webml-community/Janus-1.3B-WebGPU

akhaliq

posted an update 4 months ago

Post

13718

QwQ-32B-Preview is now available in anychat

A reasoning model that is competitive with OpenAI o1-mini and o1-preview

try it out: akhaliq/anychat

1 reply

·

gg-hf

AI & ML interests

Recent Activity

gg-hf's activity

SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model

SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model

SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model

SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model

Towards Best Practices for Open Datasets for LLM Training

Global MMLU: Understanding and Addressing Cultural and Linguistic Biases in Multilingual Evaluation

PaliGemma 2: A Family of Versatile VLMs for Transfer

AI & ML interests

Recent Activity

Team members 117

gg-hf's activity