5 18 26

Tony Zhao

tianchez

https://www.tianchez.com

AI & ML interests

Multimodal Agent, Generative AI

Recent Activity

upvoted a collection 8 days ago

VLM-R1-models

new activity 13 days ago

omlab/VLM-R1-Referral-Expression:Apply for community grant: Personal project (gpu)

replied to their post 18 days ago

Introducing VLM-R1! GRPO has helped DeepSeek R1 to learn reasoning. Can it also help VLMs perform stronger for general computer vision tasks? The answer is YES and it generalizes better than SFT. We trained Qwen 2.5 VL 3B on RefCOCO (a visual grounding task) and eval on RefCOCO Val and RefGTA (an OOD task). https://github.com/om-ai-lab/VLM-R1

View all activity

Organizations

tianchez's activity

upvoted a collection 8 days ago

VLM-R1-models

Collection

A collection of VLM-R1 Models • 6 items • Updated 8 days ago • 2

New activity in omlab/VLM-R1-Referral-Expression 13 days ago

Apply for community grant: Personal project (gpu)

#3 opened 13 days ago by

tianchez

replied to their post 18 days ago

looks very cool!

reacted to their post with 👍 19 days ago

Post

4122

Introducing VLM-R1!

GRPO has helped DeepSeek R1 to learn reasoning. Can it also help VLMs perform stronger for general computer vision tasks?

The answer is YES and it generalizes better than SFT. We trained Qwen 2.5 VL 3B on RefCOCO (a visual grounding task) and eval on RefCOCO Val and RefGTA (an OOD task).

https://github.com/om-ai-lab/VLM-R1

3 replies

New activity in omlab/VLM-R1-Referral-Expression 21 days ago

Fixes 500 error for some users

#1 opened 23 days ago by

Tonic

reacted to their post with ❤️🔥 24 days ago

Post

4122

Introducing VLM-R1!

GRPO has helped DeepSeek R1 to learn reasoning. Can it also help VLMs perform stronger for general computer vision tasks?

The answer is YES and it generalizes better than SFT. We trained Qwen 2.5 VL 3B on RefCOCO (a visual grounding task) and eval on RefCOCO Val and RefGTA (an OOD task).

https://github.com/om-ai-lab/VLM-R1

3 replies

liked a Space 25 days ago

VLM R1 Referral Expression

💬

Highlight described objects in images

published a Space 26 days ago

VLM R1 Referral Expression

💬

Highlight described objects in images

reacted to merve's post with 👍 27 days ago

Post

4723

Your weekly recap of open AI is here, and it's packed with models! merve/feb-14-releases-67af876b404cc27c6d837767

👀 Multimodal
> OpenGVLab released InternVideo 2.5 Chat models, new video LMs with long context
> AIDC released Ovis2 model family along with Ovis dataset, new vision LMs in different sizes (1B, 2B, 4B, 8B, 16B, 34B), with video and OCR support
> ColQwenStella-2b is a multilingual visual retrieval model that is sota in it's size
> Hoags-2B-Exp is a new multilingual vision LM with contextual reasoning, long context video understanding

💬 LLMs
A lot of math models!
> Open-R1 team released OpenR1-Math-220k large scale math reasoning dataset, along with Qwen2.5-220K-Math fine-tuned on the dataset, OpenR1-Qwen-7B
> Nomic AI released new Nomic Embed multilingual retrieval model, a MoE with 500 params with 305M active params, outperforming other models
> DeepScaleR-1.5B-Preview is a new DeepSeek-R1-Distill fine-tune using distributed RL on math
> LIMO is a new fine-tune of Qwen2.5-32B-Instruct on Math

🗣️ Audio
> Zonos-v0.1 is a new family of speech recognition models, which contains the model itself and embeddings

🖼️ Vision and Image Generation
> We have ported DepthPro of Apple to transformers for your convenience!
> illustrious-xl-v1.0 is a new illustration generation model

3 replies

reacted to their post with 🚀 27 days ago

Post

4122

Introducing VLM-R1!

GRPO has helped DeepSeek R1 to learn reasoning. Can it also help VLMs perform stronger for general computer vision tasks?

The answer is YES and it generalizes better than SFT. We trained Qwen 2.5 VL 3B on RefCOCO (a visual grounding task) and eval on RefCOCO Val and RefGTA (an OOD task).

https://github.com/om-ai-lab/VLM-R1

3 replies

posted an update 27 days ago

Post

4122

Introducing VLM-R1!

GRPO has helped DeepSeek R1 to learn reasoning. Can it also help VLMs perform stronger for general computer vision tasks?

The answer is YES and it generalizes better than SFT. We trained Qwen 2.5 VL 3B on RefCOCO (a visual grounding task) and eval on RefCOCO Val and RefGTA (an OOD task).

https://github.com/om-ai-lab/VLM-R1

3 replies

upvoted an article about 1 month ago

Article

Replicating DeepSeek R1 for Information Extraction

•

Jan 31

• 38

authored 7 papers about 2 months ago

SPARTA: Efficient Open-Domain Question Answering via Sparse Transformer Matching Retrieval

Paper • 2009.13013 • Published Sep 28, 2020

How to Evaluate the Generalization of Detection? A Benchmark for Comprehensive Open-Vocabulary Detection

Paper • 2308.13177 • Published Aug 25, 2023

Real-time Transformer-based Open-Vocabulary Detection with Efficient Fusion Head

Paper • 2403.06892 • Published Mar 11, 2024 • 1

OmChat: A Recipe to Train Multimodal Language Models with Strong Long Context and Video Understanding

Paper • 2407.04923 • Published Jul 6, 2024 • 1

OmAgent: A Multi-modal Agent Framework for Complex Video Understanding with Task Divide-and-Conquer

Paper • 2406.16620 • Published Jun 24, 2024 • 2

ZoomEye: Enhancing Multimodal LLMs with Human-Like Zooming Capabilities through Tree-Based Image Exploration

Paper • 2411.16044 • Published Nov 25, 2024 • 1

OmDet: Large-scale vision-language multi-dataset pre-training with multimodal detection network

Paper • 2209.05946 • Published Sep 10, 2022 • 1