ahmed-masry (Ahmed Masry)

Happy to announce AlignVLM 📏 – a novel approach to bridging vision and language latent spaces for multimodal understanding in Vision-Language Models (VLMs) 🌍📄🖼

🔗 Read the paper: AlignVLM: Bridging Vision and Language Latent Spaces for Multimodal Understanding (2502.01341)

🧐 What’s the challenge?
Aligning visual features with language embeddings remains a major bottleneck in VLMs. Existing connectors such as Multi-layer perceptron (MLPs) often introduce noise that degrades performance. ❌

🎯 Our Solution: ALIGN Connector
We propose AlignVLM, a method that maps vision features into a weighted average of LLM text embeddings, ensuring they remain in a space that the LLM can effectively interpret. ✅

🔬 How does it perform?
We compared ALIGN against common connectors like MLPs, Perceiver Resampler, and Ovis trained under similar configurations. The results? ALIGN outperforms them all 🏆 on diverse document understanding tasks 📄.

📊 Meet the AlignVLM Model Family!
We trained Llama 3.1 (1B, 3B, 8B) using our connector and benchmarked them against various models. The results:
✅ AlignVLM surpasses all Base VLMs trained under similar configurations. ✅ Our models also perform competitively against Instruct VLMs such as Qwen2-VL and InternVL-2.5 🚀.

🤔 What about robustness to noise?
We injected Gaussian noise (μ=0, σ=3) into the vision encoder’s outputs before feeding them to the connector:
✅ ALIGN Connector: Minimal drop (↓1.67%) – proving its high robustness!
❌ MLP Connector: Severe degradation (↓25.54%) – struggling with noisy inputs.

Code & model weights coming soon! Stay tuned! 🔥

posted an update about 1 month ago

Post

5305

Happy to announce AlignVLM 📏 – a novel approach to bridging vision and language latent spaces for multimodal understanding in Vision-Language Models (VLMs) 🌍📄🖼

🔗 Read the paper: AlignVLM: Bridging Vision and Language Latent Spaces for Multimodal Understanding (2502.01341)

🧐 What’s the challenge?
Aligning visual features with language embeddings remains a major bottleneck in VLMs. Existing connectors such as Multi-layer perceptron (MLPs) often introduce noise that degrades performance. ❌

🎯 Our Solution: ALIGN Connector
We propose AlignVLM, a method that maps vision features into a weighted average of LLM text embeddings, ensuring they remain in a space that the LLM can effectively interpret. ✅

🔬 How does it perform?
We compared ALIGN against common connectors like MLPs, Perceiver Resampler, and Ovis trained under similar configurations. The results? ALIGN outperforms them all 🏆 on diverse document understanding tasks 📄.

📊 Meet the AlignVLM Model Family!
We trained Llama 3.1 (1B, 3B, 8B) using our connector and benchmarked them against various models. The results:
✅ AlignVLM surpasses all Base VLMs trained under similar configurations. ✅ Our models also perform competitively against Instruct VLMs such as Qwen2-VL and InternVL-2.5 🚀.

🤔 What about robustness to noise?
We injected Gaussian noise (μ=0, σ=3) into the vision encoder’s outputs before feeding them to the connector:
✅ ALIGN Connector: Minimal drop (↓1.67%) – proving its high robustness!
❌ MLP Connector: Severe degradation (↓25.54%) – struggling with noisy inputs.

Code & model weights coming soon! Stay tuned! 🔥

authored a paper about 1 month ago

AlignVLM: Bridging Vision and Language Latent Spaces for Multimodal Understanding

Paper • 2502.01341 • Published Feb 3 • 36

upvoted a paper about 1 month ago

AlignVLM: Bridging Vision and Language Latent Spaces for Multimodal Understanding

Paper • 2502.01341 • Published Feb 3 • 36

commented a paper about 1 month ago

AlignVLM: Bridging Vision and Language Latent Spaces for Multimodal Understanding

Paper • 2502.01341 • Published Feb 3 • 36 •

2

liked a dataset 2 months ago

MAmmoTH-VL/MAmmoTH-VL-Instruct-12M

Viewer • Updated Jan 5 • 37M • 5.23k • 46

New activity in ahmed-masry/unichart-qa-data 2 months ago

Dataset Source?

4

#3 opened 2 months ago by

veason

liked a dataset 3 months ago

ServiceNow/BigDocs-Bench

Viewer • Updated Feb 6 • 598k • 2.56k • 13

authored a paper 3 months ago

BigDocs: An Open and Permissively-Licensed Dataset for Training Multimodal Models on Document and Code Tasks

Paper • 2412.04626 • Published Dec 5, 2024 • 14

upvoted a paper 3 months ago

BigDocs: An Open and Permissively-Licensed Dataset for Training Multimodal Models on Document and Code Tasks

Paper • 2412.04626 • Published Dec 5, 2024 • 14

New activity in ahmed-masry/ColFlor 4 months ago

Update README.md

1

#1 opened 4 months ago by

omkar334

Update README.md

1

#1 opened 4 months ago by

omkar334

Ahmed Masry PRO

AI & ML interests

Recent Activity

Organizations

ahmed-masry's activity

ALLaM Instruct

ALLaM Instruct

SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

ALLaM-AI/ALLaM-7B-Instruct-preview

AlignVLM: Bridging Vision and Language Latent Spaces for Multimodal Understanding

AlignVLM: Bridging Vision and Language Latent Spaces for Multimodal Understanding

AlignVLM: Bridging Vision and Language Latent Spaces for Multimodal Understanding

MAmmoTH-VL/MAmmoTH-VL-Instruct-12M

Dataset Source?

ServiceNow/BigDocs-Bench

BigDocs: An Open and Permissively-Licensed Dataset for Training Multimodal Models on Document and Code Tasks

BigDocs: An Open and Permissively-Licensed Dataset for Training Multimodal Models on Document and Code Tasks

Update README.md

Update README.md