VERIFY: A Benchmark of Visual Explanation and Reasoning for Investigating Multimodal Reasoning Fidelity
Abstract
Visual reasoning is central to human cognition, enabling individuals to interpret and abstractly understand their environment. Although recent Multimodal Large Language Models (MLLMs) have demonstrated impressive performance across language and vision-language tasks, existing benchmarks primarily measure recognition-based skills and inadequately assess true visual reasoning capabilities. To bridge this critical gap, we introduce VERIFY, a benchmark explicitly designed to isolate and rigorously evaluate the visual reasoning capabilities of state-of-the-art MLLMs. VERIFY compels models to reason primarily from visual information, providing minimal textual context to reduce reliance on domain-specific knowledge and linguistic biases. Each problem is accompanied by a human-annotated reasoning path, making it the first to provide in-depth evaluation of model decision-making processes. Additionally, we propose novel metrics that assess visual reasoning fidelity beyond mere accuracy, highlighting critical imbalances in current model reasoning patterns. Our comprehensive benchmarking of leading MLLMs uncovers significant limitations, underscoring the need for a balanced and holistic approach to both perception and reasoning. For more teaser and testing, visit our project page (https://verify-eqh.pages.dev/).
Community
Visual reasoning remains a significant challenge for today’s MLLMs.
we present VERIFY, a benchmark featuring a set of carefully crafted, complex visual questions designed to assess their limits.
- 🧠 Can you solve it? Test your reasoning skills or see how top models tackle the challenge! 🎯 Try it now
- 📊 Discover the insights. Explore our full results and see how different models stack up against our benchmarks—revealing their strengths, weaknesses, and insights. Check out the details here: 🔗 Project Page
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- R1-Onevision: Advancing Generalized Multimodal Reasoning through Cross-Modal Formalization (2025)
- DriveLMM-o1: A Step-by-Step Reasoning Dataset and Large Multimodal Model for Driving Scenario Understanding (2025)
- Multimodal Inconsistency Reasoning (MMIR): A New Benchmark for Multimodal Reasoning Models (2025)
- Integrating Chain-of-Thought for Multimodal Alignment: A Study on 3D Vision-Language Learning (2025)
- MMSciBench: Benchmarking Language Models on Multimodal Scientific Problems (2025)
- Visual Reasoning Evaluation of Grok, Deepseek Janus, Gemini, Qwen, Mistral, and ChatGPT (2025)
- Cognitive Paradigms for Evaluating VLMs on Visual Reasoning Task (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on HF中国镜像站 checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 1
Spaces citing this paper 0
No Space linking this paper