arxiv:2503.11557

VERIFY: A Benchmark of Visual Explanation and Reasoning for Investigating Multimodal Reasoning Fidelity

Published on Mar 14

· Submitted by

jing-bi on Mar 20

Upvote

Authors:

Jing Bi ,

Susan Liang ,

Guangyu Sun ,

Yunlong Tang ,

Abstract

Visual reasoning is central to human cognition, enabling individuals to interpret and abstractly understand their environment. Although recent Multimodal Large Language Models (MLLMs) have demonstrated impressive performance across language and vision-language tasks, existing benchmarks primarily measure recognition-based skills and inadequately assess true visual reasoning capabilities. To bridge this critical gap, we introduce VERIFY, a benchmark explicitly designed to isolate and rigorously evaluate the visual reasoning capabilities of state-of-the-art MLLMs. VERIFY compels models to reason primarily from visual information, providing minimal textual context to reduce reliance on domain-specific knowledge and linguistic biases. Each problem is accompanied by a human-annotated reasoning path, making it the first to provide in-depth evaluation of model decision-making processes. Additionally, we propose novel metrics that assess visual reasoning fidelity beyond mere accuracy, highlighting critical imbalances in current model reasoning patterns. Our comprehensive benchmarking of leading MLLMs uncovers significant limitations, underscoring the need for a balanced and holistic approach to both perception and reasoning. For more teaser and testing, visit our project page (https://verify-eqh.pages.dev/).

View arXiv page View PDF Project page Add to collection

Community

jing-bi

Paper author Paper submitter about 23 hours ago

Visual reasoning remains a significant challenge for today’s MLLMs.
we present VERIFY, a benchmark featuring a set of carefully crafted, complex visual questions designed to assess their limits.

🧠 Can you solve it? Test your reasoning skills or see how top models tackle the challenge! 🎯 Try it now
📊 Discover the insights. Explore our full results and see how different models stack up against our benchmarks—revealing their strengths, weaknesses, and insights. Check out the details here: 🔗 Project Page