CapArena: Benchmarking and Analyzing Detailed Image Captioning in the LLM Era
Abstract
Image captioning has been a longstanding challenge in vision-language research. With the rise of LLMs, modern Vision-Language Models (VLMs) generate detailed and comprehensive image descriptions. However, benchmarking the quality of such captions remains unresolved. This paper addresses two key questions: (1) How well do current VLMs actually perform on image captioning, particularly compared to humans? We built CapArena, a platform with over 6000 pairwise caption battles and high-quality human preference votes. Our arena-style evaluation marks a milestone, showing that leading models like GPT-4o achieve or even surpass human performance, while most open-source models lag behind. (2) Can automated metrics reliably assess detailed caption quality? Using human annotations from CapArena, we evaluate traditional and recent captioning metrics, as well as VLM-as-a-Judge. Our analysis reveals that while some metrics (e.g., METEOR) show decent caption-level agreement with humans, their systematic biases lead to inconsistencies in model ranking. In contrast, VLM-as-a-Judge demonstrates robust discernment at both the caption and model levels. Building on these insights, we release CapArena-Auto, an accurate and efficient automated benchmark for detailed captioning, achieving 94.3% correlation with human rankings at just $4 per test. Data and resources will be open-sourced at https://caparena.github.io.
Community
- The first large-scale human-centered empirical study to benchmark advanced VLMs performance on detailed captioning.
- In-depth analysis of existing captioning metrics and VLM-as-a-Judge.
- A new efficient automated evaluation benchmark for detailed captioning.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Painting with Words: Elevating Detailed Image Captioning with Benchmark and Alignment Learning (2025)
- VidCapBench: A Comprehensive Benchmark of Video Captioning for Controllable Text-to-Video Generation (2025)
- What Is a Good Caption? A Comprehensive Visual Caption Benchmark for Evaluating Both Correctness and Coverage of MLLMs (2025)
- Cockatiel: Ensembling Synthetic and Human Preferenced Training for Detailed Video Caption (2025)
- Image Embedding Sampling Method for Diverse Captioning (2025)
- LongCaptioning: Unlocking the Power of Long Video Caption Generation in Large Multimodal Models (2025)
- VisCon-100K: Leveraging Contextual Web Data for Fine-tuning Vision Language Models (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on HF中国镜像站 checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper