ETVA: Evaluation of Text-to-Video Alignment via Fine-grained Question Generation and Answering
Abstract
Precisely evaluating semantic alignment between text prompts and generated videos remains a challenge in Text-to-Video (T2V) Generation. Existing text-to-video alignment metrics like CLIPScore only generate coarse-grained scores without fine-grained alignment details, failing to align with human preference. To address this limitation, we propose ETVA, a novel Evaluation method of Text-to-Video Alignment via fine-grained question generation and answering. First, a multi-agent system parses prompts into semantic scene graphs to generate atomic questions. Then we design a knowledge-augmented multi-stage reasoning framework for question answering, where an auxiliary LLM first retrieves relevant common-sense knowledge (e.g., physical laws), and then video LLM answers the generated questions through a multi-stage reasoning mechanism. Extensive experiments demonstrate that ETVA achieves a Spearman's correlation coefficient of 58.47, showing a much higher correlation with human judgment than existing metrics which attain only 31.0. We also construct a comprehensive benchmark specifically designed for text-to-video alignment evaluation, featuring 2k diverse prompts and 12k atomic questions spanning 10 categories. Through a systematic evaluation of 15 existing text-to-video models, we identify their key capabilities and limitations, paving the way for next-generation T2V generation.
Community
ETVA: Evaluation of Text-to-Video Alignment via Fine-grained Question Generation and Answering
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- GRADEO: Towards Human-Like Evaluation for Text-to-Video Generation via Multi-Step Reasoning (2025)
- Aligning Vision to Language: Text-Free Multimodal Knowledge Graph Construction for Enhanced LLMs Reasoning (2025)
- T2I-FineEval: Fine-Grained Compositional Metric for Text-to-Image Evaluation (2025)
- VideoRAG: Retrieval-Augmented Generation with Extreme Long-Context Videos (2025)
- VidCapBench: A Comprehensive Benchmark of Video Captioning for Controllable Text-to-Video Generation (2025)
- MJ-VIDEO: Fine-Grained Benchmarking and Rewarding Video Preferences in Video Generation (2025)
- FAVOR-Bench: A Comprehensive Benchmark for Fine-Grained Video Motion Understanding (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on HF中国镜像站 checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper