EvalTree: Profiling Language Model Weaknesses via Hierarchical Capability Trees
Abstract
An ideal model evaluation should achieve two goals: identifying where the model fails and providing actionable improvement guidance. Toward these goals for Language Model (LM) evaluations, we formulate the problem of generating a weakness profile, a set of weaknesses expressed in natural language, given an LM's performance on every individual instance in a benchmark. We introduce a suite of quantitative assessments to compare different weakness profiling methods. We also propose a weakness profiling method EvalTree. It constructs a capability tree where each node represents a capability described in natural language and is linked to a subset of benchmark instances that specifically evaluate this capability; it then extracts nodes where the LM performs poorly to generate a weakness profile. On the MATH and WildChat benchmarks, we show that EvalTree outperforms baseline weakness profiling methods by identifying weaknesses more precisely and comprehensively. Weakness profiling further enables weakness-guided data collection, and training data collection guided by EvalTree-identified weaknesses improves LM performance more than other data collection strategies. We also show how EvalTree exposes flaws in Chatbot Arena's human-voter-based evaluation practice. To facilitate future work, we release our code and an interface that allows practitioners to interactively explore the capability trees built by EvalTree.
Community
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- An Empirical Analysis of Uncertainty in Large Language Model Evaluations (2025)
- GenTool: Enhancing Tool Generalization in Language Models through Zero-to-One and Weak-to-Strong Simulation (2025)
- LLM-Powered Benchmark Factory: Reliable, Generic, and Efficient (2025)
- Adaptively evaluating models with task elicitation (2025)
- Towards Reasoning Ability of Small Language Models (2025)
- AutoLogi: Automated Generation of Logic Puzzles for Evaluating Reasoning Abilities of Large Language Models (2025)
- Revitalizing Saturated Benchmarks: A Weighted Metric Approach for Differentiating Large Language Model Performance (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on HF中国镜像站 checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper