Vision Language Model Leaderboard

Specialized Generative Value Learning

Reliable task completion prediction enables self-supervised deployment, allowing robots to adapt and improve autonomously. Generative Value Learning (GVL) was recently introduced, leveraging the knowledge embedded in vision-language models (VLMs) to predict a proxy for the true value function. Building on the GVL framework, we introduce a leaderboard that benchmarks both open- and closed-source VLMs for robotics tasks and datasets.

❗ The leaderboard does not aim at the highest score across all datasets but the one closest to the (unobservable) ground truth.

To prevent contamination, we define three hidden tasks—two involving robotic actions and human demonstration videos—which can be evaluated upon request. These datasets are highly curated with 100% completion rate.

We will release the accompanying code and specialized models in the coming days. We welcome pull requests for evaluating new models across all datasets.

GVL Gemma-3-27b-it Demo

Explore how models predict task progress over time. Each episode shows a robotics task with the model's predicted value function progression compared to the ground truth temporal ordering.

Frame 10 of 20

Task Progress Over Frames

#	Model	Overall Score	Shuffled
Loading leaderboard data...

Value-Order Correlation (VOC) Metric

Following original GVL work, this metric computes the rank correlation between predicted values and the chronological order of input expert video:

VOC = rank-correlation(argsort(v₁, ..., vₜ); arange(T))

We model robotics tasks as goal-conditioned partially observed Markov decision processes.
The temporal value function \( V : \mathcal{O} \times \mathcal{G} \to [0, 1] \) maps observations and goal specifications to progress values, where initial observations have value \( 0 \) and goal-satisfying observations have value \( 1 \).

#	Model	Overall Score	Shuffled
#	Model	Overall Score	Zero Context VOC	2-Episode Context VOC
Loading leaderboard data...