Task Completion Leaderboard
Evaluating Vision Language Models for generating temporal value functions for robotics tasks
Predicted Value Function: Gemma-3-27b-it
Leaderboard
| Rank | Model | Type | Size | Overall Score | Zero Context VOC | 2-Episode Context VOC |
|---|
What problem do we solve?
Robots learn and decide better when they have a sense of progress. Generative Value Learning (GVL) turns a general vision–language model (VLM) into a progress estimator for demonstration videos—so earlier frames read as “less done” later frames as “more done”.
Datasets & contamination control
To prevent contamination, we define two hidden tasks involving robotic actions and human demonstration videos—evaluated only on request. These curated datasets have a 100% completion rate.
Evaluation & scoring
Building on the GVL framework, we introduce a leaderboard benchmarking open- and closed-source VLMs across robotics tasks (zero-shot & few-shot) to test generalization.
We report Value-Order Correlation (VOC): Spearman rank correlation between the model’s progress ordering and the video’s true time order (+1 perfect, 0 random, −1 reversed).
Important note
We aim for alignment with (unobservable) ground‑truth task progress—not merely the highest aggregate score.
Contributions
Code & specialized models releasing soon. Contributions (PRs) to add models & datasets are welcome.
Have your own VLM or dataset?
Join the research community. We'll evaluate your model on our benchmarks or integrate your dataset into our evaluation framework.