Task Completion Leaderboard

Evaluating Vision Language Models for generating temporal value functions for robotics tasks

Predicted Value Function: Gemma-3-27b-it

Leaderboard

Rank	Model	Type	Size	Overall Score	Zero Context VOC	2-Episode Context VOC

What problem do we solve?

Robots learn and decide better when they have a sense of progress. Generative Value Learning (GVL) turns a general vision–language model (VLM) into a progress estimator for demonstration videos—so earlier frames read as “less done” later frames as “more done”.

Datasets & contamination control

To prevent contamination, we define two hidden tasks involving robotic actions and human demonstration videos—evaluated only on request. These curated datasets have a 100% completion rate.

Evaluation & scoring

Building on the GVL framework, we introduce a leaderboard benchmarking open- and closed-source VLMs across robotics tasks (zero-shot & few-shot) to test generalization.

We report Value-Order Correlation (VOC): Spearman rank correlation between the model’s progress ordering and the video’s true time order (+1 perfect, 0 random, −1 reversed).

Important note

We aim for alignment with (unobservable) ground‑truth task progress—not merely the highest aggregate score.

Contributions

Code & specialized models releasing soon. Contributions (PRs) to add models & datasets are welcome.

Have your own VLM or dataset?

Join the research community. We'll evaluate your model on our benchmarks or integrate your dataset into our evaluation framework.