Evaluating VLM for generating temporal value functions for robotics tasks
Reliable task completion prediction enables self-supervised deployment, allowing robots to adapt and improve autonomously.
Generative Value Learning (GVL) was recently introduced, leveraging the knowledge embedded in vision-language models (VLMs) to predict a proxy for the true value function.
Building on the GVL framework, we introduce a leaderboard that benchmarks both open- and closed-source VLMs for robotics tasks and datasets.
❗ The leaderboard does not aim at the highest score across all datasets but the one closest to the (unobservable) ground truth.
To prevent contamination, we define three hidden tasks—two involving robotic actions and human demonstration videos—which can be evaluated upon request. These datasets are highly curated with 100% completion rate.
We will release the accompanying code and specialized models in the coming days.
We welcome pull requests for evaluating new models across all datasets.
Explore how models predict task progress over time. Each episode shows a robotics task with the model's predicted value function progression compared to the ground truth temporal ordering.
# | Model | Overall Score | Shuffled | |
---|---|---|---|---|
Zero Context VOC | 2-Episode Context VOC | |||
Loading leaderboard data...
|
Following original GVL work, this metric computes the rank correlation between predicted values and the chronological order of input expert video:
We model robotics tasks as goal-conditioned partially observed Markov decision processes.
The temporal value function
\( V : \mathcal{O} \times \mathcal{G} \to [0, 1] \)
maps observations and goal specifications to progress values, where initial observations have value
\( 0 \)
and goal-satisfying observations have value
\( 1 \).