Visual Verification Enables Inference-time Steering and Autonomous Policy Improvement

Princeton University

RSS 2026

[paper] [code]

Diagram: a generator policy proposes rollouts; a gradient-free visual verifier scores them; selected rollouts steer the policy and feed back as training data. — **Overview of VERITAS.**
A pre-trained generalist policy acts as a stochastic generator, sampling multiple short-horizon action chunks at each decision step. A gradient-free visual verifier scores these candidates based on task alignment and physical plausibility, and the highest-scoring action is executed, yielding immediate performance gains at inference time. Successful verifier-guided rollouts are logged and reused for offline policy improvement, forming a data flywheel that distills verification-time reasoning into the policy and enables continual improvement with minimal human supervision.

Abstract

Robots deployed in the real world should learn from their experience and improve over time. This requires a mechanism of practicing and learning from feedback. In this paper, we propose VERITAS, a generator–verifier framework for generalist robot policies for inference-time policy steering and self-improvement. We use a pre-trained generalist robot policy as a “generator” and pair it with a gradient-free “visual verifier” that evaluates actions at inference time.

This framework enables inference-time steering that improves policy performance without additional training. We demonstrate that inference-time verification consistently outperforms vanilla generalists without training on additional demonstration data.

Additionally, we demonstrate that the verified rollouts provide effective supervision for offline policy improvement: policies fine-tuned on verified self-generated trajectories achieve consistent performance gains. Notably, we find that post-training with verified rollouts achieves comparable efficiency to expert demonstrations, while requiring no human interventions. Our results highlight inference-time verification as a practical and scalable mechanism for improving robotic policies during deployment.

Method

The loop is simple. A pre-trained generalist policy emits candidate action trajectories from the current observation. A gradient-free visual verifier evaluates each candidate against task semantics — no extra training, no labels — and the top-scoring trajectory is executed. Top-scoring rollouts are also retained as supervision; fine-tuning on them distills the verification-time reasoning into the policy.

i. Generator proposes. A pre-trained policy emits candidate action trajectories from the current observation.
ii. Verifier scores. A gradient-free visual verifier evaluates candidates without additional training or labeled data.
iii. Policy improves. Top-scoring rollouts become supervision; the policy is fine-tuned using the verified rollouts and the performance is improved.

Simulation Results

The policy trained in simulation under verifier-in-the-loop steering demonstrates robust and diverse whole-body control behaviors. Four representative rollouts are shown below; each clip is selected and steered at inference time by the visual verifier.

Put eggplant in basket.

Pick carrot on plate.

Pick spoon on towel.

Stack green block on yellow block.

Simulation results. Verified rollouts across SIMPLER tasks.

Bar plot of simulation success rates comparing the raw policy, inference-time steering, and the policy fine-tuned on verifier-curated autonomous data. — **Simulation success rates of the raw policy, the inference-time steering and the same policy fine-tuned on verifier curated autonomous data.**
While verification improves performance at inference time through action steering, fine-tuning on the collected verifier curated rollouts successfully distills these gains back into the policy weights.

Real-world Results

Pick up the mouse and place it into the green bowl

Insert the marker into the mug

Put the carrot on the plate

Put the tape into the wooden box

VERITAS in real-world deployment. Verifier-in-the-loop rollouts on a real robot under inference-time steering.

Bar chart showing real-world success rates: VERITAS improves over the base policy on average by 35%, while PIVOT's naïve action primitives fail to complete tasks. — **Inference-time verification yields large real-world gains.**
An average of 35% improvement in real-world deployment, without any policy fine-tuning. These gains highlight the effectiveness of inference-time computation alone in improving generalist policy performance. Note that PIVOT's naïve action primitives can not achieve successful outcomes, whereas a trained policy with a strong action prior performs effectively, highlighting the importance of good action priors for improving performance.

Chart comparing policies fine-tuned on verifier-curated autonomous rollouts versus human teleoperation data, showing comparable data efficiency. — **Verifier-curated autonomous rollouts can match the data efficiency of costly human teleoperation.**
By converting deployment-time execution into effective training data, our approach enables scalable policy improvement without requiring continuous expert supervision. These findings suggest that execution-time verification not only improves performance online, but also enables a practical pathway for continual policy improvement by converting deployment experience into effective training data.

BibTeX

If this work is useful for your research, please consider citing:

@inproceedings{veritas,
  title     = {Visual Verification Enables Inference-time Steering and Autonomous Policy Improvement},
  author    = {Zhang, Mingtong and Shah, Dhruv},
  booktitle = {Proceedings of Robotics: Science and Systems (RSS)},
  year      = {2026}
}