Think Twice, Act Once: Verifier-Guided Action Selection for Embodied Agents

TL;DR

Sample many actions, verify, then act.

We give embodied agents a moment to think. Instead of committing to the first action a multimodal LLM produces, we draw a handful of candidates and let a small, specifically trained verifier pick the most reliable one. The result: consistent gains on out-of-distribution embodied tasks (up to +36% relative on the hardest multi-object scenarios), with no change to the underlying policy.

+6% absolute on LangR (65% → 71%).
+5% absolute on EB-ALFRED (44% → 49%), with the same trend across two model families.
A 3B verifier improves a 72B off-the-shelf policy it was never trained with (30% → 38%).
Scales better than self-consistency under the same compute budget; parallel sampling keeps latency modest.

The Problem

Strong models, brittle agents.

Multimodal LLMs make surprisingly capable embodied agents: hand them an instruction like “bring me a banana” and they will navigate the kitchen and pick the right thing. But the moment the world tilts, performance falls off a cliff. A paraphrased instruction (“a yellow curved fruit”), an extra object, or a longer chain of subtasks is enough.

We trace this fragility to a simple fact: the agent commits to a single greedy action at every step. There is no chance to consider an alternative or catch a small error before it compounds. Humans don’t act this way. We weigh a few options, mentally check each, and then move. We verify before acting.

Test-time verification has transformed math and code. Embodied reasoning, with partial observability and long horizons, has been left behind. We close that gap.

VeGAS

Verifier-Guided Action Selection.

At every timestep, the policy proposes N candidate actions, each with a chain-of-thought rationale. A separate generative verifier reads each candidate, writes its own short rationale, and emits a verdict (yes / no). We average M verdicts per candidate to get a stable score, then execute the highest-scoring action. The base policy is untouched.

A surprising finding

Off-the-shelf MLLMs are bad verifiers.

Using the same MLLM as a zero-shot verifier doesn’t help. In fact, it slightly hurts. General language understanding is not the same as the ability to judge whether this action, in this scene, advances the goal. Verification is a distinct skill, and it has to be learned.

Synthetic failure example: starting from a correct 'find a TennisRacket' action, the pipeline injects a 'pick up the TennisRacket' mistake (precondition violation) and writes a verification rationale explaining why the action fails. — **A synthetic failure, end-to-end.** The expert plans to *find* the racket first; our pipeline produces a tempting but wrong alternative (*pick* before *locate*) and explains exactly which precondition was violated. The verifier learns from many such examples.

Training the verifier

From a few successes to a verifier that catches mistakes.

Training a verifier requires paired examples of correct and incorrect actions, each annotated with a rationale explaining why. No standard embodied benchmark provides this, so we synthesize it. Starting from a small set of expert demonstrations, we use an LLM to do two things for every action: (1) annotate the correct action with a chain-of-thought rationale, and (2) generate a plausible failed twin (the kind of mistake a real policy would actually make) paired with a rationale explaining the error. Finetuning an MLLM on this paired corpus yields a verifier tuned to the failure modes that matter, without any new human annotation.

Verifier training pipeline: an LLM augments expert trajectories with chain-of-thought rationales and synthesizes plausible failures, each labelled with a verification rationale and verdict. — **How we train the verifier.** An LLM (OpenAI o3) annotates every expert action with a chain-of-thought rationale and injects realistic failures (wrong object, wrong receptacle, precondition violations), each paired with a verification rationale. We then finetune an MLLM on this corpus.

Results

Consistent gains.

On both benchmarks, the same recipe holds: chain-of-thought is a strong baseline; an off-the-shelf verifier doesn’t move it; our finetuned verifier does.

LangR (Habitat 2.0): Average success rate (%)
Approach	Average	Paraphrastic Robustness				Behavioral Generalization
Approach	Average	Rephrasing	Context	Irrelevant Text	Referring Expressions	Multiple Rearrange	Novel Objects	Multiple Objects	Conditional
Prior work
LLaRP (LLaMa-7B)	46	92	34	32	26	47	95	0	39
SemLang (LLaVA-1.5-7B)	58	92	46	66	31	80	97	2	46
Policy only (Qwen-2.5-VL-3B-Instruct)
No-CoT	58	93	39	72	48	68	97	17	28
w/ CoT	65	98	50	85	59	64	97	25	42
w/ CoT policy + Verifier (Qwen-2.5-VL-3B-Instruct)
+ Zero-shot Verifier	64	98	50	85	48	65	97	30	40
+ Finetuned Verifier (VeGAS)	71 (+6)	99	52	92	62	82	97	34	48

EB-ALFRED (AI2-THOR), Qwen-3B policy: Average success rate (%)
Approach	Average	Base	Common Sense	Complex Instructions	Long Horizon	Spatial	Visual Appearance
Qwen-3B w/ CoT	44	62	40	58	22	34	48
+ Zero-shot Verifier	44	64	41	53	24	35	46
+ Finetuned Verifier (VeGAS)	49 (+5)	67	46	62	34	43	41

EB-ALFRED (AI2-THOR), Gemma-4B policy: Average success rate (%)
Approach	Average	Base	Common Sense	Complex Instructions	Long Horizon	Spatial	Visual Appearance
Gemma-4B w/ CoT	38	45	39	44	15	36	47
+ Zero-shot Verifier	48	66	55	61	27	32	49
+ Finetuned Verifier (VeGAS)	51 (+13)	67	56	67	25	37	53

Transfer

A small verifier, a much larger policy.

A 3B verifier improves much larger policies it was never trained with. Pair it with an off-the-shelf 27–72B MLLM acting zero-shot, and the verifier still picks better actions than the policy alone, despite being up to ~20× smaller. Verification skill, learned once, transfers across model families and scales.

A 3B verifier lifts much larger zero-shot policies. Average success rate on EB-ALFRED across three off-the-shelf MLLMs the verifier was never trained with. Same verifier in every case.

Test-time compute

Verification scales; voting doesn’t.

A natural question: are the gains just from sampling more actions? We compare against self-consistency (majority vote over samples) under the same total compute budget. Verification scales better.

Scaling chart on EB-ALFRED comparing VeGAS to self-consistency. VeGAS rises from ~65% to ~72% as N grows; self-consistency stays near 65%. — **Average success rate on LangR vs. number of candidate actions.** Same compute budget on both curves. The verifier is the lever that turns extra samples into accuracy.

And latency stays reasonable: even with N=8 candidates and M=5 verifications each (48× the LLM calls of greedy decoding), wall-clock latency only doubles, because every sample runs in parallel.

In the wild

Catching subtle mistakes.

Instruction: Acquire a sports object and position it on the brown table.

Proposed

pick(sponge): “I’ve arrived at the sink, the object should be here, so I should attempt to pick it up.”

Verifier

“The instruction requires a sports object (likely a ball). The chosen action grasps a sponge instead. It does not advance the goal.” action_is_correct: no

Instruction: Move the book from the table to the large comfortable lounging spot.

Proposed

find(ArmChair): “I should head to the lounge chair to place the book.”

Verifier

“The ‘large comfortable lounging spot’ in this environment is the Sofa, not an ArmChair. The action does not advance the true goal.” action_is_correct: no

Citation

BibTeX

@article{singhi2026vegas,
  title   = {Think Twice, Act Once: Verifier-Guided Action Selection for Embodied Agents},
  author  = {Singhi, Nishad and Bialas, Christian and Jauhri, Snehal and
             Prasad, Vignesh and Chalvatzaki, Georgia and
             Rohrbach, Marcus and Rohrbach, Anna},
  journal = {arXiv preprint},
  year    = {2026}
}