Papers
arxiv:2503.02505

ROCKET-2: Steering Visuomotor Policy via Cross-View Goal Alignment

Published on Mar 4
Authors:
,

Abstract

We aim to develop a goal specification method that is semantically clear, spatially sensitive, and intuitive for human users to guide agent interactions in embodied environments. Specifically, we propose a novel cross-view goal alignment framework that allows users to specify target objects using segmentation masks from their own camera views rather than the agent's observations. We highlight that behavior cloning alone fails to align the agent's behavior with human intent when the human and agent camera views differ significantly. To address this, we introduce two auxiliary objectives: cross-view consistency loss and target visibility loss, which explicitly enhance the agent's spatial reasoning ability. According to this, we develop ROCKET-2, a state-of-the-art agent trained in Minecraft, achieving an improvement in the efficiency of inference 3x to 6x. We show ROCKET-2 can directly interpret goals from human camera views for the first time, paving the way for better human-agent interaction.

Community

Paper author

Paper author

🐲🐲🐲 ROCKET-2 is the first end-to-end trained agent capable of dealing damage to the Ender Dragon using a bow — all in a zero-shot setting.
rocket-dragon.gif
🐑🐑🐑 ROCKET-2 can also accurately and smoothly eliminate a specified sheep, much like a human player.
rocket-sheep.gif

Paper author

🌲🌲 🌲 It can also reliably locate and chop down target trees, even in complex forest environments.
rocket-tree.gif

Paper author

rocket-emerald.gif

rocket-boat.gif

rocket-nether.gif

Paper author

rocket-wither-skeleton.gif

rocket-chest.gif

rocket-boat.gif

rocket-approach.gif

Paper author

rocket-build.gif

Paper author

ROCKET-2 pioneers bridge-building capabilities among Minecraft agents.

rocket-bridge.gif

Paper author

rocket-obsidian.gif

Paper author

rocket-craft.gif

Paper author

Demo

Your need to confirm your account before you can post a new comment.

Sign up or log in to comment

Models citing this paper 2

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2503.02505 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2503.02505 in a Space README.md to link it from this page.

Collections including this paper 1