Observer-Actor: Active Vision Imitation Learning with Sparse-View
Gaussian Splatting

Under Review

At test time, we optimize the camera position to an optimal viewpoint—closest to the demonstration and minimally occluded—by leveraging 3D Gaussian Splatting from sparse-view images for view-conditioned imitation learning.

Abstract

We propose Observer-Actor (ObAct), a novel framework for active vision imitation learning in which the observer moves to optimal visual observations for the actor. We study ObAct on a dual-arm robotic system equipped with wrist-mounted cameras. At test time, ObAct dynamically assigns observer and actor roles: the observer arm constructs a 3D Gaussian Splatting (3DGS) representation from three images, virtually explores this to find an optimal camera pose, then moves to this pose; the actor arm then executes a policy using the observer's observations. This formulation enhances the clarity and visibility of both the object and the gripper in the policy's observations. As a result, we enable the training of ambidextrous policies on observations that remain closer to the occlusion-free training distribution, leading to more robust policies. We study this formulation with two existing imitation learning methods -- trajectory transfer and behavior cloning -- and experiments show that ObAct significantly outperforms static-camera setups: trajectory transfer improves by 145% without occlusion and 233% with occlusion, while behavior cloning improves by 75% and 143%, respectively.

Method

Comparing Active Vision and Fixed Camera Setups

Active vision for imitation learning in a mug-handle pickup task across five scenarios. When a static camera struggles (top row) due to self-occlusion, obstacle occlusion, robot occlusion, small object parts, or unseen object poses, alternative camera placements (bottom row; indicated by colored frustums in the top row) provide clearer observations. In our method, at test time an observer robot (the robot on the right in these examples) computes and moves to such an optimal view using its wrist-mounted camera, after which an actor robot (the robot on the left) performs the task conditioned on this view.

Observer-Actor Framework Overview

(1) Train: The operator selects a demonstration optimal view, moves the observer arm to this view, and records a demonstration. This process is repeated as required by the imitation learning method.

(2) Test: The robots capture six views of the scene to construct a 3DGS representation. View optimization within this representation identifies the test-time optimal view. The observer arm then moves to this view, after which the actor arm executes the task.

Images of Optimal Views

3D Gaussian Splatting Reconstruction in Real World

Top row: demonstration optimal views. Middle row: test-time optimal views in 3DGS with gripper mask overlay. Bottom row: real-world test-time optimal views. Red boxes indicate the task-relevant object parts. Test-time optimal views are derived by reconstructing the demonstration’s optimal viewpoints subject to minimal occlusion.

Experiments

We present experiments in scenes where the demonstration images are occluded. For videos with active vision (AV), we remove the repetitive six-view exploration segments. For Trajectory Transfer (TT) experiments, we collected a single demonstration and executed the SE(3)-equivariant trajectory following pose estimation in an open-loop manner. For Behavior Cloning (BC) experiments, we collected 70 demonstrations to train the view-conditioned policy and another 70 to train the fixed-camera policy. We compare both approaches in the same test environment.

Show results for:

Using our method, both the TT and BC approaches achieve significantly higher success rates across all tasks compared to baselines without AV, highlighting the advantages of leveraging optimal viewpoints for imitation learning.

Dynamic Role Assignment

The roles of the observer and actor arms are not fixed; instead, they are determined automatically based on the object configuration at test time. The same downstream policy is used for both arms without needing to be retrained.

Below, we show an example of two consecutive timed runs, where the roles of the arms are swapped based on the object’s position. In the first run, the left arm acts as the observer, while in the second run, it takes on the role of the actor.

Currently, the active vision pipeline takes around 76 seconds. For a detailed breakdown of its components, please refer to our paper. We expect that as the efficiency of 3DGS continues to advance, the required processing time will decrease significantly.

Object Generalization

Our method is capable of generalizing to unseen objects within the same category under occlusion.

Object Tracking

We demonstrate object-tracking capabilities by moving the object during task execution and applying a visual servoing controller to maintain tracking once the test-time optimal viewpoint is reached.

Observer-Actor: Active Vision Imitation Learning with Sparse-View Gaussian Splatting