Object Perception and Manipulation

To comprehend the environment, robots must first model individual objects from multiple perspectives, including their 2D visual appearance, 3D geometry, and interactive spatial positioning. For more information, please click each image below the paragraphs.

For this purpose, we collect large-scale datasets, HAMMER and HouseCat6D, for novel view synthesis, depth estimation and object/grasp pose estimation.

(Left) Hardware for collecting the data. (Right) Data modality and quality compared to other datasets.
housecat6d pipeline
HouseCat6D configurations.
Supporting by the data acquisition, we researched robotic grasping, MonoGraspnet, and category-level pose estimation, SecondPose.
MonoGraspNet teaser
Grasp representation in MonoGraspNet.
SecondPose teaser
SecondPose aligns features in a category.
To move a further step to high-level interactive applications, we research object rearrangement, SG-Bot, making individual object modeling more useful.
SG-Bot pipeline
SG-Bot three-step pipeline.

For a long horizon run, we believe world models play important roles in achieving AGI, as it learns the dynamics of actions and world states in the latent space. This mirrors human cognition, where observations are encoded into latent signals to infer actions for interacting with the world. In open-vocabulary robotic manipulation tasks, world models learn the causal relationships between language descriptions and object repositioning. We introduce TokenWM, a framework that constructs world states based on foundational visual features, thereby enhancing representational capacity and significantly improving manipulation performance.​

TokenWM pipeline
TokenWM is a recurrent world model built on foundation visual encoders.

Related