Design and Implementation of a Multimodal Combination Framework for Robotic Grasping

Keywords: Robotic Grasping, Point Cloud, ChatGPT, Wearable Device, Multimodal combi-nation, Human-Robot Interaction.

Abstract

Robotic grasping plays a crucial role in manipulation tasks. However, due to the complexity of human-robot interaction, service robots still face significant challenges in handling task-oriented operations in real-world environments. To address this issue and better meet practical interaction needs, we propose a multimodal combination framework for robotic grasping. It leverages language texts to facilitate communication and detects and grasps target objects based on point clouds and feedback. The framework comprises several multimodal components, including ChatGPT, stereo cameras, and wearable devices, to complete instruction processing, grasp detection, and motion execution. To enable effective interaction, ChatGPT facilitates basic communication and responds to instructions between humans and robots. Additionally, the robot can detect the 6-DoF grasp of objects based on point clouds obtained by stereo cameras. These grasps are combined with the feedback provided by ChatGPT to further meet the requirement from human. Finally, we utilize wearable devices to teach robots generalized motor skills. This enables the robot to learn corresponding movements and perform them effectively in various scenarios, further improving its manipulation abilities. The experimental results from simulated conversations and real-scene tasks highlight that our proposed framework provides logical communication, stable grasping, and effective motion.

Published
2024-01-22