7 days ago

Visual-Tactile Fusion for Multimodal Semantic Communication with Foundation Models

Integrating vision and touch is key to understanding the physical world, but it faces two main challenges: effective multimodal fusion and high-fidelity tactile representation. This paper proposes a multimodal semantic communication framework based on foundation models through visual-tactile fusion. First, a multimodal enhancement fusion network extracts deep features from video to improve tactile recognition and semantic understanding. Second, a CLIP-driven framework, grounded in a tactile knowledge base, enhances the accuracy of tactile information transmission. An end-to-end model with joint source-channel coding further improves transmission efficiency. Finally, we introduce a tactile generative reconstruction method using ImageBind, which ensures high similarity in both visual features and pressure distribution. Experimental results confirm the effectiveness of our approach in semantic tactile reconstruction. Overall, the proposed method enables efficient, low-bit-rate communication with high semantic fidelity, offering a promising solution for visual-tactile fusion in real-world applications.

Visual-Tactile Fusion for Multimodal Semantic Communication with Foundation Models

Zhuorui Wang, Mingkai Chen, Nanjing University of Posts and Telecommunications; Xiaoming He, Nanjing University of Posts and Telefommunications; Haitao Zhao, Nanjing University of Posts and Telecommunications; Yun Lin, Harbin Engineering University; Mariam Hussain, National Defense University, NDU Islamabad; Shahid Mumtaz, Nottingham Trent University, NG1 4FQ Nottingham. U.K.

Comment (0)

No comments yet. Be the first to say something!