Research on multimodal text-visual large model for robotic terrain perception algorithm; [多模态文本视觉大模型机器人地形感知算法研究] - Details

author：

Indexed by：

Scopus

Abstract：

A　terrain　segmentation　algorithm　based　on　the　fusion　of　information　from　multimodal　text-visual　large　models　was　proposed　to　enhance　the　intelligent　perception　capability　of　robots　in　dynamic　and　complex　environments.　The　algorithm　integrated　simple　linear　iterative　clustering　(SLIC)　for　image　data　preprocessing,　contrastive　language-image　pre-training　(CLIP)　and　segment　anything　model　(SAM)　for　mask　generation,　and　Dice　coefficient　for　post-processing.　Initially,　the　original　input　image　was　preprocessed　using　SLIC　to　obtain　image　segmentation　blocks,　and　the　quality　of　subsequent　masks　was　improved　by　adding　prompt　points,　which　significantly　enhanced　terrain　classification　accuracy.　Subsequently,　the　CLIP　large　model,　which　has　been　pre-trained　on　text-image　data,　was　used　to　match　the　input　visual　images　with　predefined　terrain　text　information,　leveraging　its　interpretability　and　zero-shot　learning　capabilities　to　generate　sets　of　terrain　prompt　points.　The　SAM　large　model　then　generates　masked　data　with　semantic　labels　based　on　these　sets,　and　the　Dice　coefficient　was　applied　in　post-processing　to　select　usable　masks.　Using　the　Cityscapes　dataset　as　a　terrain　segmentation　sample,　the　superiority　of　the　proposed　algorithm　over　mainstream　segmentation　algorithms　under　both　supervised　and　unsupervised　learning　frameworks　was　validated.　Without　the　need　for　labeled　data,　the　algorithm　achieved　a　mask　generation　rate　of　76.58%　and　an　IoU　(intersection　over　union)　of　90.14%.　For　the　terrain　perception　task　of　a　quadruped　robot,　a　U-net　encoder/decoder　network　quantification　validation　module　was　added.　Using　the　generated　masks　as　a　dataset,　a　lightweight　terrain　segmentation　model　was　constructed,　deployed　on　the　edge　computing　device　of　the　quadruped　robot,　and　terrain　segmentation　experiments　were　conducted　in　a　real-world　environment.　The　experimental　results　demonstrated　that　the　two　mask　optimization　methods　proposed　in　this　paper　improved　the　model’s　mean　IoU　(MIoU)　by　2.36%　and　.2.56%,　respectively,　with　the　final　lightweight　model　achieving　an　MIoU　of　96.34%,　demonstrating　reliable　terrain　segmentation　accuracy.　The　segmentation　algorithm　effectively　guided　the　robot　to　quickly　and　safely　navigate　from　the　starting　point　to　the　target　location,　while　effectively　avoiding　non-geometric　obstacles　such　as　grasslands.　©　2025　Editorial　of　Board　of　Journal　of　Graphics.　All　rights　reserved.

Keyword：

computer vision deep learning quadruped robots terrain perception text-visual large models

Community：

[ 1 ] [Sun H.]School of Mechanical Engineering and Automation, Fuzhou University, Fujian, Fuzhou, 350108, China
[ 2 ] [Xie T.]School of Mechanical Engineering and Automation, Fuzhou University, Fujian, Fuzhou, 350108, China
[ 3 ] [He L.]Hangzhou Zhiyuan Research Institute Co., Ltd, Zhejiang, Hangzhou, 310008, China
[ 4 ] [Guo W.]School of Computer Science and Big Data, Fuzhou University, Fujian, Fuzhou, 350108, China
[ 5 ] [Yu Y.]Hangzhou Zhiyuan Research Institute Co., Ltd, Zhejiang, Hangzhou, 310008, China
[ 6 ] [Wu Q.]Hangzhou Zhiyuan Research Institute Co., Ltd, Zhejiang, Hangzhou, 310008, China
[ 7 ] [Wang J.]Hangzhou Zhiyuan Research Institute Co., Ltd, Zhejiang, Hangzhou, 310008, China
[ 8 ] [Dong H.]School of Mechatronics Engineering, Harbin Institute of Technology, Heilongjiang, Harbin, 150001, China
[ 9 ] [Dong H.]State Key Laboratory of Robotics and System, Harbin Institute of Technology, Heilongjiang, Harbin, 150001, China

Reprint 's Address：

Email：

Show more details

Related Keywords：

YOLOPears: a novel benchmark of YOLO object detectors for multi-class pear surface defect detection in quality grading systems
2025，FRONTIERS IN PLANT SCIENCE
Comparative Analysis of Novel View Synthesis and Photogrammetry for 3D Forest Stand Reconstruction and Extraction of Individual Tree Parameters
2025，REMOTE SENSING
Terrain Perception and Stiffness Adaptation-Based Attitude Control for Quadruped Robots
2025，IEEE-ASME TRANSACTIONS ON MECHATRONICS
You only label once: A self-adaptive clustering-based method for source-free active domain adaptation
2024，IET IMAGE PROCESSING

Source ：

Journal of Graphics

ISSN： 2095-302X

Year： 2025

Issue： 3

Volume： 46

Page： 558-567

Cited Count：

WoS CC Cited Count：

SCOPUS Cited Count：

ESI Highly Cited Papers on the List： 0 Unfold All

WanFang Cited Count：

Chinese Cited Count：

30 Days PV： 2

Affiliated Colleges：

Get Fulltext

DOI Library Discovery Baidu Scholar Search SCOPUS

Type
Departments

All Years Choose Year From to