Indexed by:
Abstract:
Recently, CNN-Transformer hybrid network has been proposed to resolve either the heavy computational burden of CNN or the difficulty encountered during training the Transformer-based networks. In this work, we design an efficient and effective CNN-Transformer hybrid network for human pose estimation, namely CTHPose. Specifically, Polarized CNN Module is employed to extract the feature with plentiful visual semantic clues, which is beneficial for the convergence of the subsequent Transformer encoders. Pyramid Transformer Module is utilized to build the long-term relationship between human body parts with lightweight structure and less computational complexity. To establish long-term relationship, large field of view is necessary in Transformer, which leads to a large computational workload. Hence, instead of the entire feature map, we introduced a reorganized small sliding window to provide the required large field of view. Finally, Heatmap Generator is designed to reconstruct the 2D heatmaps from the 1D keypoint representation, which balances parameters and FLOPs while obtaining accurate prediction. According to quantitative comparison experiments with CNN estimators, CTHPose significantly reduces the number of network parameters and GFLOPs, while also providing better detection accuracy. Compared with mainstream pure Transformer networks and state-of-the-art CNN-Transformer hybrid networks, this network also has competitive performance, and is more robust to the clothing pattern interference and overlapping limbs from the visual perspective. © 2024, The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
Keyword:
Reprint 's Address:
Email:
Source :
ISSN: 0302-9743
Year: 2024
Volume: 14429 LNCS
Page: 327-339
Language: English
0 . 4 0 2
JCR@2005
Cited Count:
SCOPUS Cited Count:
ESI Highly Cited Papers on the List: 0 Unfold All
WanFang Cited Count:
Chinese Cited Count:
30 Days PV: 1
Affiliated Colleges: