Rethinking the grouping strategy in bottom-up multi-person pose estimation

Sina Moghimi

Rethinking the grouping strategy in bottom-up multi-person pose estimation

Sina Moghimi

Abstract

Grouping keypoints into distinct human instances remains a central challenge in multi-person pose estimation, particularly under conditions of occlusion and dense crowding. We propose a novel embedding-based grouping strategy that encodes all keypoints of a single person into a compact 34-dimensional vector. This embedding is predicted at each pixel location using a transformer-based network that processes visual features and stacked Hourglass network to predict keypoint presence heatmaps. By associating keypoints with their corresponding person-level embedding, our method removes the need for heuristic post-processing for grouping. Furthermore, the shared embedding structure naturally enables occlusion recovery through voting among visible keypoints. Experiments on the COCO dataset demonstrate competitive accuracy and improved robustness in occluded scenes compared to existing bottom-up approaches.

Full Text:

PDF

References

X. Bai, X. Wei, Z. Wang, and M. Zhang, “CONet: Crowd and occlusion-aware network for occluded human pose estimation,” Neural Networks, vol. 172, p. 106109, 2024.

N. R. Fisal, A. Fathalla, D. Elmanakhly, and A. Salah, “Reported Challenges in Deep Learning-Based Human Pose Estimation: A Systematic Review,” IEEE Access, 2025.

E. S. dos Reis et al., “Monocular multi-person pose estimation: A survey,” Pattern Recognit, vol. 118, p. 108046, 2021.

Y. Dang, J. Yin, and S. Zhang, “Relation-based associative joint location for human pose estimation in videos,” IEEE Transactions on Image Processing, vol. 31, pp. 3973–3986, 2022.

Z. Cao, G. Hidalgo, T. Simon, S.-E. Wei, and Y. Sheikh, “Openpose: Realtime multi-person 2d pose estimation using part affinity fields,” IEEE Trans Pattern Anal Mach Intell, vol. 43, no. 1, pp. 172–186, 2019.

L. Tang, C. Gao, X. Chen, and Y. Zhao, “Pose detection in complex classroom environment based on improved Faster R-CNN,” IET Image Process, vol. 13, no. 3, pp. 451–457, 2019.

J. Ding, S. Niu, Z. Nie, and W. Zhu, “Research on human posture estimation algorithm based on YOLO-Pose,” Sensors, vol. 24, no. 10, p. 3036, 2024.

G. Papandreou et al., “Towards accurate multi-person pose estimation in the wild,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 4903–4911.

G. Kim, H. Kim, K. Kong, J.-W. Song, and S.-J. Kang, “Human body-aware feature extractor using attachable feature corrector for human pose estimation,” IEEE Trans Multimedia, vol. 25, pp. 5789–5799, 2022.

Y.-F. Cheng, B. Wang, B. Yang, and R. T. Tan, “Monocular 3D Multi-Person Pose Estimation by Integrating Top-Down and Bottom-Up Networks,” 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 7645–7655, 2021, [Online]. Available: https://api.semanticscholar.org/CorpusID:233024935

C. Cheng and H. Xu, “Human pose estimation in complex background videos via Transformer-based multi-scale feature integration,” Displays, vol. 84, p. 102805, 2024.

W. Mao, Y. Ge, C. Shen, Z. Tian, X. Wang, and Z. Wang, “Tfpose: Direct human pose estimation with transformers,” arXiv preprint arXiv:2103.15320, 2021.

Y. Xu, J. Zhang, Q. Zhang, and D. Tao, “Vitpose: Simple vision transformer baselines for human pose estimation,” Adv Neural Inf Process Syst, vol. 35, pp. 38571–38584, 2022.

C. Zheng, S. Zhu, M. Mendieta, T. Yang, C. Chen, and Z. Ding, “3D Human Pose Estimation with Spatial and Temporal Transformers,” Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2021.

C. Du, Z. Yan, H. Yu, L. Yu, and Z. Xiong, “Hierarchical Associative Encoding and Decoding for Bottom-Up Human Pose Estimation,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 33, pp. 1762–1775, 2023, [Online]. Available: https://api.semanticscholar.org/CorpusID:253347794

A. Newell, K. Yang, and J. Deng, “Stacked hourglass networks for human pose estimation,” in European conference on computer vision, 2016, pp. 483–499.

A. Vaswani et al., “Attention is all you need,” Adv Neural Inf Process Syst, vol. 30, 2017.

A. Howard et al., “Searching for mobilenetv3,” in Proceedings of the IEEE/CVF international conference on computer vision, 2019, pp. 1314–1324.

Z. Tu et al., “Maxvit: Multi-axis vision transformer,” in European conference on computer vision, 2022, pp. 459–479.

A. Newell and J. Deng, “Pixels to graphs by associative embedding,” Adv Neural Inf Process Syst, vol. 30, 2017.

T.-Y. Lin et al., “Microsoft coco: Common objects in context,” in Computer vision–ECCV 2014: 13th European conference, zurich, Switzerland, September 6-12, 2014, proceedings, part v 13, 2014, pp. 740–755.

M. Kocabas, S. Karagoz, and E. Akbas, “Multiposenet: Fast multi-person pose estimation using pose residual network,” in Proceedings of the European conference on computer vision (ECCV), 2018, pp. 417–433.

X. Sun, B. Xiao, F. Wei, S. Liang, and Y. Wei, “Integral human pose regression,” in Proceedings of the European conference on computer vision (ECCV), 2018, pp. 529–545.

B. Xiao, H. Wu, and Y. Wei, “Simple baselines for human pose estimation and tracking,” in Proceedings of the European conference on computer vision (ECCV), 2018, pp. 466–481.

J. Wang et al., “Deep high-resolution representation learning for visual recognition,” IEEE Trans Pattern Anal Mach Intell, vol. 43, no. 10, pp. 3349–3364, 2020.

Y. Li et al., “Tokenpose: Learning keypoint tokens for human pose estimation,” in Proceedings of the IEEE/CVF International conference on computer vision, 2021, pp. 11313–11322.

S. Yang, Z. Quan, M. Nie, and W. Yang, “Transpose: Keypoint localization via transformer,” in Proceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 11802–11812.

Y. Li, R. Liu, X. Wang, and R. Wang, “Human pose estimation based on lightweight basicblock,” Mach Vis Appl, vol. 34, no. 1, p. 3, 2023.

G. H. Martnez, “Openpose: Whole-body pose estimation,” Ph. D. dissertation, 2019.

Z. Cao, T. Simon, S.-E. Wei, and Y. Sheikh, “Realtime multi-person 2d pose estimation using part affinity fields,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 7291–7299.

Refbacks

There are currently no refbacks.

Abava Кибербезопасность Monetec 2026 СНЭ

ISSN: 2307-8162

International Journal of Open Information Technologies