Man as Point Figure: Human Pose Retrieval in Art History

The theoretical exploration of gesture and posture as forms of non-verbal communication can be traced back to the 17th century (Knowlson 1965). However, its integration into the study of the visual arts has been sporadic, as exemplified by Warburg’s analysis of pathos formula influenced by ancient traditions (Warburg 1998). This scattered focus may be due to the historical reliance on manual data processing and the lack of a systematic vocabulary for categorizing gestures. In response to this research deficit, our study focuses on the quantitative investigation of gesture and posture in the visual arts. This research extends own previous work in Springstein et al. (2022) and Schneider / Vollmer (2023). Our approach involves two modules: Human Pose Estimation (HPE), to identify keypoints in human figures, and Human Pose Retrieval (HPR), to classify these keypoints based on their proximity to each other. Previous applications of HPE in art history, such as in Impett / Süsstrunk (2016) and Madhu et al. (2023), have been limited to specific data sets, highlighting the need for broader research in this field.

1. Semi-supervised Human Pose Estimation

Fig. 1: Our approach to HPE first localizes human figures by bounding boxes; these boxes are then analyzed for keypoints (Springstein et al. 2022).

Based on the top-down strategy (Li et al. 2021; Wang / Sun et al. 2021), human figures in HPE are first detected by bounding boxes; these boxes are then analyzed for keypoints. In this way, a machine-efficient abstraction of the human skeleton is to be generated. Fig. 1 shows the overall architecture.

1.1. Methodology

The first phase is based on the Detection Transformer Framework ( DETR; Carion et al. 2020). A convolutional neural network backbone processes feature descriptors, which are enhanced with positional encoding. These descriptors are then converted into a visual feature sequence for processing by a transformer encoder. The encoder’s output is used in the cross-attention modules of the transformer decoder; the decoder’s output is fed into two multilayer perceptron heads: one for distinguishing figures from the background, and the other for regressing the coordinates of the bounding box. The second phase mirrors the first, with the distinction that the head is now tasked with predicting the coordinates of 17 keypoints for each previously identified bounding box.

To augment our training data, we incorporate a semi-supervised learning ( SSL) approach that exploits the teacher-student paradigm (Xu et al. 2021). In this paradigm, the teacher, derived from the student’s exponential moving average (Tarvainen et al. 2017), serves as a pseudo-label generator for unlabeled data, generating annotations for bounding boxes and keypoints.

1.2. Data

We employ five data sets for model training: COCO 2017 (123,287 images; Lin et al. 2014) is used as the real-world data source. We extend this with a stylized version, created to assess the efficiency of style transfer ( ST) techniques (Chen et al. 2021). Art-historical material is included via the People-Art data set, which provides bounding boxes for human figures (4,851 images; Westlake et al. 2016). Also included is our PoPArt data set, introduced in Schneider / Vollmer (2023), which contains keypoints on 2,454 images. Unlabeled data are from ART500K (318,869 images; Mao et al. 2017).

1.3. Evaluation

Tab. 1: Results of the first phase of HPE, in which bounding boxes of human figures are detected.

Tab. 2: Results of the second phase of HPE, in which the coordinates of 17 keypoints are predicted for each identified bounding box.

As shown in Tab. 1, the evaluation results indicate a significant improvement in bounding box detection via SSL, outperforming previous approaches (Kadish et al. 2021; Gonthier et al. 2022) in terms of Average Precision ( AP) and Average Recall ( AR). The difference is even more pronounced for keypoint estimation (Tab. 2). ¹ This highlights the advantage of including domain-specific material in training, as opposed to relying on synthetic imagery alone (Madhu et al. 2023).

2. Viewpoint-invariant Human Pose Retrieval

Fig. 3: In HPR, a query is filtered and transformed into a 320-dimensional embedding (Sun et al. 2020). This embedding is then classified using a support set.

Our three-stage HPR approach translates keypoints into semantic descriptors and classifies them using a compact support set. Fig. 3 shows the overall architecture.

2.1. Methodology

Initially, the HPE-generated whole-body skeleton is segregated into upper- and lower-body configurations, the query q. These configurations are then encoded into a 320-dimensional embedding, serving as a descriptor for the identified gestures or postures. Traditional methods that focus on absolute keypoint positions (So / Baciu 2005) or angle-based metrics (Chen et al. 2011) fail to produce viewpoint-invariant embeddings. We address this issue by utilizing the Pr-VIPE architecture (Sun et al. 2020), which generates probabilistic embeddings from viewpoint-augmented keypoints in two-dimensional space.

Each embedding is then classified. In the absence of a pre-existing data set that explicitly catalogues art-historically significant postures, we construct a taxonomy based on Iconclass (van de Waal 1973–1985). This taxonomy comprises four notation groups (31A23, 31A25, 31A26, 31A27) with 69 relevant sub-notations. Inspired by One-shot Learning ( OSL), we identify a representative image for each sub-notation from Wikidata, establish its ground-truth annotation, and produce its embedding; unlike standard OSL methods (e.g., Jadon / Jadon 2020), which typically require separate classifier training, our methodology re-uses the Pr-VIPE embeddings. This support set S is essential to categorizing the configurations, with the cosine distance d assessing the distance between the query embedding and those within the support set. This ensures a fine-grained representation of the body, even in scenarios with incomplete configuration estimates. It avoids the constrained, semantically ambiguous classification found in agglomerative clustering approaches (Impett / Süsstrunk, 2016).

2.2. Data

For data acquisition, we queried 644,155 art-historical objects from Wikidata’s SPARQL endpoint. ² This involved a two-step process: extracting ‘class entities,’ i.e., subclasses of the nodes “visual artwork” (wdt:Q4502142) or “artwork series” (wdt:Q15709879), followed by querying for ‘object entities’ with associated two-dimensional reproductions (wdt:P18).

2.3. Evaluation

Our evaluation targets the embedding space at both aggregate and individual levels. The data set comprises 644,155 objects from Wikidata processed through the HPE and HPR pipelines, identifying 385,481 as relevant with a total of 2,355,592 figures. In the absence of a labeled test data set, the evaluation of HPR is purely qualitative.

2.3.1. Aggregate level

Fig. 4: The dimensionally reduced embedding space clearly reveals two closely spaced groups representing upper- and lower-body configurations.

The embedding space is analyzed by reducing the 320 dimensions of the whole-body skeleton to just two. To avoid the formation of spurious clusters commonly associated with standard dimension reduction techniques such as UMAP (McInnes et al. 2018), we apply Pairwise Controlled Manifold Approximation Projection ( PaCMAP; Wang / Huang et al. 2021). Fig. 4 illustrates the resulting embedding space, with two apparent cluster-like formations corresponding predominantly to upper- and lower-body configurations, specifically focusing on arm and leg postures. These are further defined using the Iconclass-annotated support set. This classification not only facilitates navigation within the embedding space, but also helps to quickly identify potential cluster formations. It is noteworthy that inaccuracies in whole-body skeleton estimates are most prevalent in the sparsely populated intermediate regions that merge towards the central region.

2.3.2. Individual level

Fig. 5: Similar results for the figure shown on the left from Tissot’s Le Coup de Lance (1886–1894) with the estimated keypoints in green.

For the HPR of individual queries, we create an index structure containing the figures’ 320-dimensional embeddings. Hierarchical Navigable Small World ( HNSW; Malkov / Yashunin 2020), an approximate-k-nearest-neighbor method with polylogarithmic complexity, is utilized for this purpose. It demonstrates enhanced precision and recall compared to other graph-based methods like Faiss (Johnson et al. 2021), as affirmed by Aumueller et al. (2023). For our analysis, we choose a figure from James Tissot’s Le Coup de Lance (1886–1894): the crucified thief to Christ’s right. Fig. 5 displays the results at a short distance from the query. Most of them show figures from crucifixion scenes, primarily characterized by upper-body configurations in either a T- or Y-shape. An interesting misrecognition occurs in Jacques Louis David’s Napoleon at the Great St Bernard (1801; third row, third image from the left). Here, the HPE misinterprets the front part of an upright horse as a human figure with arms outstretched and legs bent, a posture in the HPR closely resembling that of the depicted thief.

Appendix A

Bibliography

Aumueller, Martin / Bernhardsson, Erik / Faitfull, Alec (2023): ANN Benchmarks <https://ann-benchmarks.com> [09.05.2024].
Carion, Nicolas / Massa, Francisco / Synnaeve, Gabriel / Usunier, Nicolas / Kirillov, Alexander / Zagoruyko, Sergey (2020): “End-to-end Object Detection with Transformers”, in: Computer Vision – ECCV 2020. Lecture Notes in Computer Science 12346: 213–229. DOI: 10.1007/978-3-030-58452-8_13.
Chen, Cheng / Zhuang, Yueting / Nie, Feiping / Yang, Yi / Wu, Fei / Xiao, Jun (2011): “Learning a 3D Human Pose Distance Metric from Geometric Pose Descriptor”, in: IEEE Transactions on Visualization and Computer Graphics 17, 11: 1676–1689. DOI: 10.1109/TVCG.2010.272.
Chen, Haibo / Zhao, Lei / Wang, Zhizhong / Hui Ming, Zhang / Zuo, Zhiwen / Li, Ailin / Xing, Wei / Lu, Dongming (2021): “Artistic Style Transfer with Internal-external Learning and Contrastive Learning”, in: 35th Conference on Neural Information Processing Systems <https://proceedings.neurips.cc/paper/2021/file/df5354693177e83e8ba089e94b7b6b55-Paper.pdf> [09.05.2024].
Gonthier, Nicolas / Ladjal, Saïd / Gousseau, Yann (2022): “Multiple Instance Learning on Deep Features for Weakly Supervised Object Detection with Extreme Domain Shifts,” in: Computer Vision and Image Understanding 214. DOI: 10.1016/j.cviu.2021.103299.
Impett, Leonardo / Süsstrunk, Sabine (2016): “Pose and Pathosformel in Aby Warburg’s Bilderatlas”, in: Computer Vision – ECCV 2016 Workshops. Lecture Notes in Computer Science 9913: 888–902. DOI: 10.1007/978-3-319-46604-0_61.
Jadon, Shruti / Jadon, Aryan (2020): An Overview of Deep Learning Architectures in Few-shot Learning Domain arXiv:1412.6980.
Johnson, Jeff / Douze, Matthijs / Jegou, Herve (2021): “Billion-scale Similarity Search with GPUs”, in: IEEE Transactions of Big Data 7: 535–547. DOI: 10.1109/TBDATA.2019.2921572.
Kadish, David / Risi, Sebastian / Løvlie, Anders S. (2021): “Improving Object Detection in Art Images Using Only Style Transfer”, in: International Joint Conference on Neural Networks. IJCNN 2021, 1–8. DOI: 10.1109/IJCNN52387.2021.9534264.
Knowlson, James R (1965): “The Idea of Gesture as a Universal Language in the XVIIth and XVIIIth Centuries”, in: Journal of the History of Ideas 26: 495–508.
Li, Ke / Wang, Shijie / Zhang, Xiang / Xu, Yifan / Xu, Weijian / Tu, Zhuowen (2021): “Pose Recognition with Cascade Transformers”, in: IEEE Conference on Computer Vision and Pattern Recognition. CVPR 2021, 1944–1953.
Lin, Tsung-Yi / Maire, Michael / Belongie, Serge / Hays, James / Perona, Pietro / Ramanan, Deva / Dollár, Piotr / Zitnick, C. Lawrence (2014): “Microsoft COCO. Common Objects in Context”, in: Computer Vision – ECCV 2014. Lecture Notes in Computer Science 8693: 740–755. DOI: 10.1007/978-3-319-10602-1_48.
Madhu, Prathmesh / Villar-Corrales, Angel / Kosti, Ronak / Bendschus, Torsten / Reinhardt, Corinna / Bell, Peter / Maier, Andreas K. / Christlein, Vincent (2023): “Enhancing Human Pose Estimation in Ancient Vase Paintings via Perceptually-grounded Style Transfer Learning”, in: ACM Journal on Computing and Cultural Heritage 16, 1: 1–17. DOI: 10.1145/3569089.
Malkov, Yu A. / Yushunin, D. A. (2020): “Efficient and Robust Approximate Nearest Neighbor Search Using Hierarchical Navigable Small World Graphs”, in: IEEE Transactions on Pattern Analysis and Machine Intelligence 42, 4: 824–836. DOI: 10.1109/TPAMI.2018.2889473.
Mao, Hui / Cheung, Ming / She, James (2017): “DeepArt. Learning Joint Representations of Visual Arts”, in: MM ’17. The 25th ACM International Conference on Multimedia, 1183–1191. DOI: 10.1145/3123266.3123405.
McInnes, Leland / Healy, John / Saul, Nathaniel / Großberger, Lukas (2018): “UMAP. Uniform Manifold Approximation and Projection”, in: Journal of Open Source Software 3, 29. DOI: 10.21105/joss.00861.
Schneider, Stefanie / Vollmer, Ricarda (2023): Poses of People in Art. A Data Set for Human Pose Estimation in Digital Art History arXiv:2301.05124.
So, Clifford K.-F. / Baciu, George (2005): “Entropy-based Motion Extraction for Motion Capture Animation”, in: Computer Animation and Virtual Worlds 16, 3–4: 225–235. DOI: 10.1002/cav.107.
Springstein, Matthias / Schneider, Stefanie / Althaus, Christian / Ewerth, Ralph (2022): “Semi-supervised Human Pose Estimation in Art-historical Images”, in: MM ’22. The 30th ACM International Conference on Multimedia, 1107–1116. DOI: 10.1145/3503161.3548371.
Sun, Jennifer J. / Zhao, Jiaping / Chen, Liang-Chieh / Schroff, Florian / Adam, Hartwig / Liu, Ting (2020): “View-invariant Probabilistic Embedding for Human Pose”, in: Computer Vision – ECCV 2020. Lecture Notes in Computer Science 12350: 53–70. DOI: 10.1007/978-3-030-58558-7_4.
Tarvainen, Antti / Valpola, Harri (2017): “Mean Teachers are Better Role Models. Weight-averaged Consistency Targets Improve Semi-supervised Deep Learning Results”, in: 5th International Conference on Learning Representations. ICLR 2017.
van de Waal, Henri (1973–1985): Iconclass. An Iconographic Classification System. Completed and Edited by L. D. Couprie with R. H. Fuchs. Amsterdam: North-Holland Publishing Company.
Wang, Jingdong / Sun, Ke / Cheng, Tianheng / Jiang, Borui / Deng, Chaorui / Zhao, Yang / Liu, Dong / Mu, Yadong / Tan, Mingkui / Wang, Xinggang / Liu, Wenyu / Xiao, Bin (2021): “Deep High-resolution Representation Learning for Visual Recognition”, in: IEEE Transactions on Pattern Analysis and Machine Intelligence 43, 10: 3349–3364. DOI: 10.1109/TPAMI.2020.2983686.
Wang, Yingfan / Huang, Haiyang / Rudin, Cynthia / Shaposhnik, Yaron (2021): “Understanding How Dimension Reduction Tools Work. An Empirical Approach to Deciphering t-SNE, UMAP, TriMAP, and PaCMAP for Data Visualization”, in: Journal of Machine Learning Research 22, 201: 1–73 <https://jmlr.org/papers/v22/20-1061.html> [09.05.2024].
Warburg, Aby (1998 [1905]): “Dürer und die italienische Antike”, in: Bredekamp, Horst / Diers, Michael (eds.): Die Erneuerung der heidnischen Antike. Kulturwissenschaftliche Beiträge zur Geschichte der europäischen Renaissance. Gesammelte Schriften, 443–449. Berlin: Akademie Verlag.
Westlake, Nicholas / Cai, Hongping / Hall, Peter (2016): “Detecting People in Artwork with CNNs”, in: Computer Vision – ECCV 2016 Workshops. Lecture Notes in Computer Science 9913: 825–841. DOI: 10.1007/978-3-319-46604-0_57.
Xu, Mengde / Zhang, Zheng / Hu, Han / Wang, Jianfeng / Wang, Lijuan / Wie, Fangyun / Bai, Xiang / Liu, Zicheng (2021): “End-to-end Semi-supervised Object Detection with Soft Teacher”, in: IEEE/CVF International Conference on Computer Vision. ICCV 2021, 3040–3049. DOI: 10.1109/ICCV48922.2021.00305.

In contrast to Springstein et al. (2022), the analysis is performed on the fully annotated PoPArt data set.