The theoretical exploration of gesture and posture as forms of non-verbal communication can be traced back to the 17th century (Knowlson 1965). However, its integration into the study of the visual arts has been sporadic, as exemplified by Warburg’s analysis of pathos formula influenced by ancient traditions (Warburg 1998). This scattered focus may be due to the historical reliance on manual data processing and the lack of a systematic vocabulary for categorizing gestures. In response to this research deficit, our study focuses on the quantitative investigation of gesture and posture in the visual arts. This research extends own previous work in Springstein et al. (2022) and Schneider / Vollmer (2023). Our approach involves two modules: Human Pose Estimation (HPE), to identify keypoints in human figures, and Human Pose Retrieval (HPR), to classify these keypoints based on their proximity to each other. Previous applications of HPE in art history, such as in Impett / Süsstrunk (2016) and Madhu et al. (2023), have been limited to specific data sets, highlighting the need for broader research in this field.
Fig. 1: Our approach to HPE first localizes human figures by bounding boxes; these boxes are then analyzed for keypoints (Springstein et al. 2022).
Based on the top-down strategy (Li et al. 2021; Wang / Sun et al. 2021), human figures in HPE are first detected by bounding boxes; these boxes are then analyzed for keypoints. In this way, a machine-efficient abstraction of the human skeleton is to be generated. Fig. 1 shows the overall architecture.
The first phase is based on the Detection Transformer Framework ( DETR; Carion et al. 2020). A convolutional neural network backbone processes feature descriptors, which are enhanced with positional encoding. These descriptors are then converted into a visual feature sequence for processing by a transformer encoder. The encoder’s output is used in the cross-attention modules of the transformer decoder; the decoder’s output is fed into two multilayer perceptron heads: one for distinguishing figures from the background, and the other for regressing the coordinates of the bounding box. The second phase mirrors the first, with the distinction that the head is now tasked with predicting the coordinates of 17 keypoints for each previously identified bounding box.
To augment our training data, we incorporate a semi-supervised learning ( SSL) approach that exploits the teacher-student paradigm (Xu et al. 2021). In this paradigm, the teacher, derived from the student’s exponential moving average (Tarvainen et al. 2017), serves as a pseudo-label generator for unlabeled data, generating annotations for bounding boxes and keypoints.
We employ five data sets for model training: COCO 2017 (123,287 images; Lin et al. 2014) is used as the real-world data source. We extend this with a stylized version, created to assess the efficiency of style transfer ( ST) techniques (Chen et al. 2021). Art-historical material is included via the People-Art data set, which provides bounding boxes for human figures (4,851 images; Westlake et al. 2016). Also included is our PoPArt data set, introduced in Schneider / Vollmer (2023), which contains keypoints on 2,454 images. Unlabeled data are from ART500K (318,869 images; Mao et al. 2017).
Tab. 1: Results of the first phase of HPE, in which bounding boxes of human figures are detected.
Tab. 2: Results of the second phase of HPE, in which the coordinates of 17 keypoints are predicted for each identified bounding box.
As shown in Tab. 1, the evaluation results indicate a significant improvement in bounding box detection via SSL, outperforming previous approaches (Kadish et al. 2021; Gonthier et al. 2022) in terms of Average Precision ( AP) and Average Recall ( AR). The difference is even more pronounced for keypoint estimation (Tab. 2). 1 This highlights the advantage of including domain-specific material in training, as opposed to relying on synthetic imagery alone (Madhu et al. 2023).
Fig. 3: In HPR, a query is filtered and transformed into a 320-dimensional embedding (Sun et al. 2020). This embedding is then classified using a support set.
Our three-stage HPR approach translates keypoints into semantic descriptors and classifies them using a compact support set. Fig. 3 shows the overall architecture.
Initially, the HPE-generated whole-body skeleton is segregated into upper- and lower-body configurations, the query q. These configurations are then encoded into a 320-dimensional embedding, serving as a descriptor for the identified gestures or postures. Traditional methods that focus on absolute keypoint positions (So / Baciu 2005) or angle-based metrics (Chen et al. 2011) fail to produce viewpoint-invariant embeddings. We address this issue by utilizing the Pr-VIPE architecture (Sun et al. 2020), which generates probabilistic embeddings from viewpoint-augmented keypoints in two-dimensional space.
Each embedding is then classified. In the absence of a pre-existing data set that explicitly catalogues art-historically significant postures, we construct a taxonomy based on Iconclass (van de Waal 1973–1985). This taxonomy comprises four notation groups (31A23, 31A25, 31A26, 31A27) with 69 relevant sub-notations. Inspired by One-shot Learning ( OSL), we identify a representative image for each sub-notation from Wikidata, establish its ground-truth annotation, and produce its embedding; unlike standard OSL methods (e.g., Jadon / Jadon 2020), which typically require separate classifier training, our methodology re-uses the Pr-VIPE embeddings. This support set S is essential to categorizing the configurations, with the cosine distance d assessing the distance between the query embedding and those within the support set. This ensures a fine-grained representation of the body, even in scenarios with incomplete configuration estimates. It avoids the constrained, semantically ambiguous classification found in agglomerative clustering approaches (Impett / Süsstrunk, 2016).
For data acquisition, we queried 644,155 art-historical objects from Wikidata’s SPARQL endpoint. 2 This involved a two-step process: extracting ‘class entities,’ i.e., subclasses of the nodes “visual artwork” (wdt:Q4502142) or “artwork series” (wdt:Q15709879), followed by querying for ‘object entities’ with associated two-dimensional reproductions (wdt:P18).
Our evaluation targets the embedding space at both aggregate and individual levels. The data set comprises 644,155 objects from Wikidata processed through the HPE and HPR pipelines, identifying 385,481 as relevant with a total of 2,355,592 figures. In the absence of a labeled test data set, the evaluation of HPR is purely qualitative.
Fig. 4: The dimensionally reduced embedding space clearly reveals two closely spaced groups representing upper- and lower-body configurations.
The embedding space is analyzed by reducing the 320 dimensions of the whole-body skeleton to just two. To avoid the formation of spurious clusters commonly associated with standard dimension reduction techniques such as UMAP (McInnes et al. 2018), we apply Pairwise Controlled Manifold Approximation Projection ( PaCMAP; Wang / Huang et al. 2021). Fig. 4 illustrates the resulting embedding space, with two apparent cluster-like formations corresponding predominantly to upper- and lower-body configurations, specifically focusing on arm and leg postures. These are further defined using the Iconclass-annotated support set. This classification not only facilitates navigation within the embedding space, but also helps to quickly identify potential cluster formations. It is noteworthy that inaccuracies in whole-body skeleton estimates are most prevalent in the sparsely populated intermediate regions that merge towards the central region.
Fig. 5: Similar results for the figure shown on the left from Tissot’s Le Coup de Lance (1886–1894) with the estimated keypoints in green.
For the HPR of individual queries, we create an index structure containing the figures’ 320-dimensional embeddings. Hierarchical Navigable Small World ( HNSW; Malkov / Yashunin 2020), an approximate-k-nearest-neighbor method with polylogarithmic complexity, is utilized for this purpose. It demonstrates enhanced precision and recall compared to other graph-based methods like Faiss (Johnson et al. 2021), as affirmed by Aumueller et al. (2023). For our analysis, we choose a figure from James Tissot’s Le Coup de Lance (1886–1894): the crucified thief to Christ’s right. Fig. 5 displays the results at a short distance from the query. Most of them show figures from crucifixion scenes, primarily characterized by upper-body configurations in either a T- or Y-shape. An interesting misrecognition occurs in Jacques Louis David’s Napoleon at the Great St Bernard (1801; third row, third image from the left). Here, the HPE misinterprets the front part of an upright horse as a human figure with arms outstretched and legs bent, a posture in the HPR closely resembling that of the depicted thief.
In conclusion, our three-stage HPR, in conjunction with semi-supervised HPE, provides a solid foundation for the quantitative study of gesture and posture in the visual arts. By capturing the human skeleton in a viewpoint-invariant 320-dimensional embedding, and storing upper- and lower-body configurations separately, our methodology allows for fine-grained exploration and typification of the human body, even with partial configuration estimates.
In contrast to Springstein et al. (2022), the analysis is performed on the fully annotated PoPArt data set.