Deep Learning for the Identification of Ex-libris Stamps (Zoshoin) in Old Japanese Books

1. Introduction
Ex-libris stamps (Zoshoin) in old Japanese books give relevant information about the book, such as past book owners. Identifying ex-libris stamps is therefore an essential task in librarians’ cataloguing process. Identification, however, is challenging for librarians because they need to find the identical or most similar stamp from the stamp database. Hence, our research aims to develop a stamp database system to retrieve the most similar stamps to the query using computer vision and deep learning techniques, thereby aiding librarians or literature experts in identifying stamps with less effort and cost. Deep learning techniques have already been employed to segment and retrieve Japanese Ex-libris stamps. Li et al. extract individual characters from stamps to search for similar stamps based on character similarity[3]. In comparison, our approach uses deep learning-based visual features for the whole stamp to search by visual similarity. A similar task is also relevant in other cultures, such as digital sigillography[1]. The deep learning method developed in this paper could also be applied to these cultural artefacts.

2. Methods
Dataset. The National Institute of Japanese Literature created the ex-libris stamp dataset with 16619 entries. Each entry has metadata, a cropped image around the stamp, and a page image containing the stamp. Metadata has a label, which is the transcription of characters on the stamp, and the proposed system uses this label as the target of retrieval. After grouping by labels and performing data cleaning, the dataset contains 5973 distinct labels. We created four experimental datasets for the minimum number of images per label. Table 1 shows that labels with more than four images cover only six percent of the dataset, and about 80% of labels have only one image. These experimental datasets are created to analyse relationships between the number of images per label and retrieval performance. Our goal was to create a program that would analyse the pictures directly without much alteration. For this reason, we kept the pre-processing to a minimum, only doing data cleaning, and focused on augmentation such as rotation and translation.

System. We aim to develop a system that receives a cropped stamp image as a query and retrieves the most similar images from the database. In the preparation stage, we encoded all entries in the database into feature vectors using a deep learning model, specifically a VGG16 variation [2]. We used the VGG16 model, pre-trained on ImageNet, and fine-tuned on a stamp label classification task. The fine-tuning process consisted of 20 epochs of training over the dataset. After training, we removed the final layers, and the adaptive average pooling layer was used to extract feature vectors. We tested other deep- learning models, but their performance was very similar, so we chose VGG16 because it gave slightly better results. In the retrieval stage, a query image is encoded using the same model, and the similarity is computed with the cosine similarity of feature vectors between a query and entries in the database. The result is sorted in descending order of similarity, and the Top-N most similar images with distinct labels are returned.

Table 1. Four experimental datasets for the minimum number of images per label

Figure 1. Examples of cropped stamp images and their respective labels.

3. Results
We used Top-N accuracy to evaluate the model’s performance on a test set. This metric has a value of 1 if the correct stamp is among the first N stamps; otherwise, it is 0. In the following experiments, we chose N = 10 to reflect the need for a librarian’s cataloguing process. In this process, the correct stamp does not have to be shown at the top but should be listed in a small number of stamps, such as 10. It remains the librarians’ responsibility to pick the correct one. To train the VGG16 model for four experimental datasets, each dataset was split randomly into the train (60%), validation (20%), and test (20%) sets, and the model was trained for 20 epochs. Table 2 summarizes performance for experimental datasets. Dataset D2 resulted in a Top-10 accuracy of 0.908 and a Top-1 accuracy of 0.835. In addition, Dataset D4 showed the best result, which suggests that having more images on each label leads to higher performance. The following figures show examples of stamp retrieval. Figure 2 shows a successful case where the correct label is at the top. Figure 3 is a more difficult case, where the target stamp is partially overlapped by other stamps. Figure 4 shows the most challenging case, where stamps have large variations in shape within a single label.

Table 2. Performance of Top-1 and Top-10 for experimental datasets.

Figure 2. An example of stamp retrieval. A successful case where the correct label is at the top .

Figure 3. An example of stamp retrieval. A partially successful case where the correct label is at the second.

Figure 4. An example of stamp retrieval. A failed case where the correct label is not in the list.

4. Discussion
The results show that our proposed system can provide librarians with a short list of stamps given a query stamp with a Top-10 performance of more than 0.9. This means that in 90 percent of cases, users can find the correct stamp in the list of 10 labels. Future work includes data pre-processing techniques such as data augmentation, feature extracting using contrastive learning, and object detection to simultaneously extract and identify an ex-libris stamp from a full-page image. The last work is especially beneficial for librarians in reducing the burden of stamp identification

5. Acknowledgements
We would like to thank the National Institute of Japanese Literature, partic- ularly Prof. Kigoshi Shunsuke and Prof. Matsunaga Ryusei, for providing the data and advice as literature experts.