PicAxe: Creating an Open-source Image Extraction Tool for Large and Diverse Corpora of Text-Image PDF Documents

XML

Humanities researchers want to analyze images that exist in text-image environments (journal articles, books, newspapers, letters) with computational tools. To do so, researchers must typically separate images from accompanying text. Researchers can manually extract images from digital text-image environments (e.g., with “screenshot,” “snipping,” and “crop” tools on Mac or PC; “crop” and “marquee” tools in Adobe and Microsoft products). Manual extraction becomes time-consuming, tedious, and prone to error for researchers who want to extract hundreds to thousands of images from hundreds to thousands of PDF files.

We are developing a Python-based image extraction tool, called PicAxe, that humanities researchers can easily apply to large corpora of diverse text-image PDF documents. Our tool automatically extracts images like diagrams, photographs, graphs, or tables, from large corpora of PDF files. The tool will be integrated into the Giles Ecosystem (Damerow et al. 2017) so researchers with no coding experience can upload large corpora of digital documents for storage, OCR, and text and image extraction . We will release the code under an open-source license and the source code will remain accessible via GitHub.

Currently, our tool (1) converts PDF file pages into individual binary .png files, (2) applies pytesseract ¹ to detect and remove detected text from the .png files, (3) applies a combination of OTSU thresholding and contour detection to the text-free .png to identify boundaries of remaining image marks, (4) filters boundaries with smaller areas and combines them in order to reduce noise and identify whole individual images, (5) and extracts any images and text from the original PDF pages that fall within the identified boundaries (Figure 1). Currently, any researcher can download and use the code for the tool from GitHub ( https://github.com/acguerr1/imageextraction ).

Figure 1. PicAxe image extraction workflow (as of June 2024).

Our tool is different from existing free or open-source tools for automatic image extraction (e.g., PyMuPDF Pillow ² and pdfplumber ³ ) in three ways:

Other tools are inaccessible to humanities researchers who lack time or programming skills necessary to understand and implement them; by integrating our tool into Giles, humanities researchers will be able to use the tool without coding themselves.
Other tools do not take into account that PDF files encode image data according to how the file was produced. Many older documents exist digitally as scans or photographs of hardcopies; age and method of digital preservation affects the quality of text and image data, which can appear warped, tilted, incomplete, or with aberrations. Newer text-image documents are frequently generated and disseminated as “born digital” PDFs in which images are embedded as XObjects. While extracting individual XObjects from newer PDFs is straightforward, images are often and inconspicuously embedded as multiple XObjects such that existing tools extract nonsensical or unwanted fractions of images. Existing tools cannot reliably extract images from corpora that include older PDF files of variable scan quality or newer PDF files that encode single images as multiple XObjects.
Other tools do not take into account that documents from different time periods and cultures use different text and image layouts and styles. Computer scientists are actively improving document layout and image detection in text-image environments using machine learning (Binmakhashen / Mahmoud 2019; Subramani et al. 2021; Yu et al. 2023). However, these tools are trained on stylistically homogenous PDF file data (e.g. only PDFs of “born digital” scientific journal articles). It is not clear that these extraction algorithms will be useful for large corpora of stylistically heterogenous text-image environments common to humanities researchers.

In summary, our too aims to manage the amount of variation in extractability of image data inherent to the PDF file format and the stylistically diverse text-image environments with which humanities researchers commonly work.

So far, we tested the tool on a corpus of 286 scientific journal articles and book chapters published from 1929 to 1974 representing an early history of the microbial biofilm concept. The tool generally extracts all diagrams, photographs, and graphs successfully, and under-extracts tables, for a document-level accuracy of ~76% compared to manual extraction. Statistical data does not fully reflect the accuracy of extraction as the combination of over- and under- extraction of individual images varies widely from PDF to PDF and corpus to corpus.

We are working to improve the tool. It currently extracts ~516 images per hour and we will shorten overall extraction time. We are testing additional corpora and adding functionality, including automatically removing scanned page borders and line breaks that prevent successful image extraction. The code will also generate a .csv document with relevant metadata for extracted images. We seek input on improving the tool.

Appendix A

Bibliography

Binmakhashen, Galal M. / Mahmoud, Sabri A. (2019): “Document layout analysis: a comprehensive survey“, in: ACM Computing Surveys 52, 6: 1 – 36. https://doi.org/10.1145/3355610 .
Damerow, Julia / Peirson, Erick B. R. / Laubichler, Manfred D. (2017): “The Giles Ecosystem – Storage, Text Extraction, and OCR of Documents“, in: Journal of Open Research Software 5: 26. https://doi.org/10.5334/jors.164 .
Subramani, Nishant / Matton, Alexandre / Greaves, Malcolm / Lam, Adrian (2020): “ A survey of deep learning approaches for OCR and document understanding [Paper presentation]“, in: ML Retrospectives, Surveys & Meta-analyses (ML-RSA), Vancouver, Canada. https://doi.org/10.48550/arXiv.2011.13534 .
Yu, Fengchang / Huang, Jiani / Luo, Zhuoran / Zhang, Li / Lu, Wei (2023): “An effective method for figures and tables detection in academic literature“, in: Information Processing & Management 60, 3. https://doi.org/10.1016/j.ipm.2023.103286 .

Notes

See: https://pypi.org/project/pytesseract/.

See: https://pymupdf.readthedocs.io/en/latest/.

See: https://github.com/jsvine/pdfplumber.

Anna Clemencia Guerrero (acg@santafe.edu), Santa Fe Institute, United States of America and Aaron Dinner (dinner@uchicago.edu), University of Chicago, United States of America and Krishna Kamath (kamathk@uchicago.edu), University of Chicago, United States of America and Julia Damerow (jdamerow@asu.edu), Arizona State University, United States of America