Machine-Readable Figures from Historical Scientific Literature

document layout analysis and machine learning methods to digitize historical articles

Funded by a NASA grant and in collaboration with the World Wide Telescope project, my group is working to digitize archival scientific images from papers published before ~1950.

Our current pipeline is outlined below. Blue squares indicate different processing modules and green squares indicate data products. Squares with dashed outlines indicate modules under development. The yellow region denotes modules that can be cross-checked against the Zooniverse citizen science project results.

Overall pipeline of this project (Naiman, 2020).

We can use these results and an image-finding algorithm for preliminary layout analysis, as seen below.

On the left: This figure shows how the text layout is analyzed to select text likely associated with the figure captions (green squares) for each figure (red squares). Text that is likely associated with x and y labels are shown in blue. On the right: The results of optical character recognition performed on Figure 1 labeled by the left-most green box (J. P. Naiman et al., 2022; J. P. Naiman et al., 2023).

There’s a bunch of fun stuff we can do with this data. For example, we can classify figures by their images and their captions.

Three different image categories and the bottom panels show word clouds generated from collections of images of those types. Generally, combining these datasets in a multi-modal neural network can increase the accuracy of image classifications by 10-20%.

For images with physical coordinates (like those of the sky) we can further analyze them to map pixels to points on in space.

This figure below shows the basic steps of turning an image tagged as a “sky image with axis labeled” into an image that includes metadata which translates between its pixel and physical coordinates.

The left panel shows the results of the square-finding and image dewarping algorithm (cyan square), coordinate tagging (Right ascension and Declination lines in red), tick-finding algorithm (green circles), and coordinate OCR results (red boxes and raw OCR results as red text). These are then parsed into physical coordinates as shown in the right panel — here the "most likely" pixel-to-right-ascension coordinates are highlighted in cyan. These, combined with their associated tick marks (green circles) are used to translate between pixel coordinates and right ascension. A similar process is applied to the declination measurements.

References

2023

  1. iou_paper1_v2.png
    The Digitization of Historical Astrophysical Literature with Highly-Localized Figures and Figure Captions
    J. P. Naiman , Peter K. G. Williams , and Alyssa Goodman
    International Journal on Digital Libraries, Feb 2023

2022

  1. mega_yolo_v3.png
    Figure and Figure Caption Extraction for Mixed Raster and Vector PDFs: Digitization of Astronomical Literature with OCR Features
    J. P. Naiman , Peter KG Williams , and Alyssa Goodman
    In Linking Theory and Practice of Digital Libraries: 26th International Conference on Theory and Practice of Digital Libraries, TPDL 2022, Padua, Italy, September 20–23, 2022, Proceedings , Feb 2022

2020

  1. The Reading Time Machine: Transforming Astrophysical Literature into Actionable Data
    J. P. Naiman
    Jan 2020