Machine-Readable Figures from Historical Scientific Literature
document layout analysis and machine learning methods to digitize historical articles
Funded by a NASA grant and in collaboration with the World Wide Telescope project, my group is working to digitize archival scientific images from papers published before ~1950.
Our current pipeline is outlined below. Blue squares indicate different processing modules and green squares indicate data products. Squares with dashed outlines indicate modules under development. The yellow region denotes modules that can be cross-checked against the Zooniverse citizen science project results.
We can use these results and an image-finding algorithm for preliminary layout analysis, as seen below.
There’s a bunch of fun stuff we can do with this data. For example, we can classify figures by their images and their captions.
For images with physical coordinates (like those of the sky) we can further analyze them to map pixels to points on in space.
This figure below shows the basic steps of turning an image tagged as a “sky image with axis labeled” into an image that includes metadata which translates between its pixel and physical coordinates.
References
2023
-
The Digitization of Historical Astrophysical Literature with Highly-Localized Figures and Figure CaptionsInternational Journal on Digital Libraries, Feb 2023
2022
-
Figure and Figure Caption Extraction for Mixed Raster and Vector PDFs: Digitization of Astronomical Literature with OCR FeaturesIn Linking Theory and Practice of Digital Libraries: 26th International Conference on Theory and Practice of Digital Libraries, TPDL 2022, Padua, Italy, September 20–23, 2022, Proceedings , Feb 2022
2020
-
The Reading Time Machine: Transforming Astrophysical Literature into Actionable DataJan 2020