In collaboration with the World Wide Telescope project, I've been working on the digitization of archival scientific images from papers published before ~1950.

Our current pipeline is outlined below. Blue squares indicate different processing modules and green squares indicate data products. Squares with dashed outlines indicate modules under development. The yellow region denotes modules that can be cross-checked against the Zooniverse citizen science project results.

We can use these results and an image-finding algorithm for preliminary layout analysis, as seen below.

On the left: This figure shows how the text layout is analyzed to select text likely associated with the figure captions (green squares) for each figure (red squares). Text that is likely associated with x & y labels are shown in blue. On the right: The results of optical character recognition performed on Figure 1 labeled by the left-most green box.

There's a bunch of fun stuff we can do with this data. For example, we can classify figures by their images and their captions.

In the figure below we see three different image categories and the bottom panels show word clouds generated from collections of images of those types. Generally, combining these datasets in a neural network can increase the accuracy of image classifications by 10-20%.

For images with physical coordinates (like those of the sky) we can further analyze them to map pixels to points on in space.

This figure below shows the basic steps of turning an image tagged as a "sky image with axis labeled" into an image that includes metadata which translates between its pixel and physical coordinates. The left panel shows the results of the square-finding and image dewarping algorithm (cyan square), coordinate tagging (Right ascension and Declination lines in red), tick-finding algorithm (green circles), and coordinate OCR results (red boxes and raw OCR results as red text). These are then parsed into physical coordinates as shown in the right panel — here the "most likely" pixel-to-right-ascension coordinates are highlighted in cyan. These, combined with their associated tick marks (green circles) are used to translate between pixel coordinates and right ascension. A similar process is applied to the declination measurements.