Context: I am a digitization librarian working in the Local History and Genealogy department of a public library. We use CONTENTdm for our public access repository through Ohio Memory (a joint collaboration of the Ohio History Connection and the State Library of Ohio). I use the CONTENTdm Project Client to upload materials to our collections in Ohio Memory. I ran into an issue recently when I was uploading yearbooks to CONTENTdm where my 10,000 page monthly OCR license was being eaten up over the course of uploading ten items. The total number of images we were looking at here was probably no more than 2,500 and all of the images should have been small enough that they easily fit inside the image constraints in the ABBYY engine being used. Instead of using 25% of my license on the upload, it would use 100% of the license 4 or 5 books in (I upload yearbooks in batches of one decade at a time).
The first time this occurred, I simply pushed through the upload process, leaving the 5 or 6 books without an OCR transcription and went back and added in the transcript the next month after my license reset. The next time a decade’s worth of books was ready for ingest, I started checking what was going on with the individual files to see what the problem was. The issue became readily apparent in Adobe Bridge and all the metadata it exposes; in this case showing the original JPEG file:
The images had the expected pixel dimensions of about 3000 pixels by 4200 pixels, but the resolution of the image was a mere 72 PPI. At that pixel density the image was a whopping 43.4″ by 59.4″, far larger than the maximum size allowed by the ABBYY fine reader. The CONTENTdm documentation lets you know that the OCR engine’s maximum size is A4, or 8.5″ by 11″; if you have an image larger than that, it simply uses a number of pages equal to how many pages of text the image is equivalent to at that resolution. At 72 PPI, A4 is a mere 595 by 842 pixels, which amounts to 27 pages of OCR license being eaten up on a single image using OCLC’s formula on their help page when your image is 3124 pixels by 4276 pixels at 72 PPI. That is a recipe for chewing up a 10,000 page OCR license very quickly.
The fix for this is luckily fairly straightforward. I was working with JPEG images generated by our KIRTAS Kabis IIIW in Photoshop. First I converted all the JPEGs to the our archival TIFF images both for long term preservation and to avoid the “JPEG Options” dialogue pop-up that appears in Photoshop. Once the TIFFs were created, I generated a simple batch action to process all the images using the “Image Size” menu in Photoshop. The key here is to simply change the resolution of the image without resampling it. If you try to upscale a 72 PPI image to 400 PPI and resample in the process, Photoshop will do it’s best to create a much more massive image that won’t help your situation at all.
The final result is a TIFF image with the exact same pixel dimensions as the original image, but now with the appropriate pixel density of 400 PPI and an image size in inches that easily fits inside A4, now no longer destroying my OCR license. The next step is to identify what is happening during the workflow of editing these images to convert them to 72 PPI to prevent this from happening to begin with. Big thanks to Glenn Fleishman’s article “How to set a higher dpi without changing an image’s resolution” at Macworld to understand what was happening here, definitely worth a read!