PDF Ingest in Digital.Grinnell // The Grinnell College Digital Library Application Developer's Blog

A set of 21 PDF objects were ingested into Digital.Grinnell’s Faculty Scholarship collection using IMI on 22-July-2019; unfortunately none of these PDFs contained OCR (optical character recognition) or “text recognition” data, so none of them generated a valid FULL_TEXT datastream. FULL_TEXT datastreams are required to make PDF, and similar text content, searchable and discoverable in Digital.Grinnell.

In order to confirm that the lack of OCR was in fact the problem, I ran a little test on https://digital.grinnell.edu/islandora/object/grinnell:26702, one of the 21 objects.

In my test I…

signed in to Digital.Grinnell as an admin,
opened the object (see address above) in my browser,
clicked Manage to see all the object details,
clicked Datastreams to see the list of all the object’s datastreams,
clicked the download link corresponding to the OBJ datastream - this allowed me to download a copy of the PDF file to my workstation.
Once the PDF was downloaded I opened it on my workstation in Adobe Acrobat Pro,
clicked Tools and Text Recognition,
then I chose* In This File.
After a few minutes I had a new PDF with OCR’d and searchable text.
I saved that new PDF on my workstation,
went back into the Manage tab in my browser,
clicked replace in the OBJ datastream line,
then uploaded the new PDF file to Digital.Grinnell.

Once the upload was complete the system automatically generated new derivatives for the object which now has a valid FULL_TEXT datastream, so this should make the content searchable and discoverable.

*Note that if I had multiple PDFs to process I believe I could have selected the In Multiple Files option to save some time and OCR several PDFs in one operation.

The lesson-to-be-learned here is to… always run "Text Recognition" on a PDF BEFORE it is ingested into Digital.Grinnell. But, if you forget, this procedure in the hands of any Digital.Grinnell admin, can save the day! 😄

And that’s a wrap. Until next time…