A set of 21 PDF objects were ingested into Digital.Grinnell’s Faculty Scholarship collection using IMI on 22-July-2019; unfortunately none of these PDFs contained OCR (optical character recognition) or “text recognition” data, so none of them generated a valid FULL_TEXT datastream. FULL_TEXT datastreams are required to make PDF, and similar text content, searchable and discoverable in Digital.Grinnell.

In order to confirm that the lack of OCR was in fact the problem, I ran a little test on https://digital.grinnell.edu/islandora/object/grinnell:26702, one of the 21 objects.

In my test I…

  • signed in to Digital.Grinnell as an admin,
  • opened the object (see address above) in my browser,
  • clicked Manage to see all the object details,
  • clicked Datastreams to see the list of all the object’s datastreams,
  • clicked the download link corresponding to the OBJ datastream - this allowed me to download a copy of the PDF file to my workstation.
  • Once the PDF was downloaded I opened it on my workstation in Adobe Acrobat Pro,
  • clicked Tools and Text Recognition,
  • then I chose* In This File.
  • After a few minutes I had a new PDF with OCR’d and searchable text.
  • I saved that new PDF on my workstation,
  • went back into the Manage tab in my browser,
  • clicked replace in the OBJ datastream line,
  • then uploaded the new PDF file to Digital.Grinnell.

Once the upload was complete the system automatically generated new derivatives for the object which now has a valid FULL_TEXT datastream, so this should make the content searchable and discoverable.

*Note that if I had multiple PDFs to process I believe I could have selected the In Multiple Files option to save some time and OCR several PDFs in one operation.

The lesson-to-be-learned here is to… always run "Text Recognition" on a PDF BEFORE it is ingested into Digital.Grinnell. But, if you forget, this procedure in the hands of any Digital.Grinnell admin, can save the day! 😄

And that’s a wrap. Until next time…