PDF Ingest in Digital.Grinnell
A set of 21 PDF objects were ingested into Digital.Grinnell’s Faculty Scholarship collection using IMI on 22-July-2019; unfortunately none of these PDFs contained OCR (optical character recognition) or “text recognition” data, so none of them generated a valid FULL_TEXT datastream. FULL_TEXT datastreams are required to make PDF, and similar text content, searchable and discoverable in Digital.Grinnell.
In order to confirm that the lack of OCR was in fact the problem, I ran a little test on https://digital.grinnell.edu/islandora/object/grinnell:26702, one of the 21 objects.
In my test I…
- signed in to Digital.Grinnell as an admin,
- opened the object (see address above) in my browser,
- clicked
Manage
to see all the object details, - clicked
Datastreams
to see the list of all the object’s datastreams, - clicked the
download
link corresponding to theOBJ
datastream - this allowed me to download a copy of the PDF file to my workstation. - Once the PDF was downloaded I opened it on my workstation in Adobe Acrobat Pro,
- clicked
Tools
andText Recognition
, - then I chose*
In This File
. - After a few minutes I had a new PDF with OCR’d and searchable text.
- I saved that new PDF on my workstation,
- went back into the
Manage
tab in my browser, - clicked
replace
in theOBJ
datastream line, - then uploaded the new PDF file to Digital.Grinnell.
Once the upload was complete the system automatically generated new derivatives for the object which now has a valid FULL_TEXT datastream, so this should make the content searchable and discoverable.
*Note that if I had multiple PDFs to process I believe I could have selected the In Multiple Files
option to save some time and OCR several PDFs in one operation.
The lesson-to-be-learned here is to… always run "Text Recognition" on a PDF BEFORE it is ingested into Digital.Grinnell.
But, if you forget, this procedure in the hands of any Digital.Grinnell admin, can save the day! 😄
And that’s a wrap. Until next time…