Exporting, Editing & Replacing MODS Datastreams: Updated Technical Details
Attention: This post supersedes posts/070-exporting-editing-replacing-mods-datastreams-technical-details.
A 7-Step Workflow
This document is follow-up, with technical details, to Exporting, Editing & Replacing MODS Datastreams, post 069, in my blog. In case you missed it, the aforementioned post was written specifically for metadata editors working on the 2020 Grinnell College Libraries review of Digital Grinnell MODS metadata.
Attention: This document uses a shorthand ./
in place of the frequently referenced //STORAGE/LIBRARY/ALLSTAFF/DG-Metadata-Review-2020-r1/
directory. For example, ./social-justice
is equivalent to the Social Justice collection sub-directory at //STORAGE/LIBRARY/ALLSTAFF/DG-Metadata-Review-2020-r1/social-justice
.
Briefly, the seven steps in this workflow are:
Export of all
grinnell:*
MODS datastreams usingdrush islandora_datastream_export
. This step, last performed on April 14, 2020, was responsible for creating all of thegrinnell_<PID>_MODS.xml
exports found in./<collection-PID>
.Execute my Map-MODS-to-MASTER Python 3 script on iMac MA8660 to create a
mods.tsv
file for each collection, along with associatedgrinnell_<PID>_MODS.log
andgrinnell_<PID>_MODS.remainder
files for each object. The resultant./<collection-PID>/mods.tsv
files are tab-seperated-value (.tsv) files, and they are key to this process.Edit the MODS .tsv files. Refer Exporting, Editing, & Replacing MODS Datastreams for details and guidance.
Use
drush islandora_mods_via_twig
in each ready-for-update collection to generate new .xml MODS datastream files. For a specified collection, this command will find and read the./<collection-PID>/mods-imvt.tsv
and create one./<collection-PID>/ready-for-datastream-replace/grinnell_<PID>_MODS.xml
file for each object.Execute the
drush islandora_datastream_replace
command once for each collection. This command will process each./<collection-PID>/ready-for-datastream-replace/grinnell_<PID>_MODS.xml
file and replace the corresponding object’s MODS datastream with the contents of the .xml file. The digital_grinnell branch version of theislandora_datastream_replace
command also performs an implicit update of the object’s “Title”, a transform of the new MODS to DC (Dublin Core), and a re-indexing of the new metadata in Solr.Execute an optional follow-up
drush
command as documented in Islandora MODS Post Processing. This portion of the workflow will help to reduce duplication of effort for objects that are shared between two or more collections.Configure and run the
main.py
script as described in theREADME.md
file at my reduce-MODS-remainders repository. This portion of the workflow will analyze all of the*.remainders
files left behind by worflow Step 2 for objects in a given collection.
The remainder of this document provides technical details, frequently in the form of command lines used to build and use the aforementioned tools.
Step 1a - Installation of Drush islandora_datastream_export
and islandora_datastream_replace
Commands
To help implement this process efficiently and effectively I first turned to Exporting, Editing, & Replacing MODS Datastreams, a workflow developed by the good folks at The California Historical Society. I initiated the workflow by installing two Drush tools on my local/development instance of ISLE on my Mac workstation.
The command line process in my local host/workstation terminal looked like this:
Apache=isle-apache-ld
docker exec -w /var/www/html/sites/all/modules/islandora/ ${Apache} git clone https://github.com/Islandora-Labs/islandora_datastream_exporter.git --recursive
docker exec -w /var/www/html/sites/all/modules/islandora/ ${Apache} git clone https://github.com/pc37utn/islandora_datastream_replace.git --recursive
docker exec -w /var/www/html/sites/all/modules/islandora/ ${Apache} chown -R islandora:www-data *
docker exec -w /var/www/html/sites/default ${Apache} drush en islandora_datastream_exporter islandora_datastream_replace -y
docker exec -w /var/www/html/sites/default ${Apache} drush cc drush -y
Local tests of these commands were successful so I proceeded to install them in the production instance of Digital Grinnell at dgdocker1.grinnell.edu. Before doing that I needed to change the definition of Apache
to reflect the production instance of our Apache container, like so Apache=isle-apache-dg
.
Created a Fork of Islandora Datastream Replace
I also chose to “fork” the islandora_datastream_replace project so that I could do a little Digital.Grinnell customization of it. The fork I’m working with is here and my work is limited to the digital_grinnell branch of that fork.
In the digital_grinnell branch I modified the behavior of the islandora_datastream_replace command so that it implicitly performs an UpdateFromMODS
operation that lives in our idu, or Islandora Drush Utilities module. The UpdateFromMODS
, performed immediately after each datastream replace operation does the following:
- Updates the object “Title”, one of its properties, to match the new value of
/mods:mods/mods:titleInfo[not(@type)]/mods:title
. - Invokes the
iduF DCTransform
operation which runs the default XSLT transform of the new MODS to DC (Dublin Core) and creates a new “DC” datastream for the object. - The
iduF DCTransform
operation also concludes with an implicitiduF IndexSolr
operation to ensure that the new object metadata is properly indexed in Solr.
Step 1b - Installation of Drush islandora_datastream_export
and islandora_datastream_replace
Commands in Production
To install the commands in production I opened a terminal to dgdocker1.grinnell.edu as user islandora and executed the following commands there:
Apache=isle-apache-dg
docker exec -w /var/www/html/sites/all/modules/islandora/ ${Apache} git clone https://github.com/Islandora-Labs/islandora_datastream_exporter.git --recursive
docker exec -w /var/www/html/sites/all/modules/islandora/ ${Apache} git clone https://github.com/DigitalGrinnell/islandora_datastream_replace.git --recursive
docker exec -w /var/www/html/sites/all/modules/islandora/ ${Apache} chown -R islandora:www-data *
docker exec -w /var/www/html/sites/all/modules/islandora/islandora_datastream_replace ${Apache} git checkout -b digital_grinnell
docker exec -w /var/www/html/sites/default ${Apache} drush en islandora_datastream_exporter islandora_datastream_replace -y
docker exec -w /var/www/html/sites/default ${Apache} drush cc drush -y
Step 1c - Mounting //STORAGE to DGDocker1
Attention! This step, and some that come later, will require that the network storage path //STORAGE/LIBRARY/ALLSTAFF/DG-Metadata-Review-2020-r1
be accessible to our production instance of Digital.Grinnell. To make that possible I had to run this sequence on DGDocker1:
docker exec -it isle-apache-dg bash mount -t cifs -o username=mcfatem /storage.grinnell.edu/LIBRARY/ALLSTAFF/DG-Metadata-Review-2020-r1 /mnt/metadata-review /mnt/metadata-review
Step 1d - Using Drush islandora_datastream_export
Unfortunately, the islandora_datastream_export
results in my local test were woefully incomplete… NONE of the child objects with a compound parent were exported. I’m still not entirely sure why child obejcts were omitted since the query I used should have captured all objects. In testing I did find that this seems to be a flaw in the islandora_datastream_export command, and specifically in its implementation of any Solr query.
Fortunately, the aforementioned command also has a SPARQL query option, and after some trial-and-error I got it to work properly. To do so I created an export.sh
bash script, shown below, and used it on dgdocker1.grinnell.edu like so:
docker exec -it isle-apache-dg bash
source export.sh
The export.sh
script is:
Apache=isle-apache-dg
Target=/utility-scripts
# wget https://gist.github.com/McFateM/5bd7e5b0fa5d2928b2799d039a4c0fab/raw/collections.list
while read collection
do
cp -f ri-query.txt query.sparql
sed -i 's|COLLECTION|'${collection}'|g' query.sparql
docker cp query.sparql ${Apache}:${Target}/${collection}.sparql
rm -f query.sparql
q=${Target}/${collection}.sparql
echo Processing collection '${collection}'; Query is '${q}'...
docker exec -w ${Target} ${Apache} mkdir -p /mnt/metadata-review/${collection}
docker exec -w /var/www/html/sites/default/ ${Apache} drush -u 1 islandora_datastream_export --export_target=/mnt/metadata-review/${collection} --query=${q} --query_type=islandora_datastream_exporter_ri_query --dsid=MODS
done < collections.list
In the case of the Digital Grinnell social-justice collection, for example, this script produced 32 .xml files, the correct number. Each collection’s set of exported .xml files can be found in the collection-specific subdirectory of //STORAGE/LIBRARY/ALLSTAFF/DG-Metadata-Review-2020-r1/
and all have filenames of the form: grinnell_<PID>_MODS.xml
. Note that objects which have no MODS datastream were not exported.
Step 2 - Map-MODS-to-MASTER Python 3 Script
The Map-MODS-to-MASTER script was developed, in Python 3, on iMac MA8660 at ~/GitHub/Map-MODS-to-MASTER
to facilitate generation of mods.tsv
and accompanying .log
files for each Digital Grinnell collection from the .xml
files found in subdirectories of //STORAGE/LIBRARY/ALLSTAFF/DG-Metadata-Review-2020-r1/
.
The Map-MODS-to-MASTER project can be found in the master branch of https://github.com/DigitalGrinnell/Map-MODS-to-MASTER. I choose to execute it using PyCharm from iMac MA8660 since the directory holding all of the .xml
files and folders is already mapped to /Volumes/LIBRARY/ALLSTAFF/DG-Metadata-Review-2020-r1
on that iMac. Note that this //STORAGE
location was choosen because the ./ALLSTAFF
directory, and its subordinates, are accessible to all staff in the Grinnell College Libraries.
It should not be necessary to run this script ever again…NEVER. However, if it becomes necessary to look back at this code and process, details can be found in Map-MODS-to-MASTER. Note: If it should ever become necessary to repeat the Map-MODS-to-MASTER process it might be wise to look at replacing the Python 3 script with a new Drush command, maybe islandora_map_mods_to_master
, written in PHP and installed directly into the production instance of Digital.Grinnell.
Step 3 - Editing the MODS .tsv Files
Please refer to Refer to Exporting, Editing, & Replacing MODS Datastreams, post 069 in my blog, for details and guidance.
Step 4 - Run drush islandora_mods_via_twig
As each individual collection mods-imvt.tsv
file is made ready-for-update, it will be necessary to run a drush islandora_mods_via_twig
command to process the .tsv data. Running --help
with that command produces:
[islandora@dgdocker1 ~]$ docker exec -it isle-apache-dg bash
root@122092fe8182:/# cd /var/www/html/sites/default/
root@122092fe8182:/var/www/html/sites/default# drush -u 1 islandora_mods_via_twig --help
Generate MODS .xml files from the mods-imvt.tsv file for a specified collection.
Examples:
drush -u 1 islandora_mods_via_twig social-justice Process ../social-justice/mods-imvt.tsv, for example.
Arguments:
collection The name of the collection to be processed. Defaults to "social-justice".
Aliases: imvt
So, my command sequence to run islandora_mods_via_twig
for the “Social Justice” collection, as an example, was:
[islandora@dgdocker1 ~]$ docker exec -it isle-apache-dg bash
root@122092fe8182:/# cd /var/www/html/sites/default/
root@122092fe8182:/var/www/html/sites/default# drush -u 1 islandora_mods_via_twig social-justice
When the islandora_mods_via_twig
command is run, it processes the corresponding //STORAGE/LIBRARY/ALLSTAFF/DG-Metadata-Review-2020-r1/<collection-PID>/mods-imvt.tsv
file and creates one //STORAGE/LIBRARY/ALLSTAFF/DG-Metadata-Review-2020-r1/<collection-PID>/ready-for-datastream-replace/grinnell_<PID>_MODS.xml
file for each object.
Step 5 - Run drush islandora_datastream_replace
The whole point of this entire process is to get us back to this point with a set of reviewed and modified .xml files in a //STORAGE/LIBRARY/ALLSTAFF/DG-Metadata-Review-2020-r1/<collection-PID>/ready-for-datastream-replace/
collection-specific subdirectory so that we can replace existing object MODS datastreams with new data, and we use the drush islandora_datastream_replace
command to do this.
Running --help
for the aformentioned command produced this:
root@122092fe8182:/var/www/html/sites/default# drush -u 1 islandora_datastream_replace --help
Replaces a datastream in all objects given a file list in a directory.
Examples:
drush -u 1 islandora_datastream_replace --source=/mnt/metadata-review/social-justice/ready-for-datastream-replace
--dsid=MODS --namespace=grinnell
Replacing MODS datastream for objects in --source using the digital_grinnell branch of code.
Options:
--dsid The datastream id of the datastream. Required.
--namespace The namespace of the pids. Required.
--source The directory to get the datastreams and pid# from. Required.
Aliases: idre
It’s worth noting that this command looks for any files named MODS in whatever ABSOLUTE directory is named with the --source
parameter. The command shown below was executed inside the Apache container, isle-apache-dg, on node DGDocker1, in order to process Digital Grinnell’s social-justice collection.
root@122092fe8182:drush -u 1 islandora_datastream_replace --source=/mnt/metadata-review/social-justice/ready-for-datastream-replace --dsid=MODS --namespace=grinnell
The same command could have been executed directly from node DGDocker1 like so:
docker exec isle-apache-dg drush -u 1 -w /var/www/html/sites/default drush -u 1 islandora_datastream_replace --source=mnt/metadata-review/social-justice/ready-for-datastream-replace --dsid=MODS --namespace=grinnell
Step 6 - Reduce Duplicates
As mentioned in the summary above, this is an optional but recommended follow-up drush
command described in Islandora MODS Post Processing.
Step 7 - Analyze and Restore “remainder” Fields
As mentioned in the summary above, this step is designed to analyze all of the *.remainder
files left behind by workflow Step 2 for objects in a given collection, and provide commands to restore legitimate elements after review. Follow instructions found in the README.md
file at my reduce-MODS-remainders repository.
And that’s a wrap. Until next time, stay safe and wash your hands! 😄