Creating a CollectionBuilder-CSV Instance from Our Migration Collection
This post is essentially a CollectionBuilder-CSV follow-up to Creating a Migration Collection, intended to document the path I’ve taken and the decisions I made when creating a first cut of Digital.Grinnell content using the aformentioned CollectionBuilder-CSV.
CB-CSV_DG-01
With few notable exceptions, everything mentioned below will be visible in a new public GitHub repo at Digital-Grinnell/CB-CSV_DG-01.
Corresponding Google Sheet
One of the exceptions: the project’s metadata CSV in a time-stamped tab at https://docs.google.com/spreadsheets/d/1ic4PxHDbuzDrmf4YtauhC4vEQJxt3QSH8bYfLBCM3Gc/.
Other worksheets/tabs in the Google Sheet contain:
Sheet 1 = the initial imported demo data from the CollectionBuilder-CSV Metadata Template
mods.csv = original MODS metadata .csv as exported from Digital.Grinnell and described in Creating a Migration Collection.
Creating the Development Environment
The steps required to create a CollectionBuilder-CSV instance are well-documented in the CollectionBuilder Docs. I selected the CollectionBuilder-CSV version of CB, and because my development enviroment will be on a Mac, I consistenly selected Mac-specific options throughout the process, including the use of Homebrew whenever it was an option.
An ordered list of documentation sections I followed includes:
- CollectionBuilder Docs
- Templates
- GitHub
- I skipped this section because I am already familiar with GitHub use.
- Git Setup
- I skipped this section because Git is already properly configured on my Mac.
- Get a Text Editor
- I skipped this section because Visual Studio Code is already installed and includes all of the extensions mentioned in the document.
- Install Ruby
- Ruby on Mac - I used all of the recommended commands with Ruby version
3.1.3
.
- Ruby on Mac - I used all of the recommended commands with Ruby version
- Install Jekyll
- Optional Software
- Set Up Project Repository
- I worked through all of the sub-sections here to produce the CB-CSV_DG-01 repository. Most of the work was done using VSCode and it included initial commits and push with project-specific information added to the README.md file.
Adding Metadata
The next task here involves properly populating the project’s metadata CSV in the ready-for-CB tab at https://docs.google.com/spreadsheets/d/1ic4PxHDbuzDrmf4YtauhC4vEQJxt3QSH8bYfLBCM3Gc/. At the time of this writing that tab is just a raw copy of our original MODS metadata .csv as exported from Digital.Grinnell and described in Creating a Migration Collection. Essentially the ready-for-CB tab needs to contain the data that it currently holds, but transformed into a structure matching that of Sheet 1, the initial imported demo data from the CollectionBuilder-CSV Metadata Template.
Using Open Refine?
My first thought was to use Open Refine to manipulate the CSV structure and data, and I do have this tool installed on all of my Macs in case it is needed. However, I’m not a huge fan of Open Refine, in part because it is, in my opinion, Java-based and therefore bloated and cumbersome. My bigger concern is that capturing the transform “process” isn’t a natural thing in Open Refine, and I expect to repeat this same process many times over as we work through each collection of Digital.Grinnell objects. Also, over time I expect to refine and improve the transformation process so I’d like to have the logic captured in a repeatable script rather than a GUI-driven tool.
Python?
Of course! My intent is to create a Python script capable of reading and writing Google Sheet data and structures so that I can create, manage, improve, and above all, repeat my transforms. CollectionBuilder is Jekyll-based so it does not involve Hugo, but the Python scripts in my Hugo Front Matter Tools should still provide a good starting point for crafting scripts to help with this.
Update: I’m going to pivot the effort described above and take a little different approach. My first efforts in Python produced transform-mods-csv-to-ready-for-CB plus a new Google Sheet at https://docs.google.com/spreadsheets/d/1ic4PxHDbuzDrmf4YtauhC4vEQJxt3QSH8bYfLBCM3Gc/. The head of the README.md
file describes where I “was” going with the effort:
This script, evolved from rootstalk-google-sheet-to-front-matter.py from my https://github.com/Digital-Grinnell/hugo-front-matter-tools project, is designed to read all exported MODS records from the mods.csv
tab of https://docs.google.com/spreadsheets/d/1ic4PxHDbuzDrmf4YtauhC4vEQJxt3QSH8bYfLBCM3Gc/ and transform that data into a new ready-for-CB tab of the same Google Sheet, but using the column heading/structure of the CollectionBuilder demo Sheet1 tab.
The Pivot
Essentially, rather than making the ready-for-CB
tab conform to CollectionBuilder’s out-of-the-box metatdata schema, I’m going to make my initial CollectionBuilder configuration conform to the schema reflected in the mods.csv
tab of the aforementioned Google Sheet. Wish me luck.
Not Going to Pivot After All
Ok, I changed my mind, again. I started re-strucutring my exported MODS data into a brand new CollectionBuilder configuration as suggested in the “Pivot”, and realized that my initial approach of scripting the transformation of MODS to initially match CB’s out-of-the-box “demo” schema was a better idea after all. So, I did just that, and this morning I have a working script, transform-mods-csv-to-ready-for-CB and a transformed set of data now in the aformentioned Google Sheet at https://docs.google.com/spreadsheets/d/1ic4PxHDbuzDrmf4YtauhC4vEQJxt3QSH8bYfLBCM3Gc/.
Time now to reconfig CB-CSV_DG-01 and point it to the new .csv
data.
Metadata Changes
With our metadata stored in the latest time-stamped tab at https://docs.google.com/spreadsheets/d/1ic4PxHDbuzDrmf4YtauhC4vEQJxt3QSH8bYfLBCM3Gc/, I exported the Google Sheet to a transformed.csv
file and dropped that file into this repo’s _data
directory as prescribed in the README.md
documentation.
Config Changes
After exporting and depoisting the transformed.csv
I incrementally made changes to the project’s _config.yml
file, and others as described below, regenerating a new CB instance with each change to the config.
Change Collection Settings: metadata
I changed the last line of _config.yml
from…
##########
# COLLECTION SETTINGS
#
# Set the metadata for your collection (the name of the CSV file in your _data directory that describes the objects in your collection)
# Use the filename of your CSV **without** the ".csv" extension! E.g. _data/demo-metadata.csv --> "demo-metadata"
metadata: demo-metadata
…to…
metadata: transformed
The result wasn’t great, it included a host of “Notice” messages like the one shown below.
╭─mcfatem@MAC02FK0XXQ05Q ~/GitHub/CB-CSV_DG-01 ‹main●›
╰─$ bundle exec jekyll s
Configuration file: /Users/mcfatem/GitHub/CB-CSV_DG-01/_config.yml
Source: /Users/mcfatem/GitHub/CB-CSV_DG-01
Destination: /Users/mcfatem/GitHub/CB-CSV_DG-01/_site
Incremental build: disabled. Enable with --incremental
Generating...
Error cb_helpers: Item for featured image with objectid 'demo_001' not found in configured metadata 'transformed'. Please check 'featured-image' in '_data/theme.yml'
Notice cb_page_gen: record '0' in 'transformed', 'grinnell:10023' is being sanitized to create a valid filename. This may cause issues with links generated on other pages. Please check the naming convention used in 'objectid' field.
...
Clearly, the objectid
values I’m writing to the metadata .csv
file need to be improved. I’m making that change in the transform-mods-csv-to-ready-for-CB script now.
After Sanitizing the objectid
Field
After the change I got this output…
Auto-regeneration: enabled for '/Users/mcfatem/GitHub/CB-CSV_DG-01'
Server address: http://127.0.0.1:4000
Server running... press ctrl-c to stop.
[2022-12-13 12:52:26] ERROR `/items/grinnell:23517.html' not found.
Regenerating: 1 file(s) changed at 2022-12-13 13:01:21
_data/transformed.csv
Error cb_helpers: Item for featured image with objectid 'demo_001' not found in configured metadata 'transformed'. Please check 'featured-image' in '_data/theme.yml'
...done in 0.492276 seconds.
…and the output looks better, but there’s still an issue with ‘demo_001’ as a “featured item”.
Changing the Home Page (Featured Image) in theme.yml
In CB the theme.yml
file is home to settings that “…help configure details of individual pages in the website”. I made the following changes to that file.
Changing the last line from this “HOME PAGE” snippet from…
##########
# HOME PAGE
#
# featured image is used in home page banner and in meta markup to represent the collection
# use either an objectid (from an item in this collect), a relative location of an image in this repo, or a full url to an image elsewhere
featured-image: demo_001
…to…
featured-image: grinnell_23345
That change automatically regenerated my local site, but this time with the following error…
Regenerating: 1 file(s) changed at 2022-12-13 13:14:09
_data/theme.yml
Error cb_helpers: Item for featured image with objectid 'grinnell_23345' does not have an image url in metadata. Please check 'featured-image' in '_data/theme.yml' and choose an item that has 'object_location' or 'image_small'
...done in 0.499817 seconds.
This is actually indicative of a MUCH bigger issue… Apparently the object_location
values that I’m providing – as links to the original objects in Digital.Grinnell – are not acceptable. They need to have something like /datastream/OBJ/view
appened to them in order to work correctly.
Pointing CB to Digital.Grinnell Storage
It’s now time to clone the filename
function for a new obj
function in transform-mods-csv-to-ready-for-CB so that it references exported “OBJ” objects with URLs from DG like https://digital.grinnell.edu/islandora/object/grinnell:23517/datastream/OBJ/view. I’m also going to add a thumbnail
function to populate the image_thumb
AND image_small
metadata columns.
With those columns completed the site local site at http://127.0.0.1:4000/ is working, but it’s not pretty. One lesson learned… the featured-image
listed in _data/theme.yml
MUST have a valid image_small
element in order to display correctly. The error message shown above will be present until image_small
is resolved.
Next, Pushing Local Changes to Azure
At this time I’m going to follow the guidance in Tutorial: Publish a Jekyll site to Azure Static Web Apps to create a shared/visible instance of the new CB site.
Done. The process was super-simple and the results as-expected. You can see the site from the main
branch of https://github.com/Digital-Grinnell/CB-CSV_DG-01 at https://purple-river-002460310.2.azurestaticapps.net/.
That address again:
The GitHub Action driving the build and deployment of the main
branch reads like this:
name: Azure Static Web Apps CI/CD
on:
push:
branches:
- main
pull_request:
types: [opened, synchronize, reopened, closed]
branches:
- main
jobs:
build_and_deploy_job:
if: github.event_name == 'push' || (github.event_name == 'pull_request' && github.event.action != 'closed')
runs-on: ubuntu-latest
name: Build and Deploy Job
steps:
- uses: actions/checkout@v2
with:
submodules: true
- name: Build And Deploy
id: builddeploy
uses: Azure/static-web-apps-deploy@v1
with:
azure_static_web_apps_api_token: ${{ secrets.AZURE_STATIC_WEB_APPS_API_TOKEN_PURPLE_RIVER_002460310 }}
repo_token: ${{ secrets.GITHUB_TOKEN }} # Used for Github integrations (i.e. PR comments)
action: "upload"
###### Repository/Build Configurations - These values can be configured to match your app requirements. ######
# For more information regarding Static Web App workflow configurations, please visit: https://aka.ms/swaworkflowconfig
app_location: "/" # App source code path
api_location: "" # Api source code path - optional
output_location: "_site" # Built app content directory - optional
###### End of Repository/Build Configurations ######
close_pull_request_job:
if: github.event_name == 'pull_request' && github.event.action == 'closed'
runs-on: ubuntu-latest
name: Close Pull Request Job
steps:
- name: Close Pull Request
id: closepullrequest
uses: Azure/static-web-apps-deploy@v1
with:
azure_static_web_apps_api_token: ${{ secrets.AZURE_STATIC_WEB_APPS_API_TOKEN_PURPLE_RIVER_002460310 }}
action: "close"
We Need a 2nd Azure Instance
So, I’ve created a new develop
branch with it’s own GitHub Action and deployment to Azure at:
Introducing Oral Histories
So the convention in CollectionBuilder is for every value of display_template
there should be a corresponding .html
template in _layouts/item
with the same name, and that template will be used to render the object and control the object’s individual page behavior. The documentation also says that any display_template
value that does not have a corresponding _layouts/item
template will assume a type of item
.
I decided to test that in the new develop
branch. So, I first changed the display_template
or our “grinnell_19423” object, an oral history interview, from audio
to test
. Sure enough, it rendered as an item
as promised.
Next, I copied _layouts/item/audio.html
, the audio
template, and gave the copy a name of _layouts/item/oral-history.html
. I made no changes to the template. Then I altered our transformed.csv
data to give “grinnell_19423” a display_template
value of oral-history
, matching the name of the new template. Did it work? You betcha! The object is now rendered like an audio
object since that’s what oral-history.html
does.
Kudos to the authors of CollectionBuilder. Well played!
Will .obj
Filename Extensions Work?
There’s evidence in the few tests I’ve run thus far that the extension on the end of a filename makes no difference in how, or if, the object is rendered. Let’s test that a little further by changing the extension on a couple of cloned objects stored in Azure to .obj
, and altering the transformed.csv
file to point to them instead of to Digital.Grinnell. The objects I’m going to alter are “grinnell_19423”, an oral-history
type with a .mp3
extension, and “grinnell_16934”, an image
with a .jpg
extension. The new URLs for these cloned items are:
- https://migrationtestcollection.blob.core.windows.net/migration-test/grinnell_19423_OBJ.obj
- https://migrationtestcollection.blob.core.windows.net/migration-test/grinnell_16934_OBJ.obj
Huzzah, it works! Beautimous!
That solves the need for applying proper filename extensions to Digital.Grinnell objects that have none! The only problem with this .obj
approach is that downloaded objects might not behave as expected, but we’ll cross that bridge when we come to it.
I’m sure there will be more here soon, but for now… that’s a wrap.