This blog post chronicles portions of a process used to restore and subsequently WARC (the creation of a web archive) a Communication Department website that had been retired. The website content of interest included material describing plans for the recently-completed HSSC and Administration Building projects.

During the restoration and WARC process a .md document named was created and it’s contents are presented here.

Creating a WARC from a Clone of

In July 2022 a new “clone” of the original web site project – a Wordpress copy that contains posts and supporting information regarding campus construction projects including the HSSC and Adminissions Center – was created. That clone can be found, and administered, from

Turning Redirection Off

At present, both the and my clone at are redirected to That’s not the site that we want to WARC, so I need to turn redirection off in the /clone site so that it can be WARC’d, I hope.

First Attempt

On the page I will “disable” all redirects by selecting all of the listed URLs, then under Bulk Actions I’ll choose Disable click the Apply button. Done.


Unfortunately, still redirects to :frown:

Second Attempt

So, now I’m going to delete the old redirection using the same process, but selecting Delete rather than Disable. Done.


Nope, still redirected.

Third Attempt

I’m going to visit settings and try to point to a new address like So, I’ll change the Site Address (URL) field from to


Nope, no longer redirected but I got a big SORRY message saying the site could not be found.

Enlisting Help from DLAC

Clearly, my attempts to “clone” the old comm site in a form that could be successfully archived had failed, so I turned to the Digital Liberal Arts Collaborative (DLAC) and their Reclaim Hosting admin powers.

The next section of this document is a thread of emails captured as a PDF document and subsequently converted to Markdown format for publication here.

DLAC Email Thread

Thread elements are in reverse-chronological order.

Subject: Re: email
Date: Friday, July 29, 2022 at 3:09:11 PM Central Daylight Time
From: Pelzel, Morris
To: Rodrigues, Elizabeth, McFate, Mark

OK, should be ready to go …


Dr. Morris Pelzel

From: Rodrigues, Elizabeth
Sent: Friday, July 29, 2022 11:07 AM
To: Pelzel, Morris; McFate, Mark
Subject: Re: email

Thanks, Mo, and I’m sorry about the jargon. A WARC is a web archive file format created through a process of crawling a site.

If we could clone the site to dg-dev directly, I think that would be our best bet for a next thing to try. Basically, we want to be able to crawl the site as it was originally published in wordpress.

Elizabeth Rodrigues, PhD

From: Pelzel, Morris
Sent: Friday, July 29, 2022 11:04 AM
To: Rodrigues, Elizabeth; McFate, Mark
Subject: Re: email

Hi Mark and Liz,

I’m back in town and taking a look at this. I’m trying to get clear for myself exactly what it is that you want to do, so it may be best for us to meet in person sometime next week to sort things out. When you refer to WP “modules” Mark, I assume you mean plug-ins?

In general, we handle redirects, backups, restorations, migrations, and the like, in cPanel, and not in WordPress itself. It’s just cleaner and simpler to do it that way.

Perhaps the issue is that we set up the clone as a subdirectory instead of a subdomain. As a subdirectory, the clone remains part of the original domain, so the redirect cannot be removed. If we instead created it as a subdomain, then it would appear in the list of domains in the cPanel Domains module, and we could then remove the redirects for that subdomain.

But would it not be easier just to clone the site directly in dg- We should be able to clone a WP site from one cPanel account (comms) into another (dg-dev). Then we should be able to turn off any redirects.

Let me know if I am on the right track here.

Also … I do not know (and perhaps do not need to know) what WARC is.



From: Rodrigues, Elizabeth
Sent: Wednesday, July 27, 2022 4:50 PM
To: McFate, Mark; Pelzel, Morris
Subject: Re: email

And I’d add that the pain point here is the redirect that Comm currently has set up. It doesn’t appear to be changeable from within the cloned copy, and when Mark tried reconstructing the site on his own subdomain using Updraft, the homepage worked but all the links still pointed back to the cloned comm site with the apparently baked in redirect.

Is getting a copy with no redirect possible? Or does comm have to stop the redirect from within their own cPanel long enough for us to copy it?

By redirect, I mean now redirects to We have confirmed that the WP site has unique content, and on top of that, WARCing the redirected address leads to WARCing the whole college site…as we learned.

Thanks for any insight you have!

From: McFate, Mark
Sent: Wednesday, July 27, 2022 2:57 PM
To: Pelzel, Morris
Cc: Rodrigues, Elizabeth
Subject: Re: email

Good afternoon, Mo.

I’ve been waiting on some ITS changes to DG today and turned my attention back to for a bit. In that site’s wp-admin I tried turning off, then deleting, the “Redirection” module, but that had no effect.

So, I tried changing the site’s “sekngs” to have it resolve to a different URL, and that didn’t work. Then I tried changing it to resolve to my new https://dg- address, but that also failed.

Liz suggested trying the “Updraft” module to migrate the site and provided an article with guidance. Once I’d completed the prescribed backup process, I tried to restore the backup into, but was warned that the free version of “Updraft” is for “backup only”, and not to be used for “migration”. The migration add-on costs extra, or one must purchase Updraft Premium. 8^(

Well, I didn’t like that answer so I proceeded with the restoration anyway. The outcome was interesting… I got a copy of the old home page at, but all of the navigation was still redirected to their new site, and some nav elements didn’t work at all.

So, that was not a site that I can WARC as intended.

The other effect of restoring from backup was that I lost access to, since that address always asked me to login and then took me back to again. So, I opened the cPanel for and uninstalled WordPress, and have since re-installed a pristine copy and I have wp-admin access there once again.

Through all of this we looked at different means of properly “migrating” the old WordPress site at to my new domain at, but everything I’ve found so far suggests that there is no easy DIY process, there are only $$$$ options available. Even Reclaim’s own discussion about migration suggests the same…

So, I’m wondering if you have a recommendation for me…. How can we easily get the WordPress content that’s in migrated to

Thanks for any advice you can offer. Take care.

-Mark M.

From: McFate, Mark
Date: Monday, July 25, 2022 at 10:44 AM
To: Pelzel, Morris
Subject: Re: email

Ok, thanks Mo. No worries, and no rush. Take care.

-Mark M.

From: Pelzel, Morris
Date: Monday, July 25, 2022 at 10:42 AM
To: McFate, Mark
Subject: email


I’m setting up the domain you requested. If you just received an email about your password, please ignore it…I accidentally left a check box checked (that should have been unchecked).

I’ll send you more information in a moment.


Attempting to WARC

DLAC was able to properly clone the old comm site into my Wordpress space, without redirection, so my hope was restored. I set about creating a WARC of that site…

First wget from My MacBook Pro

wget --warc-file=living-and-learning-community-web-archive --recursive --level=5 --warc-cdx --page-requisites --html-extension --convert-links --execute robots=off --directory-prefix=. -x /solr-search --wait=10 --random-wait
FINISHED --2022-08-01 12:08:09--
Total wall clock time: 18m 19s
Downloaded: 94 files, 11M in 1m 13s (156 KB/s)

Second wget from iMac

wget --warc-file=living-and-learning-community-web-archive --recursive --level=10 --warc-cdx --page-requisites --html-extension --convert-links --execute robots=off --directory-prefix=. -x /solr-search --wait=10 --random-wait
FINISHED --2022-08-01 14:04:25--
Total wall clock time: 16m 13s
Downloaded: 94 files, 11M in 7.3s (1.54 MB/s)


Since both wget operations returned 94 files it’s safe to assume that constitutes a complete archive.

On the iMac the process produced the following .cdx index and .warc.gz compressed archive…

╭─markmcfate@MAD25W812UJ1G9 ~ ‹ruby-2.3.0›
╰─$ ls -alh living*
-rw-r--r--  1 markmcfate  staff    35K Aug  1 14:04 living-and-learning-community-web-archive.cdx
-rw-r--r--  1 markmcfate  staff   9.1M Aug  1 14:04 living-and-learning-community-web-archive.warc.gz

WARC is NOT Complete

Unfortunately, the WARC mentioned above is woefully incomplete because the WordPress reconstruction of the old site is also incomplete. A lot of the old content regarding projects like the HSSC, Adminssions Center, and campus Landscaping were still “published” in the site, but excluded from navigation so the wget... command used to produce the WARC was unable to “find” them.

I enlisted the help of Donna D., an original author of the site, to reassemble things as best we could. Now that that’s done (8-Aug-2022) I’m kicking off a new WARC process on iMac 8660, like so…

╭─markmcfate@MAD25W812UJ1G9 ~/ ‹ruby-2.3.0›
╰─$ time wget --warc-file=living-and-learning-community-web-archive --recursive --level=10 --warc-cdx --page-requisites --html-extension --convert-links --execute robots=off --directory-prefix=. -x /solr-search --wait=10 --random-wait
Opening WARC file ‘living-and-learning-community-web-archive.warc.gz’.

/solr-search: Scheme missing.
--2022-09-08 13:00:47--
Resolving (
Connecting to (||:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Saving to: ‘./’

     0K .......... .......... .......... .......... ..........  314K
    50K .......... .......... .......... .......... ....       6.78M=0.2s

2022-09-08 13:00:48 (569 KB/s) - ‘./’ saved [96460] 
Converting links in ./ nothing to do.
Converting links in ./ nothing to do.
Converting links in ./ 100.
Converted links in 198 files in 2.6 seconds.
wget --warc-file=living-and-learning-community-web-archive --recursive         30.23s user 31.62s system 0% cpu 2:59:10.63 total


The output from the above operation includes /Users/markmcfate/ and a corresponding .cdx file both stored on iMac 8660. Both files have also been copied to my OneDrive so the .gz file also exists at /Users/markmcfate/Library/CloudStorage/OneDrive-GrinnellCollege/iMac-Home-Folder-07-Sep-2022/living-and-learning-community-web-archive.warc.gz.

Moving WARCs to //Storage

I’ve striked the mention of OneDrive above because on my Macs I just don’t trust OneDrive anymore. Today I created a new WARCs folder in my OneDrive, or at least I thought I did, in order to consolidate my storage of WARC archives. Well, the folder structure that I see in my OneDrive on iMac 8660 doesn’t look the same as on my GC MacBook, MA01713. So, like I said, I just don’t trust it.

I do have a reliable home for WARCs in //Storage, the college’s age-old network storage, so that’s where I’m going to put these precious files, at least for now. So I’ve mounted //Storage/Library/mcfatem on my MacBook as verified below…

╭─mcfatem@MAC02FK0XXQ05Q /Volumes/Library/mcfatem
╰─$ pwd

Note: I keep a mount link in Finder on every Mac I have, it reads something like this under the Go | Connect to Server... menu: smb://storage/library/mcfatem.

Capturning a WARC of

While working on this WARC process I discovered that a very old website,, was still “active” (although the site certificates were now invalid) but overdue to be retired. So, I assumed it would be a good idea to capture a WARC. Due to the expired certificate I used this command on iMac 8660 to capture the site:

time wget --warc-file=rootstalk-archive-WARC --recursive --level=10 --warc-cdx --page-requisites --html-extension --convert-links --execute robots=off --directory-prefix=. -x /solr-search --wait=10 --random-wait --no-check-certificate

The result is a pair of files, almost 2.5 GB in size, named:

-rwx------@ 1 markmcfate  staff   1.0M Sep 14 23:48 rootstalk-archive-WARC.cdx
-rwx------@ 1 markmcfate  staff   2.4G Sep 14 23:48 rootstalk-archive-WARC.warc.gz

//Storage WARC Contents

All of the aforementioned WARC capture files, and more, are now stored in //Storage as shown below…

╭─markmcfate@MAD25W812UJ1G9 /Volumes/mcfatem/warcs ‹ruby-2.3.0›
╰─$ ls -alh
total 6481568
drwx------+ 1 markmcfate  GRIN\Domain Users    16K Sep 16 10:07 .
drwx------+ 1 markmcfate  GRIN\Domain Users    16K Jul 13 10:23 ..
-rwx------@ 1 markmcfate  staff               361K Sep  8 15:59 living-and-learning-community-web-archive.cdx
-rwx------@ 1 markmcfate  staff               556M Sep  8 15:59 living-and-learning-community-web-archive.warc.gz
-rwx------+ 1 markmcfate  staff                74K Oct 27  2021 mime-and-me.warc.cdx
-rwx------+ 1 markmcfate  staff               101M Oct 27  2021 mime-and-me.warc.warc.gz
-rwx------@ 1 markmcfate  staff               1.0M Sep 14 23:48 rootstalk-archive-WARC.cdx
-rwx------@ 1 markmcfate  staff               2.4G Sep 14 23:48 rootstalk-archive-WARC.warc.gz
-rwx------@ 1 markmcfate  staff               621K Sep 14 14:36 wget-log

Verified Using

I turned to ReplayWeb in order to confirm the validity of the two “new” WARCs listed above. Using that tool I was able to successfully load and subsequently browse both of the most recently captured WARCs, specifically…

Note that, because of its size, the ReplayWeb rendering of rootstalk-archive-WARC.warc.gz takes a very, very, very long time to load and even longer to render!


After the COVID-19 pandemic was declared a thing of the past in June 2023, Grinnell College made the decision to archive the coronavirus portion of its website,

My first attempts to capture that site caught way too much stuff, presumably because the coronavirus page includes a global menu that opens up all of So, I added the --no-parent option to my wget command, but in that form the command didn’t catch much at all. I found it necessary to also drop the trailing slash at the end of from my original command, and then the capture looked reasonable.

So, the wget that I ultimately used was this:

mcfatem@MAD25W812UJ1G9 ~ % wget --warc-file=coronavirus-pages-web-archive --recursive --level=10 --warc-cdx --page-requisites --html-extension --convert-links --execute robots=off --directory-prefix=. -x /solr-search --wait=10 --random-wait --no-parent

Note that there’s no trailing slash on the https://... specification!

Like the WARCs that came before, this capture has been copied to //Storage/Library/mcfatem/warcs/ for safe-keeping.