Preserving the Personal Landing Page

In going through my CV on this very landing page this year I came across the same problem that so many of us digital preservationists deal with in our day-to-day professional lives: linked content was starting to get wonky. The Blog Posts section of my CV has a series of posts I put together for a former employer that had undergone a website redesign since I departed the organization. My posts were still up, not having succumb to bonified link rot, but the new styling of the website made the posts look a bit off and some images were already starting to break

Luckily I had an excellent resource at my disposal to help address the problem: Archive-It. The organization I worked for had been a member of the Archive-It sponsored Community Webs program and as a result I had spent a fair bit of time selecting and crawling websites related to the geographic region of the organization, including the website of the organization itself. As a result I had access to fully crawled, monthly snapshots of the organization website.

Instead of my blog posts looking more and more janky by the month (and almost certainly eventually breaking outright), I could instead share those posts as they were meant to be seen, with images still extant and formatted to look attractive in the CSS as it looked when I wrote them.

Recently while continuing to tweak my CV page, I realized that a number of other areas could benefit from a similar treatment, this time just using the standard Wayback Machine. Pretty much anything that was linked to could benefit by being converted into a link to a specific entry in the Wayback Machine rather than as a link to a live site, such as:

  • Landing pages that described my involvement in mentoring programs
  • Pages on organization websites listing awards or grants received
  • Conference websites that put my presentation or poster in the context of the larger event
  • Digital projects developed as part of professional service work

One of the big advantages here is being able to pick a moment in time to link to. Linking to an organizational website that lists all of the recipients for a given award is fine, and the content I want folks to see is in that page somewhere, but instead being able to link back to the 2020 version of that site where the award listing relevant to me is at the top of the list because it’s the most recent one is a far better situation.

This strategy even worked in some surprising ways. I was part of a statewide community of practice related to digitization whose WordPress website kept all the past meeting information on one big, long list. Not necessarily ideal if I’m trying to prevent folks from having to hunt for the data relevant to me. Instead I was able to pull up the org’s listserv archives which are all shared as basic HTML files. Using the Wayback Machine Chrome extension, you can do one-time crawls for any site you are currently viewing. It was incredibly easy to do a quick crawl on the listserv email related to the meeting that I was presenting at, and it was even lightning fast because all that was being crawled was a short page of un-styled HTML (example). The web archived listserv email then acts essentially as a conference landing page, providing context about when the event occurred and who else was presenting.

This strategy isn’t without downsides of course. Web archives via the Wayback Machine take a hot minute to load, and web crawls can be imperfect tools that may fail to grab an image for instance. Also, no organization or strategy is a perfect digital preservation solution and there are no guarantees that the Internet Archive will be able to preserve and share these web archives forever.

These cons I think are vastly outweighed by the fact that your landing page becomes far more durable and accurate to the moment in time to which you are referencing and gets closer to that “set it and forget it” vibe. This has been particularly handy as I’ve started to see changes occur in projects that I started but no longer have any control of; I don’t want to link people out to a version of a digital project that may have been totally overhauled by a different person, I want folks to see my own work as I intended it to be seen.

In the future I’m looking forward to factoring this work directly into updates as they occur on my CV and have a sneaking suspicion there are even more links to hunt down and swap out with web archives around this landing page.

Migrating Digital Assets from Islandora to Preservica

About four months ago I was hired as the Digital Asset Management Lead at the University of Rochester, River Campus Libraries. One of the big areas of work that I am responsible for is establishing a sustainable digital preservation service using (among other resources) Preservica Cloud Professional edition. The organization currently holds digital assets in a wide variety of different repository platforms, but of particular concern is our use of Islandora, based on Drupal 7, which will be reaching it’s end-of-life by the end of 2022. While it is possible that the organization could make the jump to Drupal 8/9, that isn’t a path the organization is interested in pursuing, so instead I’m working on other strategies to migrate that content, one of which is moving it into Preservica.

To learn how to engage in this migration work I’m using a small collection (one that I have learned is now almost traditionally used as the “authorized test collection” when engaging with these sorts of migrations), the John McGraw Civil War Papers. As this Islandora instance is intended to be shut down, due to the inevitable link rot, I link here to a representative page preserved in the Internet Archive Wayback Machine and a finding aid is also available for the collection.

This collection contained 78 complex digital objects in Islandora. The access copies of these digital objects were PDFs and they were stored with both MODS and Dublin Core metadata associated with them. While Islandora also hosts the preservation quality TIFF images, I have been unable to find a means to programmatically export these images from Islandora, and as such am securing the preservation masters from the digitization lab which originally photographed the materials (special thanks to Lisa Wright, Digital Scholarship Digitization Specialist in the Digital Scholarship department of the River Campus Libraries). The PDF and metadata XML are being exported from Islandora using the BagIt Utility Module.

In order to facilitate this migration process I’ve written a small set of Python scripts in order to successively manipulate the Islandora bags and preservation masters into something that Preservica can ingest. As an implementation team, we’re still working out how exactly we want to ingest digital assets into Preservica as a standard workflow (and to be honest, our instance of Preservica has yet to have a single authentic ingest outside of testing) but our intentions are the following:

  • To use the PAX standard to structure our digital assets for ingest so that we can have a “many to one” relationship among our SIPs (that is, one PDF and a whole bunch of TIFFs)
  • To use the OPEX standard to formulate our metadata, allowing the submission of an XML record that contains multiple local identifiers, multiple metadata fragments, and allows us to leverage the ArchivesSpace to Preservica synchronization tool so that digital assets in Preservica can be associated with archival objects in ArchivesSpace, and so that the finding aid hierarchy can be replicated in Preservica
  • Use the OPEX Incremental Workflow in Preservica so that folders of digital assets can simply be “dumped” into a staging area and left to be automatically ingested

Many thanks to Richard Smith’s post on the Preservica site, “Using OPEX and PAX for Ingesting Content,” for providing an accessible means of understanding these various different interwoven standards. The aforementioned Python scripts used to form the SIP for this collection can be found in a GitHub repository. Setting aside the lengthy preamble above, my goal is to lay out the purpose of each of the individual 18 functions to detail what I was trying to accomplish with each one.

I should note that I am trained as a librarian and definitely not a software engineer. I’m part of that class of library-critters that finally found problems in the course of their work that needed some code to surmount, instead of trying to tackle the problem manually, and as a result I finally took a MOOC on Python after saying that I would get around to it for years. To sum up, there are almost certainly better ways to execute just about everything I’m going to talk about.

Thus, this document assumes the following:

  • A basic understanding of Python
  • A fairly comprehensive understanding of Preservica, especially:
    • The OPEX metadata standard and ingest method
    • The PAX asset structure standard
    • The ArchivesSpace – Preservica synchronization workflows
  • Enough understanding of Islandora to work with the BagIt Utility Module

My last disclaimer is that I haven’t actually taken the SIP generated by these scripts and tried to ingest it into Preservica yet; that is coming next week. So, fair chance this gets an “Editor’s Note: something went WRONG” with some addendums on what I changed to make Preservica take the SIP intelligibly.

Python Import Statements and variables

import os
import os.path
import shutil
import pathlib
import re
import xml.etree.ElementTree as ET
from datetime import datetime
from bdbag import bdbag_api
from bagit import BagValidationError
from bdbag.bdbagit import BaggingInterruptedError
from pyrsistent import thaw
from zipfile import ZipFile
from openpyxl import Workbook
from os.path import basename

This script set ends up using a pretty varied set of Python modules, some of them sparingly, some of them constantly. the os and shutil modules are used constantly to rename and move files and folders. The bdbagit module is a Python module specifically created for the validation and manipulation of bags created by the BagIt standard. Extracting data from MODS and Dublin Core metadata is pretty common, hence the XML ElementTree module. Others are used sparingly, such as datetime just to create a unique, time-based, identifier for each SIP, as well as the ZipFile module when creating the PAX archives. During the process of staging the data, the access PDFs and the preservation TIFFs need to be compared to make sure they match on a one-to-one basis (reader, they did not) so an Excel spreadsheet was created, and openpyxl was used to generate that. All of this will be gone into greater detail in the various specific functions

proj_path = 'M:/IDT/DAM/McGraw_Preservica_Ingest'
ds_files = 'preservation_masters'

The “proj_path” variable is used constantly to reference the position on one of our network drives where this project folder was housed. The “ds_files” variable is used little just to define the original name of the folder that the preservation masters came over from Digital Scholarship in. In general all of the pathing throughout these scripts is represented by variables representing directories separated by instances of ” ‘/’ “.

  • proj_path represents the location of the project folder on a network drive. It contains the container folder that has all the digital assets in it, the python scripts, and some other sidecar files created by the scripts
  • container is a subdirectory of proj_path and is the folder that will eventually be uploaded via the OPEX incremental workflow and contains all the digital assets and is where they are manipulated to prepare them for ingest
  • bags_dir is a subdirectory of container that initially contains all the access copies of the digital assets and the metadata files, which are manipulated in this folder before being merged with the preservation assets in container

This ends up looking like “variable + ‘/’ + variable + ‘/’ variable” which simplifies visualizing how deep into the directory structure you are, instead of including those slashes in with the variables themselves. Once the scripts start working on this content in earnest a fairly typical structure would be:

  • proj_path
    • container
      • directory1
      • directory2
      • directory3
      • bags_dir
        • bag1
        • bag2
        • bag3

def create_container()

In advance of this function, the folder full of preservation masters provided by Digital Scholarship is copied into proj_path. The name of the folder is stored in the ds_files global variable.

def create_container():
    project_log_hand = open(proj_path + '/' + 'project_log.txt', 'a')
    now = datetime.now()
    date_time = now.strftime('%Y-%m-%d_%H-%M-%S')
    project_log_hand.write(date_time + '\n')
    container = 'container_' + date_time
    os.rename(proj_path + '/' + ds_files, proj_path + '/' + container)
    project_log_hand.write(container + '\n')
    print('Container directory: {}'.format(container))
    project_log_hand.close()

First up, we need to create the container folder that will be the wrapper for the entire SIP that gets sent off to Preservica. The OPEX incremental method is nice because you let Preservica know what files to expect in a manifest in the OPEX metadata, that way it doesn’t start trying to process a digital asset before it has all the components of it (such as metadata), reducing errors on ingest. My understanding is that using a container folder to hold the whole SIP is sort of a way of cheating this a bit, letting Preservica know to not start doing anything until this one container folder is ingested, and it just so happens to contain a myriad of digital assets in it, that also will benefit from that. This saves having to build a manifest for many different assets.

This function opens up a “project_log.txt” file which will be referenced in every single subsequent function. The idea here was to park variables in a text file so that they would be portable and durable outside of the code itself, and let me set the project aside for days at a time and come back to it. Inside the text file we store the current date and time on the first line, and then “container_” with that date/time variable appended to it; this string corresponds to container. Then we rename the folder that Digital Scholarship provided that contains all the preservation masters into the variable stored in container followed by a print statement letting us know what that container variable is called. If I made it right now it would be “container_2022-02-18_18-07-37” reflecting the current date and time.

folder_ds_files()

def folder_ds_files():
    project_log_hand = open(proj_path + '/' + 'project_log.txt', 'r')
    vars = project_log_hand.readlines()
    container = vars[1]
    container = container.strip()
    folder_name = ''
    folder_count = 0
    file_count = 0
    for file in os.listdir(path = proj_path + '/' + container):
        file_root = file.split('-')
        file_root = file_root[0]
        if  file_root == folder_name:
            shutil.move(proj_path + '/' + container + '/' + file, proj_path + '/' + container + '/' + folder_name + '/' + file)
            file_count += 1
        else:
            folder_name = file_root
            os.mkdir(proj_path + '/' + container + '/' + folder_name)
            folder_count += 1
            shutil.move(proj_path + '/' + container + '/' + file, proj_path + '/' + container + '/' + folder_name + '/' + file)
            file_count += 1
    for folder in os.listdir(path = proj_path + '/' + container):
        count = 0
        for file in os.listdir(proj_path + '/' + container + '/' + folder):
            count += 1
        if count > 99:
            os.rename(proj_path + '/' + container + '/' + folder, proj_path + '/' + container + '/' + folder + "-001-" + str(count))
        elif count > 9:
            os.rename(proj_path + '/' + container + '/' + folder, proj_path + '/' + container + '/' + folder + "-001-0" + str(count))
        else:
            os.rename(proj_path + '/' + container + '/' + folder, proj_path + '/' + container + '/' + folder + "-001-00" + str(count))
    print('Created and renamed {} subdirectories and moved {} files into them'.format(folder_count, file_count))
    project_log_hand.close()

After the first line calling the function declaration, we have the standard first set of lines that opens “project_log.txt” and reads some subset of the first three lines to extract the variables mentioned above which include date_time, container, and bags_dir. Following this we have an empty string variable that will be used in a for loop shortly, as well as a couple incrementing counter variables that are present in most functions to be passed into a concluding print statement to show just how many actions the function was able to take.

The folder of preservation assets that Digital Scholarship is passing along is one directory that contains all the assets, but we need those assets to be in separate subdirectories. The first for loop takes care of that process, thanks to consistent file-naming conventions. Each file has a file name that is a prefix, representing the physical location of the analogue asset in the archives as represented by “Collection Number – Box Number – Folder Number – Date” which is followed by a suffix in the form of a sequence number representing the pages. This for loop looks at the prefix of the file, compares it to what is stored in the folder_name variable, and if it’s different, it stores the prefix in the folder_name variable, creates a new directory named identically to the prefix, and then moves the file in the iteration into the folder. In the next round of the for loop, if the prefix is identical to what is stored in folder_name, it simply moves the file in.

The prefix is used as part of a unique identifier in the asset metadata that we’ll be using to match up preservation and access copies later, but it’s not quite formed identically yet. The identifier in the metadata has the prefix plus a range of three digit sequence numbers reflecting how many pages total there are. For instance “D528_1_4_1864_02_28-001-002” represents an asset from collection D528, in box 1, folder 4, dated February 28, 1864, which contains two pages. Currently our folder names do not have the information reflecting the two pages. The next for loop counts how many files are in each folder and then renames the folder to append that page count information to the end of the folder name. This includes logic to handle how many leading zeroes should be included. If the for loop counts more than 99 files in the folder, it will not have any leading zeroes, but if it counts 9 or less, than there will be two leading zeroes.

create_bags_dir()

def create_bags_dir():
    project_log_hand = open(proj_path + '/' + 'project_log.txt', 'r')
    vars = project_log_hand.readlines()
    date_time = vars[0]
    date_time = date_time.strip()
    container = vars[1]
    container = container.strip()
    bags_dir = 'bags_' + date_time
    os.mkdir(proj_path + '/' + container + '/' + bags_dir)
    project_log_hand.close()
    project_log_hand = open(proj_path + '/' + 'project_log.txt', 'a')
    project_log_hand.write(bags_dir + '\n')
    print('Created bags directory: {}'.format(bags_dir))
    project_log_hand.close()

This function starts with the standard call to “project_log.txt” and will ultimately add the third and final line of information to the text file, the name of the folder containing the access assets and metadata referred to as bags_dir. The function uses the information in the date_time variable to name the new subdirectory, creates the subdirectory, and writes the variable information into “project_log.txt”

Once the bags_dir directory is created the exported bags from Islandora should be copied over into the newly created folder.

extract_bags()

def extract_bags():
    project_log_hand = open(proj_path + '/' + 'project_log.txt', 'r')
    vars = project_log_hand.readlines()
    container = vars[1]
    container = container.strip()
    bags_dir = vars[2]
    bags_dir = bags_dir.strip()
    num_bags = 0
    for file in os.listdir(path = proj_path + '/' + container + '/' + bags_dir):
        bdbag_api.extract_bag(proj_path + '/' + container + '/' + bags_dir + '/' + file, output_path = proj_path + '/' + container + '/' + bags_dir, temp=False)
        print('extracting bag: {}'.format(file))
        num_bags += 1
    for bag in os.listdir(path = proj_path + '/' + container + '/' + bags_dir):
        if bag.endswith('.zip'):
            print('removing zipped bag: {}'.format(bag))
            os.remove(proj_path + '/' + container + '/' + bags_dir + '/' + bag)
    print('Extracted {} bags'.format(str(num_bags)))
    project_log_hand.close()

At this point I’m going to stop referring to the fact that every function opens with the call to “project_log.txt”; you can assume each subsequent function will do this, pulling out the variables that represent the names of the container directory and the bags_dir directory as appropriate. I’m also going to skip mentioning that each function contains one or more counter variables to be used by a concluding print statement to summarize what the function did. Finally, many of the functions have print statements in the for loops themselves so that progress can be monitored as the function iterates through each file or directory.

As mentioned previously, Islandora exports digital assets as bags, per the BagIt specification. This specification was created by the Library of Congress to provide a means of ensuring that when one is transferring large amounts of data over long distances, you can be assured that what is being sent and what is received is identical. This is accomplished by a simple directory structure containing the assets and a set of text files that contain lists of all the included files (providing an inventory or manifest) as well as checksums for each file (so you can compare the fixity before and after transfer). These can either be zipped bags or unzipped bags, which is just turning the bag directory structure into a zip file/archive.

With that context, this function takes the zipped bags exported by Islandora, unzips them using the bdbag Python library, specifically the “bdbag_api.extract_bag(source path, output path) line of the code. This takes each bag, unzips it, and deposits it into the bags_dir directory.

The above for loop creates an unzipped version of the bag, but the zipped version is still hanging around. The second for loop looks for anything in bags_dir that has a “.zip” file extension and deletes it.

validate_bags()

def validate_bags():
    project_log_hand = open(proj_path + '/' + 'project_log.txt', 'r')
    vars = project_log_hand.readlines()
    container = vars[1]
    container = container.strip()
    bags_dir = vars[2]
    bags_dir = bags_dir.strip()
    error_log_handle = open(proj_path + '/' + 'validation_error_log.txt', 'a')
    num_bags = 0
    for directory in os.listdir(path = proj_path + '/' + container + '/' + bags_dir):
        print('attempting to validate {}'.format(directory))
        num_bags += 1
        try:
            bdbag_api.validate_bag(proj_path + '/' + container + '/' + bags_dir + '/' + directory, fast = False)
        except BagValidationError:
            error_log_handle.write('Bag Validation Error | Directory: ' + directory + '\n')
        except BaggingInterruptedError:
            error_log_handle.write('Bagging Interruped Error | Directory: ' + directory + '\n')
        except RuntimeError:
            error_log_handle.write('Runtime Error | Directory: ' + directory + '\n')
    print('Validated {} bags'.format(str(num_bags)))
    error_log_handle.close()
    project_log_hand.close()

Now that we have all the access assets and metadata inside of unzipped bags in bags_dir, we next need to make sure that what we got from Islandora hasn’t been corrupted in transfer in any way. This for loop attempts to use the checksums in the bag text files to validate the assets each bag contains, again using the bdbag Python library, specifically the “bdbag_api.validate_bag(path-to-bag)” function. If any of three different types of errors are raised (BagValidationError, BaggingInterrupedError, or Runtime Error) then those get written to a new sidecar file called “validation_error_log” that is located in proj_path. Any bags that raise errors can then be worked with individually to see if a non-corrupt version can be pulled from Islandora.

create_id_ss()

def create_id_ss():
    wb = Workbook()
    ws = wb.active
    project_log_hand = open(proj_path + '/' + 'project_log.txt', 'r')
    vars = project_log_hand.readlines()
    container = vars[1]
    container = container.strip()
    bags_dir = vars[2]
    bags_dir = bags_dir.strip()
    project_log_hand.close()
    ws['A1'] = 'pres_file_name'
    ws['B1'] = 'acc_file_name'
    ws['C1'] = 'bag_id'
    pres_file_list = []
    for folder in os.listdir(path = proj_path + '/' + container):
        if folder.startswith('bags_'):
            continue
        else:
            pres_file_list.append(folder)
    bag_dict = dict()
    for bag in os.listdir(path =  proj_path + '/' + container + '/' + bags_dir):
        tree = ET.parse(proj_path + '/' + container + '/' + bags_dir + '/' + bag + '/data/MODS.xml')
        identifier = tree.find('{http://www.loc.gov/mods/v3}identifier').text
        bag_dict[identifier] = bag
    for item in pres_file_list:
        if item in bag_dict.keys():
            ws.append([item, item, bag_dict[item]])
        else:
            ws.append([item, '', ''])
    for item in bag_dict.keys():
        if item not in pres_file_list:
            ws.append(['', item, bag_dict[item]])
    wb.save('pres_acc_bag_ids.xlsx')
    print('Created pres_acc_bag_ids.xlsx')

This function introduces a new Python library to the mix, openpyxl, which is used to read, manipulate, and create Excel spreadsheets. Since many of these digitization projects were completed years ago, memories get hazy of what was done in the course of the projects and what assets may have been used in Islandora, and what ended up being left out. Digital Scholarship is providing a complete inventory of their work, which may or may not have a one-to-one corollary inside the exported bags from Islandora. So we need some way to compare the two lists to see how well they match up.

This function creates a spreadsheet object workbook (“wb = Workbook()”) and a worksheet in that workbook (“ws = wb.active”) that has three columns (a worksheet is one of the tabs at the bottom of an Excel file). The worksheet has three columns and the column headings are “pres_file_name”, “acc_file_name”, and “bag_id”. The first for loop iterates through the contents of container (skipping bags_dir) and creates a list of all the folder names that container contains.

Next up we create a blank dictionary and loop through all of the bags in bags_dir, specifically pulling out the MODS metadata record and looking at the <identifier> field, which should hopefully match the folder names mentioned in the previous paragraph. The loop uses the <identifier> information as the key in the dictionary and the folder name of the bag as the value.

Next up we need to write all this information into the spreadsheet so we can look it over in detail. The idea here is that in the first for loop, it first attempts to find a match between container folder names and tries to match them up to the MODS identifiers. If it can it adds a line to the spreadsheet with the folder name (representing the preservation assets) in the first column, the MODS <identifier> (representing the access assets/metadata) in the second column, and the bag directory name (also representing the access assets/metadata) to the third column.

If the first for loop can’t find a match, then it simply adds a line to the spreadsheet with the container folder name leaving the second and third columns blank. These represent preservation assets that Digital Scholarship provided that have no corollary in Islandora.

We still have one piece of the puzzle left though, those exported bags from Islandora that showed up without preservation masters from Digital Scholarship. The last for loop looks at each of the keys in the dictionary (which again are drawn from the MODS <identifier> fields) and if that key doesn’t show up in the list of container folder names, then it appends another line to the spreadsheet with the first column blank, the second column being the dictionary key (MODS <identifier>) and the third column being the dictionary value (bags_dir folder).

Finally it’s time to actually look over the spreadsheet and see what went wrong. In some cases there were assets that Digital Scholarship imaged that simply weren’t used in Islandora (in this particular collection, a number of envelopes). For some there were file-naming mismatches due to clerical error, or because a blank analogue (a physical piece of paper, or back of a piece of paper, with no writing on it) was imaged for the preservation masters but not included in the PDF which resulted in “prefix-001-002” and “prefix-001-003” being the same intellectual content but different file names. Going into depth here is likely not altogether useful, but the spreadsheet was analyzed, corrections were made, and the files/folders in container and bags_dir were manipulated so the scripts could continue.

representation_preservation()

def representation_preservation():
    project_log_hand = open(proj_path + '/' + 'project_log.txt', 'r')
    vars = project_log_hand.readlines()
    container = vars[1]
    container = container.strip()
    folder_count = 0
    file_count = 0
    rep_pres = 'Representation_Preservation'
    for directory in os.listdir(path = proj_path + '/' + container):
        if directory.startswith('bags_'):
            continue
        path = proj_path + '/' + container + '/' + directory + '/' + rep_pres
        os.mkdir(path)
        folder_count += 1
        for file in os.listdir(path = proj_path + '/' + container + '/' + directory):
            if file == rep_pres:
                continue
            else:
                file_name = file.split('.')
                file_name = file_name[0]
                os.mkdir(path + '/' + file_name)
                shutil.move(proj_path + '/' + container + '/' + directory + '/' + file, path + '/' + file_name + '/' + file)
            file_count += 1
    print('Created {} Representation_Preservation directories | Moved {} files into created directories'.format(folder_count, file_count))
    project_log_hand.close()

All of the above functions involve setting the stage for what ideally should be a set of functions that can be run automatically in sequence rather than one at a time. With this folder we start to build the directory structure necessary for the PAX that will represent the contents of each digital object.

Throughout these remaining functions you’ll see reference to a path that looks like “proj_path + ‘/’ + container + ‘/’ + directory” which represents the working folder for each digital asset. The PAX structure requires that each of these directories has a “Representation_Preservation” and “Representation_Access” subdirectory, which are fairly self explanatory. Each of these “Representation_” folders then contains sub directories that contain a single file. “Representation_Preservation” will have a series of subdirectories that each contain a single TIFF. “Representation_Access” will contain a single subdirectory with a single PDF. If this is sounding confusing I would again recommend reading Smith’s blog post for an overview on PAX/OPEX.

The first for loop here makes sure to skip the bags_dir and then creates the new “Representation_Preservation” subdirectory. The second for loop then looks at all the files in a given directory, stores the file name (minus the extension) in a variable, and creates a subdirectory for each file, before moving the corresponding file into the newly created folder.

process_bags()

def process_bags():
    project_log_hand = open(proj_path + '/' + 'project_log.txt', 'r')
    vars = project_log_hand.readlines()
    container = vars[1]
    container = container.strip()
    bags_dir = vars[2]
    bags_dir = bags_dir.strip()
    num_bags = 0
    error_log_str = ''
    error_log_handle = open(proj_path + '/' + 'validation_error_log.txt', 'r')
    error_log = error_log_handle.read()
    for line in error_log:
        error_log_str = error_log_str + line
    for directory in os.listdir(path = proj_path + '/' + container + '/' + bags_dir):
        #skips any directories that raised errors during validation
        if error_log_str.find(directory) != -1 :
            continue
        else:
            print('attempting to revert bag: {}'.format(directory))
            obj_file_name = ''
            #converts the bags back into normal directories, removing bagit and manifest files
            bdbag_api.revert_bag(proj_path + '/' + container + '/' + bags_dir + '/' + directory)
            #removes unnecessary files generated by Islandora
            unneccesary_files = ['foo.xml', 'foxml.xml', 'JP2.jp2', 'JPG.jpg', 'POLICY.xml', 'PREVIEW.jpg', 'RELS-EXT.rdf', 'RELS-INT.rdf', 'TN.jpg', 'HOCR.html', 'OCR.txt', 'MP4.mp4', 'PROXY_MP3.mp3']
            for file in os.listdir(path = proj_path + '/' + container + '/' + bags_dir + '/' + directory):
                if file in unneccesary_files:
                    os.remove(proj_path + '/' + container + '/' + bags_dir + '/' + directory + '/' + file)
                if re.search('^OBJ', file):
                    obj_file_name = file
                    extension = obj_file_name.split('.')
                    extension = extension[1]
                    extension = extension.strip()
                elif re.search('^PDF', file):
                    obj_file_name = file
                    extension = obj_file_name.split('.')
                    extension = extension[1]
                    extension = extension.strip()
            #use xml.etree to identify filename from MODS.xml
            tree = ET.parse(proj_path + '/' + container + '/' + bags_dir + '/' + directory + '/MODS.xml')
            identifier = tree.find('{http://www.loc.gov/mods/v3}identifier').text
            #rename the OBJ file to original filename pulled from MODS.xml
            os.rename(proj_path + '/' + container + '/' + bags_dir + '/' + directory + '/' + obj_file_name, proj_path + '/' + container + '/' + bags_dir + '/' + directory + '/' + identifier + '.' + extension)
        num_bags += 1
    error_log_handle.close()
    print('Processed {} bags'.format(str(num_bags)))
    project_log_hand.close()

The bags exported by Islandora need a bit more massaging to get them ready for their “Representation_Access” folder. This function starts off by opening the “validation_error_log.txt” file created in validate_bags() and puts all the contents into a string variable. Then as the bags in bags_dir are looped through, if any of the bags show up in the error log string, they are skipped.

Next we use the bdbag Python library to revert the bag into a simple directory, of files, removing the sidecar files (containing the manifests and checksums) and bag directory structure. After reverting the bags, a for loop is called to look at the contents of the reverted bags and make a decision about what to do with each file. Islandora exports a lot of files in the bags that are only relevant to Islandora which can be deleted, or things like thumbnails that aren’t necessary since Preservica will create those for us. The list of these files that can be deleted is stored in the “unnecessary_files” variable, and if a file is in the list, it gets deleted.

Next up in the conditional logic we look use an incredibly simple regular expression to look for any files that are either named “OBJ” or “PDF”. We then store the file name in one variable and the file extension in a second variable. We’re going to be renaming these files and now we have the “before” for the os.rename function.

Now we find the “after” by looking at the MODS record XML, finding our previously used <identifier> field and storing it in a variable. Next up we rename the access asset from either “OBJ.pdf” or “PDF.pdf” to the contents of <identifier> plus “.pdf” (which was stored in a variable in the above paragraph for this step).

representation_access()

def representation_access():
    project_log_hand = open(proj_path + '/' + 'project_log.txt', 'r')
    vars = project_log_hand.readlines()
    container = vars[1]
    container = container.strip()
    bags_dir = vars[2]
    bags_dir = bags_dir.strip()
    folder_count = 0
    rep_acc = 'Representation_Access'
    for directory in os.listdir(path = proj_path + '/' + container):
        if directory.startswith('bags_'):
            print('bags_ folder found - skipped')
        else:
            path = proj_path + '/' + container + '/' + directory + '/' + rep_acc
            os.mkdir(path)
            print('created {}'.format(path))
        folder_count += 1
    print('Created {} Representation_Access directories'.format(folder_count))
    project_log_hand.close()

This function is the corollary to representation_preservation(), now looping through the directories in contain to create a “Representation_Access” folder so that the access assets prepared in process_bags() can then be moved into the newly created “Representation_Access” folder.

access_id_path()

def access_id_path():
    project_log_hand = open(proj_path + '/' + 'project_log.txt', 'r')
    vars = project_log_hand.readlines()
    container = vars[1]
    container = container.strip()
    bags_dir = vars[2]
    bags_dir = bags_dir.strip()
    access_id_hand = open(proj_path + '/' + 'access_ids.txt', 'a')
    access_count = 0
    for directory in os.listdir(path = proj_path + '/' + container + '/' + bags_dir):
        tree = ET.parse(proj_path + '/' + container + '/' + bags_dir + '/' + directory + '/MODS.xml')
        identifier = tree.find('{http://www.loc.gov/mods/v3}identifier').text
        access_id_hand.write(identifier + '|')
        path = proj_path + '/' + container + '/' + bags_dir + '/' + directory
        access_id_hand.write(path + '\n')
        access_count += 1
        print('logged {} and {}'.format(identifier, path))
    print('Logged {} paths and identifiers in access_ids.txt'.format(access_count))
    project_log_hand.close()
    access_id_hand.close()

So now our preservation assets are in the directories they should be in to create a well-formed PAX, but our access assets are still in bags_dir and in need of being merged, but as of yet we don’t have a mapping between the two to merge them. This function opens up a new sidecar text file called “access_ids.txt” that will hold a set of identifiers that are present in both the preservation masters as well as in the bags, and also the path to the unzipped bags.

The function then loops through each subdirectory in bags_dir and writes to “access_ids.txt” the identifier found in the MODS XML record in the <identifier> field, as well as the path to the current subdirectory in the iteration. This is formatted as a pipe (“|”) delimitated text file.

merge_access_preservation()

def merge_access_preservation():
    project_log_hand = open(proj_path + '/' + 'project_log.txt', 'r')
    vars = project_log_hand.readlines()
    container = vars[1]
    container = container.strip()
    bags_dir = vars[2]
    bags_dir = bags_dir.strip()
    access_id_hand = open(proj_path + '/' + 'access_ids.txt', 'r')
    access_id_list = access_id_hand.readlines()
    rep_acc = 'Representation_Access'
    file_count = 0
    for directory in os.listdir(path = proj_path + '/' + container):
        if directory.startswith('bags_'):
            continue
        else:
            for line in access_id_list:
                print('merging {} and {}'.format(directory,line))
                access_info = line.split('|')
                identifier = access_info[0]
                identifier = identifier.strip()
                path = access_info[1]
                path = path.strip()
                if identifier == directory:
                    for file in os.listdir(path = path):
                        if file.endswith('.xml'):
                            shutil.move(path + '/' + file, proj_path + '/' + container + '/' + directory + '/' + file)
                            file_count += 1
                        else:
                            file_name = file.split('.')
                            file_name = file_name[0]
                            os.mkdir(proj_path + '/' + container + '/' + directory + '/' + rep_acc + '/' +  file_name)
                            shutil.move(path + '/' + file, proj_path + '/' + container + '/' + directory + '/' + rep_acc + '/' + file_name + '/' + file)
                            file_count += 1
    print('Moved {} access and metadata files'.format(file_count))
    project_log_hand.close()
    access_id_hand.close()

This is a bit of a tortured function, but I couldn’t figure out a better way to write it. This function opens up the previously created “access_ids.txt” and reads all of the lines into a list using the file_handle.readlines() function. First we start by looping through every directory in container (first iteration), as usual skipping bag_dir when it shows up.

Now, as we loop through each directory in container, we then loop through the entire list of lines from “access_ids.txt” (second iteration), splitting each line and turning the identifier and path to the access assets into their own variables again. As we loop through these, if the identifier is identical to the name of the directory we are currently iterating through the next for loop triggers.

This next iteration (third iteration) identifies all of the metadata files (which end in “.xml”) and moves them over to the directory from our first iteration. Then it also takes the access asset itself, creates the subdirectory inside “Representation_Access” that is equivalent to the file name, and then moves the access asset into this newly created subdirectory inside “Representation_Access”.

I worry that these successive layers of nested iteration, and the fact that for each directory in container the entire “access_ids.txt” file needs to be read through, will make for a slow process once larger data sets are being used but I don’t necessarily see an alternative.

cleanup_bags()

def cleanup_bags():
    project_log_hand = open(proj_path + '/' + 'project_log.txt', 'r')
    vars = project_log_hand.readlines()
    container = vars[1]
    container = container.strip()
    bags_dir = vars[2]
    bags_dir = bags_dir.strip()
    shutil.rmtree(proj_path + '/' + container + '/' + bags_dir)
    os.remove('access_ids.txt')
    print('Deleted "{}" directory and access_ids.txt'.format(bags_dir))
    project_log_hand.close()

This next function is a much simpler one, now cleaning up the staging area that we used to process the digital assets from Islandora. First we delete bags_dir and all of its contents, then we delete the now unnecessary “access_ids.txt” file.

pax_metadata()

def pax_metadata():
    project_log_hand = open(proj_path + '/' + 'project_log.txt', 'r')
    vars = project_log_hand.readlines()
    container = vars[1]
    container = container.strip()
    dir_count = 0
    for directory in os.listdir(path = proj_path + '/' + container):
        opex1 = '<?xml version="1.0" encoding="UTF-8" standalone="yes"?><opex:OPEXMetadata xmlns:opex="http://www.openpreservationexchange.org/opex/v1.0"><opex:Properties><opex:Title>'
        tree = ET.parse(proj_path + '/' + container + '/' + directory + '/DC.xml')
        root = tree.getroot()
        opex2 = tree.find('{http://purl.org/dc/elements/1.1/}title').text
        opex3 = '</opex:Title><opex:Identifiers>'
        id_list = []
        opex4 = ''
        for id in root.findall('{http://purl.org/dc/elements/1.1/}identifier'):
            id_list.append(id.text)
        for item in id_list:
            if item.startswith('ur'):
                opex4 += '<opex:Identifier type="code">' + item + '</opex:Identifier>'
            else:
                other_identifiers = item.split(':')
                label = other_identifiers[0]
                label = label.strip()
                value = other_identifiers[1]
                value = value.strip()
                opex4 += '<opex:Identifier type="' + label + '">' + value + '</opex:Identifier>'
        opex5 = '</opex:Identifiers></opex:Properties><opex:DescriptiveMetadata><LegacyXIP xmlns="http://preservica.com/LegacyXIP"><AccessionRef>catalogue</AccessionRef></LegacyXIP>'
        opex6 = ''
        for file in os.listdir(path = proj_path + '/' + container + '/' + directory):
            if file.endswith('.xml'):
                temp_file_hand = open(proj_path + '/' + container + '/' + directory + '/' + file, 'r')
                lines = temp_file_hand.readlines()
                for line in lines:
                    opex6 += line
                temp_file_hand.close()
        opex7 = '</opex:DescriptiveMetadata></opex:OPEXMetadata>'
        filename = directory + '.pax.zip.opex'
        pax_md_hand = open(proj_path + '/' + container + '/' + directory + '/' + directory + '.pax.zip.opex', 'a')
        pax_md_hand.write(opex1 + opex2 + opex3 + opex4 + opex5 + opex6 + opex7)
        pax_md_hand.close()
        print('created {}'.format(filename))
        dir_count += 1
    print('Created {} OPEX metdata files for individual assets'.format(dir_count))
    project_log_hand.close()

Each digital asset gets packaged as a PAX object, which is a zip archive, and has a corresponding OPEX metadata file that corresponds with it. The OPEX file that corresponds to each PAX is in particular a fairly complex one, as can be seen by the size of this function, which creates the XML file. The overall strategy of this entire function is build the different chunks of XML using different for loops, then take all of the variables storing these strings, concatenate them, and write them into a file with a suffix of “.pax.zip.opex”. All of the PAX zip archives have a suffix of “.pax.zip” and opex metadata files are recognized by a file extension of “.opex” hence “.pax.zap.opex”.

The opex1 variable contains a static string, the XML file headers and the opening tag for the XIP title element.

The opex2 variable contains the title of the digital asset. This is found by opening the Dublin Core metadata XML file, and finding the <title> field.

The opex3 variable contains a static string, the closing title tag, and the opening of the identifiers tag.

The opex4 variable is built by looping through the aforementioned Dublin Core XML file, in this case looking for values in the <identifier> field. In this case there should be at least one, but can be more than that, so each value is appended to the “id_list” variable. Once the list is created then the identifiers, surrounded by static values for their opening and closing tags, are concatenated onto the opex4 string variable.

The only exception to this process is that the Islandora ID (which is generally structured as “ur:#####” where the hash signs indicate an integer) will always have an XML attribute type of “code”, whereas the other identifiers will have the label from the Dublin Core metadata used as the XML attribute type. Having the Islandora identifier always be labelled as “code” has to do with the ArchivesSpace – Preservica integration, as it requires a unique identifier to initialize, and that unique identifier must have a label of “code”.

The opex5 variable is a static string, which contains the closing tags for the Identifiers and Properties sections of the metadata, and the opening tag for the Descriptive Metadata portion, which also includes a small XIP metadata line: “<LegacyXIP xmlns=”http://preservica.com/LegacyXIP”><AccessionRef>catalogue</AccessionRef></LegacyXIP>&#8221; which again is necessary for the ArchivesSpace – Preservica integration.

The opex6 variable works with another for loop, this time looking for extra metadata files in the directory (identified by an “.xml” file extension) and adding those to the OPEX metadata file as well. This allows you to add multiple different metadata fragments in the same XML document, allowing the Preservica XIP, MODS, DC, and FITS, all to be added at the same time. In this case each file is opened, all of the lines are read into a list, and then each line concatenated with the opex6 variable.

The opex7 variable is a static string holding the closing tags for the Descriptive Metadata section and the entire OPEX XML document.

Finally, the script opens up a new file named after the current directory in the top level for loop, and an extension of “.pax.zip.opex”, writes opex1 through opex7 to the file, and closes it, before moving to the next directory in container.

stage_pax_content()

def stage_pax_content():
    project_log_hand = open(proj_path + '/' + 'project_log.txt', 'r')
    vars = project_log_hand.readlines()
    container = vars[1]
    container = container.strip()
    project_log_hand.close()
    pax_count = 0
    rep_count = 0
    for directory in os.listdir(path = proj_path + '/' + container):
        print(directory)
        os.mkdir(proj_path + '/' + container + '/' + directory + '/pax_stage')
        pax_count += 1
        shutil.move(proj_path + '/' + container + '/' + directory + '/Representation_Access', proj_path + '/' + container + '/' + directory + '/pax_stage')
        shutil.move(proj_path + '/' + container + '/' + directory + '/Representation_Preservation', proj_path + '/' + container + '/' + directory + '/pax_stage')
        rep_count += 2
    print('Created {} pax_stage subdirectories and staged {} representation subdirectories'.format(pax_count, rep_count))

As we get closer to getting the PAX ready to be created in earnest, first we stage the contents so that they can be written into a zip file more easily. Since each directory that gets iterated through in container at this point has a folder “Representation_Preservation” and “Representation_Access” which will ultimately go into the PAX, a new staging folder called “pax_stage” is created, and both “Representation_” directories are moved into it.

create_pax()

def create_pax():
    project_log_hand = open(proj_path + '/' + 'project_log.txt', 'r')
    vars = project_log_hand.readlines()
    container = vars[1]
    container = container.strip()
    project_log_hand.close()
    dir_count = 0
    for directory in os.listdir(path = proj_path + '/' + container):
        zip_dir = pathlib.Path(proj_path + '/' + container + '/' + directory + '/pax_stage/')
        pax_obj = ZipFile(proj_path + '/' + container + '/' + directory + '/' + directory + '.zip', 'w')
        for file_path in zip_dir.rglob("*"):
            pax_obj.write(file_path, arcname = file_path.relative_to(zip_dir))
        pax_obj.close()
        os.rename(proj_path + '/' + container + '/' + directory + '/' + directory + '.zip', proj_path + '/' + container + '/' + directory + '/' + directory + '.pax.zip')
        dir_count += 1
        zip_file = dir_count + ': ' + directory + '.pax.zip'
        print('created {}'.format(zip_file))
    print('Created {} PAX archives for ingest'.format(dir_count))

I have prided myself generally thus far on using the Python docs relevant to all these modules to write all this code, but when it came to creating the Zip archives for whatever reason I sort of hit a brick wall. This function uses the Python ZipFile library to add the “Representation_” folders (and their contents) to a zip archive, but I don’t claim to have a good grasp on exactly how it does it. This was a script-kiddie moment for me. After a lot of online searching, this article by Leodanis Pozo Ramos on the Real Python site provided me something I could work with.

The pathli.Path and zip_dir.rglob(“*”) section of this script I only sort of understand how they work. The “zip_dir” variable holds the path to our newly created staging directory, “/pax_stage”. We open up a new zip object, set it so it can be written to, and hold that in the “pax_obj” variable. Then everything that is held in the zip_dir variable (representing the path to “pax_stage”) is written into the zip archive. Finally we rename the archive so that instead of the directory plus “.zip” it is renamed to the directory plus “.pax.zip”.

cleanup_directories()

def cleanup_directories():
    project_log_hand = open(proj_path + '/' + 'project_log.txt', 'r')
    vars = project_log_hand.readlines()
    container = vars[1]
    container = container.strip()
    file_count = 0
    dir_count = 0
    unexpected = 0
    for directory in os.listdir(path = proj_path + '/' + container):
        for entity in os.listdir(path = proj_path + '/' + container + '/' + directory):
            if entity.endswith('.zip') == True:
                print('PAX: ' + entity)
            elif entity.endswith('.opex') == True:
                print('metadata: ' + entity)
            elif entity.endswith('.xml') == True:
                os.remove(proj_path + '/' + container + '/' + directory + '/' + entity)
                file_count += 1
                print('removed metadata file')
            elif os.path.isdir(proj_path + '/' + container + '/' + directory + '/' + entity) == True:
                shutil.rmtree(proj_path + '/' + container + '/' + directory + '/' + entity)
                dir_count += 1
                print('removed pax_stage directory')
            else:
                print('***UNEXPECTED ENTITY: ' + entity)
                unexpected += 1
    print('Deleted {} metadata files and {} Representation_Preservation and Representation_Access folders'.format(file_count, dir_count))
    print('Found {} unexpected entities'.format(unexpected))
    project_log_hand.close()

We’re getting close to the end here. At this point every directory inside of container has an OPEX metadata file, a zipped PAX archive, the “pax_stage” directory, and whatever other files that weren’t the actual digital assets themselves. Mostly these should be MODS, DC, and FITS metadata files. Time to clean up all this extraneous stuff we no longer need.

This set of for loops identifies the PAX archive (but does nothing to it), the OPEX file (but does nothing to it), deletes anything with an “.xml” file extension, looks for a directory structure (“pax_stage”) and deletes it, and then prints out anything that doesn’t fit into the above framework.

Ideally the process_bags() function will have gotten rid of any extra files in the bags exported by Islandora that we don’t need, but this print statement is meant to catch any files that made it through that process that might need to be deleted or incorporated somehow.

ao_opex_metadata()

def ao_opex_metadata():
    project_log_hand = open(proj_path + '/' + 'project_log.txt', 'r')
    vars = project_log_hand.readlines()
    container = vars[1]
    container = container.strip()
    project_log_hand.close()
    file_count = 0
    id_hand = open(proj_path + '/' + 'mcgraw_aonum_islid.txt', 'r')
    id_list = id_hand.readlines()
    id_hand.close()
    for directory in os.listdir(path = proj_path + '/' + container):
        opex_hand = open(proj_path + '/' + container + '/' + directory + '/' + directory + '.pax.zip.opex', 'r')
        opex_str = opex_hand.read()
        opex_hand.close()
        ao_num = ''
        for line in id_list:
            ids = line.split('|')
            aonum = ids[0]
            aonum = aonum.strip()
            isnum = ids[1]
            isnum = isnum.strip()
            if opex_str.find(isnum) != -1:
                ao_num = aonum
                print('found a match for {} and {}'.format(aonum, isnum))
        opex = '<?xml version="1.0" encoding="UTF-8" standalone="yes"?><opex:OPEXMetadata xmlns:opex="http://www.openpreservationexchange.org/opex/v1.0"><opex:Properties><opex:Title>' + ao_num + '</opex:Title><opex:Identifiers><opex:Identifier type="code">' + ao_num + '</opex:Identifier></opex:Identifiers></opex:Properties><opex:DescriptiveMetadata><LegacyXIP xmlns="http://preservica.com/LegacyXIP"><Virtual>false</Virtual></LegacyXIP></opex:DescriptiveMetadata></opex:OPEXMetadata>'
        ao_md_hand = open(proj_path + '/' + container + '/' + directory + '/' + ao_num + '.opex', 'a')
        ao_md_hand.write(opex)
        ao_md_hand.close()
        os.rename(proj_path + '/' + container + '/' + directory, proj_path + '/' + container + '/' + ao_num)
        file_count += 1
    print('Created {} archival object metadata files'.format(file_count))

Naturally, we’re not quite done creating OPEX metadata just yet. Since this process is leveraging the ArchivesSpace – Preservica integration, the OPEX metadata and PAX archive both need to be located inside a folder that is going to map over to ArchivesSpace and is a key part of what allows that integration to run. Thankfully this time it won’t involve concatenating 7 different variables to make the OPEX file.

Also, there is one area I don’t have a good answer to: mapping the archival object number from Preservica to the identifier from Islandora. I manually made my way through ArchivesSpace copying archival object numbers and the corresponding Islandora identifiers (based on links from associated ArchivesSpace digital objects) into a text file to create that mapping. Since this test collection is small, it only took about an hour, but is definitely something I’m hoping to automate in the future. This file was called “mcgraw_aonum_islid.txt”

Now the function first opens this file of archival object numbers and Islandora ids and reads the contents into a list. Next we start looping through directories in container just as we have been for most of these functions. Next we open the directory’s OPEX file and read the contents into a variable. Now for each directory we loop through the contents of the “mcgraw_aonum_islid.txt” list looking for a match between the Islandora identifier (stored in one of the OPEX identifiers mentioned in pax_metadata(), specifically the one labelled “code”), and try to match it to the contents of the string containing the OPEX metadata.

Once a match is found, the archival object number is written to a variable, and then inserted into some OPEX metadata, specifically in the <title> and <identifier> fields, once again the <identifier> must have a label of “code”.

Finally, a file is opened and all of the OPEX XML data is written into it. While the directories in container have generally been named following the “Collection_Box_Folder_Date-Sequence-Number” format thus far, now they need to be renamed to the archival object number. As you might guess this is for the ArchivesSpace – Preservica synchronization to work correctly. For OPEX incremental ingest, the metadata about a directory is included inside the directory, with the same name of the directory and an “.opex” file extension.

write_opex_container_md()

def write_opex_container_md():
    project_log_hand = open(proj_path + '/' + 'project_log.txt', 'r')
    vars = project_log_hand.readlines()
    container = vars[1]
    container = container.strip()
    opex1 = '<?xml version="1.0" encoding="UTF-8" standalone="yes"?><opex:OPEXMetadata xmlns:opex="http://www.openpreservationexchange.org/opex/v1.0"><opex:Transfer><opex:Manifest><opex:Folders><opex:Folder>'
    opex2 = container
    opex3 = '</opex:Folder></opex:Folders></opex:Manifest></opex:Transfer></opex:OPEXMetadata>'
    container_opex_hand = open(proj_path + '/' + container + '/' + container + '.opex', 'w')
    container_opex_hand.write(opex1 + opex2 + opex3)
    print('Created OPEX metadata file for {} directory'.format(container))
    project_log_hand.close()
    container_opex_hand.close()

The last function in this series is to create another folder OPEX metadata file, this time for the entire container folder instead of just a single digital asset. In this case we sandwich the folder container name inside the OPEX <Transfer><Manifest><Folders><Folder> tags. This small OPEX file is then concatenated and written to our last OPEX file.

At this point, the container should be ready to drop into the OPEX incremental staging area and you can take a nap after all your hard work.


Adding Images to Wikipedia Articles via DPLA

DPLA is engaged in a fantastic project to integrate records harvested by DPLA and upload those rights-free materials to Wikimedia Commons so that they might be exposed to a wider audience that they might otherwise receive. For more detailed information on this project as a whole there is a recording of a conference presentation given on that topic at the ODNFest 2020 that can be useful. This post however is to provide information on the nitty-gritty of what buttons one has to click in order to get an image in your local image repository into a Wikipedia article (with lots and lots of screenshots).

This post assumes that your organization and DPLA Content or Service Hub has already gone through the work of ingesting your content into Wikimedia Commons. This outlines the nitty-gritty of what to do next.

If you are doing this work as someone responsible for a repository that is contributing to DPLA, there is a good chance actually searching for images inside your repository is the easiest way for you to go about the work since you are more familiar with how records are structured and organized in that environment. So say you find a photograph in a digitized local history book and think it’s a good candidate for inclusion in a Wikipedia article.

Finding the Image in Wikimedia Commons

Viewing the specific page of the book in Ohio Memory (powered by CONTENTdm)

For instance a picture of an explosive powder company in Toledo. First you need to find the image in Wikimedia Commons. Search the title of the book in Wikimedia Commons to find all the images associated with it. Since the ingest process from your local repository to Wikimedia Commons breaks complex digital objects (such as books) into many single simple digital objects, in this case we’ll get a number of results.

Searching for the book’s title in Wikimedia Commons

The ingest process into Wikimedia Commons creates wonderfully consistent file names which are reflected in the page URL. I’ve found the easiest way to find the specific page needed is to simply open any of the pages associated with the book being searched for in the Wikimedia Commons results page and alter the URL to get the page needed. In the first screenshot above we see that the page with our explosives company on it is page 37 in the CONTENTdm record, and the first entry in the Wikimedia Commons results is the same book but page 53.

The entire URL for the book page in Wikimedia Commons
Zooming in on the last (relevant) portion of the URL
https://commons.wikimedia.org/wiki/File:Year_book_-_photo_flashes_showing_Toledo%27s_phenomenal_progress,_thriving_industries_and_wonderful_resources_-_DPLA_-_ac95c5ef8efd2394c21e2b6edcd01d94_(page_53).jpg

Simply adjust the last portion of that URL from “(page_53).jpg” to “(page_37).jpg” will bring up the desired page quickly and easily without sorting through Wikimedia Common’s results page (there also may be a simpler way to do this using advanced search in Wikimedia Commons but this is easy enough that I haven’t bothered investigating).

The item record page for book’s page in Wikimedia Commons

If searching for the item by title brings up a long list of irrelevant search results in Wikimedia Commons, there is an alternate solution for that problem.

Editing Images for Articles

But we don’t want to insert this entire book page into an article, just the image on it. So we’ll need a utility that can be enabled in Wikimedia Commons called the Crop Tool. Once you are registered with a Wikimedia account (which is shared across Wikimedia Commons, Wikipedia, and other sites) click on “Preferences” in the top right corner of the screen, and then “Gadgets” in the tabbed menus in the center of the screen.

The Gadgets section of the Preferences Page for a user in Wikimedia Commons

In the “Interface: Editing and uploads” section, select the checkbox for “Crop Tool.”

The Interface: Editing and uploads section of the with the relevant Crop Tool highlighted

Now returning to our Wikimedia Commons record, we’ll see that “Crop Tool” shows up in the list of Tools on the left column of the screen. Click on your newly enabled tool to create a derivative image from the full page that was originally uploaded.

The Crop Tool now showing up in the sidebar of Wikimedia Commons

This then bring up a simple image editing tool that allows for crop and rotation of images. Simply drag the box around the portion of the image you want to crop into and hit preview. If you would like to rotate the image enter in a number of degrees into the relevant area which will rotate the image clockwise. Negative values are supported as well.

The editing interface for an image in Wikimedia Commons

The preview page allows you to review your work and edit the file name if necessary. The cropped image has an identical file name to the original with the addition of a new final element concatenated to the end: “(cropped).” In this case we can see a red-warning message letting us know that the file-name is not unique in Wikimedia Commons (this is an already completed example.)

The preview screen for an edited image in Wikimedia Commons, with options for file naming

However say you have a book or composite photograph from which you want to crop multiple images; simply add a sequence number to the end of the file name in order complete the process.

Adding a sequence number to a filename to differentiate it

Finally always ensure the radio button for “Upload as new file” is selected; we never want to overwrite previously existing objects in Wikimedia Commons. Now we see our newly cropped image, ready to be inserted into a Wikipedia article

The edited derivative image in Wikimedia Commons

The file name at the the top of the screen is what you’ll need to insert the image into the article so copying that now is a good idea. You’ll note this includes that prefix of “File:”

File:Year book - photo flashes showing Toledo's phenomenal progress, thriving industries and wonderful resources - DPLA - ac95c5ef8efd2394c21e2b6edcd01d94 (page 37) (cropped).jpg

Filenames in Wikimedia Commons

In the process of importing files over into Wikimedia Commons, all of the digital assets will have new filenames created for them. It’s worthwhile to analyze these briefly as they embed a great deal of useful information.

File:Year book - photo flashes showing Toledo's phenomenal progress, thriving industries and wonderful resources - DPLA - ac95c5ef8efd2394c21e2b6edcd01d94 (page 37) (cropped).jpg

The first section of the filename is the “File:” prefix that identifies this string of characters as a filename in Wikitext.

File:

Next up is the title cross-walked from the metadata in the IIIF Manifest:

File:Year book - photo flashes showing Toledo's phenomenal progress, thriving industries and wonderful resources

Next we have a dash separating the different sections of the file name followed by “DPLA” to indicate where the file originated from.

File:Year book - photo flashes showing Toledo's phenomenal progress, thriving industries and wonderful resources - DPLA -

After that we have another dash followed by the unique identifier assigned by DPLA when the record was harvested.

File:Year book - photo flashes showing Toledo's phenomenal progress, thriving industries and wonderful resources - DPLA - ac95c5ef8efd2394c21e2b6edcd01d94

If the original digital object in your local repository was a complex digital object that has since been transformed into a plethora of simple digital objects in Wikimedia Commons, that will be indicated by a page number in parantheses.

File:Year book - photo flashes showing Toledo's phenomenal progress, thriving industries and wonderful resources - DPLA - ac95c5ef8efd2394c21e2b6edcd01d94 (page 37)

Finally, if you use the aforementioned process to crop and edit images directly in Wikimedia Commons, the fact that the image is a derivative will be indicated by “(cropped)” and at the very end.

File:Year book - photo flashes showing Toledo's phenomenal progress, thriving iustries and wonderful resources - DPLA - ac95c5ef8efd2394c21e2b6edcd01d94 (page 37) (cropped)

Finally we have the file format extension, in this case, JPEG with an extension of “.jpg”.

File:Year book - photo flashes showing Toledo's phenomenal progress, thriving industries and wonderful resources - DPLA - ac95c5ef8efd2394c21e2b6edcd01d94 (page 37) (cropped).jpg

Even if the complex object being ingested wasn’t actually a book, for instance a collection of photographs, the resulting filenames in Wikimedia Commons will still be differentiated as “pages.”

File:1630 Broadway Street, exterior views, 2019 - DPLA - 9923f13b736689f90ff60e4adf4005b9 (page 1).jpg

Finally, for the sake of completeness, here is an example of a simple digital object file name imported into Wikimedia Commons that has not been editted:

File:Manhattan Iron Works Company floor men - DPLA - c9e6a11fe7b4ad2de073adda9b261dd8.jpg

Inserting Images in Articles

A photograph of an explosive powder company certainly seems suitable to an article about explosives, and luckily Wikipedia has just such an article:

https://en.wikipedia.org/wiki/Explosive

This particular article contains a history section that would benefit from an image. There are two ways to insert images into Wikipedia articles, using the Visual Editing method (utilizing a “What You See Is What You Get” (WYSIWYG) style editor) and the Source Editing method, which edits the particular flavor of markdown text used by Wikipedia known as Wikitext. Both options will be covered in detail.

Visual Editing

To begin the editing process, click on the “Edit” tab in the top right of the screen in order to bring up the visual editing interface.

The Wikipedia article about explosives to which an image will be added

The page will look pretty similar but now there will be a toolbar on the top of the article that has all of the editing utilities. First place your cursor generally where you want the image to appear in the article, and then select “Images and media” from the “Insert” dropdown menu.

The Visual Editing interface highlighting the option to insert Images and media

This will bring up a search box that will allow you to search for the specific cropped image, the file name which we had noted previously.

Searching for an image in Wikimedia Commons to insert into the article

Click on the image to select the one you want to insert into the article. You don’t have to wait for the search to complete before selecting the image.

Selecting an image in Wikimedia Commons to insert into the article

Click on the blue “Use this images” button in the top right of the “Media settings” box.

Adding a caption and alt text to an image while inserting

From here you can add a caption for the image that will show up in the article as well as alt-text for the image to increase accessibility. The “Advanced” tab in the “Media settings” box allows you to customize how the image will be displayed in the article. The standard settings (which is what you’ll see for most images appearing in articles) is for the image to float over to the right of the page, the size of a thumbnail with article text flowing around the image. Until you get more comfortable with this process, sticking with the standard appearance settings is a good way to go.

Changing the position, size, and formatting of how an inserted image will appear

We now see the image inserted into the article, though we are still in the Visual Editing mode, the image hasn’t actually been added to the article. Click on the blue “Publish changes” button in the top right of the screen to actually finalize the edit. All edits require a reason for why the change is being made in order to document the work being done on the article. If you check the “Watch this page” option, the article will be added to a shortcuts list which is a handy way of gathering up all the articles being edited to track your work.

Adding a comment to a change to a Wikipedia article to explain the update

We now can see the image finally and officially placed in the article.

Viewing the inserted image in a Wikipedia article

Source Editing

Editing images via the Source Editing option is very similar to the visual editing process; you’ll go through all the same steps you just do them in a slightly different fashion. To enable this mode, click on the “Edit” button in the top right of the screen as before, but this time click on the pencil-looking icon to change editing modes from the resulting dropdown menu.

Switching from Visual Editing to Source Editing in Wikipedia

This brings up a pretty dramatically different view. If you’ve got some basic HTML editing experience and created a website in school from scratch, this will probably look vaguely familiar. As previous mentioned the formatted text is actually Wikitext, a specific formatting language for creating rich text. The only aspects that we’re going to care about however are how to go about inserting images and not the many, many other options.

Seeing the article laid out as Wikitext

One of the advantages to editing with this method is precision. Sometimes when inserting images using the visual method you can put your cursor in one place and see ultimately that the image gets shoved far further down the page that you expected. Looking at the source-view allows you to see where the various elements are showing up in precisely the order in which they will be displayed on the page. Scrolling down to the history section of our previously uploaded image we can see how this is displayed in the Source Editing view.

Adding the Wikitext necessary to insert an image into a Wikipedia article

The relevant text we care about is:

[[File:Year book - photo flashes showing Toledo's phenomenal progress, thriving industries and wonderful resources - DPLA - ac95c5ef8efd2394c21e2b6edcd01d94 (page 37) (cropped).jpg|thumb|right|The Great Western Powder Company of Toledo, Ohio, a producer of explosives, seen in 1905]]

First thing to note, the entire image object needs to be bounded in double brackets (“[[” and “]]”) before and after all the information you include. Inside those double brackets need to be a minimum of three things. First you’ll see the file name for the object beginning with the “File:” prefix that we got from the record in Wikimedia Commons.

[[File:Year book - photo flashes showing Toledo's phenomenal progress, thriving industries and wonderful resources - DPLA - ac95c5ef8efd2394c21e2b6edcd01d94 (page 37) (cropped).jpg

All the qualifiers in this string are separated by pipes (“|”, a fairly uncommonly used keyboard character.) Next we add a qualifier specifying how large we’d like the image to be, in this case using “thumb” to indicate a thumbnail sized image.

[[File:Year book - photo flashes showing Toledo's phenomenal progress, thriving industries and wonderful resources - DPLA - ac95c5ef8efd2394c21e2b6edcd01d94 (page 37) (cropped).jpg|thumb

Next up we want to add a qualifier indicating where on the page the image should be located. Leaving this element out is an option, as the default assumption is that images will float to the right side of the page as we saw in the Visual Editing portion above.

[[File:Year book - photo flashes showing Toledo's phenomenal progress, thriving industries and wonderful resources - DPLA - ac95c5ef8efd2394c21e2b6edcd01d94 (page 37) (cropped).jpg|thumb|right

Finally we add one more pipe to this Wikitext block followed by the caption we would like to see displayed underneath the image. Don’t forget to put the closing double brackets at the end of the block as that will definitely screw things up on the page otherwise.

[[File:Year book - photo flashes showing Toledo's phenomenal progress, thriving industries and wonderful resources - DPLA - ac95c5ef8efd2394c21e2b6edcd01d94 (page 37) (cropped).jpg|thumb|right|The Great Western Powder Company of Toledo, Ohio, a producer of explosives, seen in 1905]]

Once we’ve entered in the information into the editing box, we scroll to the bottom of the page to enter in our comment for why we are making this edit, have the option to add the page to our watched pages, and can commit the edit to be publicly visible. I highly recommend using the “Show preview” option to see how the change will appear as this will prevent a lot of simple mistakes from accidentally being pushed live.

Adding a comment to a change to a Wikipedia article to explain the update

Once comfortable constructing those Wikitext strings, I’ve found this to be a much faster and simpler way to add images instead of using the Visual Editing option.

Another Way to Find Files in Wikimedia Commons

Finding a title like “Year book – photo flashes showing Toledo’s phenomenal progress, thriving industries and wonderful resources” as a simple search in Wikimedia Commons is pretty simple as it’s a nice long title with a high chance of being unique. What if you are looking for something shorter and more generic though, such as “Geography of Ohio”? In the search below, the book I’m looking for doesn’t even turn up in the first page of results, much less the first item.

Searching for the “Geography of Ohio” in Wikimedia Commons

However if I search for that same item in DPLA, the very first result is what I’m looking for.

Searching for the “Geography of Ohio” in DPLA

What we’re after here in the DPLA record is the unique identifier that DPLA assigns, a nice long gloriously unique series of characters that is embedded in all the filenames of digital assets once they are imported over into Wikimedia Commons. We can find this in the URL of the page after opening the record from the DPLA search results page.

Screenshot of the DPLA item URL
https://dp.la/item/aaba7b3295ff6973b6fd1e23e33cde14?q=Geography%20of%20Ohio

What we care about specifically is after the last forward slash and before the question mark. Everything after the question mark is the search I entered into DPLA. What we want is this:

aaba7b3295ff6973b6fd1e23e33cde14

Once we search for this character string in Wikimedia Commons, particularly if you enter the string bound by quotation marks to only match this exact string, the book that we’re looking for should populate first.

"aaba7b3295ff6973b6fd1e23e33cde14"
Searching for a DPLA unique identifier in Wikimedia Commons

More Advanced Use Cases

If you are looking for more advanced use cases than simply inserting thumbnail images across the right side of article pages, I would recommend bookmarking the Wikipedia Manual of Style related to images:

https://en.wikipedia.org/wiki/Wikipedia:Manual_of_Style/Images

There is a wealth of documentation on formatting, positioning, and styling of images added to articles. Similarly there is a great deal of documentation about the creation of image galleries should you wish to insert a group of images into an article.

https://en.wikipedia.org/wiki/Help:Gallery_tag