Migrating Digital Assets from Islandora to Preservica

About four months ago I was hired as the Digital Asset Management Lead at the University of Rochester, River Campus Libraries. One of the big areas of work that I am responsible for is establishing a sustainable digital preservation service using (among other resources) Preservica Cloud Professional edition. The organization currently holds digital assets in a wide variety of different repository platforms, but of particular concern is our use of Islandora, based on Drupal 7, which will be reaching it’s end-of-life by the end of 2022. While it is possible that the organization could make the jump to Drupal 8/9, that isn’t a path the organization is interested in pursuing, so instead I’m working on other strategies to migrate that content, one of which is moving it into Preservica.

To learn how to engage in this migration work I’m using a small collection (one that I have learned is now almost traditionally used as the “authorized test collection” when engaging with these sorts of migrations), the John McGraw Civil War Papers. As this Islandora instance is intended to be shut down, due to the inevitable link rot, I link here to a representative page preserved in the Internet Archive Wayback Machine and a finding aid is also available for the collection.

This collection contained 78 complex digital objects in Islandora. The access copies of these digital objects were PDFs and they were stored with both MODS and Dublin Core metadata associated with them. While Islandora also hosts the preservation quality TIFF images, I have been unable to find a means to programmatically export these images from Islandora, and as such am securing the preservation masters from the digitization lab which originally photographed the materials (special thanks to Lisa Wright, Digital Scholarship Digitization Specialist in the Digital Scholarship department of the River Campus Libraries). The PDF and metadata XML are being exported from Islandora using the BagIt Utility Module.

In order to facilitate this migration process I’ve written a small set of Python scripts in order to successively manipulate the Islandora bags and preservation masters into something that Preservica can ingest. As an implementation team, we’re still working out how exactly we want to ingest digital assets into Preservica as a standard workflow (and to be honest, our instance of Preservica has yet to have a single authentic ingest outside of testing) but our intentions are the following:

  • To use the PAX standard to structure our digital assets for ingest so that we can have a “many to one” relationship among our SIPs (that is, one PDF and a whole bunch of TIFFs)
  • To use the OPEX standard to formulate our metadata, allowing the submission of an XML record that contains multiple local identifiers, multiple metadata fragments, and allows us to leverage the ArchivesSpace to Preservica synchronization tool so that digital assets in Preservica can be associated with archival objects in ArchivesSpace, and so that the finding aid hierarchy can be replicated in Preservica
  • Use the OPEX Incremental Workflow in Preservica so that folders of digital assets can simply be “dumped” into a staging area and left to be automatically ingested

Many thanks to Richard Smith’s post on the Preservica site, “Using OPEX and PAX for Ingesting Content,” for providing an accessible means of understanding these various different interwoven standards. The aforementioned Python scripts used to form the SIP for this collection can be found in a GitHub repository. Setting aside the lengthy preamble above, my goal is to lay out the purpose of each of the individual 18 functions to detail what I was trying to accomplish with each one.

I should note that I am trained as a librarian and definitely not a software engineer. I’m part of that class of library-critters that finally found problems in the course of their work that needed some code to surmount, instead of trying to tackle the problem manually, and as a result I finally took a MOOC on Python after saying that I would get around to it for years. To sum up, there are almost certainly better ways to execute just about everything I’m going to talk about.

Thus, this document assumes the following:

  • A basic understanding of Python
  • A fairly comprehensive understanding of Preservica, especially:
    • The OPEX metadata standard and ingest method
    • The PAX asset structure standard
    • The ArchivesSpace – Preservica synchronization workflows
  • Enough understanding of Islandora to work with the BagIt Utility Module

My last disclaimer is that I haven’t actually taken the SIP generated by these scripts and tried to ingest it into Preservica yet; that is coming next week. So, fair chance this gets an “Editor’s Note: something went WRONG” with some addendums on what I changed to make Preservica take the SIP intelligibly.

Python Import Statements and variables

import os
import os.path
import shutil
import pathlib
import re
import xml.etree.ElementTree as ET
from datetime import datetime
from bdbag import bdbag_api
from bagit import BagValidationError
from bdbag.bdbagit import BaggingInterruptedError
from pyrsistent import thaw
from zipfile import ZipFile
from openpyxl import Workbook
from os.path import basename

This script set ends up using a pretty varied set of Python modules, some of them sparingly, some of them constantly. the os and shutil modules are used constantly to rename and move files and folders. The bdbagit module is a Python module specifically created for the validation and manipulation of bags created by the BagIt standard. Extracting data from MODS and Dublin Core metadata is pretty common, hence the XML ElementTree module. Others are used sparingly, such as datetime just to create a unique, time-based, identifier for each SIP, as well as the ZipFile module when creating the PAX archives. During the process of staging the data, the access PDFs and the preservation TIFFs need to be compared to make sure they match on a one-to-one basis (reader, they did not) so an Excel spreadsheet was created, and openpyxl was used to generate that. All of this will be gone into greater detail in the various specific functions

proj_path = 'M:/IDT/DAM/McGraw_Preservica_Ingest'
ds_files = 'preservation_masters'

The “proj_path” variable is used constantly to reference the position on one of our network drives where this project folder was housed. The “ds_files” variable is used little just to define the original name of the folder that the preservation masters came over from Digital Scholarship in. In general all of the pathing throughout these scripts is represented by variables representing directories separated by instances of ” ‘/’ “.

  • proj_path represents the location of the project folder on a network drive. It contains the container folder that has all the digital assets in it, the python scripts, and some other sidecar files created by the scripts
  • container is a subdirectory of proj_path and is the folder that will eventually be uploaded via the OPEX incremental workflow and contains all the digital assets and is where they are manipulated to prepare them for ingest
  • bags_dir is a subdirectory of container that initially contains all the access copies of the digital assets and the metadata files, which are manipulated in this folder before being merged with the preservation assets in container

This ends up looking like “variable + ‘/’ + variable + ‘/’ variable” which simplifies visualizing how deep into the directory structure you are, instead of including those slashes in with the variables themselves. Once the scripts start working on this content in earnest a fairly typical structure would be:

  • proj_path
    • container
      • directory1
      • directory2
      • directory3
      • bags_dir
        • bag1
        • bag2
        • bag3

def create_container()

In advance of this function, the folder full of preservation masters provided by Digital Scholarship is copied into proj_path. The name of the folder is stored in the ds_files global variable.

def create_container():
    project_log_hand = open(proj_path + '/' + 'project_log.txt', 'a')
    now = datetime.now()
    date_time = now.strftime('%Y-%m-%d_%H-%M-%S')
    project_log_hand.write(date_time + '\n')
    container = 'container_' + date_time
    os.rename(proj_path + '/' + ds_files, proj_path + '/' + container)
    project_log_hand.write(container + '\n')
    print('Container directory: {}'.format(container))
    project_log_hand.close()

First up, we need to create the container folder that will be the wrapper for the entire SIP that gets sent off to Preservica. The OPEX incremental method is nice because you let Preservica know what files to expect in a manifest in the OPEX metadata, that way it doesn’t start trying to process a digital asset before it has all the components of it (such as metadata), reducing errors on ingest. My understanding is that using a container folder to hold the whole SIP is sort of a way of cheating this a bit, letting Preservica know to not start doing anything until this one container folder is ingested, and it just so happens to contain a myriad of digital assets in it, that also will benefit from that. This saves having to build a manifest for many different assets.

This function opens up a “project_log.txt” file which will be referenced in every single subsequent function. The idea here was to park variables in a text file so that they would be portable and durable outside of the code itself, and let me set the project aside for days at a time and come back to it. Inside the text file we store the current date and time on the first line, and then “container_” with that date/time variable appended to it; this string corresponds to container. Then we rename the folder that Digital Scholarship provided that contains all the preservation masters into the variable stored in container followed by a print statement letting us know what that container variable is called. If I made it right now it would be “container_2022-02-18_18-07-37” reflecting the current date and time.

folder_ds_files()

def folder_ds_files():
    project_log_hand = open(proj_path + '/' + 'project_log.txt', 'r')
    vars = project_log_hand.readlines()
    container = vars[1]
    container = container.strip()
    folder_name = ''
    folder_count = 0
    file_count = 0
    for file in os.listdir(path = proj_path + '/' + container):
        file_root = file.split('-')
        file_root = file_root[0]
        if  file_root == folder_name:
            shutil.move(proj_path + '/' + container + '/' + file, proj_path + '/' + container + '/' + folder_name + '/' + file)
            file_count += 1
        else:
            folder_name = file_root
            os.mkdir(proj_path + '/' + container + '/' + folder_name)
            folder_count += 1
            shutil.move(proj_path + '/' + container + '/' + file, proj_path + '/' + container + '/' + folder_name + '/' + file)
            file_count += 1
    for folder in os.listdir(path = proj_path + '/' + container):
        count = 0
        for file in os.listdir(proj_path + '/' + container + '/' + folder):
            count += 1
        if count > 99:
            os.rename(proj_path + '/' + container + '/' + folder, proj_path + '/' + container + '/' + folder + "-001-" + str(count))
        elif count > 9:
            os.rename(proj_path + '/' + container + '/' + folder, proj_path + '/' + container + '/' + folder + "-001-0" + str(count))
        else:
            os.rename(proj_path + '/' + container + '/' + folder, proj_path + '/' + container + '/' + folder + "-001-00" + str(count))
    print('Created and renamed {} subdirectories and moved {} files into them'.format(folder_count, file_count))
    project_log_hand.close()

After the first line calling the function declaration, we have the standard first set of lines that opens “project_log.txt” and reads some subset of the first three lines to extract the variables mentioned above which include date_time, container, and bags_dir. Following this we have an empty string variable that will be used in a for loop shortly, as well as a couple incrementing counter variables that are present in most functions to be passed into a concluding print statement to show just how many actions the function was able to take.

The folder of preservation assets that Digital Scholarship is passing along is one directory that contains all the assets, but we need those assets to be in separate subdirectories. The first for loop takes care of that process, thanks to consistent file-naming conventions. Each file has a file name that is a prefix, representing the physical location of the analogue asset in the archives as represented by “Collection Number – Box Number – Folder Number – Date” which is followed by a suffix in the form of a sequence number representing the pages. This for loop looks at the prefix of the file, compares it to what is stored in the folder_name variable, and if it’s different, it stores the prefix in the folder_name variable, creates a new directory named identically to the prefix, and then moves the file in the iteration into the folder. In the next round of the for loop, if the prefix is identical to what is stored in folder_name, it simply moves the file in.

The prefix is used as part of a unique identifier in the asset metadata that we’ll be using to match up preservation and access copies later, but it’s not quite formed identically yet. The identifier in the metadata has the prefix plus a range of three digit sequence numbers reflecting how many pages total there are. For instance “D528_1_4_1864_02_28-001-002” represents an asset from collection D528, in box 1, folder 4, dated February 28, 1864, which contains two pages. Currently our folder names do not have the information reflecting the two pages. The next for loop counts how many files are in each folder and then renames the folder to append that page count information to the end of the folder name. This includes logic to handle how many leading zeroes should be included. If the for loop counts more than 99 files in the folder, it will not have any leading zeroes, but if it counts 9 or less, than there will be two leading zeroes.

create_bags_dir()

def create_bags_dir():
    project_log_hand = open(proj_path + '/' + 'project_log.txt', 'r')
    vars = project_log_hand.readlines()
    date_time = vars[0]
    date_time = date_time.strip()
    container = vars[1]
    container = container.strip()
    bags_dir = 'bags_' + date_time
    os.mkdir(proj_path + '/' + container + '/' + bags_dir)
    project_log_hand.close()
    project_log_hand = open(proj_path + '/' + 'project_log.txt', 'a')
    project_log_hand.write(bags_dir + '\n')
    print('Created bags directory: {}'.format(bags_dir))
    project_log_hand.close()

This function starts with the standard call to “project_log.txt” and will ultimately add the third and final line of information to the text file, the name of the folder containing the access assets and metadata referred to as bags_dir. The function uses the information in the date_time variable to name the new subdirectory, creates the subdirectory, and writes the variable information into “project_log.txt”

Once the bags_dir directory is created the exported bags from Islandora should be copied over into the newly created folder.

extract_bags()

def extract_bags():
    project_log_hand = open(proj_path + '/' + 'project_log.txt', 'r')
    vars = project_log_hand.readlines()
    container = vars[1]
    container = container.strip()
    bags_dir = vars[2]
    bags_dir = bags_dir.strip()
    num_bags = 0
    for file in os.listdir(path = proj_path + '/' + container + '/' + bags_dir):
        bdbag_api.extract_bag(proj_path + '/' + container + '/' + bags_dir + '/' + file, output_path = proj_path + '/' + container + '/' + bags_dir, temp=False)
        print('extracting bag: {}'.format(file))
        num_bags += 1
    for bag in os.listdir(path = proj_path + '/' + container + '/' + bags_dir):
        if bag.endswith('.zip'):
            print('removing zipped bag: {}'.format(bag))
            os.remove(proj_path + '/' + container + '/' + bags_dir + '/' + bag)
    print('Extracted {} bags'.format(str(num_bags)))
    project_log_hand.close()

At this point I’m going to stop referring to the fact that every function opens with the call to “project_log.txt”; you can assume each subsequent function will do this, pulling out the variables that represent the names of the container directory and the bags_dir directory as appropriate. I’m also going to skip mentioning that each function contains one or more counter variables to be used by a concluding print statement to summarize what the function did. Finally, many of the functions have print statements in the for loops themselves so that progress can be monitored as the function iterates through each file or directory.

As mentioned previously, Islandora exports digital assets as bags, per the BagIt specification. This specification was created by the Library of Congress to provide a means of ensuring that when one is transferring large amounts of data over long distances, you can be assured that what is being sent and what is received is identical. This is accomplished by a simple directory structure containing the assets and a set of text files that contain lists of all the included files (providing an inventory or manifest) as well as checksums for each file (so you can compare the fixity before and after transfer). These can either be zipped bags or unzipped bags, which is just turning the bag directory structure into a zip file/archive.

With that context, this function takes the zipped bags exported by Islandora, unzips them using the bdbag Python library, specifically the “bdbag_api.extract_bag(source path, output path) line of the code. This takes each bag, unzips it, and deposits it into the bags_dir directory.

The above for loop creates an unzipped version of the bag, but the zipped version is still hanging around. The second for loop looks for anything in bags_dir that has a “.zip” file extension and deletes it.

validate_bags()

def validate_bags():
    project_log_hand = open(proj_path + '/' + 'project_log.txt', 'r')
    vars = project_log_hand.readlines()
    container = vars[1]
    container = container.strip()
    bags_dir = vars[2]
    bags_dir = bags_dir.strip()
    error_log_handle = open(proj_path + '/' + 'validation_error_log.txt', 'a')
    num_bags = 0
    for directory in os.listdir(path = proj_path + '/' + container + '/' + bags_dir):
        print('attempting to validate {}'.format(directory))
        num_bags += 1
        try:
            bdbag_api.validate_bag(proj_path + '/' + container + '/' + bags_dir + '/' + directory, fast = False)
        except BagValidationError:
            error_log_handle.write('Bag Validation Error | Directory: ' + directory + '\n')
        except BaggingInterruptedError:
            error_log_handle.write('Bagging Interruped Error | Directory: ' + directory + '\n')
        except RuntimeError:
            error_log_handle.write('Runtime Error | Directory: ' + directory + '\n')
    print('Validated {} bags'.format(str(num_bags)))
    error_log_handle.close()
    project_log_hand.close()

Now that we have all the access assets and metadata inside of unzipped bags in bags_dir, we next need to make sure that what we got from Islandora hasn’t been corrupted in transfer in any way. This for loop attempts to use the checksums in the bag text files to validate the assets each bag contains, again using the bdbag Python library, specifically the “bdbag_api.validate_bag(path-to-bag)” function. If any of three different types of errors are raised (BagValidationError, BaggingInterrupedError, or Runtime Error) then those get written to a new sidecar file called “validation_error_log” that is located in proj_path. Any bags that raise errors can then be worked with individually to see if a non-corrupt version can be pulled from Islandora.

create_id_ss()

def create_id_ss():
    wb = Workbook()
    ws = wb.active
    project_log_hand = open(proj_path + '/' + 'project_log.txt', 'r')
    vars = project_log_hand.readlines()
    container = vars[1]
    container = container.strip()
    bags_dir = vars[2]
    bags_dir = bags_dir.strip()
    project_log_hand.close()
    ws['A1'] = 'pres_file_name'
    ws['B1'] = 'acc_file_name'
    ws['C1'] = 'bag_id'
    pres_file_list = []
    for folder in os.listdir(path = proj_path + '/' + container):
        if folder.startswith('bags_'):
            continue
        else:
            pres_file_list.append(folder)
    bag_dict = dict()
    for bag in os.listdir(path =  proj_path + '/' + container + '/' + bags_dir):
        tree = ET.parse(proj_path + '/' + container + '/' + bags_dir + '/' + bag + '/data/MODS.xml')
        identifier = tree.find('{http://www.loc.gov/mods/v3}identifier').text
        bag_dict[identifier] = bag
    for item in pres_file_list:
        if item in bag_dict.keys():
            ws.append([item, item, bag_dict[item]])
        else:
            ws.append([item, '', ''])
    for item in bag_dict.keys():
        if item not in pres_file_list:
            ws.append(['', item, bag_dict[item]])
    wb.save('pres_acc_bag_ids.xlsx')
    print('Created pres_acc_bag_ids.xlsx')

This function introduces a new Python library to the mix, openpyxl, which is used to read, manipulate, and create Excel spreadsheets. Since many of these digitization projects were completed years ago, memories get hazy of what was done in the course of the projects and what assets may have been used in Islandora, and what ended up being left out. Digital Scholarship is providing a complete inventory of their work, which may or may not have a one-to-one corollary inside the exported bags from Islandora. So we need some way to compare the two lists to see how well they match up.

This function creates a spreadsheet object workbook (“wb = Workbook()”) and a worksheet in that workbook (“ws = wb.active”) that has three columns (a worksheet is one of the tabs at the bottom of an Excel file). The worksheet has three columns and the column headings are “pres_file_name”, “acc_file_name”, and “bag_id”. The first for loop iterates through the contents of container (skipping bags_dir) and creates a list of all the folder names that container contains.

Next up we create a blank dictionary and loop through all of the bags in bags_dir, specifically pulling out the MODS metadata record and looking at the <identifier> field, which should hopefully match the folder names mentioned in the previous paragraph. The loop uses the <identifier> information as the key in the dictionary and the folder name of the bag as the value.

Next up we need to write all this information into the spreadsheet so we can look it over in detail. The idea here is that in the first for loop, it first attempts to find a match between container folder names and tries to match them up to the MODS identifiers. If it can it adds a line to the spreadsheet with the folder name (representing the preservation assets) in the first column, the MODS <identifier> (representing the access assets/metadata) in the second column, and the bag directory name (also representing the access assets/metadata) to the third column.

If the first for loop can’t find a match, then it simply adds a line to the spreadsheet with the container folder name leaving the second and third columns blank. These represent preservation assets that Digital Scholarship provided that have no corollary in Islandora.

We still have one piece of the puzzle left though, those exported bags from Islandora that showed up without preservation masters from Digital Scholarship. The last for loop looks at each of the keys in the dictionary (which again are drawn from the MODS <identifier> fields) and if that key doesn’t show up in the list of container folder names, then it appends another line to the spreadsheet with the first column blank, the second column being the dictionary key (MODS <identifier>) and the third column being the dictionary value (bags_dir folder).

Finally it’s time to actually look over the spreadsheet and see what went wrong. In some cases there were assets that Digital Scholarship imaged that simply weren’t used in Islandora (in this particular collection, a number of envelopes). For some there were file-naming mismatches due to clerical error, or because a blank analogue (a physical piece of paper, or back of a piece of paper, with no writing on it) was imaged for the preservation masters but not included in the PDF which resulted in “prefix-001-002” and “prefix-001-003” being the same intellectual content but different file names. Going into depth here is likely not altogether useful, but the spreadsheet was analyzed, corrections were made, and the files/folders in container and bags_dir were manipulated so the scripts could continue.

representation_preservation()

def representation_preservation():
    project_log_hand = open(proj_path + '/' + 'project_log.txt', 'r')
    vars = project_log_hand.readlines()
    container = vars[1]
    container = container.strip()
    folder_count = 0
    file_count = 0
    rep_pres = 'Representation_Preservation'
    for directory in os.listdir(path = proj_path + '/' + container):
        if directory.startswith('bags_'):
            continue
        path = proj_path + '/' + container + '/' + directory + '/' + rep_pres
        os.mkdir(path)
        folder_count += 1
        for file in os.listdir(path = proj_path + '/' + container + '/' + directory):
            if file == rep_pres:
                continue
            else:
                file_name = file.split('.')
                file_name = file_name[0]
                os.mkdir(path + '/' + file_name)
                shutil.move(proj_path + '/' + container + '/' + directory + '/' + file, path + '/' + file_name + '/' + file)
            file_count += 1
    print('Created {} Representation_Preservation directories | Moved {} files into created directories'.format(folder_count, file_count))
    project_log_hand.close()

All of the above functions involve setting the stage for what ideally should be a set of functions that can be run automatically in sequence rather than one at a time. With this folder we start to build the directory structure necessary for the PAX that will represent the contents of each digital object.

Throughout these remaining functions you’ll see reference to a path that looks like “proj_path + ‘/’ + container + ‘/’ + directory” which represents the working folder for each digital asset. The PAX structure requires that each of these directories has a “Representation_Preservation” and “Representation_Access” subdirectory, which are fairly self explanatory. Each of these “Representation_” folders then contains sub directories that contain a single file. “Representation_Preservation” will have a series of subdirectories that each contain a single TIFF. “Representation_Access” will contain a single subdirectory with a single PDF. If this is sounding confusing I would again recommend reading Smith’s blog post for an overview on PAX/OPEX.

The first for loop here makes sure to skip the bags_dir and then creates the new “Representation_Preservation” subdirectory. The second for loop then looks at all the files in a given directory, stores the file name (minus the extension) in a variable, and creates a subdirectory for each file, before moving the corresponding file into the newly created folder.

process_bags()

def process_bags():
    project_log_hand = open(proj_path + '/' + 'project_log.txt', 'r')
    vars = project_log_hand.readlines()
    container = vars[1]
    container = container.strip()
    bags_dir = vars[2]
    bags_dir = bags_dir.strip()
    num_bags = 0
    error_log_str = ''
    error_log_handle = open(proj_path + '/' + 'validation_error_log.txt', 'r')
    error_log = error_log_handle.read()
    for line in error_log:
        error_log_str = error_log_str + line
    for directory in os.listdir(path = proj_path + '/' + container + '/' + bags_dir):
        #skips any directories that raised errors during validation
        if error_log_str.find(directory) != -1 :
            continue
        else:
            print('attempting to revert bag: {}'.format(directory))
            obj_file_name = ''
            #converts the bags back into normal directories, removing bagit and manifest files
            bdbag_api.revert_bag(proj_path + '/' + container + '/' + bags_dir + '/' + directory)
            #removes unnecessary files generated by Islandora
            unneccesary_files = ['foo.xml', 'foxml.xml', 'JP2.jp2', 'JPG.jpg', 'POLICY.xml', 'PREVIEW.jpg', 'RELS-EXT.rdf', 'RELS-INT.rdf', 'TN.jpg', 'HOCR.html', 'OCR.txt', 'MP4.mp4', 'PROXY_MP3.mp3']
            for file in os.listdir(path = proj_path + '/' + container + '/' + bags_dir + '/' + directory):
                if file in unneccesary_files:
                    os.remove(proj_path + '/' + container + '/' + bags_dir + '/' + directory + '/' + file)
                if re.search('^OBJ', file):
                    obj_file_name = file
                    extension = obj_file_name.split('.')
                    extension = extension[1]
                    extension = extension.strip()
                elif re.search('^PDF', file):
                    obj_file_name = file
                    extension = obj_file_name.split('.')
                    extension = extension[1]
                    extension = extension.strip()
            #use xml.etree to identify filename from MODS.xml
            tree = ET.parse(proj_path + '/' + container + '/' + bags_dir + '/' + directory + '/MODS.xml')
            identifier = tree.find('{http://www.loc.gov/mods/v3}identifier').text
            #rename the OBJ file to original filename pulled from MODS.xml
            os.rename(proj_path + '/' + container + '/' + bags_dir + '/' + directory + '/' + obj_file_name, proj_path + '/' + container + '/' + bags_dir + '/' + directory + '/' + identifier + '.' + extension)
        num_bags += 1
    error_log_handle.close()
    print('Processed {} bags'.format(str(num_bags)))
    project_log_hand.close()

The bags exported by Islandora need a bit more massaging to get them ready for their “Representation_Access” folder. This function starts off by opening the “validation_error_log.txt” file created in validate_bags() and puts all the contents into a string variable. Then as the bags in bags_dir are looped through, if any of the bags show up in the error log string, they are skipped.

Next we use the bdbag Python library to revert the bag into a simple directory, of files, removing the sidecar files (containing the manifests and checksums) and bag directory structure. After reverting the bags, a for loop is called to look at the contents of the reverted bags and make a decision about what to do with each file. Islandora exports a lot of files in the bags that are only relevant to Islandora which can be deleted, or things like thumbnails that aren’t necessary since Preservica will create those for us. The list of these files that can be deleted is stored in the “unnecessary_files” variable, and if a file is in the list, it gets deleted.

Next up in the conditional logic we look use an incredibly simple regular expression to look for any files that are either named “OBJ” or “PDF”. We then store the file name in one variable and the file extension in a second variable. We’re going to be renaming these files and now we have the “before” for the os.rename function.

Now we find the “after” by looking at the MODS record XML, finding our previously used <identifier> field and storing it in a variable. Next up we rename the access asset from either “OBJ.pdf” or “PDF.pdf” to the contents of <identifier> plus “.pdf” (which was stored in a variable in the above paragraph for this step).

representation_access()

def representation_access():
    project_log_hand = open(proj_path + '/' + 'project_log.txt', 'r')
    vars = project_log_hand.readlines()
    container = vars[1]
    container = container.strip()
    bags_dir = vars[2]
    bags_dir = bags_dir.strip()
    folder_count = 0
    rep_acc = 'Representation_Access'
    for directory in os.listdir(path = proj_path + '/' + container):
        if directory.startswith('bags_'):
            print('bags_ folder found - skipped')
        else:
            path = proj_path + '/' + container + '/' + directory + '/' + rep_acc
            os.mkdir(path)
            print('created {}'.format(path))
        folder_count += 1
    print('Created {} Representation_Access directories'.format(folder_count))
    project_log_hand.close()

This function is the corollary to representation_preservation(), now looping through the directories in contain to create a “Representation_Access” folder so that the access assets prepared in process_bags() can then be moved into the newly created “Representation_Access” folder.

access_id_path()

def access_id_path():
    project_log_hand = open(proj_path + '/' + 'project_log.txt', 'r')
    vars = project_log_hand.readlines()
    container = vars[1]
    container = container.strip()
    bags_dir = vars[2]
    bags_dir = bags_dir.strip()
    access_id_hand = open(proj_path + '/' + 'access_ids.txt', 'a')
    access_count = 0
    for directory in os.listdir(path = proj_path + '/' + container + '/' + bags_dir):
        tree = ET.parse(proj_path + '/' + container + '/' + bags_dir + '/' + directory + '/MODS.xml')
        identifier = tree.find('{http://www.loc.gov/mods/v3}identifier').text
        access_id_hand.write(identifier + '|')
        path = proj_path + '/' + container + '/' + bags_dir + '/' + directory
        access_id_hand.write(path + '\n')
        access_count += 1
        print('logged {} and {}'.format(identifier, path))
    print('Logged {} paths and identifiers in access_ids.txt'.format(access_count))
    project_log_hand.close()
    access_id_hand.close()

So now our preservation assets are in the directories they should be in to create a well-formed PAX, but our access assets are still in bags_dir and in need of being merged, but as of yet we don’t have a mapping between the two to merge them. This function opens up a new sidecar text file called “access_ids.txt” that will hold a set of identifiers that are present in both the preservation masters as well as in the bags, and also the path to the unzipped bags.

The function then loops through each subdirectory in bags_dir and writes to “access_ids.txt” the identifier found in the MODS XML record in the <identifier> field, as well as the path to the current subdirectory in the iteration. This is formatted as a pipe (“|”) delimitated text file.

merge_access_preservation()

def merge_access_preservation():
    project_log_hand = open(proj_path + '/' + 'project_log.txt', 'r')
    vars = project_log_hand.readlines()
    container = vars[1]
    container = container.strip()
    bags_dir = vars[2]
    bags_dir = bags_dir.strip()
    access_id_hand = open(proj_path + '/' + 'access_ids.txt', 'r')
    access_id_list = access_id_hand.readlines()
    rep_acc = 'Representation_Access'
    file_count = 0
    for directory in os.listdir(path = proj_path + '/' + container):
        if directory.startswith('bags_'):
            continue
        else:
            for line in access_id_list:
                print('merging {} and {}'.format(directory,line))
                access_info = line.split('|')
                identifier = access_info[0]
                identifier = identifier.strip()
                path = access_info[1]
                path = path.strip()
                if identifier == directory:
                    for file in os.listdir(path = path):
                        if file.endswith('.xml'):
                            shutil.move(path + '/' + file, proj_path + '/' + container + '/' + directory + '/' + file)
                            file_count += 1
                        else:
                            file_name = file.split('.')
                            file_name = file_name[0]
                            os.mkdir(proj_path + '/' + container + '/' + directory + '/' + rep_acc + '/' +  file_name)
                            shutil.move(path + '/' + file, proj_path + '/' + container + '/' + directory + '/' + rep_acc + '/' + file_name + '/' + file)
                            file_count += 1
    print('Moved {} access and metadata files'.format(file_count))
    project_log_hand.close()
    access_id_hand.close()

This is a bit of a tortured function, but I couldn’t figure out a better way to write it. This function opens up the previously created “access_ids.txt” and reads all of the lines into a list using the file_handle.readlines() function. First we start by looping through every directory in container (first iteration), as usual skipping bag_dir when it shows up.

Now, as we loop through each directory in container, we then loop through the entire list of lines from “access_ids.txt” (second iteration), splitting each line and turning the identifier and path to the access assets into their own variables again. As we loop through these, if the identifier is identical to the name of the directory we are currently iterating through the next for loop triggers.

This next iteration (third iteration) identifies all of the metadata files (which end in “.xml”) and moves them over to the directory from our first iteration. Then it also takes the access asset itself, creates the subdirectory inside “Representation_Access” that is equivalent to the file name, and then moves the access asset into this newly created subdirectory inside “Representation_Access”.

I worry that these successive layers of nested iteration, and the fact that for each directory in container the entire “access_ids.txt” file needs to be read through, will make for a slow process once larger data sets are being used but I don’t necessarily see an alternative.

cleanup_bags()

def cleanup_bags():
    project_log_hand = open(proj_path + '/' + 'project_log.txt', 'r')
    vars = project_log_hand.readlines()
    container = vars[1]
    container = container.strip()
    bags_dir = vars[2]
    bags_dir = bags_dir.strip()
    shutil.rmtree(proj_path + '/' + container + '/' + bags_dir)
    os.remove('access_ids.txt')
    print('Deleted "{}" directory and access_ids.txt'.format(bags_dir))
    project_log_hand.close()

This next function is a much simpler one, now cleaning up the staging area that we used to process the digital assets from Islandora. First we delete bags_dir and all of its contents, then we delete the now unnecessary “access_ids.txt” file.

pax_metadata()

def pax_metadata():
    project_log_hand = open(proj_path + '/' + 'project_log.txt', 'r')
    vars = project_log_hand.readlines()
    container = vars[1]
    container = container.strip()
    dir_count = 0
    for directory in os.listdir(path = proj_path + '/' + container):
        opex1 = '<?xml version="1.0" encoding="UTF-8" standalone="yes"?><opex:OPEXMetadata xmlns:opex="http://www.openpreservationexchange.org/opex/v1.0"><opex:Properties><opex:Title>'
        tree = ET.parse(proj_path + '/' + container + '/' + directory + '/DC.xml')
        root = tree.getroot()
        opex2 = tree.find('{http://purl.org/dc/elements/1.1/}title').text
        opex3 = '</opex:Title><opex:Identifiers>'
        id_list = []
        opex4 = ''
        for id in root.findall('{http://purl.org/dc/elements/1.1/}identifier'):
            id_list.append(id.text)
        for item in id_list:
            if item.startswith('ur'):
                opex4 += '<opex:Identifier type="code">' + item + '</opex:Identifier>'
            else:
                other_identifiers = item.split(':')
                label = other_identifiers[0]
                label = label.strip()
                value = other_identifiers[1]
                value = value.strip()
                opex4 += '<opex:Identifier type="' + label + '">' + value + '</opex:Identifier>'
        opex5 = '</opex:Identifiers></opex:Properties><opex:DescriptiveMetadata><LegacyXIP xmlns="http://preservica.com/LegacyXIP"><AccessionRef>catalogue</AccessionRef></LegacyXIP>'
        opex6 = ''
        for file in os.listdir(path = proj_path + '/' + container + '/' + directory):
            if file.endswith('.xml'):
                temp_file_hand = open(proj_path + '/' + container + '/' + directory + '/' + file, 'r')
                lines = temp_file_hand.readlines()
                for line in lines:
                    opex6 += line
                temp_file_hand.close()
        opex7 = '</opex:DescriptiveMetadata></opex:OPEXMetadata>'
        filename = directory + '.pax.zip.opex'
        pax_md_hand = open(proj_path + '/' + container + '/' + directory + '/' + directory + '.pax.zip.opex', 'a')
        pax_md_hand.write(opex1 + opex2 + opex3 + opex4 + opex5 + opex6 + opex7)
        pax_md_hand.close()
        print('created {}'.format(filename))
        dir_count += 1
    print('Created {} OPEX metdata files for individual assets'.format(dir_count))
    project_log_hand.close()

Each digital asset gets packaged as a PAX object, which is a zip archive, and has a corresponding OPEX metadata file that corresponds with it. The OPEX file that corresponds to each PAX is in particular a fairly complex one, as can be seen by the size of this function, which creates the XML file. The overall strategy of this entire function is build the different chunks of XML using different for loops, then take all of the variables storing these strings, concatenate them, and write them into a file with a suffix of “.pax.zip.opex”. All of the PAX zip archives have a suffix of “.pax.zip” and opex metadata files are recognized by a file extension of “.opex” hence “.pax.zap.opex”.

The opex1 variable contains a static string, the XML file headers and the opening tag for the XIP title element.

The opex2 variable contains the title of the digital asset. This is found by opening the Dublin Core metadata XML file, and finding the <title> field.

The opex3 variable contains a static string, the closing title tag, and the opening of the identifiers tag.

The opex4 variable is built by looping through the aforementioned Dublin Core XML file, in this case looking for values in the <identifier> field. In this case there should be at least one, but can be more than that, so each value is appended to the “id_list” variable. Once the list is created then the identifiers, surrounded by static values for their opening and closing tags, are concatenated onto the opex4 string variable.

The only exception to this process is that the Islandora ID (which is generally structured as “ur:#####” where the hash signs indicate an integer) will always have an XML attribute type of “code”, whereas the other identifiers will have the label from the Dublin Core metadata used as the XML attribute type. Having the Islandora identifier always be labelled as “code” has to do with the ArchivesSpace – Preservica integration, as it requires a unique identifier to initialize, and that unique identifier must have a label of “code”.

The opex5 variable is a static string, which contains the closing tags for the Identifiers and Properties sections of the metadata, and the opening tag for the Descriptive Metadata portion, which also includes a small XIP metadata line: “<LegacyXIP xmlns=”http://preservica.com/LegacyXIP”><AccessionRef>catalogue</AccessionRef></LegacyXIP>&#8221; which again is necessary for the ArchivesSpace – Preservica integration.

The opex6 variable works with another for loop, this time looking for extra metadata files in the directory (identified by an “.xml” file extension) and adding those to the OPEX metadata file as well. This allows you to add multiple different metadata fragments in the same XML document, allowing the Preservica XIP, MODS, DC, and FITS, all to be added at the same time. In this case each file is opened, all of the lines are read into a list, and then each line concatenated with the opex6 variable.

The opex7 variable is a static string holding the closing tags for the Descriptive Metadata section and the entire OPEX XML document.

Finally, the script opens up a new file named after the current directory in the top level for loop, and an extension of “.pax.zip.opex”, writes opex1 through opex7 to the file, and closes it, before moving to the next directory in container.

stage_pax_content()

def stage_pax_content():
    project_log_hand = open(proj_path + '/' + 'project_log.txt', 'r')
    vars = project_log_hand.readlines()
    container = vars[1]
    container = container.strip()
    project_log_hand.close()
    pax_count = 0
    rep_count = 0
    for directory in os.listdir(path = proj_path + '/' + container):
        print(directory)
        os.mkdir(proj_path + '/' + container + '/' + directory + '/pax_stage')
        pax_count += 1
        shutil.move(proj_path + '/' + container + '/' + directory + '/Representation_Access', proj_path + '/' + container + '/' + directory + '/pax_stage')
        shutil.move(proj_path + '/' + container + '/' + directory + '/Representation_Preservation', proj_path + '/' + container + '/' + directory + '/pax_stage')
        rep_count += 2
    print('Created {} pax_stage subdirectories and staged {} representation subdirectories'.format(pax_count, rep_count))

As we get closer to getting the PAX ready to be created in earnest, first we stage the contents so that they can be written into a zip file more easily. Since each directory that gets iterated through in container at this point has a folder “Representation_Preservation” and “Representation_Access” which will ultimately go into the PAX, a new staging folder called “pax_stage” is created, and both “Representation_” directories are moved into it.

create_pax()

def create_pax():
    project_log_hand = open(proj_path + '/' + 'project_log.txt', 'r')
    vars = project_log_hand.readlines()
    container = vars[1]
    container = container.strip()
    project_log_hand.close()
    dir_count = 0
    for directory in os.listdir(path = proj_path + '/' + container):
        zip_dir = pathlib.Path(proj_path + '/' + container + '/' + directory + '/pax_stage/')
        pax_obj = ZipFile(proj_path + '/' + container + '/' + directory + '/' + directory + '.zip', 'w')
        for file_path in zip_dir.rglob("*"):
            pax_obj.write(file_path, arcname = file_path.relative_to(zip_dir))
        pax_obj.close()
        os.rename(proj_path + '/' + container + '/' + directory + '/' + directory + '.zip', proj_path + '/' + container + '/' + directory + '/' + directory + '.pax.zip')
        dir_count += 1
        zip_file = dir_count + ': ' + directory + '.pax.zip'
        print('created {}'.format(zip_file))
    print('Created {} PAX archives for ingest'.format(dir_count))

I have prided myself generally thus far on using the Python docs relevant to all these modules to write all this code, but when it came to creating the Zip archives for whatever reason I sort of hit a brick wall. This function uses the Python ZipFile library to add the “Representation_” folders (and their contents) to a zip archive, but I don’t claim to have a good grasp on exactly how it does it. This was a script-kiddie moment for me. After a lot of online searching, this article by Leodanis Pozo Ramos on the Real Python site provided me something I could work with.

The pathli.Path and zip_dir.rglob(“*”) section of this script I only sort of understand how they work. The “zip_dir” variable holds the path to our newly created staging directory, “/pax_stage”. We open up a new zip object, set it so it can be written to, and hold that in the “pax_obj” variable. Then everything that is held in the zip_dir variable (representing the path to “pax_stage”) is written into the zip archive. Finally we rename the archive so that instead of the directory plus “.zip” it is renamed to the directory plus “.pax.zip”.

cleanup_directories()

def cleanup_directories():
    project_log_hand = open(proj_path + '/' + 'project_log.txt', 'r')
    vars = project_log_hand.readlines()
    container = vars[1]
    container = container.strip()
    file_count = 0
    dir_count = 0
    unexpected = 0
    for directory in os.listdir(path = proj_path + '/' + container):
        for entity in os.listdir(path = proj_path + '/' + container + '/' + directory):
            if entity.endswith('.zip') == True:
                print('PAX: ' + entity)
            elif entity.endswith('.opex') == True:
                print('metadata: ' + entity)
            elif entity.endswith('.xml') == True:
                os.remove(proj_path + '/' + container + '/' + directory + '/' + entity)
                file_count += 1
                print('removed metadata file')
            elif os.path.isdir(proj_path + '/' + container + '/' + directory + '/' + entity) == True:
                shutil.rmtree(proj_path + '/' + container + '/' + directory + '/' + entity)
                dir_count += 1
                print('removed pax_stage directory')
            else:
                print('***UNEXPECTED ENTITY: ' + entity)
                unexpected += 1
    print('Deleted {} metadata files and {} Representation_Preservation and Representation_Access folders'.format(file_count, dir_count))
    print('Found {} unexpected entities'.format(unexpected))
    project_log_hand.close()

We’re getting close to the end here. At this point every directory inside of container has an OPEX metadata file, a zipped PAX archive, the “pax_stage” directory, and whatever other files that weren’t the actual digital assets themselves. Mostly these should be MODS, DC, and FITS metadata files. Time to clean up all this extraneous stuff we no longer need.

This set of for loops identifies the PAX archive (but does nothing to it), the OPEX file (but does nothing to it), deletes anything with an “.xml” file extension, looks for a directory structure (“pax_stage”) and deletes it, and then prints out anything that doesn’t fit into the above framework.

Ideally the process_bags() function will have gotten rid of any extra files in the bags exported by Islandora that we don’t need, but this print statement is meant to catch any files that made it through that process that might need to be deleted or incorporated somehow.

ao_opex_metadata()

def ao_opex_metadata():
    project_log_hand = open(proj_path + '/' + 'project_log.txt', 'r')
    vars = project_log_hand.readlines()
    container = vars[1]
    container = container.strip()
    project_log_hand.close()
    file_count = 0
    id_hand = open(proj_path + '/' + 'mcgraw_aonum_islid.txt', 'r')
    id_list = id_hand.readlines()
    id_hand.close()
    for directory in os.listdir(path = proj_path + '/' + container):
        opex_hand = open(proj_path + '/' + container + '/' + directory + '/' + directory + '.pax.zip.opex', 'r')
        opex_str = opex_hand.read()
        opex_hand.close()
        ao_num = ''
        for line in id_list:
            ids = line.split('|')
            aonum = ids[0]
            aonum = aonum.strip()
            isnum = ids[1]
            isnum = isnum.strip()
            if opex_str.find(isnum) != -1:
                ao_num = aonum
                print('found a match for {} and {}'.format(aonum, isnum))
        opex = '<?xml version="1.0" encoding="UTF-8" standalone="yes"?><opex:OPEXMetadata xmlns:opex="http://www.openpreservationexchange.org/opex/v1.0"><opex:Properties><opex:Title>' + ao_num + '</opex:Title><opex:Identifiers><opex:Identifier type="code">' + ao_num + '</opex:Identifier></opex:Identifiers></opex:Properties><opex:DescriptiveMetadata><LegacyXIP xmlns="http://preservica.com/LegacyXIP"><Virtual>false</Virtual></LegacyXIP></opex:DescriptiveMetadata></opex:OPEXMetadata>'
        ao_md_hand = open(proj_path + '/' + container + '/' + directory + '/' + ao_num + '.opex', 'a')
        ao_md_hand.write(opex)
        ao_md_hand.close()
        os.rename(proj_path + '/' + container + '/' + directory, proj_path + '/' + container + '/' + ao_num)
        file_count += 1
    print('Created {} archival object metadata files'.format(file_count))

Naturally, we’re not quite done creating OPEX metadata just yet. Since this process is leveraging the ArchivesSpace – Preservica integration, the OPEX metadata and PAX archive both need to be located inside a folder that is going to map over to ArchivesSpace and is a key part of what allows that integration to run. Thankfully this time it won’t involve concatenating 7 different variables to make the OPEX file.

Also, there is one area I don’t have a good answer to: mapping the archival object number from Preservica to the identifier from Islandora. I manually made my way through ArchivesSpace copying archival object numbers and the corresponding Islandora identifiers (based on links from associated ArchivesSpace digital objects) into a text file to create that mapping. Since this test collection is small, it only took about an hour, but is definitely something I’m hoping to automate in the future. This file was called “mcgraw_aonum_islid.txt”

Now the function first opens this file of archival object numbers and Islandora ids and reads the contents into a list. Next we start looping through directories in container just as we have been for most of these functions. Next we open the directory’s OPEX file and read the contents into a variable. Now for each directory we loop through the contents of the “mcgraw_aonum_islid.txt” list looking for a match between the Islandora identifier (stored in one of the OPEX identifiers mentioned in pax_metadata(), specifically the one labelled “code”), and try to match it to the contents of the string containing the OPEX metadata.

Once a match is found, the archival object number is written to a variable, and then inserted into some OPEX metadata, specifically in the <title> and <identifier> fields, once again the <identifier> must have a label of “code”.

Finally, a file is opened and all of the OPEX XML data is written into it. While the directories in container have generally been named following the “Collection_Box_Folder_Date-Sequence-Number” format thus far, now they need to be renamed to the archival object number. As you might guess this is for the ArchivesSpace – Preservica synchronization to work correctly. For OPEX incremental ingest, the metadata about a directory is included inside the directory, with the same name of the directory and an “.opex” file extension.

write_opex_container_md()

def write_opex_container_md():
    project_log_hand = open(proj_path + '/' + 'project_log.txt', 'r')
    vars = project_log_hand.readlines()
    container = vars[1]
    container = container.strip()
    opex1 = '<?xml version="1.0" encoding="UTF-8" standalone="yes"?><opex:OPEXMetadata xmlns:opex="http://www.openpreservationexchange.org/opex/v1.0"><opex:Transfer><opex:Manifest><opex:Folders><opex:Folder>'
    opex2 = container
    opex3 = '</opex:Folder></opex:Folders></opex:Manifest></opex:Transfer></opex:OPEXMetadata>'
    container_opex_hand = open(proj_path + '/' + container + '/' + container + '.opex', 'w')
    container_opex_hand.write(opex1 + opex2 + opex3)
    print('Created OPEX metadata file for {} directory'.format(container))
    project_log_hand.close()
    container_opex_hand.close()

The last function in this series is to create another folder OPEX metadata file, this time for the entire container folder instead of just a single digital asset. In this case we sandwich the folder container name inside the OPEX <Transfer><Manifest><Folders><Folder> tags. This small OPEX file is then concatenated and written to our last OPEX file.

At this point, the container should be ready to drop into the OPEX incremental staging area and you can take a nap after all your hard work.