In my role as Repository Services Team Lead (née Digital Asset Management Lead) for the University of Rochester I’m helping to orchestrate the creation of a new Digital Preservation Service (DPS), using Preservica as the primary digital preservation platform. To date this has involved the creation of a seemingly endless number of policy and workflow documents, among them our Metadata Requirements Policy.
The intrepid metadata librarians at the University of Rochester have very well established and well thought out practices related to descriptive metadata, and how those fields crosswalk into MARC, MODS, Dublin Core, and EAD. There are a few different metadata elements that we need to have captured but which don’t enter as cleanly as we’d like in one or more of the four above metadata standards/encoding schemas:
- Copyright status and licensing information
- Nature of the digital asset, born digital versus digitized
- Manual preservation events on the digital assets
MODS for example has a good way to handle born digital vs. digitized (with the <digitalOrigin> tag), but EAD does not. After going round and round a bit, we decided to break all of the above bulleted items out from the descriptive metadata entirely and instead create records in PREMIS to handle administrative metadata, while leaving the descriptive metadata alone without trying to shoehorn things into it that didn’t really belong.
In this post I’ll describe the different sections of what we expect to be a fairly typical PREMIS record as it will be imported into Preservica, as well as go over the Python script that I used to generate the record. Below is a sample record in full:
WordPress doesn’t do a great job of making the XML all that legible so instead I’ve provided a PDF conversion from Notepad++ which seemed to be more readable. I’m not going to be going into a great deal of detail about specifically how to encode in PREMIS, but will talk more about the high level subsections of the record and how we used them to satisfy our needs locally. The PREMIS data dictionary packs a wonderful amount of information into it.
PREMIS: Object
The object section of the record provides the context for which Asset in Preservica the PREMIS record is describing, as well as what the type of object is to begin with. the xsi:type="premis:intellectualEntity"
draws from a controlled vocabulary maintained by LC. In this case we are always mapping to a Preservica Asset, meaning the entire intellectual object (e.g. a photograph, book, map, article, oral history, etc.) and not the different representations, files, or bitstreams. Thus for our purposes, the type will always be premis:intellectualEntity
. Preservica itself uses the same controlled vocabulary in its data model so this corresponds nicely for our purposes.
The Identifier and Identifier Type correspond to the Ref UUID that is created by Preservica when creating a digital asset. This section really just serves to provide the context for what digital asset it is that we’re talking about. This section is somewhat duplicative because each of the Rights and Events subsections of the record also incorporate the Preservica asset identifier, however having the object type picked out (as above) I think is useful to help disambiguate the record should it be found divorced of any context.
PREMIS: Rights
The Rights section packs in a lot of information. Each Rights Statement section gets its own identifier, autogenerated by the Python script that creates these PREMIS records. PREMIS calls for a controlled vocabulary to be used when discussing issues of intellectual property rights, and the Basis you choose then dictates the rest of the fields included in the Rights subsection. For instance in our example here, we have chosen “copyright” as the Rights Basis, which means we also then fill in information related to the status of the copyright (in the public domain), the jurisdiction in which this applies (the United States), and the date when this determination was made. A Copyright Note field becomes useful as a free text field to include any additional narrative description or context. Finally, this Rights subsection gets tied back to the digital asset in question via the Linking Object Identifier, by including the Preservica Ref UUID here as well.
However the Basis could also be “license” which then involves creating the record in a fairly different fashion. Now we have the option to specify a documentation Identifier Type and Identifier Value, for which we could include the name and URI for a Creative Commons license. The License Note field is analogous to the Copyright Note field and provides an opportunity to include narrative information and context. The License Terms field can be used to link out to the legal text URI for the Creative Commons license as well.
“Copyright” and “license” are likely to be two of the most commonly used Rights Basis options available to us, though “statute” might play a part as well when sharing materials online as part of a fair use calculation.
PREMIS: Events
The Event section accomplishes two major goals for us: it allows for the specification of born digital versus digitized digital assets, as well as providing a place for digital preservation actions that are not captured automatically by software to be recorded. Similar to the Rights section above, the Python script automatically assigns a UUID identifier to each Event section of the PREMIS record. Then we use the “creation” Event Type as the mandatory (in our local use) section that must be present in each record in order to specify born digital versus digitized. This is accomplished by providing context from the rest of the fields in the Event section. The Event Date Time section lets you specify when the digital asset was created, and the Event Detailed Information section allows the specification of born digital versus digitized. In this case we are drawing from a LC controlled vocabulary used in MODS. The Event section also allows the specification of agents who are doing the event, so in the case of a born digital asset, we can specify the original creator (who perhaps donated their papers to our archives), whereas if the material was digitized in house or via a vendor, that information can also be captured here, by including names of companies, staff members and library departments.
The event section can also use all of the above fields for other types of actions, for instance if we transfer digital assets from a local storage solution to Preservica, or for past manual format migrations that were carried out. In this latter case, these had been shoehorned into MODS note fields, and trying to find a better solution for that is what kicked off this entire process for us.
Python Script
https://github.com/rochester-rcl/premis-generator
The Python script to create these PREMIS records is pretty simple and only makes use of a couple of built-in Python libraries. The general workflow idea is to ingest the digital assets as normal into Preservica, then manually identify different groupings of digital assets that will all need the same sort of record. For instance, “this entire series is A/V and was digitized by Backstage Library Works.” That way it makes it easy to use the pyPreservica library to identify the assets in question and spit out a list of Preservica Ref UUIDs.
These IDs can then be entered into a spreadsheet which has columns enough for all of the variables mentioned above: agents, event types, copyright information, etc. The hope is that this doesn’t become particularly labor intensive in the way that a lot of spreadsheet-based metadata creation can be. Again, the hope is that we have an entire series that was digitized in-house, or by the same vendor, or is from the same time period with the same copyright status, or was all uniformly applied the same Creative Commons license. Then it is simply a matter of entering a value once and filling it down the spreadsheet. I do realize that this probably sounds naïve and that just about anyone out there has a dozen examples of wildly heterogeneous collections with wildly inconsistent metadata; you gotta start some place though.
The script itself is essentially just a PREMIS template (stored in a multi-line string variable) that uses the str.format function to pass in values from the spreadsheet to the template, then writes the multi-line string to an XML file. The only other place that data gets passed into the template is from the UUID library, which auto-generates Rights and Events section identifiers.
Each PREMIS record is named after the Preservica Ref UUID, so that it should be simple to use a pyPreservica script to have it loop through a folder of files, split the file name on the period, look up the digital asset via the file name, and then add the metadata fragment to the digital asset.
PREMIS is definitely interesting to work with because of how utterly flexible it is. Whether records are human or machine generated, there is a whole lot of information that can be packed into it, and it was fun to explore options here. We’re hoping that creating some administrative metadata here will help our collection stay well contextualized over the long term.
