At my organization we are in a lengthy process to set up a digital preservation service and naturally, during that process one issue that has been tackled is file-naming best practices. Since this is something that I’ve been immersed in lately, I had the thought to perhaps see about writing an article on the topic, maybe even including a survey of practices at other organizations. Or at least until I read the very first article I had flagged in my literature review: What’s in a Name? On ‘Meaningfulness’ and Best Practices in Filenaming within the LAM Community by Drew Krewer and Mary Wahl in code4lib Journal (2018). Krewer and Wahl have done a stellar job of investigating the topic, and I didn’t really feel I could add anything by writing something similar so I figured I’d try a different tack.
The article mentions how practitioners out in the community are (1) always eager to see documented practices happening at other institutions and (2) generally happy to share what they themselves are doing. So instead, in that spirit I’m opting for sharing in more depth the particular file-naming scheme that we are in the process of implementing at my institution, along with the reasoning of why we are pursuing this particular strategy. Unsurprisingly, the conventions that we find most important at the University of Rochester largely correspond with the “most frequently” and “frequently” mentioned conventions mentioned in Krewer and Wahl’s article, with one major exception. In the “occasionally mentioned” section, the very last item is:
Do not use values that could change: Using values that may change over time within a filename would create unnecessary retroactive work in the future.
Addressing this last issue is central to the file-naming guidelines that we’ve created, specifically thinking to the longevity involved with matters such as systems migrations, specifically making them both possible and sustainable. The goals as we laid them out are to create file names which…
- …provide useful information
- …are created consistently
- …can be interacted with programmatically
- …don’t require constant updating
- …aren’t tied to a particular software platform
Of somewhat less prominence is to have file names that are as brief as possible, and to have some degree of human readability. In our case, human readability will generally translate to providing a minimal amount of information that instead points users to where more information about the asset can be retrieved, for instance in documentation or metadata management systems. The overall structure looks like:
Data | Delimiter | Data | Delimiter | Data | Delimiter | Data | Delimiter | Data |
Organizational Unit | _ underscore | Identifier | _ underscore | Multi-Part Item | – dash | Sequence Number | . period | File Extension |
Each of the bolded “Data” elements will be discussed in more detail.
Organizational Unit
Every unit within the institution is given a unique identifier that corresponds with it. This is a combination of letters and numbers, where the letters indicate the parent unit and the number indicates a subunit. For instance “RCL003” references the Rare Books, Special Collections, and Preservation department (003) inside the River Campus Libraries (RCL). Each of the top-level academic/parent units gets a three letter abbreviation or acronym to pick them out, such as “MED” to represent the medical center, or “ESM” to represent the Eastman School of Music. The numerical identifier is zero-padded to allow for proper sorting, with an anticipation that we could potentially hit 10 different subunits in a single parent unit, but certainly not likely to hit 100. The numerical portion is also generated randomly at first (to prevent any indication that the number is directly tied with stature inside the organization) and will now be assigned in the order in which we engage with projects from new subunits, incrementing up one for each.
An earlier draft of the above convention took this a step further actually, by using random letters instead of the three-letter abbreviation/acronym. They were all drawn from the word “ROCHESTER” (again, leaving out the “R” to prevent any unintended indication of stature), and made for a truly symbolic representation of the subunit. For instance in the above example of “RCL003” it would have instead been, for example “C003.” Without the documentation this would have been essentially meaningless, but if you understood the schema, it did allow for human readability.
The idea here, in either the early unused draft or the current guidelines, is that this portion of the filename can withstand a re-org. If the Rare Books, Special Collections, and Preservation department gets a massive financial donation and ends up being renamed to the “Jones Family Special Collections Library,” then instead of potentially updating the “RBSCP” acronym embedded in the file names of an interminable number of digital assets, we instead only have to log the name change in the guidelines themselves, indicating that whereas “RCL003” used to indicate one thing, at such and such date, it now indicates another. A big shout-out to former colleague Stephen Alexander who proposed this idea.
Now of course, this isn’t perfectly resistant to organizational realignment, because if the River Campus Libraries gets broken up and the various libraries themselves are divvied up to other colleges and departments at the University of Rochester (or some other massive upheaval to the way the university is organized), that presents a serious problem for our file names. The hope here is that while changes at the departmental level may be expected somewhat frequently when thinking in the time frames of digital preservation, at the highest levels of the university the org chart should hopefully stay somewhat more consistent. Ultimately this was a compromise solution to embed a hair more human readability into the filenames and make them slightly less opaque. I still worry about whether or not this was the right choice, but as digital preservation is risk management, this course of action I think qualifies as “relatively low risk.”
The somewhat maddening issue here, is that while we are trying to build file names that exist as perfect Platonic forms which exist above particular software platforms and independent of anything but best practices, this isn’t ever truly possible. In reality, “RCL001,” “RCL003,” and “RCL004” all represent the Rare Books, Special Collections, and Preservation department at the River Campus Libraries. Why? Because of one of the primary software platforms at play here: ArchivesSpace.
Identifier and Sequence Number
The reason that three separate indicators all point to the same organizational unit has to do with how we generate identifiers for individual digital assets in ArchivesSpace. The identifier is the second component to the filename and is used to create a point of commonality everywhere it is needed: in the file name of the digital asset itself, in the metadata management system, and in the digital asset management system. Specifically thinking of the metadata management systems at the University of Rochester, this means ArchivesSpace and Alma. One thing to note briefly is that the organizational unit indicator and the identifier are the only two mandatory components that every single file name will contain, the other two elements being optional.
At this point in time, the workflow we have established for creating identifiers is only well-developed for ArchivesSpace, and makes heavy use of preexisting software including a wonderful plugin developed by NYU that was created to accommodate creating digitization work orders in ArchivesSpace. This plugin allows a back-end user (it is not intended for the public to request things to be digitized after using a finding aid) to select item level objects inside of a finding aid, and generate a spreadsheet that can be passed along to a digitization lab to track what needs to be reformatted. Crucially, this plugin leverages the Component Unique Identifier field in ArchivesSpace and generates an identifier to put in the field. The value of the identifier is simply “cuid” plus an incrementing integer that begins with zero. Thus the first eleven CUIDs generated by the plugin would be:
- cuid0
- cuid1
- cuid2
- cuid3
- cuid4
- cuid5
- cuid6
- cuid7
- cuid8
- cuid9
- cuid10
Then, the plugin exports a spreadsheet that can include a wealth of data from Archivespace:
- Resource ID for the collection
- Ref ID for the item
- URI, including the archival object number (more on this later)
- Container information
- Title
- Component Unique Identifier (generated by the plugin)
- Series and subseries name
- Dates
- Barcode (not used in our implementation)
I then use a short Python script to concatenate the organizational unit indicator for the department holding the collection with the CUID generated by the plugin, and put the output in a new column of the spreadsheet. For simple digital objects, this may be the entirety of the filename, but for any complex digital objects (multi page books, photographs where both recto and verso are imaged, etc.) a sequence number will be added as well (more on this element later). At this point a box full of archival materials can be handed to a digitization lab and this spreadsheet can be emailed to them, and now the lab will have all they need to be able to create digital assets with file names that conform to our requirements. So including a sequence number, some sample file names could be:
- RCL003_cuid0-00001.jpg
- RCL003_cuid172-00385.tif
- RCL003_cuid9175-08643.jp2
While the organizational unit indicator and the sequence number both have a static amount of zero-padding (a total of three digits, and five digits respectively), the value of the number in the CUID has no zero-padding. As we are working in the constraints of a piece of software not under our control, this element of inconsistency is another one of the compromises that the file naming conventions is making, for better or worse.
Now finally to bring it back to where this section first started: why Rare Books, Special Collections, and Preservation has three separate organizational unit indicators. The NYU plugin creates CUIDs for each repository that exists in ArchivesSpace, but it starts off with “cuid0” for each separate repository. That is, it’s possible, and quite likely, to have duplicate identifiers. To ensure that the file names are unique, they need to be differentiated by the organizational unit indicator. The Rare Books, Special Collections, and Preservation department utilizes three different repositories of finding aids and accession records for the different major areas under their purview:
- Rare Books, Special Collections, and Preservation
- University Archives
- ArchivesSpace Repository: RBSCP-UA
- Sample File Name: RCL001_cuid0-00001.jpg
- Historic Manuscripts
- ArchivesSpace Repository: RBSCP-HM
- Sample File Name: RCL003_cuid0-00001.jpg
- Rare Books
- ArchivesSpace Repository: RBSCP-BK
- Sample File Name: RCL004_cuid0-00001.jpg
- University Archives
Crucially, all three of those sample file names are totally valid in this convention, even though they reuse identifiers. Ultimately, the organizational unit indicator in conjunction with the CUID form the truly unique identifier in practice.
Thus, this doesn’t present much of a problem, and if anything it is easy to imagine a re-org that might split some of these sections off into their own departments (e.g. University Archives becomes it’s own department rather than a subsection of RBSCP), so there is the possibility that this even saves headaches down the line. However this is a great example of how even though we wanted to build a file naming convention devoid of dependencies on any particular software platform as one of our explicitly stated goals, this is a lot easier said than done. Features of both ArchivesSpace and the NYU plugin for that platform end up featuring pretty heavily in our guidelines, even if that data should still be portable to other systems should a migration occur.
A big shout out to Sarit Hand for clueing me in to the existence of the NYU Plugin. One conversation with her about it was what I needed to unstick a whole mess of problems I was having, which were largely related to Preservica.
Preservica and the Archival Object Number
One of the overarching goals of our digital preservation service is to have dedicated metadata management systems (ArchivesSpace and Alma) and dedicated digital asset management systems. For the University of Rochester, Preservica is the primary system for the latter. It has robust support for attaching metadata to the records it stores, and metadata can even be created in the system but that experience is not particularly feature rich, hence the desire to simply import it from systems of record for metadata and then keep it in sync over time. Network storage is the other primary digital asset management system in use, but that is decidedly not feature-rich, and doesn’t really enter this conversation.
Preservica is capable of keeping metadata in sync with ArchivesSpace automatically, so that changes made in ArchivesSpace are automatically imported over to the records in Preservica on a scheduled basis. This prevents the dreaded situation where metadata ends up “drifting,” and is no longer in sync across systems. In order to facilitate this, Preservica needs the Archival Object Number from ArchivesSpace. As luck would have it, this can be found at the end of the URI that is included in the work order spreadsheet generated by the NYU plugin.
This particular identifier, the Archival Object Number, is otherwise difficult to programmatically export from ArchivesSpace. Early, small pilot projects that I’ve been doing to test the ingest process into Preservica actually saw me copy-pasting the Archival Object Numbers into a text file that could be incorporated into my Python scripts, not exactly something I’m eager to make standard practice (once again, thank you Sarit!) Also, I am not an ArchivesSpace expert, or even amateur really, and have only really worked with the system in the context of getting it to talk amiably with Preservica. In other words, if there are easier ways to get all those Archival Object Numbers out of ArchivesSpace, and you are screaming them at your monitor currently, I just didn’t know about them.
Now, none of this has anything to do with how to name files necessarily, but as we talk about Alma, the other metadata management system in use, as well as the next steps, this context will be useful.
Alma and the Multipart Indicator
The last element of the file naming convention that we have yet to talk about is the multipart item indicator. This is an optional element of the file names that can be used in situations where the identifier is referring to a collection of digital assets, rather than a single one. The need for this element of the convention became apparent as we began to think through how the convention would work with our other metadata management system, Alma and it’s related discovery platform, Primo.
The initial thought was to generate identifiers for item-level records of materials in Alma, just as we are doing in ArchivesSpace. This however ran into a snag pretty quickly when we started to think of things that have enumeration/chronology issues; essentially things that are, or act like, serials. If you search for an article in Primo, it then kicks you out to the journal system via EZproxy (or similar), so that if you find an article stored in AGRICOLA, you are then directed to the database (generally speaking) and the specific article you want. If each of the individuals items had an identifier, which corresponds to a Portfolio in the Alma data model for electronic resources, then when you search for something serial-esque that has been described in Alma and ingested into Preservica, you will then find as many links as there are individual resources, a very different user experience than what one would expect. Instead of being faced with one link to the journal database, you are faced with potentially hundreds of links to individual items.
Instead it was desired to link out to the grouping of all the different assets together in aggregate. In terms of Preservica this will mean providing a link in Alma/Primo to the Preservica folder instead of to the Preservica item. Say for instance we have an undergraduate research journal published locally at the university, the filenames for these assets might look like:
- RCL001_cuid45_v1n1.pdf
- RCL001_cuid45_v1n2.pdf
- RCL001_cuid45_v1n3.pdf
Or perhaps the yearbooks of the university:
- RCL001_cuid9_1907-00001.tif
- RCL001_cuid9_1908-00001.tif
- RCL001_cuid9_1909-00001.tif
- RCL001_cuid9_1947-00001.tif
Once again we see an instance where the perfect form of the filename needed to compromise in order to handle the reality of the situation at hand and the limitations and behaviors of the systems that we operate within the constraints of. The enumeration/chronology/multipart indicators are essentially free-text, with an emphasis on making them clear, compact, and consistent. Developing something like a controlled vocabulary for these is a long term goal as we undertake projects to increase consistency.
Additionally, for both serial and monographic records a new instance of the record is spawned to represent the digital asset. The record representing the physical asset isn’t doing double duty to represent the digital asset as well. Or at least all of this is the intention anyways, more on that in discussion of future developments in the next section.
While the multipart indicator has only been discussed in the context of Alma, there is no reason that it couldn’t be used in an ArchivesSpace context as well, hypothetically. In the interests of making file names as brief and consistent as possible however, this particular element of the convention should only be used if absolutely necessary. So you could have a letter described in a finding aid with file names:
- RCL003_cuid488_letter-00001.tiff
- RCL003_cuid488_letter-00002.tiff
- RCL003_cuid488_letter-00003.tiff
- RCL003_cuid488_letter-00004.tiff
- RCL003_cuid488_envelope-00001.tiff
- RCL003_cuid488_envelope-00002.tiff
However instead it would be better practice to create briefer and more consistent file names that look like:
- RCL003_cuid488-00001.tiff
- RCL003_cuid488-00002.tiff
- RCL003_cuid488-00003.tiff
- RCL003_cuid488-00004.tiff
- RCL003_cuid488-00005.tiff
- RCL003_cuid488-00006.tiff
Future Developments
There is a key difference in our workflows related to ArchivesSpace and Alma however, our workflows for ArchivesSpace exists whereas our workflows for Alma are entirely theoretical. The reality of the situation is that we were able to leverage pre-existing software in order to facilitate the full workflow of digital asset ingest and metadata synchronization because of the existence of the NYU plugin and the software Preservica developed to keep metadata in sync between itself and ArchivesSpace. The only thing we did locally was get the pieces in place and devise a workflow to take advantage of them.
When it comes to replicating that with Alma, it’s going to involve local software development in our institution to make that work. The hope is to build software that can:
- Spawn a bibliographic record in Alma to represent the digital object
- This is analogous to the Preservica-ArchivesSpace integration automatically spawning a Digital Object in ArchivesSpace for the resource
- Generate a local identifier, insert the identifier into a MARC 099 field, and log the identifier in a database
- This is analogous to the NYU ArchivesSpace plugin automatically generating CUIDs and inserting them into item level records
- Insert the Preservica URIs into the spawned MARC records for the digital assets
- This is analogous to the Preservica-ArchivesSpace integration automatically inserting the back-end and publicly facing URIs for Preservica assets into the created Digital Objects from (1.)
- Update metadata records in Preservica when changes to the MARC metadata are detected in Alma
- This is analogous to the Preservica-ArchivesSpace integration updating the EAD metadata stored in Preservica on a scheduled daily basis
This is a tremendous amount of local development work that needs to be accomplished. At the time of this writing we’re embarking on some real, genuine pilot projects (as opposed to the projects that have been more about testing to date) that all involve resources described in ArchivesSpace. The hope is to use these projects to discover pain points in productions workflows and take that information, as well as the workflow models, to inform the local software development that will be needed to replicate the ArchivesSpace-Preservica workflows in the Alma-Preservica workflows.
For the sake of consistency, we’re anticipating having the generated identifier for materials described in Alma follow a similar convention to the one being used by the NYU Plugin. Instead of having “cuid#” the intention is to formulate the identifier as “ils#”. When looking at the file name then, a user will know to go to ArchivesSpace if they see the “cuid” prefix in the filename, and go to Alma if they see “ils.” Past that, they should function identically, with a non-zero-padded integer being assigned, ticking up by one for each record, and starting with “0.”
Thinking through issues of sustainability and systems migration, the storage point for both identifiers is meant to be such they will be quite portable between systems. The CUID field in ArchivesSpace can be interacted with programmatically through the ArchivesSpace API, so a systems migration should mean this can be copied over into a new software platform, stored in a database, or utilized however we need. Since the identifier in Alma will be stored in the MARC metadata itself, it will come along for the ride wherever those MARC records go.
All Together Now
This turned into quite a bit more substantial blog post than I had originally anticipated but it kind of goes to show just how complicated a seemingly banal topic like file naming conventions can be. Doing it with an eye towards the types of timeframes were looking at in digital preservation (i.e. indeterminately long) is a thorny problem that involves compromises no matter how hard you want to stay away from them.
To sum up, our proposed file naming convention involves the combination of two mandatory elements that identify what departmental unit in the University of Rochester has curatorial authority over the digital asset, as well as an identifier that uniquely picks that particular digital asset out inside that departmental unit. In some situations, principally involving serial-esque materials that it is anticipated will be discoverable in Primo at some point, an optional multi-part indicator can be used to pick out certain elements of the series (such as a particular journal issue). Lastly, many digital objects are complex in nature, with a front and a back or multiple pages; these necessitate the use of a sequence number to reflect their complex nature.
So now we come back to those five desired qualities of a file naming convention that the blog post started off with to see if this proposed file naming convention accomplishes what it set out to.
To provide useful information
While the human readability of the file names may be pretty low if the user has no access to documentation about the file naming convention itself, the file names do manage to pack a fair amount of information if they do have documentation on the convention. The file name provides the user with information on the curatorial unit from whence the digital asset came, information on what metadata management system can provide further information, as well as what identifier to search that metadata system for. The presence of further file name elements can provide further context; if there is a multipart indicator, this will be human readable and it indicates the nature of the resource as something like a serial. If there is a sequence number, we know the digital object is complex in nature.
To be created consistently
The conventions themselves are consistent in how filenames should be constructed. The presence of the multipart indicator is one compromise here that weakens the ability to behave consistently, but overall the convention I believe does a good job here.
To be able to be interacted with programmatically
The file naming convention adheres to best practices around character choices in file names. Only letters, numbers, dashes, underscores, and periods are permitted in the file name. Only one period is permitted to delimit the file name from the file format extension. The file name cannot begin with a number and no spaces can be used in the file name. Between these rules and the consistent nature of how the filename itself is formatted, this allows for robust programmatic interaction. Even in cases where the multipart indicator is used, it is required to be used consistently and minimally to allow regular expressions to be implemented without a huge lift.
To ensure file names don’t require constant updating
This particular issue is in many ways the central thesis of this entire post. Creating a symbolic representation of an organizational unit rather than using a department’s acronym is one important facet of making sure that file names are durable.
To make sure they aren’t tied to a particular software platform
Also playing double duty with the previous point about durable file names is the practice of ensuring that identifiers are used in all file names and that those identifiers can easily be moved between software platforms, ensuring they can continue to be useful far into the future, making them both durable and able to sustain a software platform migration.
This file naming convention isn’t necessarily as robust as I might have hoped when I first set out to craft one with a maximalist useful time horizon. The file names are inextricably linked with the systems that they interact with, and sometimes, with scarce resources, it just makes a lot more sense to use the wonderful work of an institution like NYU, rather than trying to reengineer something from the beginning. Functionality in our discovery systems means that we need to use the multipart indicator for some of our resources, which is probably the largest (yet still necessary) compromise that had to be made. All that being said, I do truly believe this convention is pretty darn robust, and should set us up for success well into the future, beyond the use of systems like ArchivesSpace, Alma, and Preservica, and maybe even beyond the use of encoding standards like MARC and EAD.
While this narrative description of our file naming conventions isn’t really a robust set of documentation for our practices, once our implementation phase is complete at the University of Rochester, I have a personal goal to make all our policies publicly available for use and reference in the wider community. More on that, and all the many other policies we’ve been busy drafting all year, later.
I welcome any thoughts or feedback you might have on this particular topic and am incredibly curious what any readers might be doing around file naming in their own organizations!
