Quick-and-Dirty Email Archiving (But Different)

In 2019 myself and a colleague endeavored to find a straightforward way to collect email newsletters being regularly distributed by Toledo/Lucas County-area organization such as the art museum or city government to keep their audiences informed. The initial formulation of this was to use Zapier to create a lightweight action where an email would be received into a Gmail account which would then automatically be “printed” off as a PDF using Google Cloud Print and then deposited into the associated Google Drive account to await file renaming and organization (an annoyingly manual process).

This worked, but only sort of. Frequently the PDFs would have content missing from them, and at best, only the content of the email themselves was preserved. Metadata, attachments, and timestamps were never included in the PDF versions of these newsletters. After revisiting this automated pilot project, a year later, it was determined this was not really living up to expectations and did not need to continue being pursued. Google Cloud Print as a service is being shuttered by Google at the end of 2020 as well so the workflow was bound to break sooner rather than later. However, all the emails collected were still in the Gmail account and able to be used in a different workflow.

This program area only involves the collection of newsletters from area organizations. This project does not account for description, emulation, or access of this content. Should a researcher want to use the material included in the collection, the only means of access would be to pull it out of the Gmail account itself, perhaps by printing a hardcopy or simply forwarding the email to the researcher. Nor is there any means of making the content available in an indexed form by the public. As such, there is no intention of advertising this collection in any way.

The point of this project is simply to begin collecting material so that future developments by research groups and the development of new tools can provide means to describe, contextualize, and grant access to this material. To that end, the first step of this workflow is simply to use the Google account created by the Library digitization department to sign up for as many relevant newsletters as possible.

Once newsletters begin getting received then the organization of these is pretty simple. Using the Labels functionality in Gmail, which is essentially a combination of a tagging and folder system, newsletters are grouped by the organization creating them, and then into subcategories so that each year’s worth of content produced by a particular organization is also identified.

For instance, an email might have the label “Toledo Museum of Art” as well as a subordinate label “Toledo Museum of Art 2020.” This becomes important when we get the emails out of Gmail for bit-level preservation and eventually, hopefully, for the other tasks described previously that are at present not possible.

Google also has a service called Google Takeout which allows user data from Google services to be exported to allow for greater ownership and flexibility in its use. This service can be used with Gmail to allow for the export of email in the MBOX file format. This file format is generally used as a means of import/export between email services, for instance for migrating from a Gmail account to an Outlook account. For this reason, there are not good access solutions for the material contained in the MBOX file as it was designed to be used by email clients and services. However, the MBOX files do have the advantage of saving all the email metadata and attachments, providing a more robust experience than simply a PDF print-off.

Still, with promising work being done by researchers in email archiving, it is hoped that eventually these MBOX exports will be able to be made accessible by transformation into a common access format like PDF. In the meantime, to exert more intellectual control over the material, each year an MBOX export for each organization’s newsletters is generated and saved on Library servers. This is accomplished because in order to identify a subsection of a Gmail account to export, a Label must be identified, hence the workflow above of creating Labels that pick out a specific organization’s newsletters in a specific year. Additionally, the overarching labels for all newsletters generated by a specific organization without qualifying a specific year are there in case a researcher wants an archive of everything the art museum has sent out for instance.

In the meantime all that is required is to go into the Gmail account annually to export the previous year’s content for each organization, and spending a few minutes once or twice a month applying Labels to emails and archiving them in Gmail, thus clearing them out of the inbox.