My Setup for Working from Home

I had the good fortune of getting all the various pieces and parts of my home office settled just before the novel coronavirus pandemic began really impacting life in the U.S., so while this situation is horrifying for so very many reasons, I was at least well positioned to continue trying to accomplish my digitization/digital library goals from home. I’ve been really enjoying seeing other folks’ home set ups and curious to hear how they are adapting so I thought I’d share mine.

  1. The central piece for this whole set up is a newly purchased Dell XPS 15 laptop that has some fairly robust capabilities, particularly for the image editing and batch actions I’m doing. It has a Core i7 processor, 16 GB of RAM, a 0.5 TB SSD, a NVIDIA GeForce GTX 1650 graphics card, and Thunderbolt 3. I’ve got it parked on a cheap Nulaxy laptop stand that I picked up on Amazon.
  2. The Thunderbolt 3 port enabled me to take advantage of the CalDigit TS3 Plus dock, which connects to all of the other peripherals that I’ll list below, as well as delivering power and ethernet to the laptop. Conveniently it tucks right under the laptop inside the Nulaxy laptop stand. Also you can’t even really see it in the image, but just to the left of the dock and laptop stand is a Western Digital My Passport USB 3.0 2TB external hard drive. This is what I’m saving all my projects to and is my primary storage mechanism and connects to the CalDigit dock.
  3. A friend passed along an old monitor he was no longer using which is attached to the CalDigit dock. It is a HANNspree monitor, though no idea what the model is. This took a bit of trial and error as the CalDigit dock only outputs through DisplayPort and the monitor only receives HDMI. Turns out you need an active display port adapter or cable, not a passive one (as I initially had purchased.)
  4. My keyboard is an Azio Large Print Tri-Color Backlit Wired Keyboard that is newly purchased and has been lovely so far. It’s a pretty cheap options, I just wanted a real keyboard to make this more like a conventional desktop computer setup. The backlighting is nice in the evenings, and honestly as my youth continues to flee, the large print isn’t terrible either. Connects to the CalDigit dock.
  5. My mouse is an old Anker gaming mouse that was surprisingly cheap at the time and allows a fair bit of customization with enough input options to be useful. Nothing much to say here, I just used the mouse I still had laying around. Connects to the CalDigit dock.
  6. This is a Nektech 72W charger that I use for my phone. I had originally bought this to have a single charger for both my phone and my old ASUS C302 Chromebook, but as I’ve migrated laptops, this has become a dedicated phone charger on my desk. It has three additional USB A ports for charging of other devices as well which is handy.
  7. This Epson V600 photo scanner was borrowed from my workplace so that I could do at home digitization projects, in conjunction with Epson Scan installed on my laptop. It has been a dream to stay productive on repetitive scanning at home and is now making me think about getting myself one for my own use. Connects to the CalDigit dock.
  8. It’s nice not having my phone constantly in use streaming podcasts so I’m using a Nest Hub as my media device, casting from Pocket Casts to the Nest Hub for entertainment during the day. I also have set it up recently as a digital photo frame which has honestly been pretty awesome while feeling rather isolated. Having images of friends and family scroll through helps a lot with that. Also I just like the idea of this being my command center like I’m on the bridge of a spaceship or something, so MOAR SCREENZ helps me achieve that admittedly silly goal.
  9. Continuing that theme, I’ve got a little temperature/humidity tracker that I had previously used in an old apartment because I was convinced the thermostat was broken and I wanted a second opinion. No sense in getting rid of it though since it’s still truckin, and it is handy to have an idea of what the temperature is. And again MOAR SCREENZ.
  10. I backed the original Antsy Labs Fidget Cube desk toy and it continues to live on in this place on my desk. Nice to have something tactile to play with while staring at progress bars.
  11. I’ve got an ancient 2.1 computer speaker set that I’ve moved around something like 5 apartments. Since they refuse to break I keep using them, nothing much to say here other than they also connect to the CalDigit dock.
  12. The basis for my at home digitization is an index card collection of clippings about soldiers in World War II. There are dozens of drawers of these cards, and it is time consuming work, so perfect for this at home digitization set up. I wrote more about this in my last blog post.
  13. Additional drawers of cards just waiting to be scanned.
  14. Books brought home from my office for reference and reading, including “A Field Guide to American Houses, 2nd edition” by Virginia Savage McAlester, “Radical Candor” by Kim Scott (a book my whole library system is encourage to read), “Archives in Libraries: What Librarians and Archivists Need to Know to Work Together” by Jeannette A Bastian, Megan Sniff-Marinoff, and Donna Webber, and finally “The Data Wrangler’s Handbook: Simple Tools for Powerful Results” by Kyle Banerjee.
  15. Finally my phone, a Pixel 3a, perched on a Toddy Gear phone wedge.

2020 Goals: Genealogical Record Sets

Some staffing changes at my local library mean that the amount of time I have available off the reference desk has shrunk pretty considerably, a change that has made me rethink how exactly I’m able to continue digitizing resources when I am not able sustainably work with the bulk of our digitization equipment. To address this change in a fashion that allows me to continue to my work, I’m choosing to focus this year on genealogical records which is turning out to be a successful plan for a variety of reasons.

Record Sets

After reviewing our manuscript collections and card indices three primary categories of materials made themselves apparent as good targets for digitization: organizational records, cemetery records, and veterans records.

For the organization records, identifying suitable record sets was luckily a straightforward process. I began by looking through our finding aids for collections devoted to a single organization (eg. International Union of Operating Engineers, Local 313 (Local 18) Collection, Athena Art Society Collection), category of organizations (Hospitals and Medical Organizations of Greater Toledo Collection) and failing that, our catch-all Clubs and Organizations of Greater Toledo Collection. While perusing these I noted any mention of a membership list, directory or yearbook.

We already have some corporate records from the Lloyd Brothers Walker Monument Company available online as a cemetery records set, but it has always bugged me that this set hasn’t had another cemetery record set to join it. After some combing through our catalog and finding aids, I eventually turned up the Northwestern Ohio Genealogical Society (NWOGS) Cemetery Project Collection. In the 1970s this genealogical society sallied forth into Northwest Ohio, parts of Michigan, and even one cemetery in Canada to individually transcribe every headstone in every cemetery they could find. We have the original manuscript cards they filled out by hand, as well as (and crucially) the typescripts they generated afterwards. A perfect candidate for digitization while scoping the records to only those cemeteries in Lucas County.

Finally for veterans records we have extensive card file indices of clippings from area newspapers documenting soldiers in World War II and the Korean War, as well as a recently rediscovered Women of World War II index. These three card files will form the starting point for our veterans record sets. These also are by far the most extensive collections with each drawer containing well over 500 cards and there being several dozen drawers. Scanning of the card files will in all probability extend well into 2021.

Equipment and specifications

These records are all typed text, generally of good quality and clarity, with minimal images for consideration. My goal in digitizing these records is to scan them at sufficient quality for a successful OCR of the records and thus allowing them to be searched across Ohio Memory, DPLA, and our own Federated Search Tool. As such, scanning these at a resolution of 400 PPI and in 48 bit color would enable successful OCR without making the process so time-consuming as to be (functionally) impossible. I’m primarily working on this project, with an intern from a local university also assisting with segments of the project as well.

Our reference desk couldn’t accommodate our Epson 11000XL large format flatbed scanner, much less our Kirtas book scanner or I2S planetary scanner, but it is spacious enough to fit one of our smaller flatbeds: the Epson V600 photo scanner. This flatbed produces remarkably good images considering it’s $200 price point and we happened to have a couple that had sort of made their way around our department for various projects. Having one moved out to our reference desk was a straightforward way to open up scanning capacity at the reference desk with no need for expenditure of funds.

Copyright considerations

Contemplating the copyright status of these works was an interesting avenue for consideration because this turned out to be a situation I don’t often find myself in. Frequently as a digitization librarian my projects center around materials in the public domain, materials for which we’ve been given permission by the copyright holder to share, or projects where we feel the digital distribution constitutes a fair use of the work. This project however falls into another category, that is, materials that are factual in nature and thus not eligible for copyright protection in the first place.

The organizational yearbooks/directories that I started this project with list committee makeup, membership rosters, and schedules of events, all information that is simply documenting and reflecting what occurred throughout the year. None of the information is particularly creative in nature. The same has been true of the cemetery transcriptions; they list who transcribed the tombstones, who generated the typescript, where the cemetery is located, when all this occurred, and then the tombstone transcriptions themselves. Again, simple facts, no creative endeavor engaged in.

In those cases where sections move into more substantially creative territory, I’ve omitted them from public access to ensure that intellectual property is being respected. For instance the Athena Art Society yearbooks were frequently handmade chapbooks during the run that has been digitized, with individually created artistic covers. All of these were excluded from upload. Similarly the Women’s Auxiliary to the Academy of Medicine of Toledo and Lucas County also included their constitution and governance information, as this would also certainly warrant copyright protection it was also excluded.

The big exception here are the card files of clippings providing information on these involved with World War II and the Korean War. The newspaper articles are all protected by copyright, and none of them have entered the public domain to date. For this a fair-use argument was needed, particularly highlighting the fact that the amount of shared was a small fraction of the original work, that there is no available means for creating a licensed version of this resource, the fact that creation of the index is a highly transformative use of the copyrighted material, and that the resource is meant for research purposes. The Columbia Fair-Use Checklist (created by Kenneth D. Crews and Dwayne K. Buttler) was used to document the assessment. Ultimately I ended up with 14 factors in favor of fair-use and one against, which certainly doesn’t guarantee a fair-use ruling but left me confident enough to proceed.

Privacy

An important factor for consideration here is obviously privacy; the entire point of this enterprise is to make it easier to find information about people. A firm cut-off point of 1970 is being used, for a period of time amounting to 50 years at the time of writing. This provides a robust buffer of time and should ensure that most names and addresses will no longer be current.

XML Parsing Errors During FTP Upload to the Internet Archive

One of the most common stumbling blocks I’ve come across when uploading ebooks to the Internet Archive via FTP has been the following XML parsing error:

Two commons problems may result in seeing this error upon submitting your GET request to have the item ingested and derived for access in the Internet Archive.

Problem One: Ampersands. I am not an expert in XML but from what I’ve been able to read, the ampersand (“&”) is a special character reserved for specific uses in XML and not able to be parsed as plain-text. If you have an ampersand in your metadata, simply substitute the special character for the word “and”; problem solved. I run into this problem most frequently when uploading early 20th century books that use an ampersand in the transcribed <publisher_original> field.

Problem Two: Incorrect XML tags. I had planned ahead on writing up this quick explainer because this is a problem that I’ve run into a few times, however in the process of creating the metadata for this upload I actually ran into a metadata problem that actually needed fixing unintentionally. I managed to insert a “1” in the third opening <subject> tag without even realizing it. It took my intern and I about 20 minutes to catch this problem as I was using this as a sample to show her how to upload materials to the Internet Archive via FTP, and she was the one who caught it (Thanks Noelle!). Double check your tags to ensure that spelling is correct and no additional characters have been introduced.

If you’ve done the above, and avoided any other problems, then you should see the following and can rest easy knowing you’ve successfully contributed one more item to the cultural record.

Fixing Large-Size/Low PPI Images for CONTENTdm Upload

Context: I am a digitization librarian working in the Local History and Genealogy department of a public library. We use CONTENTdm for our public access repository through Ohio Memory (a joint collaboration of the Ohio History Connection and the State Library of Ohio). I use the CONTENTdm Project Client to upload materials to our collections in Ohio Memory. I ran into an issue recently when I was uploading yearbooks to CONTENTdm where my 10,000 page monthly OCR license was being eaten up over the course of uploading ten items. The total number of images we were looking at here was probably no more than 2,500 and all of the images should have been small enough that they easily fit inside the image constraints in the ABBYY engine being used. Instead of using 25% of my license on the upload, it would use 100% of the license 4 or 5 books in (I upload yearbooks in batches of one decade at a time).

The first time this occurred, I simply pushed through the upload process, leaving the 5 or 6 books without an OCR transcription and went back and added in the transcript the next month after my license reset. The next time a decade’s worth of books was ready for ingest, I started checking what was going on with the individual files to see what the problem was. The issue became readily apparent in Adobe Bridge and all the metadata it exposes; in this case showing the original JPEG file:

JPEG Image Metadata in Adobe Bridge Showing 72 PPI

The images had the expected pixel dimensions of about 3000 pixels by 4200 pixels, but the resolution of the image was a mere 72 PPI. At that pixel density the image was a whopping 43.4″ by 59.4″, far larger than the maximum size allowed by the ABBYY fine reader. The CONTENTdm documentation lets you know that the OCR engine’s maximum size is A4, or 8.5″ by 11″; if you have an image larger than that, it simply uses a number of pages equal to how many pages of text the image is equivalent to at that resolution. At 72 PPI, A4 is a mere 595 by 842 pixels, which amounts to 27 pages of OCR license being eaten up on a single image using OCLC’s formula on their help page when your image is 3124 pixels by 4276 pixels at 72 PPI. That is a recipe for chewing up a 10,000 page OCR license very quickly.

The fix for this is luckily fairly straightforward. I was working with JPEG images generated by our KIRTAS Kabis IIIW in Photoshop. First I converted all the JPEGs to the our archival TIFF images both for long term preservation and to avoid the “JPEG Options” dialogue pop-up that appears in Photoshop. Once the TIFFs were created, I generated a simple batch action to process all the images using the “Image Size” menu in Photoshop. The key here is to simply change the resolution of the image without resampling it. If you try to upscale a 72 PPI image to 400 PPI and resample in the process, Photoshop will do it’s best to create a much more massive image that won’t help your situation at all.

Image Size menu configuration in Adobe Photoshop

The final result is a TIFF image with the exact same pixel dimensions as the original image, but now with the appropriate pixel density of 400 PPI and an image size in inches that easily fits inside A4, now no longer destroying my OCR license. The next step is to identify what is happening during the workflow of editing these images to convert them to 72 PPI to prevent this from happening to begin with. Big thanks to Glenn Fleishman’s article “How to set a higher dpi without changing an image’s resolution” at Macworld to understand what was happening here, definitely worth a read!

TIFF Image Metadata in Adobe Bridge Showing 400 PPI