Saturday, April 18, 2009

Making the Jump: Standardization of Legacy Data

One of the most important factors to consider in planning for migrating to the Archivists' Toolkit is the condition of your legacy data (finding aids, accessions, etc). In Manuscripts and Archives (MSSA), we were faced with daunting numbers: 2500 collections, 7000+ accessions, 100,000+ boxes. Captured in a variety of systems by a variety of individuals over several decades, we quickly saw the need to begin a lengthy process of data standardization prior to our switch to the Toolkit.

We started with finding aids. Still in EAD version 1.0 as of 2008, we spent considerable manpower converting our finding aids to EAD version 2002. Utilizing the Library of Congress' stylesheet modified for our purposes, we batch converted our finding aids in an iterative process lasting several weeks. Given the variety of encoding practices over the years, however, we were still left with 400-500 invalid files that we had to fix. This took several months. As with the conversion process itself, we worked in an iterative fashion, fixing where possible en masse the common and then the unique. Find and Replace was a godsend here. Dreamweaver was particularly helpful, allowing for multiple line Find and Replace across individual files, selected files, and entire directories.

We moved on to MARC records. Two of the specific problems we encountered when trying to import MARCXML (converted into EAD) was inconsistent assignment of local call numbers (sometimes assigning a 90b, sometimes a 90a, other times assigning same call number for multiple collections) and collection dates. This again stemmed from variations in practice over many years and had to fixed by hand.

We next addressed location info. Due to varied practice and limited authority control, our collection location information (stored in a series of Paradox tables) exhibited a good deal of disparity, if not outright oddities. Many items lacked basic control info (e.g. Record or Manuscript Group Number, Accession Number, or Series) and as a result we had to do a fair amount of detective work to assign such, where possible, or, in some cases, remove these items from our database.

Finally, we tackled accessions. Common problems encountered in our out-dated collections management system (MySQL back end with Access front-end) were inconsistent data formatting (dates, accession numbers, contact info) and input practices (through restriction info, contact info and other odds and edds into a single note field), as well as MSSA-specific (and University Archives specific) accession data and practices (see my other post on Standardization of Practice). These latter issues required a hack or two to enable shoe-horning odd MSSA data elements into appropriate AT data elements (and/or User-Defined Fields).

We anticipate a great deal of post-migration clean-up in the AT to continue to work towards a consistent data set. The great thing though about doing this work in the AT is having one system and one means for doing it. Indeed, migrating to the Toolkit, regardless of its future sustainability (an issue I will post on separately), has been a great opportunity for consistency and standardization within our department. We're just sorry it has taken so long!

No comments:

Post a Comment