Sunday, May 31, 2009

Finding Aid Handling

This post will examine issues we've encountered with the AT concerning finding aids.

  1. EAD Import

    We've run into 5 issues importing EAD into the AT. First, as I mentioned in a previous post, we've had problems importing both large EAD files (6+ MB) and large numbers of files. Even with increasing the memory assigned to the AT (see Memory Allocation) and installing the client on the same machine as the database, importing these files has crashed import, causing us to import in small batches or one file at a time as needed. Second, we've encountered issues with handling for those finding aids with a parallel extent or container summary. Here we've found that although encoded correctly, the AT is inconsistent, sometimes assigning a blank extentNumber, other times combining extentNumber and containerSummary in extentType, or, most often, assigning a blank extentNumber, extentType and throwing everything into a general physical description note. This calls for a lot of manual clean-up depending upon how you encode . Third, although encoded correctly, we and other Yale repositories have found inconsistent languageCode assignment, most often resulting in blank values. Fourth, as perhaps many others of you have experienced, we’ve had problems with Names and Subjects, both in the creation of duplicates vales and in names getting imported as subjects (and vice versa). This is likely due to how the AT treats names as subjects—a complicated concept which may or may not be able to be revised for 2.0. Fifth, we’ve found inconsistent handling of internal references/pointers, sometimes getting assigned new target values, sometimes not. Whether or not new target values are created following merging of components is another issue on the table for future investigation.

  2. DAO handling

    Unfortunately one of the EAD elements currently not handled by the AT is , which our Yale EAD Best Practices Guidelines and common EAD authoring tool (XMetal) have set up to handle all Digital Archival Objects. As a result, all our DAOs are unable to be imported into the AT. This is a major issue that needs to be addressed in the 2.0 DAO module revision.

  3. EAD Export/Styleshseet

    Although it is possible to modify or replace the stylesheet the AT uses to export EAD or PDF (see Loading Stylesheet) it’s not currently possible to select or utilize multiple stylesheets, which may be important for multiple repository installations and/or developing flexible export possibilities.

  4. Batch editing

    I’ve mentioned previously in another post that one of the key weaknesses of the AT is the inability to batch edit. Given the lack of batch editing functionality, a repository will either have to commit to greater planning and development prior to integration, or use a database administration tool such as Navicat for MySQL to clean up data once in the AT.

Tuesday, May 26, 2009

Subject Handling

We in MSSA have chosen not to utilize the AT to manage subject terms for three reasons.

First and foremost, subject handling in the AT is not as robust or or as functional as our current cataloging system.

Second, only a portion of the finding aids we imported contained controlled access points. This is because subjects have generally only been assigned to our master collection records, which are our MARC records. Only for a short period of time were controlled access points used on the manuscripts side. Furthermore, given the differences in content and purpose of MARC and EAD, trying to consolidate the two into a single system presented obvious practical issues. So, rather than try to programatically marry our MARC subjects to our finding aids, we decided to maintain access points in MARC until the AT could serve as our master record--a scenario still some ways off though made much simpler with the introduction of plugin functionality and custom import/export tools. In fact, with plugin functionality we might revisit the possibility of at least importing our subjects from MARC and attaching them to our resource records in the AT.

The third reason we chose not to use the AT to manage subjects was the difficulties the AT has roundtripping data, especially from one format to another, and the concomitant need therefore to develop tools to clean up this data for easy import into Voyager.

Thursday, May 21, 2009

AT 1.5.9 Plugin Functionality

With the release of 1.5.9 and plugin functionality the AT team has introduced a much-needed mechanism for customizing and expanding functionality in an extensible fashion. In keeping with current software development trends centered around a core system that can be modified or expanded via external applications, the beauty of AT plugin functionality is that all customizations or data imported via plugins is not affected by future AT upgrades as the plugin code lies outside of the main AT source code, stored in a separate plugin folder.

Plugins will thus offer repositories the capability to do many of the things that have perhaps up until this point kept them from fully adopting the AT. This includes creation of custom importers, export modules, record editors, and display/workflow screens. We in MSSA plan to take full advantage of plugin functionality to create a custom import mechanism for our marrying our existing container information (e.g. box type, barcode, VoyagerBib, and VoygerHolding) to instance records in the AT, a custom exporter to send data from the AT to Voyager, and a custom import mechanism for importing EAD or partial EAD for an existing resource record (e.g. new accession).

As with the AT itself, plugin development requires knowledge of Java programming. For more information on developing plugins check out the Plugin Development Guide (.pdf). A list of current plugins can be found here: http://archiviststoolkit.org/forDevelopers/plugins.shtml.

Thursday, May 14, 2009

Sustainability & the AT

Perhaps the biggest question facing the AT is its sustainability. With only one or two releases left, the AT is soon approaching the end of development. What does the future hold?

At SAA in San Francisco last year, the AT group reported that it had begun to work with a business consultant to formulate a business plan for the AT after Phase 2 of development ends in 2009. Options at that point included institutional affiliation, subscription-based service, or the pursuit of another Mellon grant. Also briefly discussed was the related need to develop a community governance model for guiding the direction and future development of the AT. Although initial steps have been made to address the governance model thanks in part to the SAA AT roundtable, little has been said yet as to the AT business plan. This post will explore the pros and cons of the three options on the table at this point.

Institutional affiliation/hosting is one option for the AT. The main challenge of this approach is the reality that few institutions are capable of taking on this responsibility, not only due to limited technical expertise and infrastructure, but also because the current economic situation is precluding many of us from taking on outside projects. Without a commitment of significant resources it seems likely that this model will only allow for ongoing support and not any additional development. Given some of the needs addressed in the AT user survey, ATUG-listserv postings, and other forums, however, this option seems in the end the least beneficial to the archival community as a whole.

A second option is subscription or fee-based AT support and development, either by individual consultants or a single software firm. Although this option would incur costs on individual institutions, this approach does allow for continued development, especially for institution-specific needs. The central challenge for this approach is that not all institutions have the resources to commit to development and so might not have a say in the future direction of the AT. Bigger institutions with bigger budgets would drive the agenda. To prevent this disparity an effective governance model would have to evolve to lobby for community interests and manage ongoing development for all interested parties.

The third option is to pursue another round of Mellon grant support. This would obviously allow for continued development of specific needs voiced by the archival community in the AT user survey and other requests posted to the ATUG-l. Given the demands on the project staff, our current economic situation, and the little traction this option seems to have garnered up to this point, it seems unlikely that this will happen.

So where does that leave us? The AT project is coming to end, sooner perhaps than we would like. Sure we'd all like a few more bells and whistles before then, but whatever business model is ultimately adopted, the AT will at least be supported for the next few years. Will Archon, ICA-AtoM, or other products, open-source or proprietary, sufficiently evolve? It's hard to tell at this point. The one thing we at least in MSSA know is that we will be much, much, MUCH! better off to have undertaken the work to get into the AT. No amount of my blogging can sell this enough. I'm sure many institutions would agree. And we thank the AT for that.

Thursday, May 7, 2009

Batch Editing & Data Clean-Up

One of the key weaknesses of the AT is the inability to batch edit data. The need for batch edit functionality was ranked of high importance in the AT user group survey and will hopefully be added in a future release. What then is a repository to do in the meantime? I suggest two possible options: 1) batch edit data prior to import; 2) manipulate the MySQL tables directly or use a database administration tool such as Navicat for MySQL to connect to the AT's MySQL database and perform queries/updates in the tables.

As I have described before, over time our collection management data has been created according to a number of different ad hoc and defacto standards. We in MSSA have tried as much as possible to batch edit and standardize our data prior to import into the AT. This was straight forward for our accessions and location information which was already stored in a database and thus easy to identify and manipulate. The one problem that did exist with this data was a tendancy by MSSA staff to combine data that belong to a series of different data fields, as instructed by EAD or the AT, into a single catch-all free text note field (a place to find everything). Options for handling this data included exporting or legacy data to and modifying another file or importing into the AT and then editing. We chose the former, performing batch operations to format the files according to the AT import maps. Although this was largely successful, we still encountered edits that needed to be made once in the AT. The options at that point were either to edit in the AT one at a time or perform another round of edits, delete the data in the AT, and then reimport again. We chose instead to perform batch edits in AT using Navicat, saving a considerable amount of time and effort.

The biggest challenge though we faced in standardizing our data prior to import came with our finding aids. Because they're not in a database and therefore not easily comparable, it's hard to see what changes need to made until they're actually in the AT. No matter the number of iterative batch edits we ran on our finding aids we still came across edits that needed to be made. Importing them into the AT however would still likely require a further round of edits. Running edits outside of the AT, deleting all the data, and then reimporting them would be a huge burden, particularly on over 2500 finding aids. We chose instead to run batch operations in the AT using Navicat and then run find/replace separately on the finding aids.

Tuesday, May 5, 2009

Tips and Tricks: Resources

In migrating your legacy data into the AT your repository might have collections that lack both a finding aid and MARC record but yet still need to get into the AT as resource records. Rather than take the time to manually create these resource records in the AT one by one, it would be helpful to come up with a means for importing this information in a batch process. MSSA had over 250 such records that we had to bring into the AT. Here is how we did it.

The legacy data that we had to address, consisting mostly of what can loosely be described as deaccessions (i.e. collections that were transferred, destroyed, never came in, and/or who knows what), was in a Paradox database. Given the orderly data structure, we decided to to generate simple EAD finding aids for each collection using a mail merge and then batch import into the AT.

The first step was to export the data from our Paradox database into Excel. We then filtered the information to select only the specific collection records needed and deleted any unnecessary data elements. We then modified the data bit, in this case creating a column for filename in addition to the other data elements present (e.g. collection number, title, note, and disposition). This served as the data source for the merge.



Next we modified our existing EAD template to include only the basics needed for import into the AT (namely level, resource identifier, title, dates, extent, and language), as well as the information present in our Paradox database to distinguish it as a deaccession (e.g. note and disposition). We then opened the EAD template in Word and set up a mail merge, inserting the elements from our Excel data source into the appropriate places in the Word document. Here is a partial view of the Word EAD template:



Then we completed the merge of the data elements in Word and saved the individual finding aids with appropriate filenames. Here is a partial view of one of the resulting finding aids:



The final step was to import the finding aids into the AT. To distinguish these resources in the AT we checked them as Internal Only in the Basic Description tab and modified appropriate note fields as needed.

All in all the process proved very easy and was much faster than trying to enter this data manually. The only real drawback was having to save 250 separate finding aids.

Friday, May 1, 2009

AT Issues: Large Finding Aids

Manuscripts and Archives encountered several performance issues when loading our finding aids into the AT. First, given the large number of finding aids in question (2500+), we encountered load time issues. Our first attempt to batch import our finding aids lasted the better part of a weekend, ultimately crashing with the dreaded java heap space error. Adding insult to injury, no log was generated to indicate the cause of the crash or the status of import (e.g. which files were successfully imported). Our initial diagnosis pointed to our setup. We had the database installed our one of our local, aging servers and ran the import via a client on a remote workstation. To address the issue of load time we changed our setup and moved the database to our fastest machine, a Mac with 8GB of memory and multiple processors, installed the AT client on it, and saved copies of our finding aids to it for easy import.

Our second attempt, which involved approximately 1800 finding aids, was much much faster but still crashed. The likely culprits this time were large finding aids and memory leak. We found that large finding aids (3MB+) crashed the import process when included as part of a batch import. In addition, we found a so called memory leak (i.e. successive loss of processing memory with each imported finding aid), which greatly slowed the process over time and contributed to the crash. As a result, we separated out large finding aids and imported them individually, as well as creating smaller batches of finding aids (both with respect to total number and total size) to import in stages. Just to give you some idea of the time required to import larger finding aids, we found that using a remote client to import up to a 2 MB file averaged 20-30 minutes; 2-3 MB took 30-60 minutes; 5-6 MB took 90-120 minutes.

These strategies proved effective, allowing us to import all but the largest of our finding aids (8-12 MB each), which we are currently working on. Because these present problems for our finding aid delivery system as well, one option is to split them up into multiple files, each around 2 MB. The only problem with this option is dealing with/maintaining multiple files.

For other institutions with similar numbers and sizes of finding aids these strategies may be of help to you.