Showing posts with label Implementation. Show all posts
Showing posts with label Implementation. Show all posts

Monday, February 22, 2010

AT Transition Update

I apologize for the long delay in posting to the blog. I have spent the better part of the last several months addressing several (thousand) issues raised by our legacy data migration and plug-in development. Today, though, I am happy to announce that the day has finally come where we, Manuscripts and Archives, have finally made the jump. We are now fully in and committed the AT! Unfortunately, much still remains to be worked out.

First off, we have to fix thousands of errors reported out from our programmatic efforts to match legacy instance information to resources in the AT. These were mostly the result of errors in our finding aids and, less often, errors in our location database. These have mostly been addressed, leaving only the truly disparate errors that need a good deal of research into how collections were processed and, often times, examining/verifying box contents.

Another major category of clean-up work stems from our QC of data sent back from our consultant where we found that, with some overlap with our other error logs, several thousand barcodes from our locations database did not come into the AT. Thankfully, many of these can be easily diagnosed with our Box Lookup plug-in and fixed with our Assign Container Information plug-in. Others require more in-depth problem-solving. For those large collections where things just went haywire, or for those small collections with only a few boxes, we've decided to delete the resource, re-import and then re-assign instance information using the Assign Container Information plug-in. Other collections will require deleting one or more components, importing those components in a dummy resource, transferring said components back into the original resource, and then re-assigning container information.

A third major challenge is cleaning up restriction flags we've assigned instances based on notes in our locations database. Our locations database had a variety of notes both at the collection/series/accession level and at the item level. Since these notes were wildly inconsistent and unable to be easily parsed, we created blanket restrictions for instances based on the notes. As a result, we have to review the restrictions assigned, verifying those that need to be restricted are and fixing those that are open. Thankfully, these errors can easily be fixed with our Box Lookup and Assign Container Information plug-ins.

Aside from these data errors, which are our first priority, we also have to finalize workflows, procedures, and documentation for accessioning, arrangement and description, and collections management. Although equally critical to our day-to-day operations, these were put off until we were in the AT so that we could fully model what needed to be done.

So, although we've made such great progress up until this point, much remains to be done, much needs to be resolved. This is more or the less the lasting impression of the project. For other large institutions planning similar migration projects, I can't say enough just how much work is involved and how important it is to get staff, especially technical staff, on board. For those institutions without technical support and dedicated staff, it is probably best to hire a consultant, especially when it comes to legacy data (e.g. instance) migration and customizations to the AT.

Tuesday, September 15, 2009

AT Issues: Box Ranges

The most recent issue we've encountered that I'm sure others out there have already concerns box ranges. When EAD is imported into the AT with a range in the <container type="Box"> tag (e.g. <container type="Box">1-2</container> or <container type="Box">3, 256</container>), the AT creates a single instance for that component (e.g. Box 1-2 or Box 3, 256), rather than separate instances for each. The problem is that each instance is likely to have a separate barcode, box type and perhaps even location. When you click on Manage Locations, for example, you are presented with a single instance to which multiple, separate values need to be assigned. There are a few options at this point to address the issue.

First, and perhaps least desirous, is to fix your EAD to eliminate box ranges. Aside from the considerable labor/programming involved, your only option is to create separate components (i.e. clones) for each instance rather than one component with multiple instances. This is because although AT (1.5.9) allows you to create and export components with multiple instances (i.e. <c0x> with multiple <container type="Box"> tags, it does not allow you to import such in the same fashion. Instead, each (and only up to 3) is imported as a separate container type into a single instance. All subquent instances tied to a component are lost. Fortunately, I am told that version 2.0 will support import of multiple instances if parent/id attributes are used for each container tag.

Second, you can fix (i.e. break apart) the ranges in the AT. You can do this in two different ways depending on how you want to characterize each instance. One, you can create separate components (i.e. clones) for each instance. Two, you can create multiple instances within a single component. The problem with the first option is that your resource/finding aid loses scanability, includes somewhat redundant info, and may grow quite large if you have several sizable ranges. The problem with the second approach is that you will need a style sheet to customize display of instances, perhaps turning back into a range if you so choose.

We've decided to address the box range issue in a combination approach, fixing some instances in EAD, addressing most programmatically in the AT with code we're developing to clone components part of ranges. Hopefully this code can be added to one of our existing plug ins that assigns other instance info, allowing others the option of creating clones for components part of a range. More to come on that soon.

Monday, July 20, 2009

Collaborative AT Instances: Pros & Cons

This post examines the pros and cons of consortial or collaborative AT instances. My comments are based on my experience administering the Yale University Library Collections Collaborative AT project and MSSA's AT development.

Pros
  1. The central benefit of a consortial/collaborative AT instance is the consolidation of systems, procedures, practices, and resources. Having one system, one set of procedures, one course of training across multiple repositories not only conserves resources, but also greatly facilitates consistency and efficiency across the institution. Such a configuration inherently allows for enhanced understanding of each other’s collections, and provides faster and more consistent access to collection information, as well as the possibility of—from one location—getting an overview of the special collections holdings across diverse repositories. In Manuscripts and Archives alone, implementation of the AT will result in the consolidation of numerous databases, centralizing collection information and reducing ongoing systems maintenance for these un-integrated databases.

  2. Centralizing collection and other archival information leads to increased security, as potentially sensitive information will no longer be scattered across databases, electronic office files, and often, paper logs.

Cons

  1. The primary challenge facing collaborative/consortial instances is that the AT does not scale well for large, complex data sets and hence causes noticeable performance issues. Based on an acknowledged design flaw, the AT will impede performance/functionality at a certain point (see my previous post on Resource loading). As a result, especially for large institutions with multiple repositories or, more importantly, with extensive legacy data, consolidating multiple systems in a single AT instance will result in a slow performing system. Unfortunately, there is currently no plan to alter the design of the AT to address this issue in the 2.0 release, but the potential exists for a third round of grant support (to merge the AT with Archon) that would allow for such. The only alternatives at this point are for one of us institutions to pay a consultant to do the redesign work, which might be costly, or to develop lookup tools/plug-ins that query the database without having to say build an entire Resource.

  2. Sustainability is another major issue. With two major releases left, development of the AT is coming to a close. Unfortunately this is happening just as many of us are finally getting around to evaluating and adopting the AT. Thankfully though with version 1.5.9 the AT team introduced plug-in functionality as a viable means for the community to customize and develop the AT. In addition, it's exciting to hear that the opportunity exists for a third round of grant support to merge the AT with Archon. Beyond greatly expanding the functionality and usability of AT/Archon, this would allow for finishing any existing development commitments still on the table when the AT development ends.

  3. A third issue for consortial/collaborative instances is that any modifications/customization done with the AT by the superuser, e.g. modifying default values, lookup lists, user-defined fields, etc., apply to the instance as a whole and cannot be applied to just a single repository. Extensive modification of lookup lists, user-defined fields, and default values may thus result in a cluttered, hard to use interface or may even impede efficiency and performance. Plug-ins may be a solution to this problem though as they can be used to alter appearance and workflow.

  4. A fourth issue is that the only way to restore an instance (e.g. after a crash or failure) is to import an entire MySQL dump. Hence if one repository or individual screws something up, you'll have to go back to the last backup of the entire database, which may cause lots of data to be re-entered.

  5. A fifth challenge is the difficulty of migrating legacy data, especially instance or box information (e.g. location, box type, barcode, etc.) into the AT. Migration can also be difficult for those institutions that do not have EAD 2002 or who lack the expertise to export, format and import their legacy data. For those institutions like us with an significant amount of complex legacy data the only real option is to hire a consultant and develop a custom import process/plug-in.

  6. A sixth shortcoming of the AT is it's inability to support the full EAD tag set, meaning that addition tools (e.g. stylesheets) or systems may be necessary to fully manage an institution's finding aids. On the bright side, especially for smaller institutions or those who lack a finding aid display/delivery system, the proposed AT-Archon merger might address the issue of full EAD support.

  7. A seventh issue is the inability to batch edit data in the AT. For those with a little know-how though a MySQL database administration tool such as Navicat can be used to query and update data in the AT's MySQL tables. This is definitely beyond the capability of the typical user so you may want to address this inability via a plug-in.

All told, it might seem that AT has more than its fair share of cons. This is not what I want to get across. True, there are some issues to take into consideration especially when creating a collaborative instance, but as MSSA is easily the largest user of the AT, none of what we've encountered thus far is a real deal-breaker. True, we'd like it to perform better, but there are options at this point to at least sidestep this issue until the AT/Archon redesign. And with plug-in functionality now available and soon to be expanded, the means for addressing the AT's shortcomings is now in the hands of the community. We just need to step up.

Wednesday, June 17, 2009

Finding Aid Clean-Up: Box Numbers

As we proceed with our AT development we are spending considerable time cleaning up and standardizing our finding aids. Aside from the work I've mentioned previously to create consistent dates, extent statements and subjects, the main focus of our latest efforts is the standardization of container information (i.e. box numbers). The reason for all this work is to allow us to programmatically hook our location information (e.g. box type, barcode, vault, shelf, etc.) to our finding aids in the AT, the key to which it turns out is the box number.

Like many repositories, our arrangement and descriptive practices have waxed and waned over the years. Although our collection numbers and accession numbers have more or less been consistently applied, our box numbers have not, particularly with used in connection with our practice of housing small quantities of odd-sized materials in common containers. Formerly, we housed such items, especially folio or slides in what we called common folio or common slide boxes (i.e. containers housing materials from multiple collections in a single or communal box), assigning a box number for the common folio/box and a folder number for the individual folder. Aside from the clear practical issues involved in administering such common containers, we've run into problems as we try to tie the box numbers we've assigned these items in our locator database to our finding aid data in the AT. More specifically, as our descriptive practice has varied over the years, the assignment of box numbers and box number extensions (e.g. an alphanumeric character used to indicate a use copy or duplicating master of a particular item) for these items has been inconsistent, unfortunately differing a great deal from box/container info in our EAD. For example, what appears in the finding aid (i.e. ) as "MS Common Folio 10" is entered in our locator database with Box '1' and BoxNumberExtension 'CF1F10'. As a result, we've had to manually edit data both in our location database and in our finding aids/AT for all these items.

This is a short-term solution. These items really need to be rehoused and all such common containers need to be done away with, not only due to the issues at hand, but also to facilitate say the creation of future use copies of these materials.

Tuesday, May 26, 2009

Subject Handling

We in MSSA have chosen not to utilize the AT to manage subject terms for three reasons.

First and foremost, subject handling in the AT is not as robust or or as functional as our current cataloging system.

Second, only a portion of the finding aids we imported contained controlled access points. This is because subjects have generally only been assigned to our master collection records, which are our MARC records. Only for a short period of time were controlled access points used on the manuscripts side. Furthermore, given the differences in content and purpose of MARC and EAD, trying to consolidate the two into a single system presented obvious practical issues. So, rather than try to programatically marry our MARC subjects to our finding aids, we decided to maintain access points in MARC until the AT could serve as our master record--a scenario still some ways off though made much simpler with the introduction of plugin functionality and custom import/export tools. In fact, with plugin functionality we might revisit the possibility of at least importing our subjects from MARC and attaching them to our resource records in the AT.

The third reason we chose not to use the AT to manage subjects was the difficulties the AT has roundtripping data, especially from one format to another, and the concomitant need therefore to develop tools to clean up this data for easy import into Voyager.

Thursday, May 7, 2009

Batch Editing & Data Clean-Up

One of the key weaknesses of the AT is the inability to batch edit data. The need for batch edit functionality was ranked of high importance in the AT user group survey and will hopefully be added in a future release. What then is a repository to do in the meantime? I suggest two possible options: 1) batch edit data prior to import; 2) manipulate the MySQL tables directly or use a database administration tool such as Navicat for MySQL to connect to the AT's MySQL database and perform queries/updates in the tables.

As I have described before, over time our collection management data has been created according to a number of different ad hoc and defacto standards. We in MSSA have tried as much as possible to batch edit and standardize our data prior to import into the AT. This was straight forward for our accessions and location information which was already stored in a database and thus easy to identify and manipulate. The one problem that did exist with this data was a tendancy by MSSA staff to combine data that belong to a series of different data fields, as instructed by EAD or the AT, into a single catch-all free text note field (a place to find everything). Options for handling this data included exporting or legacy data to and modifying another file or importing into the AT and then editing. We chose the former, performing batch operations to format the files according to the AT import maps. Although this was largely successful, we still encountered edits that needed to be made once in the AT. The options at that point were either to edit in the AT one at a time or perform another round of edits, delete the data in the AT, and then reimport again. We chose instead to perform batch edits in AT using Navicat, saving a considerable amount of time and effort.

The biggest challenge though we faced in standardizing our data prior to import came with our finding aids. Because they're not in a database and therefore not easily comparable, it's hard to see what changes need to made until they're actually in the AT. No matter the number of iterative batch edits we ran on our finding aids we still came across edits that needed to be made. Importing them into the AT however would still likely require a further round of edits. Running edits outside of the AT, deleting all the data, and then reimporting them would be a huge burden, particularly on over 2500 finding aids. We chose instead to run batch operations in the AT using Navicat and then run find/replace separately on the finding aids.

Monday, April 20, 2009

Deploying the AT Across Multiple Yale Repositories: Implementation

In setting up the AT for use across multiple Yale repositories we encountered a number of practical issues that needed to be resolved. The two most important were the need for standardization of practice and administrative set-up of the AT.
  1. Standardization of practice
    Each of the four participating repositories accessioned and managed special collections in a different way. To maximize the Toolkit's effectiveness we therefore needed to create standard procedures for accessioning, including defining a minimum-level accession record and application of consistent accession numbers. In addition, we created documentation in the form of instructions, guidelines, and tutorials to instruct both initial and future participants.

  2. AT set-up
    Again, given the variety exhibited in participating repositories' practices and collections information, we had to carefully consider whether to customize the Toolkit (e.g. use of user-defined fields, unique field labels, default values, etc.) to meet specific repository needs. The major challenge posed, however, was that the Toolkit only allows for customization across the AT instance as a whole and not specific to one repository within that instance.

    We had set up one database instance for the AT at Yale and created repositories for each of the special collections within it. An alternative strategy would have been to create separate instances for each repository, allowing for repository specific customization. The goal of the project, however, was to create one means for managing and querying special collections information across Yale's special collections. In addition, given the lack of distributed technical expertise and support, we decided to centrally manage the AT in a single instance.

    We initially chose not to customize the Toolkit, maintaining a vanilla installation for the Music Library, Arts Library, and Divinity Library collections. Given the sheer volume of collections information in Manuscripts and Archives (MSSA), however, we decided to create a separate MSSA instance, mostly for testing legacy data import, but also to incorporate MSSA-specific data elements in user-defined fields. We are also currently contracting out customization of the Toolkit to handle collections management processes, including Integrated Library System (ILS) export.

Sunday, April 19, 2009

Populating Resources: MARCXML vs. Finding Aids

Choosing how to populate resources in the AT was an important consideration for Manuscripts and Archives, one that ultimately had us reversing course and scrapping our initial plans. Our initial dilemma was that we lacked finding aids for all of our collections, with many of our University Archives finding aids lacking inventories for some accessions (some in paper only, some taken in with no inventory). As a result, we initially chose not to import our finding aids, choosing instead to import MARCXML, which we transformed via a stylesheet into EAD for batch import. Given the lack of data normally provided by importing finding aids, we developed a script to tie our container data (Paradox) to the resources. Several things didn't quite work, so we had to reconsider our options.

Having attended the AT User Group meeting at SAA last Fall, we realized that our approach caused too many complications and problems of sustainability. As a result, we speced out importing our finding aids. Again, given our lack of finding aids for all collections, we worked out a plan to hire a consultant to help standardize our findings aids for import. We successfully imported 2000/2500 finding aids and turned our attention to how to utilize the AT as a collection management system. We then realized we faced the problem of editing editing or significantly revising finding aids, especially on the University Archives side where offices regularly provide accessions inventories ready to copy and paste into EAD. Without a simple means for importing partial EAD for accessions, and without wanting to re-enter these manually in the AT, we reconsidered our plan again. Since our current EAD creation and editing process is sufficent, we decided that the AT is not currently capable of meeting our needs and that to try and have a consultant customize it to meet our needs would be beyond our means.

As it stands now, we're back to square one; MARCXML is it (again). Unfortunately, given the considerable amount of work spent cleaning up and standardizing our finding aids, we will have to come up with a means for importing data from a separate AT EAD import (e.g. Resource Titles, Finding Aid Titles, and Citations) into a fresh AT instance populated via MARCXML. In addition, to meet our needs and function as a full collection management system, we will work with a consultant to modify the Toolkit to marry our container information with our resources and allow for easy export to our ILS (Voyager). Hopefully the need to easily import partial EAD will be worked out and we can use the AT as intended, populating resources via finding aids.

Saturday, April 18, 2009

Reporting

In addition to the AT's built-in reports, which may or may not be sufficient for repository statistics, there are a few options for generating customized reports. Common reporting software tie-ins that can be purchased include: Jasper Reports, Crystal Reports, and iReports. Those wishing to create or customize their own reports with these apps will also need to make use of the Toolkit’s application programming interface (API), which is available on the Archivists’ Toolkit website.

Another option for those with knowledge of MySQL is to use a free database administration tool such as Navicat for MySQL. The beauty of this application is that with a little MySQL you can query and batch edit data in the AT MySQL tables. A similar tool is DaDaBiIK, a free PHP application that allows you to easily create a highly customizable front-end for a database in order to search, insert, update and delete records. Although these tools allow you to easily batch edit data in the AT, be forewarned that editing in the tables directly is not tacitly approved and may cause problems when upgrading.

We ran into problems during the upgrade process to 1.5 after we had written data (including primary key values) directly to the tables. We think this is likely due to our creation of new data/values (especially primary keys) in the tables directly. Subsequent work utilizing these tools with data already imported into the AT via EAD and Accession (XML) import has not encountered problems.

Making the Jump: Standardization of Legacy Data

One of the most important factors to consider in planning for migrating to the Archivists' Toolkit is the condition of your legacy data (finding aids, accessions, etc). In Manuscripts and Archives (MSSA), we were faced with daunting numbers: 2500 collections, 7000+ accessions, 100,000+ boxes. Captured in a variety of systems by a variety of individuals over several decades, we quickly saw the need to begin a lengthy process of data standardization prior to our switch to the Toolkit.

We started with finding aids. Still in EAD version 1.0 as of 2008, we spent considerable manpower converting our finding aids to EAD version 2002. Utilizing the Library of Congress' stylesheet modified for our purposes, we batch converted our finding aids in an iterative process lasting several weeks. Given the variety of encoding practices over the years, however, we were still left with 400-500 invalid files that we had to fix. This took several months. As with the conversion process itself, we worked in an iterative fashion, fixing where possible en masse the common and then the unique. Find and Replace was a godsend here. Dreamweaver was particularly helpful, allowing for multiple line Find and Replace across individual files, selected files, and entire directories.

We moved on to MARC records. Two of the specific problems we encountered when trying to import MARCXML (converted into EAD) was inconsistent assignment of local call numbers (sometimes assigning a 90b, sometimes a 90a, other times assigning same call number for multiple collections) and collection dates. This again stemmed from variations in practice over many years and had to fixed by hand.

We next addressed location info. Due to varied practice and limited authority control, our collection location information (stored in a series of Paradox tables) exhibited a good deal of disparity, if not outright oddities. Many items lacked basic control info (e.g. Record or Manuscript Group Number, Accession Number, or Series) and as a result we had to do a fair amount of detective work to assign such, where possible, or, in some cases, remove these items from our database.

Finally, we tackled accessions. Common problems encountered in our out-dated collections management system (MySQL back end with Access front-end) were inconsistent data formatting (dates, accession numbers, contact info) and input practices (through restriction info, contact info and other odds and edds into a single note field), as well as MSSA-specific (and University Archives specific) accession data and practices (see my other post on Standardization of Practice). These latter issues required a hack or two to enable shoe-horning odd MSSA data elements into appropriate AT data elements (and/or User-Defined Fields).

We anticipate a great deal of post-migration clean-up in the AT to continue to work towards a consistent data set. The great thing though about doing this work in the AT is having one system and one means for doing it. Indeed, migrating to the Toolkit, regardless of its future sustainability (an issue I will post on separately), has been a great opportunity for consistency and standardization within our department. We're just sorry it has taken so long!

Deploying the AT Across Multiple Yale Repositories: Background

In 2008, four Yale special collections repositories (Arts Library, Divinity Library, Manuscripts and Archives, and the Music Library) participated in a project to 1) install and test the AT as an open source collections management system, and 2) examine the feasibility of establishing a Yale way of managing and tracking collections across the Yale Library system. This project was one of a number of projects that were undertaken as part of the Mellon Foundation Collections Collaborative at Yale University.

Focus

The specific focus of the project was to use of the Toolkit for accessioning. Although the Toolkit can do much more, we limited our focus in this project to accessioning primarily because, with the exception of MSSA, the other participants have rudimentary systems in place for recording and managing accession information. In the past, this has been done primarily on paper. The Toolkit, however, facilitates easy capture, management and searching of collection information that is vital to the day-to-day operations of repositories. This allows participants to utilize the same system and terminology to enhance understanding of others’ collections, provide faster and more consistent access to collection information.

Work Summary

The principal investigator first met with the participants to examine existing collections management tools, record-keeping practices, and discuss needs and expectations. These sessions provided the opportunity to specify collections management practices the Toolkit does not accommodate, distinguish between software issues and points where the staff could or should be persuaded to do things differently, and explore the feasibility of developing a “Yale” way of managing special collections. In addition, with little in the way of legacy systems and practices for accessioning materials (Manuscripts and Archives excluded), it was determined that adoption of the Toolkit would be easy to implement and much welcomed.

Following installation, project staff instructed participants in use of the Toolkit and discussed issues concerning implementation and conversion of legacy data. Project staff then gave participants several weeks to use the Toolkit before following-up with a focus group to examine participants’ experiences, issues, questions, and needs. Important outcomes of the focus-group included the expressed need for common practice and use (e.g. best practices guidelines), improved documentation (especially concerning required fields, terminology, and reports), and identification of concerns regarding future administration and tie-in to other systems (i.e. Finding Aid Creation Tool).

Products

To support and further Yale’s use of the Toolkit, a variety of products were (and continue to be) created. These include:

a. Website <http://www.library.yale.edu/mssa/at/>

b. Wiki <https://yaleat.pbwiki.com/>

c. Guided instructions and tutorials for 14 separate Toolkit features and processes.

d. Expanded data dictionary (.xls) [in progress]

e. Best practice guidelines for accessioning

Staff also reported to the Toolkit developers participant experiences and provided recommendations for potential incorporation in future AT releases.

Conclusions & Recommendations

With little in place to accession and track collection material, participants have enthusiastically adopted the Toolkit. Given varieties in local practice/needs however and the amount of information the Toolkit allows you to capture, it is recommended that best practices for accessioning be undertaken to standardize its use. Additionally, ongoing central support and administration will need to be formalized, including bringing in additional special collection repositories. Particularly important here, especially for larger repositories with established record-keeping systems, will be helping repositories map and migrate their legacy data. With future releases and expanded Toolkit functionality, efforts will likely need to be made to further integrate into the AT more and more legacy systems across Yale libraries and special collections.