Showing posts with label Finding Aids. Show all posts
Showing posts with label Finding Aids. Show all posts

Wednesday, June 2, 2010

PDF & HTML Stylesheets for Resource Reports

As with perhaps most other users, we wanted to devise a means for customizing the AT's EAD output to comply with our local EAD implementation guidelines. Mike Rush at the Beinecke created and maintains a stylesheet that converts our AT EAD output to conform with Yale's EAD best practices. This now works very well for us. We export EAD, apply the transformation, validate it, and then save for upload to our finding aids database. That is all well and good. Our staff, though, wanted to be able to perform a similar operation, really a print preview, when editing a resource. This was possible in our previous EAD editor (XMetaL) thanks to built-in customizations. Although the AT's PDF & HTML resource reports do allow for such a print preview, many of our staff wanted a finding aid that looked more like our own. Thankfully, the AT allows you to swap stylesheets (see the AT's FAQ > Import & Export Questions for instructions) to address such needs. We found a few problems, however, that you may need to take into consideration when swapping stylesheets.

First, make sure to check the path for any or subordinate stylesheets you utilize. If you're saving your stylesheets in the AT's report folder (i.e. C:\Program Files\Archivists Toolkit 2.0\reports\Resources\eadToPdf) make sure to use the path \reports\Resources\eadToPdf\[filename]. Otherwise, if you're pointing to a URL, make sure to use the full URL. This was all that was needed to make our PDF stylesheets run in the AT.

Second, especially for HTML stylesheets, make sure that any parameters specified include default values. This was what was causing errors for us.

With these two simple tweaks we are now able to apply our PDF and HTML stylesheets when staff generate resource reports. Ideally, we would like to apply our AT to Yale BPG stylesheet prior to running the resource PDF & HTML reports, perhaps via a plug-in. I'm sure others would like to modify this process as well. For the time being though, we're satisfied with the current output, which allows our staff to easily preview their work in a format similar to our own finding aids.

Monday, May 24, 2010

Command Line Interface (CLI)

We have been testing the developer preview of AT Version 2.0 Update 6, which now features support for the command line interface. Based on work from Cyrus Farajpour at USC, the new version and corresponding ATCLI plug-in developed by Cyrus and Nathan Stevens allow for development of handy command line instructions such as batch export of resources--something we and perhaps others with lots of finding aids are excited about. After initial testing, the plug-in has been modified to address memory consumption issues (a new session needs to be open for each record otherwise the records get cached and used up memory) and allows for specification of output directory.

Our first batch export test of our 3000+ resources was performed on my desktop (2.99 GHz processor, 2 GB RAM) under normal workload. It took approximately 96 hours for the process to complete! Given these results, we've tweeked our approach quite a bit, running it from a better computer and adjusting the amount of memory assigned to the atcli.lax file (see AT's FAQ's for instructions). This seems to help a bit. The main problem though for us is the sheer number and size of our finding aids. No matter what we do, it's going to take some time. In addition, we have several finding aids that don't need to get exported. Ideally, we'd like to be able to select which finding aids to export (i.e. those not marked internal only or those updated since a given date). Perhaps further modifications to the the plug-in will allow such customization.

For the time being, we'd like to thank Cyrus and Nathan for developing this key tool and look forward to working with others to improve it's design and functionality.

UPDATE (6/23/2010)
Good news. Cyrus has modified the command line plug-in to incorporate parameters, allowing you to specify which subset of resources/EAD you want to export from the AT. The parameters include: findingAidStatus, author, eadFaUniqueIdentifier, resourceIdentifier1, resourceIdentifier2, resourceIdentifier3, resourceIdentifier4, internalOnly, and lastUpdated. Plus, you can use any combination of these to easily fine-tune the export process as necessary. For more information on setting up and using the plug-in check out his github site: http://github.com/smoil/AT-CLI-Export-Plugin

Thursday, March 11, 2010

AT Issues: Creators

Another pesky problem we've encountered with the AT EAD import process is that finding aids not encoded with a <corpname/> or <persname/> or <famname/> in <origination label="creator"/> do not get names created with the assigned function of creator in resource records. Unfortunately for us, this amounted to the vast majority of our finding aids, some 1600+ records. Rather than fix these one at a time in the AT, we came up with the following workaround.

1. Generate list of filenames and creators from our EAD files that lack <corpname/> or <persname/> or <famname/> in <origination/>. To do this Mark Matienzo, our new digital archivist, wrote a script for us that parsed our EAD files and dumped this data into an Excel file.

2. Format filenames as needed in separate spreadsheet (e.g. remove .xml extension, add quotes) to create list as follows:

"mssa.ms.0001",
"mssa.ms.0002",
...

3. Use Navicat to run query generating list of resourceIds based on formatted filenames (eadFaUniqueIdentifier in AT Resources table):

SELECT eadFaUniqueIdentifier, resourceId FROM Resources WHERE eadFaUniqueIdentifier IN ("mssa.ms.0001", mssa.ms.002", "mssa.ru.9019") ORDER BY eadFaUniqueIdentifier;

4. Copy resourceIds back into Excel file

5. Format names in separate spreadsheet (add quotes) to create list as follows:

"Yale University. Office of the Secretary.",
"Yale University. Office of the President.",
...

6. Create query in Navicat to retrieve nameids for any names that did make it into the AT:

SELECT nameId, sortName FROM Names WHERE sortName IN ("Yale University. Office of the Secretary.", "Yale University. Office of the President.", "Yale University. Office of the Provost") ORDER BY sortName;

7. Paste nameIds into master file with resourceIds and corpnames, famnames and persnames.

For those records with nameIds, proceed to step 13. Otherwise, for those names not present in the name table, proceed to step 8.

8. Open Names table in Navicat and proceed to last record

9. Copy last record in Names table into Excel to serve as model, noting column names. It is important to note that corpnames and persnames use different columns/fields so make sure to examine records in Names table for formatting both persnames and corpnames.

10. Copy contents of your master original file with persnames, famnames and corpnames into the new spreadsheet according to the Name table structure. Use Excel's fill-in feature to fill in data as needed and assign sequential nameIds from last record in table. Make sure to format the lastUpdated and created fields as text in Excel so as to mirror date encoding in the AT. Here is what your spreadsheet should look like: sample (.xls).

11. Paste records into Names table using Navicat

12. Copy nameIds for newly created names back into your master file of resourceIds and corpnames and persnames

13. Open ArchDescriptionNames table in Navicat

14. Go to last record and copy into new Excel file to serve as model for formatting data

15. Copy contents of master file with resourceIds, corpnames, famnames, and persnames, and nameids into Excel file mirrororing structre of sample record. Use Excel's fill-in feature to format data and assign sequential ids from last record in table. Here is a what your spreadsheet should look like: sample (.xls).

16. Paste contents from Excel file into ArchDescriptionNames table using Navicat

A couple of caveats. First, it is important that these steps be taken by someone comfortable with the aforementioned programs, as well as MySQL. Second, and most important, id creation and pasting of data directly into the AT tables should be done when no one else is using to AT to prevent accidental overlap or duplication of work. Finally, make sure to test the results, including creating new names and resource creator links in the AT client just to make sure everything is ok. If something does go haywire, you can always delete the records from the AT tables you just pasted in.

Wednesday, June 17, 2009

Finding Aid Clean-Up: Box Numbers

As we proceed with our AT development we are spending considerable time cleaning up and standardizing our finding aids. Aside from the work I've mentioned previously to create consistent dates, extent statements and subjects, the main focus of our latest efforts is the standardization of container information (i.e. box numbers). The reason for all this work is to allow us to programmatically hook our location information (e.g. box type, barcode, vault, shelf, etc.) to our finding aids in the AT, the key to which it turns out is the box number.

Like many repositories, our arrangement and descriptive practices have waxed and waned over the years. Although our collection numbers and accession numbers have more or less been consistently applied, our box numbers have not, particularly with used in connection with our practice of housing small quantities of odd-sized materials in common containers. Formerly, we housed such items, especially folio or slides in what we called common folio or common slide boxes (i.e. containers housing materials from multiple collections in a single or communal box), assigning a box number for the common folio/box and a folder number for the individual folder. Aside from the clear practical issues involved in administering such common containers, we've run into problems as we try to tie the box numbers we've assigned these items in our locator database to our finding aid data in the AT. More specifically, as our descriptive practice has varied over the years, the assignment of box numbers and box number extensions (e.g. an alphanumeric character used to indicate a use copy or duplicating master of a particular item) for these items has been inconsistent, unfortunately differing a great deal from box/container info in our EAD. For example, what appears in the finding aid (i.e. ) as "MS Common Folio 10" is entered in our locator database with Box '1' and BoxNumberExtension 'CF1F10'. As a result, we've had to manually edit data both in our location database and in our finding aids/AT for all these items.

This is a short-term solution. These items really need to be rehoused and all such common containers need to be done away with, not only due to the issues at hand, but also to facilitate say the creation of future use copies of these materials.

Sunday, May 31, 2009

Finding Aid Handling

This post will examine issues we've encountered with the AT concerning finding aids.

  1. EAD Import

    We've run into 5 issues importing EAD into the AT. First, as I mentioned in a previous post, we've had problems importing both large EAD files (6+ MB) and large numbers of files. Even with increasing the memory assigned to the AT (see Memory Allocation) and installing the client on the same machine as the database, importing these files has crashed import, causing us to import in small batches or one file at a time as needed. Second, we've encountered issues with handling for those finding aids with a parallel extent or container summary. Here we've found that although encoded correctly, the AT is inconsistent, sometimes assigning a blank extentNumber, other times combining extentNumber and containerSummary in extentType, or, most often, assigning a blank extentNumber, extentType and throwing everything into a general physical description note. This calls for a lot of manual clean-up depending upon how you encode . Third, although encoded correctly, we and other Yale repositories have found inconsistent languageCode assignment, most often resulting in blank values. Fourth, as perhaps many others of you have experienced, we’ve had problems with Names and Subjects, both in the creation of duplicates vales and in names getting imported as subjects (and vice versa). This is likely due to how the AT treats names as subjects—a complicated concept which may or may not be able to be revised for 2.0. Fifth, we’ve found inconsistent handling of internal references/pointers, sometimes getting assigned new target values, sometimes not. Whether or not new target values are created following merging of components is another issue on the table for future investigation.

  2. DAO handling

    Unfortunately one of the EAD elements currently not handled by the AT is , which our Yale EAD Best Practices Guidelines and common EAD authoring tool (XMetal) have set up to handle all Digital Archival Objects. As a result, all our DAOs are unable to be imported into the AT. This is a major issue that needs to be addressed in the 2.0 DAO module revision.

  3. EAD Export/Styleshseet

    Although it is possible to modify or replace the stylesheet the AT uses to export EAD or PDF (see Loading Stylesheet) it’s not currently possible to select or utilize multiple stylesheets, which may be important for multiple repository installations and/or developing flexible export possibilities.

  4. Batch editing

    I’ve mentioned previously in another post that one of the key weaknesses of the AT is the inability to batch edit. Given the lack of batch editing functionality, a repository will either have to commit to greater planning and development prior to integration, or use a database administration tool such as Navicat for MySQL to clean up data once in the AT.

Tuesday, May 5, 2009

Tips and Tricks: Resources

In migrating your legacy data into the AT your repository might have collections that lack both a finding aid and MARC record but yet still need to get into the AT as resource records. Rather than take the time to manually create these resource records in the AT one by one, it would be helpful to come up with a means for importing this information in a batch process. MSSA had over 250 such records that we had to bring into the AT. Here is how we did it.

The legacy data that we had to address, consisting mostly of what can loosely be described as deaccessions (i.e. collections that were transferred, destroyed, never came in, and/or who knows what), was in a Paradox database. Given the orderly data structure, we decided to to generate simple EAD finding aids for each collection using a mail merge and then batch import into the AT.

The first step was to export the data from our Paradox database into Excel. We then filtered the information to select only the specific collection records needed and deleted any unnecessary data elements. We then modified the data bit, in this case creating a column for filename in addition to the other data elements present (e.g. collection number, title, note, and disposition). This served as the data source for the merge.



Next we modified our existing EAD template to include only the basics needed for import into the AT (namely level, resource identifier, title, dates, extent, and language), as well as the information present in our Paradox database to distinguish it as a deaccession (e.g. note and disposition). We then opened the EAD template in Word and set up a mail merge, inserting the elements from our Excel data source into the appropriate places in the Word document. Here is a partial view of the Word EAD template:



Then we completed the merge of the data elements in Word and saved the individual finding aids with appropriate filenames. Here is a partial view of one of the resulting finding aids:



The final step was to import the finding aids into the AT. To distinguish these resources in the AT we checked them as Internal Only in the Basic Description tab and modified appropriate note fields as needed.

All in all the process proved very easy and was much faster than trying to enter this data manually. The only real drawback was having to save 250 separate finding aids.

Friday, May 1, 2009

AT Issues: Large Finding Aids

Manuscripts and Archives encountered several performance issues when loading our finding aids into the AT. First, given the large number of finding aids in question (2500+), we encountered load time issues. Our first attempt to batch import our finding aids lasted the better part of a weekend, ultimately crashing with the dreaded java heap space error. Adding insult to injury, no log was generated to indicate the cause of the crash or the status of import (e.g. which files were successfully imported). Our initial diagnosis pointed to our setup. We had the database installed our one of our local, aging servers and ran the import via a client on a remote workstation. To address the issue of load time we changed our setup and moved the database to our fastest machine, a Mac with 8GB of memory and multiple processors, installed the AT client on it, and saved copies of our finding aids to it for easy import.

Our second attempt, which involved approximately 1800 finding aids, was much much faster but still crashed. The likely culprits this time were large finding aids and memory leak. We found that large finding aids (3MB+) crashed the import process when included as part of a batch import. In addition, we found a so called memory leak (i.e. successive loss of processing memory with each imported finding aid), which greatly slowed the process over time and contributed to the crash. As a result, we separated out large finding aids and imported them individually, as well as creating smaller batches of finding aids (both with respect to total number and total size) to import in stages. Just to give you some idea of the time required to import larger finding aids, we found that using a remote client to import up to a 2 MB file averaged 20-30 minutes; 2-3 MB took 30-60 minutes; 5-6 MB took 90-120 minutes.

These strategies proved effective, allowing us to import all but the largest of our finding aids (8-12 MB each), which we are currently working on. Because these present problems for our finding aid delivery system as well, one option is to split them up into multiple files, each around 2 MB. The only problem with this option is dealing with/maintaining multiple files.

For other institutions with similar numbers and sizes of finding aids these strategies may be of help to you.