Mashing it up across the border with SPRUCE

Early last week we went up to Glasgow to hold the first Mashup event of the SPRUCE Project, an initiative that I'm now working on full time since leaving the British Library a few weeks ago. It's exciting to be concentrating on just one project and I'm really enjoying the challenge of developing a community focused approach to digital preservation.

It may have been the first SPRUCE Mashup, but it was the fourth in a series of events developed as part of the AQuA Project last year. We've now got the process for our practitioner and developer driven mashing well established after all this practice. But regardless of the structure and facilitation we are of course dependent on our attendees to do all the real work. Despite being a little short of developers for our Glasgow Mashup we had some excellent results, and our event room was filled with an air of amazement as our devs demo'd a variety of impressive solutions to us on the final day.


Peter May, from the British Library, was teamed up with practitioners requiring some automated file format identification and basic characterisation as the first step in appraising and managing their digital acquisitions. Peter has been working on developing the very promising capabilities of Apache Tika as part of the SCAPE Project, and worked on wrapping Tika to recursively run through directories of content, and then visualise the results. He also followed up with a method of enabling a run through disk images. Peter has already blogged in detail about his experiences here and Rebecca Nielsen of the Bodleian has blogged about her perpsective as a practitioner here.
http://wiki.opf-labs.org/display/SPR/Tika+Batch+File+Identification
Lesson learned: Modularise a solution so that each component is tightly focused on a specific purpose and is easy to maintain.

Larry Murray, from PRONI, brought to Glasgow what first appeared to be a thoroughly taxing challenge of extracting email attachments from .msg files. The thought of trawling through OLE2 objects even had our Mashup veteran, Maurice de Rooij quaking in his coding boots, but it turned out not to be an impossible challenge for 3 days of work. The msgparser library, which utilises the wonderful Apache POI (a previous Mashup fave), provided a method of getting to the attachments. Maurice just needed to apply the library in the correct manner and deal with nested attachments. Ok, maybe not that straightforward but Maurice still came through with flying colours.
http://wiki.opf-labs.org/display/SPR/Preserving+MS+Outlook+(.msg)+E-mails+with+Attachments+-+Solution
Lesson learned: Re-use code for a quick outcome and well supported results. Don't reinvent the wheel.

Herein lies the underlying principle of our Mashup events. Off the shelf solutions to practitioner's concrete digital preservation challenges do not usually exist. Tools created by the DP community do not usually meet practitioners needs well. But someone, out there in the world of open source software, usually has solved something pretty close to the problem at hand. With a practitioner on hand to steer the direction, a developer can usually solve most specific digital preservation challenges in a short space of time by re-using code that's freely available on the net.

We set Helder Silva, from KEEP Solutions in Portugal (and representative at SPRUCE Glasgow from the EC funded Scape Project), quite possibly the hardest challenge of the event. How can we get an obsolete (eg. "Please install IE4") Win95 application running on a modern computer? This was real digital archaeology, and a series of increasingly tricky challenges (and most critically, a lack of time) prevented Helder from reaching the finishing post. He did however capture a variety of rather useful notes on the experience. Emulation and virtualisation is still a long way from becoming a productized solution for practitioners, although this case says as much about the collecting and documentation policies of typical institutions as the challenges of building a technical solution to a puzzle that has several key pieces missing.
http://wiki.opf-labs.org/display/SPR/Creation+of+a+virtual+machine+to+ru...
Lesson learned: Documenting an experience for others to learn from is as valuable as a working solution to a specific challenge.

Dev8d veteran Andrew Amato, from the London School of Economics, was a big hit at the mashup with a practical solution for consistency checking of processed files. It turned out that a number of our practitioners were after a pretty similar solution to the problem he solved. Digital obsolescence and the evil bit rot gets all the attention in the digital preservation world, but its good old fashioned cock ups that often present the most common preservation challenges. Files get mislayed, a copy from one network location to another goes down and files are lost, disks get full, a processing operation omits the last batch of files, and so on! Verifiable manifests, populated with the checksum of your choice, can do a lot to address these problems. But throw in some processing or format migration and making sure you're not missing any files can be a little more tricky. Andrew's neat spreadsheet solution is potentially adaptable to different requirements and we hope to see a slightly more generic solution made available as a follow up to his initial work.
http://wiki.opf-labs.org/display/SPR/File+management+and+matching+of+tif...
Lesson learned: a solution to what might appear to be a mundane challenge, might actually be the most widely useful.

Another of our Mashup regulars is Swithun Crowe from St Andrews. Swithun was tasked with tracking down a couple of different image corruption challenges, and he made short work of both the investigation and the coding of some very promising solutions. The first problem seemed to be caused by a scanning error that introduced black pixels at the end of scans, losing much of the scanned page. Swithun knocked up a tool to count and analyse black pixels and tested it on a large sample of the images, with success.
http://wiki.opf-labs.org/display/SPR/Malformed+TIFF+images+solution
Lesson learned: inadequate QA leads to digital preservation problems further down the line.

Swithun's approach to the second (and harder) challenge from the British Library was a masterclass in breaking down a seemingly near-impossible challenge, into manageable pieces. Its not the first time we'd looked at this Issue in one of our Mashups so it was great to see a really promising solution that was identifying bad files in our sample set, without any false positives. Excited to see the results of a full run over the collection in the near future...
http://wiki.opf-labs.org/display/SPR/Corrupted+JPEG+and+JPEG2000+files+s...
Lesson learned: break down a tough challenge into bite sized pieces.

Paddy McCann, of HATII and the DCC, took on a slightly different challenge from the M3P initiative at the University of Hull. M3P is capturing and preserving experiences and memories about music from Malta, but has a challenge with engaging its target audience to archive their content in the M3P wiki. Paddy explored a method of a user being able to pull content from Facebook into the M3P Mediawiki, and has blogged about his experiences here. The resulting proof of concept won the best developer prize, as voted by the Mashup attendees themselves (Paddy is pictured here with Jen Mitcham of the ADS who won best practitioner prize).
http://wiki.opf-labs.org/display/SPR/Extracting+content+from+Facebook+to...
Lesson learned: Developing a proof of concept can be a useful approach for assessing the viability and effort required to develop a full solution to a complex challenge.

Carl Wilson, from the British Library, was responsible for technical facilitation at the Mashup, but had time to run a little cross platform experiment with the Bagger front end for LoC's excellent BagIt software. There were a few build issues but Carl made it to the end, successfully bagging on one platform and unbagging on another using self compiled versions of the software. There was a real mix of awareness of Bagger at our event (some satisfied users, some who had never heard of it), and we wondered if some extra publicity for this useful tool could be in order. I'm also interested to know if none-technical practitioners find it meets their needs. Is there a requirement for a cut down version with a really simple interface for some use cases?
http://wiki.opf-labs.org/display/SPR/Checking+the+Authenticity+and+Integ...
Lesson learned: make it as easy as possible to build and maintain your source code. SCAPE (in documenting an approach to packaging techniques) and OPF (with these excellent guidelines) are good starting points for more information on doing this well.

Our Mashups rely on frequent checkpoints between practitioners and developers in order to ensure developments are on track and are focused tightly on what the practitioners actually need. But while our devs are hacking away we run different sessions for our practitioners. In Glasgow we focused our practitioner's efforts on building business cases for their digital preservation activities. As well as helping focus their minds on selling what they do it also provides us with some raw materials with which to build a generic business case for digital preservation. A tool which will hopefully prove useful for other practitioners. As with our requirements and hacking work, we captured these event outputs on a wiki. The process of generating them was quite fun, and was useful in stimulating discussion between attendees on a whole variety of topics. We wrapped up with an "elevator pitch" which provided a useful way of drawing together the other business plan elements in a 1 minute pitch of all the salient points to a fictional senior manager.

We'll be running more events this year and next, and we'll be advertising the dates shortly if you'd like to get involved.

Preservation Topics: