There is a trend in digital preservation circles to question the need for migration. The argument varies a little from proponent to proponent but in essence, it states that software exists (and will continue to exist) that will read (and perform requisite functions, e.g., render) old formats. Hence, proponents conclude, there is no need for migration. I had thought it was a view held by a minority but at a recent workshop it became apparent that it has been accepted by many.
However, I’ve never thought this is a very strong argument. I’ve always seen a piece of software that can deal with not only new formats but also old formats as really just a piece of software that can deal with new formats with a migration tool seamlessly bolted onto the front of it. In essence, it is like saying I don’t need a migration tool and a separate rendering tool because I have a combined migration and rendering tool. Clearly that’s OK but it does not mean you’re not performing a migration?
As I see it, whenever a piece of software is used to interpret a non-native format it will need to perform some form of transformation from the information model inherent in the format to the information model used in the software. It can then perform a number of subsequent operations, e.g., render to the screen or maybe even save to a native format of that software. (If the latter happens this would, of course, be a migration.)
Clearly the way software behaves is infinitely variable but it seems to me that it is fair to say that there will normally be a greater risk of information loss in the first operation (the transformation between information models) than in subsequent operations that are likely to utilise the information model inherent in the software (be it rendering or saving in the native format). Hence, if we are concerned with whether or not we are seeing a faithful representation of the original it is the transformation step that should be verified.
This is where using a separate migration tool comes into its own (at least in principle). The point is that it allows an independent check to be made of the quality of the transformation to take place (by comparing the significant properties of the files before and after). Subsequent use of the migrated file (e.g., by a rendering tool) is assumed to be lossless (or at least less lossy) since you can choose the migrated format so that it is the native format of the tool you intend to use subsequently (meaning when the file is read no transformation of information model is required).
However, I would concede that there are some pragmatic things to consider...
First of all, migration either has a cost (if it requires the migrated file to be stored) or is slow (if it is done on demand). Hence, there are probably cases where simply using a combined migration and rendering tool is a more convenient solution and might be good enough.
Secondly, is migration validation worth the effort? Certainly it is worth simply testing, say, a rendering tool with some example files before deciding to use it which should be sufficient to determine that the tool works without detailed validation most of the time. However, we have cases where we detect uncommon issues in common migration libraries so migration validation does detect issues that would go unnoticed if the same libraries are used in a combined migration and rendering tool.
Thirdly, is migration validation comprehensive enough? The answer to this depends on the formats but for some (even common) formats it is clear that better, more comprehensive tools would do a better job. Of course the hope is that this will continually improve over time.
So, to conclude, I do see migration as a valid technique (and in fact a technique that almost everyone uses even if they don’t realise it). I see one of the aims of the digital preservation community should be to provide an intellectually sound view of what constitutes a high quality migration (e.g., through a comprehensive view of significant properties across a wide range of object types). It might be that real-life tools provide some pragmatic approximation to this idealistic vision (potentially using short cuts like using a combined migration and rendering tool) but we should at least understand and be able to express what these short cuts are.
I hope this post helps to generate some useful debate.
Some time ago Will Palmer, Peter May and Peter Cliff of the British Library published a really interesting paper that investigated three different JPEG 2000 codecs, and their effects on image quality in response to lossy compression. Most remarkably, their analysis revealed differences not only in the way these codecs encode (compress) an image, but also in the decoding phase. In other words: reading the same lossy JP2 produced different results depending on which implementation was used to decode it.
A limitation of the paper's methodology is that it obscures the individual effects of the encoding and decoding components, since both are essentially lumped in the analysis. Thus, it's not clear how much of the observed degradation in image quality is caused by the compression, and how much by the decoding. This made me wonder how similar the decode results of different codecs really are.An experiment
To find out, I ran a simple experiment:
- Encode a TIFF image to JP2.
- Decode the JP2 back to TIFF using different decoders.
- Compare the decode results using some similarity measure.
I used the following codecs:
- Kakadu v7.2.2 (kakadu)
- OpenJPEG 2.0 (opj20)
- ImageMagick 6.8.9-8 (im)
- GraphicsMagick 1.3.18 (gm)
- IrfanView 4.35 with JPEG2000 plugin 4.33 (irfan)
First I compressed my source TIFF (a grayscale newspaper page) to a lossy JP2 with a compression ratio about about 4:1. For this example I used OpenJPEG, with the following command line:opj_compress -i krant.tif -o krant_oj_4.jp2 -r 4 -I -p RPCL -n 7 -c [256,256],[256,256],[256,256],[256,256],[256,256],[256,256],[256,256] -b 64,64 Decoding the JP2
Next I decoded this image back to TIFF using the aforementioned codecs. I used the following command lines:CodecCommand lineopj20 opj_decompress -i krant_oj_4.jp2 -o krant_oj_4_oj.tif kakadu kdu_expand -i krant_oj_4.jp2 -o krant_oj_4_kdu.tif kakadu-precise kdu_expand -i krant_oj_4.jp2 -o krant_oj_4_kdu_precise.tif -precise irfanUsed GUIim convert krant_oj_4.jp2 krant_oj_4_im.tif gm gm convert krant_oj_4.jp2 krant_oj_4_gm.tif
This resulted in 6 images. Note that I ran Kakadu twice: once using the default settings, and also with the -precise switch, which "forces the use of 32-bit representations".Overall image quality
As a first analysis step I computed the overall peak signal to noise ratio (PSNR) for each decoded image, relative to the source TIFF:DecoderPSNRopj2048.08kakadu48.01kakadu-precise48.08irfan48.08im48.08gm48.07
So relative to the source image these results are only marginally different.Similarity of decoded images
But let's have a closer look at how similar the different decoded images are. I did this by computing PSNR values of all possible decoder pairs. This produced the following matrix:Decoderopj20kakadukakadu-preciseirfanimgmopj20-57.5278.5379.1796.3564.43kakadu57.52-57.5157.5257.5257.23kakadu-precise78.5357.51-79.0078.5364.52irfan79.1757.5279.00-79.1864.44im96.3557.5278.5379.18-64.43gm64.4357.2364.5264.4464.43-
Note that, unlike the table in the previous section, these PSNR values are only a measure of the similarity between the different decoder results. They don't directly say anything about quality (since we're not comparing against the source image). Interestingly, the PSNR values in the matrix show two clear groups:
- Group A: all combinations of OpenJPEG, Irfanview, ImageMagick and Kakadu in precise mode, all with a PSNR of > 78 dB.
- Group B: all remaining decoder combinations, with a PSNR of < 64 dB.
What this means is that OpenJPEG, Irfanview, ImageMagick and Kakadu in precise mode all decode the image in a similar way, whereas Kakadu (default mode) and GraphicsMagick behave differently. Another way of looking at this is to count the pixels that have different values for each combination. This yields up to 2 % different pixels for all combinations in group A, and about 12 % in group B. Finally, we can look at the peak absolute error value (PAE) of each combination, which is the maximum value difference for any pixel in the image. This figure was 1 pixel level (0.4 % of the full range) in both groups.
I also repeated the above procedure for a small RGB image. In this case I used Kakadu as the encoder. The decoding results of that experiment showed the same overall pattern, although the differences between groups A and B were even more pronounced, with PAE values in group B reaching up to 3 pixel values (1.2 % of full range) for some decoder combinations.What does this say about decoding quality?
It would be tempting to conclude from this that the codecs that make up group A provide better quality decoding than the others (GraphicsMagick, Kakadu in default mode). If this were true, one would expect that the overall PSNR values relative to the source TIFF (see previous table) would be higher for those codecs. But the values in the table are only marginally different. Also, in the test on the small RGB image, running Kakadu in precise mode lowered the overall PSNR value (although by a tiny amount). Such small effects could be due to chance, and for a conclusive answer one would need to repeat the experiment for a large number of images, and test the PSNR differences for statistical significance (as was done in the BL analysis).
I'm still somewhat surprised that even in group A the decoding results aren't identical, but I suspect this has something to do with small rounding errors that arise during the decode process (maybe someone with a better understanding of the mathematical intricacies of JPEG 2000 decoding can comment on this). Overall, these results suggest that the errors that are introduced by the decode step are very small when compared against the encode errors.Conclusions
OpenJPEG, (recent versions of) ImageMagick, IrfanView and Kakadu in precise mode all produce similar results when decoding lossily compressed JP2s, whereas Kakadu in default mode and GraphicsMagick (which uses the JasPer library) behave differently. These differences are very small when compared to the errors that are introduced by the encoding step, but for critical decode applications (migrate lossy JP2 to something else) they may still be significant. As both ImageMagick and GraphicsMagick are often used for calculating image (quality) statistics, the observed differences also affect the outcome of such analyses: calculating PSNR for a JP2 with ImageMagick and GraphicsMagick results in two different outcomes!
For losslessy compressed JP2s, the decode results for all tested codecs are 100% identical1.
This tentative analysis does not support any conclusions on which decoders are 'better'. That would need additional tests with more images. I don't have time for that myself, but I'd be happy to see others have a go at this!Link
Digital was everywhere at this year’s Society of American Archivists annual meeting. What is particularly exciting is that many of these sessions were practical and pragmatic. That is, many sessions focused on exactly how archivists are meeting the challenge of born-digital records.
In one such session, Sibyl Schaefer, Head of Digital Programs at the Rockefeller Archive Center, offered such advice. I am excited to discuss some of the themes from her talk, “We’re All Digital Archivists: Digital Forensic Techniques in Everyday Practice,” here as part of the ongoing Insights Interview series.
Trevor: Could you unpack the title of your talk a bit for us? Why exactly is it time for all archivists to be digital archivists? What does that mean to you in practice?
Sibyl: We don’t all need to be digital archivists, but we do need to be archivists who work with digital materials. It’s not scalable to have one person, or one team, focus on the “digital stuff.” When I was first considering how to structure the Digital Team (or D-Team) at the RAC, it crossed my mind to mirror the structure of my organization, which is based on the main functions of an archive: collection development, accessioning, preservation, description, and access. I quickly realized that integrating digital practices into existing functions was essential.
The archivists at my institution take great pride in their knowledge of the collections, and not tapping into that knowledge would disadvantage the digital collections. We also don’t have many purely digital collections; the vast majority are hybrid. It wouldn’t make sense for one person to arrange and describe analog materials and another the digital materials. The principles of arrangement and description don’t change due to the format of the materials. Our archivists just need guidance in how to be effective in handling digital records, they need experience using tools so they feel comfortable with them, and they need someone available to ask if they have questions. So the digital archivists on my team are figuring out which software and tools to adopt, which workflows are the most efficient, and how to best educate the rest of the staff so they can do the actual archival work. The digital archivists aren’t actually doing traditional archival work and in that sense, “digital archivist” is a misnomer.
Trevor: If an archivist wants to get caught up-to-speed on the state and role of digital forensics for his or her work, what would you suggest they read/review? Further, what about these works do you see as particularly important?
Sibyl: The CLIR report, “Digital Forensics and Born-Digital Content in Cultural Heritage Collections,” is an excellent place to start. It clearly outlines what is gained by using forensics techniques in archival practice: namely the ability to capture digital archival materials in a secure manner that preserves more of their context and original order. These techniques also allow archivists to search through and review those materials without worrying about inadvertently altering them and affecting their authenticity.
I was ecstatic when I first saw Peter Chan’s YouTube video on processing born-digital materials using the Forensic ToolKit software. It was the first time I saw how functionality in FTK could be mapped to traditional processing activities: weeding duplicates, identifying Personally Identifiable Information and restricted records, arranging materials hierarchically, etc. It really answers the question of “So you have a disk image, now what do you do with it?” It also conveyed that the program could be picked up fairly easily by processing archivists.
The “From Bitstreams to Heritage: Putting Digital Forensics into Practice in Collecting Institutions” report (pdf) provides a really good overview of the recent activities in this area and a practical analysis of some of the capabilities and limitations of the forensics tools available.
Trevor: Could you tell us a bit about how the digital team works at the Rockefeller Archive Center? What kinds of roles do people take in the staff? How does the team fit into the structure of the Archive? How do you define the services you provide?
Sibyl: My team takes a user-centered approach in fulfilling our mission of leveraging technology to support all our program areas. We generally start by identifying a need for new technology, whether it be to place our finding aids online, create digital exhibits for our donors, preserve the context and authenticity of materials as they move from one physical medium to another, or increase our efficiency in managing researcher requests. We then try to involve users — both internal and external — as much as possible throughout the process. This involvement is crucial given that we usually aren’t the primary users of the software we implement.
One archivist focuses on delivery and access, which includes managing our online finding aid delivery system, as well as working very closely with our reference staff to develop and integrate tools that will help increase the efficiency of their work. Another team member is focused on digitization and metadata projects which includes things like scanning and outsourced digitization projects, as well as migrating from the Archivists’ Toolkit to ArchivesSpace. We just hired a new digital archivist to really delve into the digital forensics work I discussed in my presentation at SAA. She will be disk imaging and teaching our processing archivists to use FTK for description. In addition to overseeing the work of all the team members, I interface with our donor institutions, create policies and procedures, set team priorities and oversee our digital preservation system.
As I mentioned before, the RAC is divided up into five different archival functional areas: donor services, collections management, processing, reference and the digital team. Certain services, like digital preservation and digital duplication for special projects, are within our realm of responsibility, while with others we take a more advisory role. For example, we’re in the midst of an Aeon special collections management tool implementation, and although we won’t be internally hosting the server, we are helping our reference staff articulate and revise their workflows to take advantage of the efficiencies that system enables.
Our services are quite loosely defined; one of our program goals is to “leverage technology in an innovative way in support of all RAC program areas.” This gives us a lot of leeway in what we choose to do. I prioritize our preservation work based on risk and our systems work based on an evaluation of institutional priorities. For example, over the last year the RAC has been trying to increase the efficiency of our reference services, so we evaluated their workflows, replaced an unscalable method for organizing reference interactions with a user-friendly ticketing system, and are now aiding with the Aeon implementation.
Trevor: Could you tell us a bit about the workflow you have put in place to implement digital forensics in processing digital records? What roles do members of your team play and what roles do others play in that workflow?
Sibyl: My team takes care of inventorying removable media, creating disk images, running virus checks on those images, and providing them to the processing staff for analysis and description. Processing staff then identifies duplicates, restricted materials, and materials that contain PII. They arrange and describe materials within FTK. When they have finished, they notify the D-Team and we add the description to the Archivists’ Toolkit (or ArchivesSpace — we’re preparing to transition over soon) and ingest those files and related metadata into Archivematica.
There’s a lot of details we need to add in that will greatly increase the complexity of the process, and some of them will require actual policy decisions to be made. For example, the question of redaction comes up every time I review this process with our archivists. Redaction can be pretty straightforward with certain file formats, but definitely not with all. Also, how do we relay that information has been redacted to our researchers? We need to have a policy that clearly outlines when we redact information (for materials going online? for use in the reading room?) what types of information we redact, and what types of files can securely be redacted.
Trevor: As your process is established and refined, what do you see as the future role and place of the digital team within the archive? That is, what things are on the horizon for you and your team?
Sibyl: In the years since I joined the RAC we’ve placed our finding aids and digital objects online in an access system, architected a system for digital preservation, and configured forensics workflows. Now that we’ve got that foundation for managing and accessing our digital materials, I want to start embodying our goals to be innovative and leaders in the field. One area I think we can contribute to is integrating systems. For example, we’re launching a new project with Artefactual, the developers of Archivematica, to create a self-submission mechanism for donors to transfer records to us. Part of the project includes integrating ArchivesSpace with Archivematica. How cool would it be to have an accession record automatically created in ArchivesSpace when a donor transfers materials to our Archivematica instance?
Likewise, I’ve been talking with a few people about using data in FTK to create interactive interfaces for researchers. We could use directory data captured during imaging or created during analysis (like labeling materials “restricted”) to recreate (but not necessarily emulate) the way files were originally organized, including listing deleted and duplicate files and then linking that directly to their final, archival organization. The researcher would be able to see how the files were originally organized by the donor and what is missing (or restricted) from what is presented as the final archival organization. I get giddy when I think of how we can use technology to increase the transparency of what happens during archival processing. I’m also excited about the prospect of working EAC-CPF records into our discovery interface to bolster our description.
We also have a great deal of less innovative but very necessary tasks ahead of us. We need to implement a DAMS to help corral the digitized materials that are created on request and also to provide more granular permissions to materials than what we currently have. We need to create and implement policies to fill in gaps in our policy framework and inch towards TRAC compliance. And lastly, we need to systematize our preservation planning. We have a lot of work to keep us busy! That said, it’s a really great time to be in the archival field. Digital materials may present new and complex challenges, but we also have a chance to be creative and innovative with systems design and applying traditional archival practices to new workflows.
Now that we are entering the final days of the SCAPE project, we would like to highlight some SCAPE Quality Assurance tools that have an online demonstrator.
See http://scape.demos.opf-labs.org/ for the following tools:
Pagelyzer: Compares web pages
Monitor your web content.
Jpylyzer: Validates images
JP2K validator and properties extractor.
Xcorr-sound: Compares audio sounds
Improve your digital audio recordings.
Flint: Validates different files and formats
Validate PDF/EPUB files against an institutional policy
Matchbox: Compares documents (following soon)
Duplicate image detection tool.
For more info on these and other tools and the SCAPE project, see http://scape.usb.opf-labs.org
for the content of our SCAPE info USB stick.
Preservation Topics: SCAPE
My name is Ed Fay, I’m the Executive Director of the Open Planets Foundation.Tell us a bit about your role in SCAPE and what SCAPE work you are involved in right now?
OPF has been involved in technical and take-up work all the way through the project, but right now we’re focused on sustainability – what happens to all the great results that have been produced after the end of the project.Why is your organisation involved in SCAPE?
OPF has been responsible for leading the sustainability work and will provide a long-term home for the outputs, preserving the software and providing an ongoing collaboration of project partners and others on best practices and other learning. OPF members include many institutions who have not been part of SCAPE but who have an interest in continuing to develop the products, and through the work that has been done - for example on software maturity and training materials - OPF can help to lower barriers to adoption by these institutions and others.What are the biggest challenges in SCAPE as you see it?
The biggest challenge in sustainability is identifying a collaboration model that can persist outside of project funding. As cultural heritage budgets are squeezed around the world and institutions adapt to a rapidly changing digital environment the community needs to make best use of the massive investment in R&D that has been made, by bodies such as the EC in projects such as SCAPE. OPF is a sustainable membership organisation which is helping to answer these challenges for its members and provide effective and efficient routes to implementing the necessary changes to working practices and infrastructure. In 20 years we won’t be asking how to sustain work such as this – it will be business as usual for memory institutions everywhere – but right now the digital future is far from evenly distributed.
But from the SCAPE perspective we have a robust plan which encompasses many different routes to adoption, which is of course the ultimate route to sustainability – production use of the outputs by the community for which they were intended. The fact that many outputs are already in active use – as open-source tools and embedded into commercial systems – shows that SCAPE has produced not only great research but mature products which are ready to be put to work in real-world situations.What do you think will be the most valuable outcome of SCAPE?
This is very difficult for me to answer! Right now OPF has the privileged perspective of transferring everything that has matured during the project into our stewardship - from initial research, through development, and now into mature products which are ready for the community. So my expectation is that there are lots of valuable outputs which are not only relevant in the context of SCAPE but also as independent components. One particular product has already been shortlisted for the Digital Preservation Awards 2014 which is being co-sponsored by OPF this year while others have won awards at DL2014. These might be the most visible in receiving accolades, but there are many other tools and services which provide the opportunity to enhance digital preservation practice within a broad range of institutions. I think the fact that SCAPE is truly cross-domain is very exciting – working with scientific data, cultural heritage, web harvesting – it shows that digital preservation is truly maturing as a discipline.
If there could be one thing to come out of this, it would be a understanding of how to continue the outstanding collaboration that SCAPE has enabled to sustain cost-effective digital preservation solutions that can be adopted by institutions of all sizes and diversity.Contact informationSCAPE
Weirder than old: The CP/M File System and Legacy Disk Extracts for New Zealand’s Department of Conservation
We’ve been doing legacy disk extracts at Archives New Zealand for a number of years with much of the effort enabling us to do this work being done by colleague Mick Crouch, and former Archives New Zealand colleague Euan Cochran – earlier this year, we received some disks from New Zealand’s Department of Conservation (DoC) which we successfully imaged and extracted what was needed by the department. While it was a pretty straightforward exercise, there was enough about it that was cool enough to warrant that this blog be an opportunity to document another facet of the digital preservation work we’re doing, especially in the spirit of being another war story that other’s in the community can refer to. We do conclude with a few thoughts about where we still relied on a little luck, and we’ll have to keep that in mind moving forward.
We received 32 180kb 5.25 inch disks from DoC. Maxell MD1-D, single sided, double-density, containing what we expected to be Survey Data circa 1984/1985.
Our goal with these disks, as with any that we are finding outside of a managed records system, is to transfer the data to a more stable medium, as disk images, and then extract the objects on the imaged file system to enable further appraisal. From there a decision will be made about how much more effort should be put into preserving the content and making suitable access copies of whatever we have found – a triage.
For agencies with 3.5-inch floppy disks, we normally help to develop a workflow within that organisation that enables them to manage this work for themselves using more ubiquitous 3.5-inch USB disk drives. With 5.25-inch disks it is more difficult to find suitable floppy disk drive controllers so we try our best at Archives to do this work on behalf of colleagues using equipment we’ve set up using the KryoFlux Universal USB floppy disk controller. The device enables the write-blocked reading, and imaging of legacy disk formats at a forensic level, using modern PC equipment.
We create disk images of the floppies using the KryoFlux and continue to use those images as a master copy for further triage. A rough outline of the process we tend to follow, plus some of its rationale is documented by Euan Cochran in his Open Planets Foundation blog: Bulk disk imaging and disk-format identification with KryoFlux.
Through a small amount of trial and error we discovered that the image format with which we were capable of reading the most sectors without error was MFM (Modified Frequency Modulation / Magnetic Force Microscopy) with the following settings:Image Type: MFM Sector Image Start Track: At least 0 End Track: At most 83 Side Mode: Side 0 Sector Size: 256 Bytes Sector Count: Any Track Distance: 40 Tracks Target RPM: By Image type Flippy Mode: Off
We didn’t experiment to see if these settings could be further optimised as we found a good result. The non-default settings in the case of these disks were side mode zero, sector size 256 bytes, track distance at 40, and flippy mode was turned off.
Taken away from volatile and unstable media, we have binary objects that we can now attach fixity to, and treat using more common digital preservation workflows. We managed to read 30 out of 32 disks.Exploding the Disk Images
Successful imaging alone doesn’t guarantee ease of mounting. We still needed to understand the underlying file system.
The images that we’ve seen before have been FAT12 and mount with ease in MS-DOS or Linux. These disks did not share the same identifying signatures at the beginning of the bitstream. We needed a little help in identifying them and fortunately through forensic investigation, and a little experience demonstrated by a colleague, it was quite clear the disks were CP/M formatted; the following ASCII text appearing as-is in the bitstream:
************************* * MIC-501 V1.6 * * 62K CP/M VERS 2.2 * ************************* COPYRIGHT 1983, MULTITECH BIOS VERS 1.6
CP/M (Control Program for Microcomputers) is a 1970’s early 1980’s operating system for early Intel microcomputers. The makers of the operating system were approached by IBM about licensing CP/M for their Personal Computer product, but talks failed, and the IBM went with MS-DOS from Microsoft; the rest is ancient history…
With the knowledge that we were looking at a CP/M file system we were able to source a mechanism to mount the disks in Windows. Cpmtools is a privately maintained suite of utilities for interacting with CP/M file systems. It was developed for working with CP/M in emulated environments, but works with floppy disks, and disk images equally well. The tool is available in Windows and POSIX compliant systems.
Commands for the different utilities look like the following:
That resulted in a command line to generate a file listing like this:
Creating a directory listing:C:> cpmls –f bw12 disk-images\disk-one.img
This will list the user number (a CP/M concept), and the files objects belonging to that user.
E.g.:0: File1.txt File2.txt
Extracting objects based on user number:C:> cpmcp -f bw12 -p -t disk-images\disk-one.img 0:* output-dir
This will extract all objects collected logically under user 0: and put them into an output directory.
Finding the right commands was a little tricky at first, but once the correct set of arguments were found, it was straightforward to keep repeating them for each of the disks.
One of the less intuitive values supplied to the command line was the ‘bw12’ disk definition. This refers to a definition file, detailing the layout of the disk. The definition that worked best for our disks was the following:# Bondwell 12 and 14 disk images in IMD raw binary format diskdef bw12 seclen 256 tracks 40 sectrk 18 blocksize 2048 maxdir 64 skew 1 boottrk 2 os 2.2 end
The majority of the disks extracted well. A small, on-image modification we made was the conversion of filenames containing forward slashes. The forward slashes did not play well with Windows, and so I took the decision to change the slashes to hashes in hex to ensure the objects were safely extracted into the output directory.WordStar and other bits ‘n’ pieces
Content on the disks was primarily WordStar – CP/M’s flavour of word processor. Despite MS-DOS versions of WordStar; almost in parallel with the demise of CP/M, the program eventually lost market share in the 1980’s to WordPerfect. It took a little searching to source a converter to turn the WordStar content into something more useful but we did find something fairly quickly. The biggest issue viewing WordStar content as-is, in a standard text editor is the format’s use of the high-order bits within individual bytes to define word boundaries, as well as being used to make other denotations.
Example text, read verbatim might look like:
thå southerî coasô = the southern coast
At first, I was sure this was a sign of bit-flipping on less stable media. Again, the experience colleagues had with older formats was useful here, and a consultation with Google soon helped me to understand what we were seeing.
Looking for various readers or migration tools led me to a number of dead websites, but with the Internet Archive coming to the rescue to allow us to see them: WordStar to other format solutions.
The tool we ended up using was the HABit WorsStar Converter, with more information on Softpedia.com. It does bulk conversion of WordStar to plain text or HTML. We didn’t have to worry too much about how faithful the representation would be, as this was just a triage, we were more interested in the intellectual value of the content, or data. Rudimentary preservation of layout would be enough. We we’re very happy with plain text output with the option of HTML output too.
Unfortunately, when we approached Henry Bartlett, the developer of the tool, about a small bug in the bulk conversion where the tool neutralises file format extensions on output of the text file, causing naming collisions; we were informed by his wife that he’d sadly passed away. I hoped it would prove to be some reassurance to her to know that at the very least his work was still of great use for a good number of people doing format research, and for those who will eventually consume the objects that we’re working on.
Conversion was still a little more manual than we’d like if we had larger numbers of files, but everything ran smoothly. Each of the deliverables were collected, and taken back to the parent department on a USB stick along with the original 3.25-inch disks.
We await further news from DoC about what they’re planning on doing with the extracts next.Conclusions
The research to complete this work took a couple of weeks overall. With more dedicated time it might have taken a week.
On completion, and delivery to The Department of Conservation, we’ve since run through the same process on another number of disks. This took a fraction of the time – possibly an afternoon. The process can be refined each further iteration.
The next step is to understand the value in what was extracted. This might mean using the extract to source printed copies of the content and understanding that we can dispose of these disks and their content. An even better result might be discovering that there are no other copies of the material and these digital objects can become records in their own right with potential for long term retention. At the very least those conversations can now begin. In the latter instance, we’ll need to understand what out of the various deliverables, i.e. the disk images; the extracted objects; and the migrated objects, will be considered the record.
Demonstrable value acts like a weight on the scales of digital preservation where we try and balance effort with value; especially in this instance, where the purpose of the digital material is yet unknown. This case study is borne from an air-gap in the recordkeeping process that sees the parent department attempting to understand the information in its possession in lieu of other recordkeeping metadata.
Aside from the value in what was extracted, there is still a benefit to us as an archive, and as a team in working with old technology, and equipment. Knowledge gained here will likely prove useful somewhere else down the line.
Identifying the file system could have been a little easier, and so we’d echo the call from Euan in the aforementioned blog post to have identification mechanisms for image formats in DROID-like tools.
Forensic analysis of the disk images and comparing that data to that extracted by CP/M Tools showed a certain amount of data remanence, that is, data that only exists forensically on the disk. It was extremely tempting to do more work with this, but we settled for notifying our contact at DoC, and thus far, we haven’t been called on to extract it.
We required a number of tools to perform this work. How we maintain the knowledge of this work, and maintain the tools used are two important questions. I haven’t an answer for the latter, while this blog serves in some way as documentation of the former.
What we received from DoC was old, but it wasn’t a problem that it was old. The right tools enabled this work to be done fairly easily – that goes for any organisation willing to put modest tools in the arms of their analysts and researchers such as the KryoFlux, and other legacy equipment. The disks were in good shape too. The curveball in this instance was that some of the pieces of the puzzle that we were interacting with were weirder than expected; a slightly different file system, and a word processing format that encoded data in an unexpected way making 1:1 extract and use a little more difficult. We got around it though. And indeed, as it stands, this wasn’t a preservation exercise; it was a low-cost and pragmatic exercise to support appraisal, continuity, and potential future preservation. The files have been delivered to DoC in its various forms: disk images; extracted objects; and migrated objects. We’ll await a further nod from them to understand where we go next.Preservation Topics: Preservation ActionsIdentificationMigrationPreservation RisksTools
Since 1996 the electronic journal Kairos has published a diverse range of webtexts, scholarly pieces made up of a range of media and hypermedia. The 18 years of digital journal texts are both interesting in their own right and as a collection of complex works of digital scholarship that illustrate a range of sophisticated issues for ensuring long-term access to new modes of publication. Douglas Eyman, Associate Professor of Writing and Rhetoric at George Mason University is senior editor and publisher of Kairos. Cheryl E. Ball, associate professor of digital publishing studies at West Virginia University, is editor of Kairos. In this Insights Interview, I am excited to learn about the kinds of issues that this body of work exposes for considering long-term access to born-digital modes of scholarship. [There was also a presentation on Kairos at the Digital Preservation 2014 meeting.]
Trevor: Could you describe Kairos a bit for folks who aren’t familiar with it? In particular, could you tell us a bit about what webtexts are and how the journal functions and operates?
Doug: Webtexts are texts that are designed to take advantage of the web-as-concept, web-as-medium, and web-as-platform. Webtexts should engage a range of media and modes and the design choices made by the webtext author or authors should be an integral part of the overall argument being presented. One of our goals (that we’ve met with some success I think) is to publish works that can’t be printed out — that is, we don’t accept traditional print-oriented articles and we don’t post PDFs. We publish scholarly webtexts that address theoretical, methodological or pedagogical issues which surface at the intersections of rhetoric and technology, with a strong interest in the teaching of writing and rhetoric in digital venues.
(As an aside, there was a debate in 1997-98 about whether we were publishing hypertexts, which then tended to be available in proprietary formats and platforms and not freely available on the WWW or not; founding editor Mick Doherty argued that we were publishing much more than only hypertexts, so we moved from calling what we published ‘hypertexts’ to ‘webtexts’ — Mick tells that story in the 3.1 loggingon column).
Cheryl: WDS (What Doug said One of the ways I explain webtexts to potential authors and administrators is that the design of a webtext should, ideally, enact authors’ scholarly arguments, so that the form and content of the work are inseparable.
Doug: The journal was started by an intrepid group of graduate students, and we’ve kept a fairly DIY approach since that first issue appeared on New Year’s day in 1996. All of our staff contribute their time and talents and help us to publish innovative work in return for professional/field recognition, so we are able to sustain a complex venture with a fairly unique economic model where the journal neither takes in nor spends any funds. We also don’t belong to any parent organization or institution, and this allows us to be flexible in terms of how the editors choose to shape what the journal is and what it does.
Cheryl: We are lucky to have a dedicated staff who are scattered across (mostly) the US: teacher-scholars who want to volunteer their time to work on the journal, and who implement the best practices of pedagogical models for writing studies into their editorial work. At any given time, we have about 25 people on staff (not counting the editorial board).
Doug: Operationally, the journal functions much like any other peer-reviewed scholarly journal: we accept submissions, review them editorially, pass on the ones that are ready for review to our editorial board, engage the authors in a revision process (depending on the results of the peer-review) and then put each submission through an extensive and rigorous copy-, design-, and code-editing process before final publication. Unlike most other journals, our focus on the importance of design and our interest in publishing a stable and sustainable archive mean that we have to add those extra layers of support for design-editing and code review: our published webtexts need to be accessible, usable and conform to web standards.
Trevor: Could you point us to a few particularly exemplary works in the journal over time for readers to help wrap their heads around what these pieces look like? They could be pieces you think are particularly novel or interesting or challenging or that exemplify trends in the journal. Ideally, you could link to it, describe it and give us a sentence or two about what you find particularly significant about it.
Cheryl: Sure! We sponsor an award every year for Best Webtext, and that’s usually where we send people to find exemplars, such as the ones Doug lists below.
Doug: From our peer-reviewed sections, we point readers to the following webtexts (the first two are especially useful for their focus on the process of webtext authoring and editing):
- Daniel Anderson, “Watch the Bubble” (2012)
- Susan H. Delagrange “When Revision Is Redesign: Key Questions for Digital Scholarship” (2009)
- David Rieder, “Typographia: A Hybrid, Alphabetic Exploration of Raleigh, NC” (2010)
- Madeleine Sorapure, “Between Modes: Assessing Students’ New Media Compositions” (2006)
- Melanie Yergeau, Kathryn Wozniak and Peter Vandenberg, “Expanding the Space of f2f: Writing Centers and Audio-Visual-Textual Conferencing” (2009)
- Scott Nelson et al’s, “Crossing Battle Lines: Teaching Multimodal Literacies through Augmented Reality Games” (2013)
Cheryl: From our editorially (internally) reviewed sections, here are a few other examples:
- Nathaniel Rivers’ “Circumnavigation: An Interview with Thomas Rickert” (a mid-career scholar who recently published an award-winning book) (2014)
- Jennifer deWinter et al’s review of “The Art of Video Games,” presented as a video game. (2014)
- Douglas Wall’s “An A-Word Production: Authentic Design,” a mini-manifesto for our short briefs section, Disputatio. (2008)
- Tara Wood and Shannon Madden’s “Suggested Practices for Syllabus Accessibility Statements,” published as part of our pedagogical tools and narratives section, PraxisWiki. (2013)
Trevor: Given the diverse range of kinds of things people might publish in a webtext, could you tell us a bit about the kinds of requirements you have enforced upfront to try and ensure that the works the journal publishes are likely to persist into the future? For instance, any issues that might come up from embedding material from other sites, or running various kinds of database-driven works or things that might depend on external connections to APIs and such.
Doug: We tend to discourage work that is in proprietary formats (although we have published our fair share of Flash-based webtexts) and we ask our authors to conform to web standards (XHTML or HTML5 now). We think it is critical to be able to archive any and all elements of a given webtext on our server, so even in cases where we’re embedding, for instance, a YouTube video, we have our own copy of that video and its associated transcript.
One of the issues we are wrestling with at the moment is how to improve our archival processes so we don’t rely on third-party sites. We don’t have a streaming video server, so we use YouTube now, but we are looking at other options because YouTube allows large corporations to apply bogus copyright-holder notices to any video they like, regardless of whether there is any infringing content (as an example, an interview with a senior scholar in our field was flagged and taken down by a record company; there wasn’t even any background audio that could account for the notice. And since there’s a presumption of guilt, we have to go through an arduous process to get our videos reinstated.) What’s worse is when the video *isn’t* taken down, but the claimant instead throws ads on top of our authors’ works. That’s actually copyright infringement against us that is supported by YouTube itself.
Another issue is that many of the external links in works we’ve published (particularly in older webtexts) tend to migrate or disappear. We used to replace these where we can with links to archive.org (aka The Wayback Machine), but we’ve discovered that their archive is corrupted because they allow anyone to remove content from their archive without reason or notice. So, despite its good intentions, it has become completely unstable as a reliable archive. But we don’t, alas, have the resources to host copies of everything that is linked to in our own archives.
Cheryl: Kairos holds the honor within rhetoric and composition of being the longest-running, and most stable, online journal, and our archival and technical policies are a major reason for that. (It should be noted that many potential authors have told us how scary those guidelines look. We are currently rewriting the guidelines to make them more approachable while balancing the need to educate authors on their necessity for scholarly knowledge-making and -preservation on the Web.)
Of course, being that this field is grounded in digital technology, not being able to use some of that technology in a webtext can be a rather large constraint. But our authors are ingenious and industrious. For example, Deborah Balzhiser et al created an HTML-based interface to their webtext that mimicked Facebook’s interface for their 2011 webtext, “The Facebook Papers.” Their self-made interface allowed them to do some rhetorical work in the webtext that Facebook itself wouldn’t have allowed. Plus, it meant we could archive the whole thing on the Kairos server in perpetuity.
Trevor: Could you give us a sense of the scope of the files that make up the issues? For instance, the total number of files, the range of file types you have, the total size of the data, and or a breakdown of the various kinds of file types (image, moving image, recorded sound, text, etc.) that exist in the run of the journal thus far?
Doug: The whole journal is currently around 20 Gb — newer issues are larger in terms of data size because there has be an increase in the use of audio and video (luckily, html and css files don’t take up a whole lot of room, even with a lot of content in them). At last count, there are 50,636 files residing in 4,545 directories (this count includes things like all the system files for WordPress installs and so on). A quick summary of primary file types:
- HTML files: 12247
- CSS: 1234
- JPG files: 5581
- PNG: 3470
- GIF: 7475
- MP2/3/4: 295
- MOV 237
- PDF: 191
Cheryl: In fact, our presentation at Digital Preservation 2014 this year [was] partly about the various file types we have. A few years ago, we embarked on a metadata-mining project for the back issues of Kairos. Some of the fields we mined for included Dublin Core standards such as MIMEtype and DCMIType. DCMIType, for the most part, didn’t reveal too much of interest from our perspective (although I am sure librarians will see it differently!! but the MIMEtype search revealed both the range of filetypes we had published and how that range has changed over the journal’s 20-year history. Every webtext has at least one HTML file. Early webtexts (from 1996-2000ish) that have images generally have GIFs and, less prominent, JPEGs. But since PNGs rose to prominence (becoming an international standard in 2003), we began to see more and more of them. The same with CSS files around 2006, after web-standards groups starting enforcing their use elsewhere on the Web. As we have all this rich data about the history of webtextual design, and too many research questions to cover in our lifetimes, we’ve released the data in Dropbox (until we get our field-specific data repository, rhetoric.io, completed).
Trevor: In the 18 years that have transpired since the first issue of Kairos a lot has changed in terms of web standards and functionality. I would be curious to know if you have found any issues with how earlier works render in contemporary web browsers. If so, what is your approach to dealing with that kind of degradation over time?
Cheryl: If we find something broken, we try to fix it as soon as we can. There are lots of 404s to external links that we will never have the time or human resources to fix (anyone want to volunteer??), but if an author or reader notifies us about a problem, we will work with them to correct the glitch. One of the things we seem to fix often is repeating backgrounds. lol. “Back in the days…” when desktop monitors were tiny and resolutions were tinier, it was inconceivable that a background set to repeat at 1200 pixels would ever actually repeat. Now? Ugh.
But we do not change designs for the sake of newer aesthetics. In that respect, the design of a white-text-on-black-background from 1998 is as important a rhetorical point as the author’s words in 1998. And, just as the ideas in our scholarship grow and mature as we do, so do our designs, which have to be read in the historical context of the surrounding scholarship.Of course, with the bettering of technology also comes our own human degradation in the form of aging and poorer eyesight. We used to mandate webtexts not be designed over 600 pixels wide, to accommodate our old branding system that ran as a 60-pixel frame down the left-hand side of all the webtexts. That would also allow for a little margin around the webtext. Now, designing for specific widths — especially ones that small — seems ludicrous (and too prescriptive), but I often find myself going into authors’ webtexts during the design-editing stage and increasing their typeface size in the CSS so that I can even read it on my laptop. There’s a balance I face, as editor, of retaining the authors’ “voice” through their design and making the webtext accessible to as many readers as possible. Honestly, I don’t think the authors even notice this change.
Trevor: I understand you recently migrated the journal from a custom platform to the Open Journal System platform. Could you tell us a bit about what motivated that move and issues that occurred in that migration?
Doug: Actually, we didn’t do that.
Cheryl: Yeah, I know it sounds like we did from our Digital Preservation 2014 abstract, and we started to migrate, but ended up not following through for technical reasons. We were hoping we could create plug-ins for OJS that would allow us to incorporate our multimedia content into its editorial workflow. But it didn’t work. (Or, at least, wasn’t possible with the $50,000 NEH Digital Humanities Start-Up Grant we had to work with.) We wanted to use OJS to help streamline and automate our editorial workflow–you know, the parts about assigning reviewers and copy-editors, etc., — and as a way to archive those processes.
I should step back here and say that Kairos has never used a CMS; everything we do, we do by hand — manually SFTPing files to the server, manually making copies of webtext folders in our kludgy way of version control, using YahooGroups (because it was the only thing going in 1998 when we needed a mail system to archive all of our collaborative editorial board discussions) for all staff and reviewer conversations, etc.–not because we like being old school, but because there were always too many significant shortcomings with any out-of-the-box systems given our outside-the-box journal. So the idea of automating, and archiving, some of these processes in a centralized database such as OJS was incredibly appealing. The problem is that OJS simply can’t handle the kinds of multimedia content we publish. And rewriting the code-base to accommodate any plug-ins that might support this work was not in the budget. (We’ve written about this failed experiment in a white paper for NEH.)
 Archive.org will obey robots.txt files if they ask not to be indexed. So, for instance, early versions of Kairos itself are no longer available on archive.org because such a file is on the Texas Tech server where the journal lived until 2004. We put that file there because we want Google to point to the current home of the journal, but we actually would like that history to be in the Internet Archive. You can think of this as just a glitch, but here’s the more pressing issue: if I find someone has posted a critical blog post of my work, if I ever get ahold of the domain it was originally posted, I can take it down there *and* retroactively make it unavailable on archive.org, even if it used to show up there. Even without such nefarious purpose, just the constant trade in domains and site locations means that no researcher can trust that archive when using it for history or any kind of digital scholarship.
Over the last three and a half years, the SCAPE project worked in several directions in order to propose new solutions for digital preservation, as well as improving existing ones. One of the results of this work is the SCAPE preservation environment (SPE). It is a loosely coupled system, which enables extending existing digital repository systems (e.g. RODA) with several components that cover collection profiling (i.e. C3PO), preservation monitoring (i.e. SCOUT) and preservation planning (i.e. Plato). Those components address key functionalities defined in the Open Archival Information System (OAIS) functional model.
Establishing trustworthiness of digital repositories is a major concern of the digital preservation community as it makes the threats and risks within a digital repository understandable. There are several approaches developed over recent years on how to address trust in digital repositories. The most notable is Trustworthy Repositories Audit and Certification (TRAC), which has later been promoted to an ISO standard by the International Standards Organization (ISO 16363, released in 2012). The standard comprises of three pillars: organizational infrastructure, digital object management, and infrastructure and security management and for each of these it provides a set of requirements and the expected evidence needed for compliance.
A recently published whitepaper reports on the work done to validate the SCAPE Preservation Environment against the ISO 16363 – a framework for Audit and Certification of Trustworthy Digital Repositories. The work aims to demonstrate that a preservation ecosystem composed of building blocks as the ones developed in SCAPE is able to comply with most of the system-related requirements of the ISO 16363.
From a total of 108 metrics included in the assessment, the SPE fully supports 69 of them. 31 metrics were considered to be “out of scope” as they refer to organisational issues that cannot be solved by technology alone nor can they be analysed outside the framework of a breathing organisation, leaving 2 metrics to be considered “partially supported” and 6 metrics to be considered “not supported”. This gives an overall compliancy level of roughly 90% (if the organisational oriented metrics are not taken into account).
This work also enabled us to identify the main weak points of the SCAPE Preservation Environment that should be addressed in the near future. In summary the gaps found were:
- The ability to manage and maintain contracts or deposit agreements through the repository user interfaces;
- Support for tracking intellectual property rights;
- Improve technical documentation, especially on the conversion of Submission Information Packages (SIP) into Archival Information Packages (AIP);
- The ability to aid the repository manager to perform better risk management.
Our goal is to ensure that the SCAPE Preservation Environment fully supports the system-related metrics of the ISO 16363. In order to close the gaps encountered, additional features have been added to the roadmap of the SPE.
To get your hands on the full report, please go to http://www.scape-project.eu/wp-content/uploads/2014/09/SCAPE_MS63_KEEPS-V1.0.pdf
Preservation Topics: Preservation StrategiesPreservation RisksSCAPE
The following is a guest post by Chris Prom, Assistant University Archivist and Professor, University of Illinois at Urbana-Champaign.
I’ll never forget one lesson from my historical methods class at Marquette University. Ronald Zupko–famous for his lecture about the bubonic plague and a natural showman–was expounding on what it means to interrogate primary sources–to cast a skeptical eye on every source, to see each one as a mere thread of evidence in a larger story, and to remember that every event can, and must, tell many different stories.
He asked us to name a few documentary genres, along with our opinions as to their relative value. We shot back: “Photographs, diaries, reports, scrapbooks, newspaper articles,” along with the type of ill-informed comments graduate students are prone to make. As our class rattled off responses, we gradually came to realize that each document reflected the particular viewpoint of its creator–and that the information a source conveyed was constrained by documentary conventions and other social factors inherent to the medium underlying the expression. Settling into the comfortable role of skeptics, we noted the biases each format reflected. Finally, one student said: “What about correspondence?” Dr Zupko erupted: “There is the real meat of history! But, you need to be careful!”
Letters, memos, telegrams, postcards: such items have long been the stock-in-trade for archives. Historians and researchers of all types, while mindful of the challenges in using correspondence, value it as a source for the insider perspective it provides on real-time events. For this reason, the library and archives community must find effective ways to identify, preserve and provide access to email and other forms of electronic correspondence.
After I researched and wrote a guide to email preservation (pdf) for the Digital Preservation Coalition’s Technology Watch Report series, I concluded that the challenges are mostly cultural and administrative.
I have no doubt that with the right tools, archivists could do what we do best: build the relationships that underlie every successful archival acquisition. Engaging records creators and donors in their digital spaces, we can help them preserve access to the records that are so sorely needed for those who will write histories. But we need the tools, and a plan for how to use them. Otherwise, our promises are mere words.
For this reason, I’m so pleased to report on the results of a recent online meeting organized by the National Digital Stewardship Alliance’s Standards and Practices Working Group. On August 25, a group of fifty-plus experts from more than a dozen institutions informally shared the work they are doing to preserve email.
For me, the best part of the meeting was that it represented the diverse range of institutions (in terms of size and institutional focus) that are interested in this critical work. Email preservation is not something of interest only to large government archives,or to small collecting repositories, but also to every repository in between. That said, the representatives displayed a surprising similar vision for how email preservation can be made effective.
Robert Spangler, Lisa Haralampus, Ken Hawkins and Kevin DeVorsey described challenges that the National Archives and Records Administration has faced in controlling and providing access to large bodies of email. Concluding that traditional records management practices are not sufficient to task, NARA has developed the Capstone approach, seeking to identify and preserve particular accounts that must be preserved as a record series, and is currently revising its transfer guidance. Later in the meeting, Mark Conrad described the particular challenge of preserving email from the Executive Office of the President, highlighting the point that “scale matters”–a theme that resonated across the board.
The whole account approach that NARA advocates meshes well with activities described by other presenters. For example, Kelly Eubank from North Carolina State Archives and the EMCAP project discussed the need for software tools to ingest and process email records while Linda Reib from the Arizona State Library noted that the PeDALS Project is seeking to continue their work, focusing on account-level preservation of key state government accounts.
Ricc Ferrante and Lynda Schmitz Fuhrig from the Smithsonian Institution Archives discussed the CERP project which produced, in conjunction with the EMCAP project, an XML schema for email objects among its deliverables. Kate Murray from the Library of Congress reviewed the new email and related calendaring formats on the Sustainability of Digital Formats website.
Harvard University was up next. Andrea Goethels and Wendy Gogel shared information about Harvard’s Electronic Archiving Service. EAS includes tools for normalizing email from an account into EML format (conforming to the Internet Engineering Task Force RFC 2822), then packaging it for deposit into Harvard’s digital repository.
One of the most exciting presentations was provided by Peter Chan and Glynn Edwards from Stanford University. With generous funding from the National Historical Publications and Records Commission, as well as some internal support, the ePADD Project (“Email: Process, Appraise, Discover, Deliver”) is using natural language processing and entity extraction tools to build an application that will allow archivists and records creators to review email, then process it for search, display and retrieval. Best of all, the web-based application will include a built-in discovery interface and users will be able to define a lexicon and to provide visual representations of the results. Many participants in the meeting commented that the ePADD tools may provided a meaningful focus for additional collaborations. A beta version is due out next spring.
In the discussion that followed the informal presentations, several presenters congratulated the Harvard team on a slide Wendy Gogel shared, comparing the functions provided by various tools and services (reproduced above).
As is apparent from even a cursory glance at the chart, repositories are doing wonderful work—and much yet remains.
Collaboration is the way forward. At the end of the discussion, participants agreed to take three specific steps to drive email preservation initiatives to the next level: (1) providing tool demo sessions; (2) developing use cases; and (3) working together.
The bottom line: I’m more hopeful about the ability of the digital preservation community to develop an effective approach toward email preservation than I have been in years. Stay tuned for future developments!
U2, already the most hated band in the world thanks to its invading millions of iOS devices with unsolicited files, isn’t stopping. An article on Time‘s website tells us, in vague terms, that
Bono, Edge, Adam Clayton and Larry Mullen Jr believe so strongly that artists should be compensated for their work that they have embarked on a secret project with Apple to try to make that happen, no easy task when free-to-access music is everywhere (no) thanks to piracy and legitimate websites such as YouTube. Bono tells TIME he hopes that a new digital music format in the works will prove so irresistibly exciting to music fans that it will tempt them again into buying music—whole albums as well as individual tracks.
It’s hard to read this as anything but an attempt to bring digital rights management (DRM) back to online music distribution. Users emphatically rejected it years ago, and Apple was among the first to drop it. You haven’t really “bought” anything with DRM on it; you’ve merely leased it for as long as the vendor chooses to support it. People will continue to break DRM, if only to avoid the risk of loss. The illegal copies will offer greater value than legal ones.
It would be nice to think that what U2 and Apple really mean is just that the new format will offer so much better quality that people will gladly pay for it, but that’s unlikely. Higher-quality formats such as AAC have been around for a long time, and they haven’t pushed the old standby MP3 out of the picture. Existing levels of quality are good enough for most buyers, and vendors know it.
Time implies that YouTube doesn’t compensate artists for their work. This is false. They often don’t bother with small independent musicians, though they will if they’re reminded hard enough (as Heather Dale found out), but it’s hard to believe that groups with powerful lawyers, such as U2, aren’t being compensated for every view.
DRM and force-feeding of albums are two sides of the same coin of vendor control over our choices. This new move shouldn’t be a surprise.
Tagged: Apple, audio, DRM
It is difficult to write that headline. After nearly four years of hard work, worry, setbacks, triumphs, weariness, and exultation, the SCAPE project is finally coming to an end.
I am convinced that I will look back at this period as one of the highlights of my career. I hope that many of my SCAPE colleagues will feel the same way.
I believe SCAPE was an outstanding example of a successful European project, characterised by
- an impressive level of trouble-free international cooperation;
- sustained effort and dedication from all project partners;
- high quality deliverables and excellent review ratings;
- a large number of amazing results, including more software tools than we can demonstrate in one day!
I also believe SCAPE has made and will continue to make a significant impact on the community and practice of digital preservation. We have achieved this impact through
- scalability improvements on existing tools, for example Plato, and Fedora 4;
- new scalable tools like nanite, Hawarp, and C3PO;
- new tools for quality control like Jpylyzer, xcorrSound, and Matchbox;
- APIs for repository interoperability;
- dozens of Taverna workflows and Hadoop-based workflows;
- documented best practices;
- a catalogue of preservation policies and advances in policy-based planning with Plato;
- advances in automated preservation watch with SCOUT;
- a high degree of take-up of project results at partner institutions and beyond.
I would like to thank all the people who contributed to the SCAPE project, who are far too numerous to name here. In particular I would like to thank our General Assembly members, our Executive Board/Sub-project leads, the Work Package leads, and the SCAPE Office, all of whom have contributed to the joy and success of SCAPE.
Finally, I would like to thank the OPF for ensuring that the SCAPE legacy will continue to live and even grow long after the project itself is just a fond memory.
It's been a pleasure folks. Well done!Preservation Topics: SCAPE
On Monday 8 September 2014 APARSEN and SCAPE together hosted a workshop, called ‘Digital Preservation Sustainability on the EU Policy Level’. The workshop was held in connection with the conference Digital Libraries 2014 in London.
The room for the workshop was ‘The Great Hall’ at City University London – a lovely, old, large room with a stage at one end and lots of space for the 12 stalls featuring the invited projects and the 85 attendees.
The first half of the workshop was dedicated to a panel session. The three panellists each had 10-15 minutes to present their views on both the achievements and future of digital preservation, followed by a discussion moderated by Hildelies Balk from the Royal Library of the Netherlands, with real time visualisations made by Elco van Staveren.‘As a community we have failed’
With these words David Giaretta, Director of APARSEN (see presentation and visualisation), pinpointed the fact that there will be no EU funding for digital preservation research in the future and that the EU expects to see some result from the 100 M € already distributed. The EU sees data as the new gold, and we should start mining it! A big difference between gold and data is that gold does not perish whereas data is not imperishable.
The important thing to do is to create some results – ‘A rising tide floats all boats’ – if we can at least show something that can be used, that will help funding the rest of the preservation.Let’s climb the wall!
David Giaretta was followed by Ross King, Project Coordinator of SCAPE (see presentation and visualisation), who started his presentation with a comparison between the two EU projects Planets and SCAPE - the latter being a follow-up project from the first. Many issues already addressed in Planets were further explored and developed in SCAPE, the biggest difference being scalability – how to handle large volumes, scalability in planning processes, more automation etc. – which was the focal point of SCAPE.
To Ross King there were three lessons learned from working with Planets and SCAPE:
- there is still a wall between Production on one side and Research & Development on the other,
- the time issue – although libraries, archives etc. work with long term horizons, most business have a planning horizon of five years or less,
- format migration may not be as important as we thought it was.
Ed Fay, director of Open Planets Foundation (see presentation and visualisation), opened with the message that by working with digital preservation we have a great responsibility of helping to define the future of information management. With no future EU funded projects community collaboration on all levels is more needed than ever. Shared services and infrastructure are essential.
The Open Planets Foundation was founded after the Planets project to help sustaining the results of this project. Together with SCAPE and other projects OPF is now trying to mature tools so they can be widely adopted and sustained (See SCAPE Final Sustainability Plan).
There are a lot of initiatives and momentum, from DPC, NDIPP or JISC to OPF or APA - but how will the future look like? How do we ensure that initiatives are aligned up to the policy level?
Sustainability is about working out who pays – and when…
If digital preservation was delivering business objectives we wouldn’t be here to talk about sustainability - it would just be embedded in how organisations work - we are not there yet!A diverse landscape with many facets
The panellist’s presentations were followed by questions from the audience, mostly concerned about risk approach. During the discussion it was stated that although the three presenters see the digital landscape from different views they all agree on its importance. People do need to preserve and to get digital value from that. The DP initiatives and organisations are the shopping window, members have lots of skills that the market could benefit from.
The audience were asked if they find it important to have a DP community - apparently nobody disagreed! And it seemed that almost everyone were members of OPF, APARSEN or other similar initiatives.
There are not many H2020 digital preservation bids. In earlier days everybody had several proposals running in these rounds, but this is not catastrophic – good research has been made and now we want the products to be consolidated. We would like to reach a point where digital preservation is an infrastructure service as obvious as your email. But we are not there yet!
Appraisal and ingest is still not solved - we need to choose the data to be preserved, especially when talking about petabytes!
The wrap-up of the discussion was done by discussing the visualisation made by Elco van Staveren.
An overall comment was that even though there are no money directed towards digital preservation, there is still lots of money for problems that can be solved by digital preservation. It is important that the community of digital preservation thinks of itself NOT as the problem but as part of the solution. And although the visualisation is mostly about sustainability, risks still play an important part. If you cannot explain the risk of doing nothing you cannot persuade anyone to pay!Clinic with experts
After the panel and one minute project elevator pitches there was a clinic session at which all the different projects could present themselves and their results at different stalls. A special clinic table was in turn manned by experts from different areas of digital preservation.
This was the time to meet a lot of different people from the Digital Preservation field, to catch up and build new relations. For a photo impression of the workshop see: http://bit.ly/1u7Lmnq.
Preservation Topics: SCAPE AttachmentSize IMG_9928.JPG1.75 MB Elcovs discussion.jpg134.17 KB IMG_3361.JPG311.72 KB IMG_3325.JPG409.54 KB
The following is a guest post by Patrick Rourke, an Information Technology Specialist and the newest member of the Library’s Viewshare team.
I made my first forays into computing on days when it was too cold, wet or snowy to walk in the woods behind our house, in a room filled with novels, atlases and other books. Usually those first programming projects had something to do with books, or writing, or language – trying to generate sentences from word lists, or altering the glyphs the computer used for text to represent different alphabets.
After a traumatic high school exposure to the COBOL programming language (Edsger Dijkstra once wrote that “its teaching should be regarded as a criminal offense” (pdf)), in college I became fascinated with the study of classical Greek and Roman history and literature. I was particularly drawn to the surviving fragments of lost books from antiquity – works that were not preserved, but of which traces remain in small pieces of papyrus, in palimpsests, and through quotations in other works. I spent a lot of my free time in the computer room, using GML, BASIC and ftp on the university’s time sharing system.
My first job after graduation was on the staff of a classics journal, researching potential contributors, proofreading, checking references. At that time, online academic journals and electronic texts were being distributed via email and the now almost-forgotten medium of Gopher. It was an exciting time, as people experimented with ways to leverage these new tools to work with books, then images, then the whole panoply of cultural content.
This editorial experience led to a job in the technical publications department of a research company, and my interest in computing to a role as the company webmaster, and then as an IT specialist, working with applications, servers and networking. In my spare time, I stayed engaged with the humanities, doing testing, web design and social media engagement for the Suda On Line project, who publish a collaborative translation and annotation of the 10th century Byzantine lexicon in which many of those fragments of lost books are found.
My work on corporate intranets and my engagement with SOL motivated me to work harder on extending my programming skills, so before long I was developing web applications to visualize project management data and pursuing a master’s degree in computer science. In the ten years I’ve been working as a developer, I’ve learned a lot about software development in multiple languages, frameworks and platforms, worked with some great teams and been inspired by great mentors.
I join the National Digital Information Infrastructure and Preservation Program as an Information Technology Specialist, uniting my interests in culture and computing. My primary project is Viewshare, a platform the Library makes available to cultural institutions for generating customized visualizations – including timelines, maps, and charts – of digital collections data. We will be rolling out a new version of Viewshare in the near future, and then I will be working with the NDIIPP team and the Viewshare user community on enhancing the platform by developing new features and new ways to view and share digital collections data. I’m looking forward to learning from and working with my new colleagues at the Library of Congress and everyone in the digital preservation community.
A few websites refuse to present content if you use a browser other than one of the four or so big-name ones.
The example shown is what I got when I accessed Apple’s support site with iCab, a relatively obscure browser which I often use. Many of Google’s pages also refuse to deliver content to iCab.
Browsers can impersonate other browsers by setting the User-Agent header, and small-name browsers usually provide that option for getting around these problems. After a couple of tries with iCab, I was able to get through by impersonating Safari. Doing this also has an advantage for privacy; identifying yourself with a little-used browser can greatly contribute to unique identification when you may want anonymity. From the standpoint of good website practices, though, a site shouldn’t be locking browsers out unless there’s an unusual need. Web pages should follow standards so that they’re as widely readable as possible. This is especially important with a “contact support” page.
Apple and Google both are browser vendors. Might we look at this as a way to make entry by new browsers more difficult?
Tagged: HTML, standards
At the Museum is an interview series highlighting the variety of digital collections in museums and the interesting people working to create and preserve these collections. For this installment I interviewed Ellice Engdahl, Digital Collections & Content Manager, and Brian Wilson, Digital Access and Preservation Archivist, at The Henry Ford in Dearborn, Michigan.
Sue: Tell us about your background and how you ended up in this role.
Ellice: My professional experience prior to The Henry Ford has been in the for-profit publishing industry, in a number of different roles. I started in that field in 1998, working on print books, CD-ROMs and a then brand-new web product providing book recommendations to readers based on other books they’d read. I moved on to work on my firm’s new eBooks program, converting print books (ours as well as other publishers’) into what was then Open eBook XML (now ePub). I also was a project manager leading Agile software development teams (I got my PMP project management certification in 2007), and my last role there was as a technical content implementation manager. I started at The Henry Ford in mid-2011.
Sue: Give us some background on your current position at The Henry Ford.
Ellice: I’m part of our Digital and Emerging Media Department, and act as a program manager for our digitization efforts, bringing together a cross-functional team to prioritize and manage ongoing collections digitization work. I also act as collections content representative on new and existing digital products with internal staff and external agencies and vendors. My role has expanded slightly recently, so I’m starting to get involved with how we aggregate digitized collections content to tell stories on the web.
Sue: Tell us a bit about your online collections area. Could you shed some light on how you are approaching the presentation of special, individual items within the context of large-scale digitization efforts?
Ellice: The Henry Ford has a large collection and an even larger archive–we estimate we have about one million objects, plus 25 million items in our archives. Though we have pockets of “digitized” collections content dating back to the 1990s (I’ve heard Günter Waibel of the Smithsonian call these “random acts of digitization”), digitization as a consistent, standardized process and as a part of our everyday work here began around 2011, with 300 objects online. Now we’re at about 26,000 online, all imaged and cataloged with at least minimal metadata (generally to CCO standards).
I think the challenge for an institution with massive amounts of material, ranging from the size of buttons to planes, trains and automobiles, is to find a balance between making our collections accessible at a high-level, with less detail, and then finding ways to highlight the gems of the collection and the major stories we as an institution tell. We haven’t yet quite solved this problem. One of our curators calls our collection “the bottomless pit of wonderfulness,” which really sums up the blessing and the curse of having so much material to work with. Even deciding what we digitize first, with the amount of material so greatly outweighing the number of digitization staff, can be a challenge, let alone the amount of time that you spend on each individual item.
Sue: What advice would you have to offer others at similar institutions?
Ellice: Don’t get overwhelmed. It is possible, perhaps just possible, that the work of digitizing museum collections tends to attract Type A personalities who really want everything to be perfect. The nice thing to remember is that even getting some information about your collections online, particularly for objects that are in storage or otherwise publicly inaccessible, makes information available to scholars and the general public in a way that wouldn’t have been possible 25 years ago.
Also, apply the 80-20 rule, along with cost-benefit analysis, constantly–80% of our impact clearly comes from 20% of our effort. We run into this all the time when attempting to digitize a particularly large or complicated object or group of objects. We try to see if there are creative ways to get the work done, or if it could become the basis for a grant application, or if we have existing assets (for example, old black and white photos) that we could use instead of new photography. Sometimes we defer specific objects or parts of the collection because tackling them at that moment would be so difficult. Fortunately for us, we always have a new part of the bottomless pit of wonderfulness to tackle.
Sue: What do you see as the biggest challenge in presenting and preserving these digital items?
Ellice: I think our biggest challenge in presentation right now is how we go beyond simply a searchable database of objects connected only by metadata to tell integrated stories, similar to those a presenter here might tell you if you come to visit, and yet still allow the user to pursue threads from the story they are particularly interested in. We are working on this, but haven’t yet licked it.
Preservation and new technologies are also a major issue. For example, we have some 360-degree files, which allow users to navigate around the inside of vehicles, that are in Flash format. We’re currently considering whether our master file format for images should remain TIFF or should switch to JPEG2000 or something else. It’s easy to create a backlog of data in a format that becomes obsolete, unless you’re really paying attention.
Sue: Are you currently working with any “born digital” materials in your collections, and what are your future plans for these materials?
Ellice: We do have a major collection of born-digital material that we have been adding to our collections website and to Flickr: the Dave Friedman Collection (pdf). These are automobile racing negatives that were scanned at high resolution by the original photographer and delivered to us as digital files. Since these came to us as well-organized digital image files, we are able to make these accessible much more quickly than their physical counterparts. We expect to see much more born-digital material in the future, however, and anticipate the level of organization and preservation when something reaches us will vary, so this will bring additional challenges. For example, how do you make an Excel or PowerPoint file accessible when the original software that created it is obsolete? How do you retrieve material on a 3.5″ disk from a circa 1990 Smith-Corona word processor?
Sue: Would you say your institution has adopted a system for preservation of digital objects or records?
Ellice: We have two big categories to consider on this front: 1) what we do with the material we’re newly digitizing going forward, and 2) how we clean up the backlog of those “random acts of digitization.”
For the first category, our newly created images and metadata, we’re creating master TIFF files for all images and ingesting those into a backed-up preservation server. Access to this server is fairly limited, and we’re moving to even more restrictive permissions. We’re also in the planning stages to create checksum data for our master image files. For most public uses, we utilize a JPG derivative of the original TIFF file. Our collections object metadata is stored in the EMu and is also backed up locally nightly and weekly to tape.
The second category is much more difficult. We have pockets of institutional data (including digital collections data) on non-backed up physical media of varying age and obsolescence. We don’t really have the people, server space, or organizational plan to simply collect and dump these into top tier storage, but we are trying to move these pockets off removable media to space where they are at least backed up. Figuring out what they are, where they came from, whether any description exists for these digital files, and if so, where, is a huge effort. Right now, we’re picking through these as we find them and prioritizing our efforts based on institutional strategic goals.
Brian: Our digital preservation efforts over the last 3 years or so at The Henry Ford have been fairly basic and focused primarily on improving the storage of output from our large-scale digitization effort, which Ellice describes above. We’ve also quietly, and at times not so quietly, worked to raise awareness for the need for, and opportunities afforded by, digital preservation.
Sue: How are you incorporating the storage of these large-scale digital images into your workflow?
Brian: We have replaced the use of distributed storage devices and removable media with backed-up, network spinning disk storage for capturing the 1-1.5TB of TIFF master files we’re creating each year. To use the new storage space effectively we’ve had to create workflows and procedures that describe data locations and storage structure, file transfer, naming conventions and so forth. Our forecasting of storage use has also been improved from less than a week to about 6 months in order to aid ITS in planning for and obtaining additional space. By the end of this summer we plan to fully implement a new electronic staging area, which will provide a place to deposit material intended for preservation as well as workspace for archival processing. The staging area will then allow for stricter permissions to be placed on our preservation storage and more effective use of checksums.
Sue: What thoughts do you each have on the need for digital preservation in general?
Ellice: Digital preservation can be a tough sell to folks not intimate with digital content. The front-end of your digital collections is exciting and vibrant and beautiful, but many people don’t think about the back-end until they need a file and it’s missing or corrupted. It has taken some time to get momentum behind digital preservation at The Henry Ford, a sea change caused in large part by Brian’s efforts; he is passionate about preservation and works closely with our IT staff to move us forward.
Brian: Along with the technology and infrastructure work we have also been making efforts to raise awareness of the need for digital preservation. Two years ago we drafted a digital preservation policy that, while still not formally approved, has been used to guide decisions and to provide support in grant applications. Working with faculty at both Wayne State University School of Information and Library Science and the University of Michigan School of Information we have hosted several student interns who have made great contributions to our preservation policy, efforts at dealing with moving image and audio materials and implementation of the staging area I mentioned previously. And there’s been a good amount of just speaking up during meetings to say, “Hey, don’t forget about preserving this electronic data you’re talking about!” The capture of old micro-websites before they’re taken offline, and the collection and storage of facilities images from staff are examples of a couple of these “oh-by-the-way” type of issues.
For what’s next, we want to complete implementation of the staging area and then take a hard look at transferring to network storage the legacy digital moving image and audio files that we’ve produced over the years and that still reside on portable hard drives and removable disk media. And of course, continue our awareness and education efforts. To paraphrase one of my grad school mentors, “It’s not a battle, it’s a campaign.”
This post is the latest in our NDSA Innovation Working Group’s ongoing Insights Interview series. Chelcie Rowell (Digital Initiatives Librarian, Wake Forest University) interviews Richard Ball (Associate Professor of Economics, Haverford College) and Norm Medeiros (Associate Librarian, Haverford Libraries) about Teaching Integrity in Empirical Research, or Project Tier.
Chelcie: Can you briefly describe Teaching Integrity in Empirical Research, or Project TIER, and its purpose?
Richard: For close to a decade, we have been teaching our students how to assemble comprehensive documentation of the data management and analysis they do in the course of writing an original empirical research paper. Project TIER is an effort to reach out to instructors of undergraduate and graduate statistical methods classes in all the social sciences to share with them lessons we have learned from this experience.
When Norm and I started this work, our goal was simply to help our students learn to do good empirical research; we had no idea it would turn into a “project.” Over a number of years of teaching an introductory statistics class in which students collaborated in small groups to write original research papers, we discovered that it was very useful to have students not only turn in a final printed paper reporting their analysis and results, but also submit documentation of exactly what they did with their data to obtain those results.
We gradually developed detailed instructions describing all the components that should be included in the documentation and how they should be formatted and organized. We now refer to these instructions as the TIER documentation protocol. The protocol specifies a set of electronic files (including data, computer code and supporting information) that would be sufficient to allow an independent researcher to reproduce–easily and exactly–all the statistical results reported in the paper. The protocol is and will probably always be an evolving work in progress, but after several years of trial and error, we have developed a set of instructions that our students are able to follow with a high rate of success.
Even for students who do not go on to professional research careers, the exercise of carefully documenting the work they do with their data has important pedagogical benefits. When students know from the outset that they will be required to turn in documentation showing how they arrive at the results they report in their papers, they approach their projects in a much more organized way and keep much better track of their work at every phase of the research. Their understanding of what they are doing is therefore substantially enhanced, and I in turn am able to offer much more effective guidance when they come to me for help.
Despite these benefits, methods of responsible research documentation are virtually, if not entirely, absent from the curricula of all the social sciences. Through Project TIER, we are engaging in a variety of activities that we hope will help change that situation. The major events of the last year were two faculty development workshops that we conducted on the Haverford campus. A total of 20 social science faculty and research librarians from institutions around the US attended these workshops, at which we described our experiences teaching our students good research documentation practices, explained the nuts and bolts of the TIER documentation protocol, and discussed with workshop participants the ways in which they might integrate the protocol into their teaching and research supervision. We have also been spreading the word about Project TIER by speaking at conferences and workshops around the country, and by writing articles for publications that we hope will attract the attention of social science faculty who might be interested in joining this effort.
We are encouraged that faculty at a number of institutions are already drawing on Project TIER and teaching their students and research advisees responsible methods of documenting their empirical research. Our ultimate goal is eventually to see a day when the idea of a student turning in an empirical research paper without documentation of the underlying data management and analysis is considered as aberrant as the idea of a student turning in a research paper for a history class without footnotes or a reference list.
Chelcie: How did TIER and your 10-year collaboration (so far!) get started?
Norm: When I came to the Haverford Libraries in 2000, I was assigned responsibility for the Economics Department. Soon thereafter I began providing assistance to Richard’s introductory statistics students, both in locating relevant literature as well as in acquiring data for statistical analysis. I provided similar, albeit more specialized, assistance to seniors in the context of their theses. Richard invited me to his classes and advised students to make appointments with me. Through regular communication, I came to understand the outcomes he sought from his students’ research assignments, and tailored my approach to meet these expectations. A strong working relationship ensued.
Meanwhile, in 2006 the Haverford Libraries in conjunction with Bryn Mawr and Swarthmore Colleges implemented DSpace, the widely-deployed open source repository system. The primary collection Haverford migrated into DSpace was its senior thesis archive, which had existed for the previous five years in a less-robust system. Based on the experience I had accrued to that point working with Richard and his students, I thought it would be helpful to future generations of students if empirical theses coexisted with the data from which the results were generated.
The DSpace platform provided a means of storing such digital objects and making them available to the public. I mentioned this idea to Richard, who suggested that not only should we post the data, but also all the documentation (the computer command files, data files and supporting information) specified by our documentation protocol. We didn’t know it at the time, but the seeds of Project TIER were planted then. The first thesis with complete documentation was archived on DSpace in 2007, and several more have been added every year since then.
Chelcie: You call TIER a “soup-to-nuts protocol for documenting data management and analysis.” Can you walk us through the main steps of that protocol?
Richard: The term “soup-to-nuts” refers to the fact that the TIER protocol entails documenting every step of data management and analysis, from the very beginning to the very end of a research project. In economics, the very beginning of the empirical work is typically the point at which the author first obtains the data to be used in the study, either from an existing source such as a data archive, or by conducting a survey or experiment; the very end is the point at which the final paper reporting the results of the study is made public.
The TIER protocol specifies that the documentation should contain the original data files the author obtained at the very beginning of the study, as well as computer code that executes all the processing of the data necessary to prepare them for analysis–including, for example, combining files, creating new variables, and dropping cases or observations–and finally generating the results reported in the paper. The protocol also specifies several kinds of additional information that should be included in the documentation, such as metadata for the original data files, a data appendix that serves as a codebook for the processed data used in the analysis and a read-me file that serves as a users’ guide to everything included in the documentation.
This “soup-to-nuts” standard contrasts sharply with the policies of academic journals in economics and other social sciences. Some of these journals require authors of empirical papers to submit documentation along with their manuscripts, but the typical policy requires only the processed data file used in the analysis and the computer code that uses this processed data to generate the results. These policies do not require authors to include copies of the original data files or the computer code that processes the original data to prepare them for analysis. In our view, this standard, sometimes called “partial replicability,” is insufficient. Even in the simplest cases, construction of the processed dataset used in the analysis involves many decisions, and documentation that allows only partial replication provides no record of the decisions that were made.
Complete instructions for the TIER protocol are available online. The instructions are presented in a series of web pages, and they are also available for download in a single .pdf document.
Chelcie: You’ve taught the TIER protocol in two main curricular contexts: introductory statistics courses and empirical senior thesis projects. What is similar or different about teaching TIER in these two contexts?
Richard: The main difference is that in the statistics courses students do their research projects in groups made up of 3-5 members. It is always a challenge for students to coordinate work they do in groups, and the challenge is especially great when the work involves managing several datasets and composing several computer command files. Fortunately, there are some web-based platforms that can facilitate cooperation among students working on this kind of project. We have found two platforms to be particularly useful: Dataverse, hosted by the Harvard Institute for Quantitative Social Science, and the Open Science Framework, hosted by the Center for Open Science.
Another difference is that when seniors write their theses, they have already had the experience of using the protocol to document the group project they worked on in their introductory statistics class. Thanks to that experience, senior theses tend to go very smoothly.
Chelcie: Can you elaborate a little bit about the Haverford Dataverse you’ve implemented for depositing the data underlying senior theses?
Norm: In 2013 Richard and I were awarded a Sloan/ICPSR challenge grant with which to promote Project TIER and solicit participants. As we considered this initiative, it was clear to us that a platform for hosting files would be needed both locally for instructors who perhaps didn’t have a repository system in place, as well as for fostering cross-institutional collaboration, whereby students learning the protocol in one participating institution could run replications against finished projects at another institution.
We imagined such a platform would need an interactive component, such that one could comment on the exactness of the replication. DSpace is a strong platform in many ways, but it is not designed for these purposes, so Richard and I began investigating available options. We came across Dataverse, which has many of the features we desired. Although we have uploaded some senior theses as examples of the protocol’s application, it was really the introductory classes for which we sought to leverage Dataverse. Our Project TIER Dataverse is available online.
In fall 2013, we experimented with using Dataverse directly with students. We sought to leverage the platform as a means of facilitating file management and communication among the various groups. We built Dataverses for each of the six groups in Richard’s introductory statistics course. We configured templates that helped students understand where to load their data and associated files. The process of building these Dataverses was time consuming, and at points we needed to jury rig the system to meet our needs. Although Dataverse is a robust system, we found its interface too complex for our needs. This fall we plan to use the Open Science Framework system to see if it can serve our students slightly better. Down the road, we can envision complementary roles for Dataverse and OSF as it relates to Project TIER.
Chelcie: After learning the TIER protocol, do students’ perceptions of the value of data management change?
Richard: Students’ perceptions change dramatically. I see this every semester. For the first few weeks, students have to do a few things to prepare to do what is required by the protocol, like setting up a template of folders in which to store the documentation as they work on the project throughout the semester, and establishing a system that allows all the students in the group to access and work on the files in those folders. There are always a few wrinkles to work out, and sometimes there is a bit of grumbling, but as soon as students start working seriously with their data they see how useful it was to do that up-front preparation. They realize quickly that organizing their work as prescribed by the protocol increases their efficiency dramatically, and by the end of the semester they are totally sold–they can’t imagine doing it any other way.
Chelcie: Have you experienced any tensions between developing step-by-step documentation for a particular workflow and technology stack versus developing more generic documentation?
Richard: The issue of whether the TIER protocol should be written in generic terms or tailored to a particular platform and/or a particular kind of software is an important one, but for the most part has not been the source of any tensions. All of the students in our introductory statistics class and most of our senior thesis advisees use Stata, on either a Windows or Mac operating system. The earliest versions of the protocol were therefore written particularly for Stata users, which meant, for example, we used the term “do-file” instead of “command file,” and instead of saying something like “a data file saved in the proprietary format of the software you are using” we would say “a data file saved in Stata’s .dta format.”
But fundamentally there is nothing Stata-specific about the protocol. Everything that we teach students to do using Stata works just fine with any of the other major statistical packages, like SPSS, R and SAS. So we are working on two ways of making it as easy as possible for users of different software to learn and teach the protocol. First, we have written a completely software-neutral version. And second, with the help of colleagues with expertise in other kinds of software, we are developing versions for R and SPSS, and we hope to create a SAS version soon. We will make all these versions available on the Project TIER website as they become available.
The one program we have come across for which the TIER protocol is not well suited is Microsoft Excel. The problem is that Excel is an exclusively interactive program; it is difficult or impossible to write an editable program that executes a sequence of commands. Executable command files are the heart and soul of the TIER protocol; they are the tool that makes it possible literally to replicate statistical results. So Excel cannot be the principal program used for a project for which the TIER documentation protocol is being followed.
Chelcie: What have you found to be the biggest takeaways from your experience introducing a data management protocol to undergraduates?
Richard: In the response to the first question in this interview, I described some of the tangible pedagogical benefits of teaching students to document their empirical research carefully. But there is a broader benefit that I believe is more fundamental. Requiring students to document the statistical results they present in their papers reinforces the idea that whenever they want to claim something is true or advocate a position, they have an intellectual responsibility to be able to substantiate and justify all the steps of the argument that led them to their conclusion. I believe this idea should underlie almost every aspect of an undergraduate education, and Project TIER helps students internalize it.
Chelcie: Thanks to funding from the Sloan Foundation and ICPSR at the University of Michigan, you’ve hosted a series of workshops focused on teaching good practices in documenting data management and analysis. What have you learned from “training the trainers”?
Richard: Our experience with faculty from other institutions has reinforced our belief that the time is right for initiatives that, like Project TIER, aim to increase the quality and credibility of empirical research in the social sciences. Instructors frequently tell us that they have thought for a long time that they really ought to include something about documentation and replicability in their statistics classes, but never got around to figuring out just how to do that. We hope that our efforts on Project TIER, by providing a protocol that can be adopted as-is or modified for use in particular circumstances, will make it easier for others to begin teaching these skills to their students.
We have also been reminded of the fact that faculty everywhere face many competing demands on their time and attention, and that promoting the TIER protocol will be hard if it is perceived to be difficult or time-consuming for either faculty or students. In our experience, the net costs of adopting the protocol, in terms of time and attention, are small: the protocol complements and facilitates many aspects of a statistics class, and the resulting efficiencies largely offset the start-up costs. But it is not enough for us to believe this: we need to formulate and present the protocol in such a way that potential adopters can see this for themselves. So as we continue to tinker with and revise the protocol on an ongoing basis, we try to be vigilant about keeping it simple and easy.
Chelcie: What do you think performing data management outreach to undergraduate, or more specifically TIER as a project, will contribute to the broader context of data management outreach?
Richard: Project TIER is one of a growing number of efforts that are bubbling up in several fields that share the broad goal of enhancing the transparency and credibility of research in the social sciences. In Sociology, Scott Long of Indiana University is a leader in the development of best practices in responsible data management and documentation. The Center for Open Science, led by psychologists Brian Nosek and Jeffrey Spies of the University of Virginia, is developing a web-based platform to facilitate pre-registration of experiments as well as replication studies. And economist Ted Miguel at UC Bekeley has launched the Berkeley Initiative for Transparency in the Social Sciences (BITSS), which is focusing its efforts to strengthen professional norms of research transparency by reaching out to early career social scientists. The Inter-university Consortium for Political and Social Research (ICPSR), which for over 50 year has served as a preeminent archive for social science research data, is also making important contributions to responsible data stewardship and research credibility. The efforts of all these groups and individuals are highly complementary, and many fruitful collaborations and interactions are underway among them. Each has a unique focus, but all are committed to the common goal of improving norms and practices with respect to transparency and credibility in social science research.
These bottom-up efforts also align well with several federal initiatives. Beginning in 2011, the NSF requires all proposals to include a “data management plan” outlining procedures that will be followed to support the dissemination and sharing of research results. Similarly, the NIH requires all investigator-initiated applications with direct costs greater than $500,000 in any single year to address data sharing in the application. More recently, in 2013 the White House Office on Science and Technology Policy issued a policy memorandum titled “Increasing Access to the Results of Federally Funded Scientific Research,” directing all federal agencies with more than $100 million in research and development expenditures to establish guidelines for the sharing of data from federally funded research.
Like Project TIER, many of these initiatives have been launched just within the past year or two. It is not clear why so many related efforts have popped up independently at about the same time, but it appears that momentum is building that could lead to substantial changes in the conduct of social science research.
Chelcie: Do you think the challenges and problems of data management outreach to students will be different in 5 years or 10 years?
Richard: As technology changes, best practices in all aspects of data stewardship, including the procedures specified by the TIER protocol, will necessarily change as well. But the principles underlying the protocol–replicability, transparency, integrity–will remain the same. So we expect the methods of implementing Project TIER will continually be evolving, but the aim will always be to serve those principles.
Chelcie: Based on your work with TIER, what kinds of challenges would you like for the digital preservation and stewardship community to grapple with?
Norm: We’re glad to know that research data are specifically identified in the National Agenda for Digital Stewardship. There is an ever-growing array of non-profit and commercial data repositories for the storage and provision of research data; ensuring the long-term availability of these is critical. Although our protocol relies on a platform for file storage, Project TIER is focused on teaching techniques that promote transparency of empirical work, rather than on digital object management per se. This said, we’d ask that the NDSA partners consider the importance of accommodating supplemental files, such as statistical code, within their repositories, as these are necessary for the computational reproducibility advocated by the TIER protocol. We are encouraged by and grateful to the Library of Congress and other forward-looking institutions for advancing this ambitious Agenda.
On September 8 the SCAPE/ APARSEN workshop Digital Preservation Sustainability on the EU Level was held at London City University in connection with the DL2014 conference. Attached to the workshop Digital Preservation Sustainability on the EU Level SCAPE and APARSEN launched a competition:
Which message do YOU want to send to the EU for the future of Digital Preservation projects?
At closing time of the workshop the winner and two runner up tweets were announced. Three very different messages to the EU altogether. One runner up tweet was urging the EU to allow for a small sustainability budget for at least 5 years after a project formally ends. The other runner up tweet included the question: 'Will this tweet be preserved?" which -very appropiate- by now is already deleted and thus seemingly lost forever.
But we are proud to announce:
More about the workshop in the offical SCAPE/ APARSEN workshop blogs- soon to be published!
Preservation Topics: SCAPE
- Jpylyzer by the KB (Royal Library of the Netherlands) and partners
- The SPRUCE Project by The University of Leeds and partners
- bwFLA Functional Long Term Archiving and Access by the University of Freiburg and partners
- Practical Digital Preservation: a how to guide for organizations of any size by Adrian Brown
- Skilling the Information Professional by the Aberystwyth University
- Introduction to Digital Curation: An open online UCLeXtend Course by University College London
- Voices from a Disused Quarry by Kerry Evans, Ann McDonald and Sarah Vaughan, University of Aberystwyth
- Game Preservation in the UK by Alasdair Bachell, University of Glasgow
- Emulation v Format Conversion by Victoria Sloyan, University College London
The DPC Award for Safeguarding the Digital Legacy, which celebrates the practical application of preservation tools to protect at-risk digital objects.
- Conservation and Re-enactment of Digital Art Ready-Made, by the University of Freiburg and Partners
- Carcanet Press Email Archive, University of Manchester
- Inspiring Ireland, Digital Repository of Ireland and Partners
- The Cloud and the Cow, Archives and Records Council of Wales
Preservation Topics: Open Planets Foundation
The following is a guest post by Nicholas Woodward, an Information Technology Specialist and the newest member of the Library’s Web Archiving team.
The path that lead me to the Library of Congress was long and circuitous, and it includes everything from a tiny web startup to teaching economics in Nicaragua to rediscovering a passion for developing software in Austin, Texas. Like many folks who develop software in the academic and library world I have a deep interest in the social sciences and humanities, in addition to technology.
But unlike others who began in these fields and subsequently developed technological knowledge and skills to do new and exciting things, I did the opposite. I spent years in the technology industry only to find that it had little value for me without serious contemplation of what effect it has on other peoples’ lives. Only later did I discover that software development in the library and academic environments allows one to incorporate such considerations as the practical applications for research or how different forces in society influence technological development and vice versa into the process of writing code.
But I’m jumping ahead. Let’s get the events out of the way. In 2003 I graduated from the University of Nebraska-Lincoln with a BS in computer science and started working full-time at a very small web development company. After deciding there must be more to life than making websites for a salary, I joined the Peace Corps in 2005 and worked as a high school teacher in Nicaragua for roughly 2.5 years. After a brief stint observing elections in Guatemala, I returned to the U.S. in hopes of going back to school to study the social sciences with a focus on Latin America. My dream scenario took shape when I was accepted to an MA program in the Teresa Lozano Long Institute of Latin American Studies at the University of Texas at Austin. I earned my MA in 2011 and subsequently earned an MS in Library and Information Science in 2013, also at UT.
It was while an MA student that a graduate research assistantship would change my career path for good. As a dual research assistant for the Latin American Network Information Center and the Texas Advanced Computing Center I had the incredible opportunity to conduct research on a large web archive in a high-performance computing environment. In the process I learned about things such as the Hadoop architecture and natural language processing and Bayesian classifiers and distributed computing and…
But the real value, as far as I was concerned, was that I could see directly how software development could be more than just putting together code to do “cool stuff.” I realized that developing software to facilitate research and discovery of massive amounts of data in an open and collaborative fashion not only increases the opportunities for alternative types of knowledge production but also influenced how it gets created in a very profound way. And being a part of this process, however small, was the ideal place for me.
Which brings us to today. I am thrilled to be starting my new role as an Information Technology Specialist with the web archiving team of the Library’s Office of Strategic Initiatives. It is an incredible opportunity to learn new skills, incorporate knowledge I’ve acquired in the past and contribute in whatever ways I can to an outstanding team that is at the forefront of Internet archiving.
As the newest member of the web archiving team, my focus will be to continue the ongoing development of Digiboard 4.0 (pdf), the next version of our web application for managing the web archiving process at the Library of Congress. Digiboard 4.0 will build on previous software that enables Library staff to create collections of web-archived content, nominate new websites and review crawls of the Internet for quality assurance, while also making the process more efficient and expanding opportunities for cataloging archived websites. Additionally, part of my time will include exploratory efforts to expand the infrastructure and capacity of the web archiving team for in-house Internet crawling.
I look forward to the challenges and opportunities that lay ahead as we contribute to the greater web archiving community through establishing best practices, improving organizational workflows for curation, quality review and presentation of web-archived content and generally expanding the boundaries of preserving the Internet for current and future generations.
Every year, The Small Press Expo in Bethesda, Md brings together a community of alternative comic creators and independent publishers. With a significant history of collecting comics, it made sense for the Library of Congress’ Serial and Government Publications Division and the Prints & Photographs Division to partner with SPX to build a collection documenting alternative comics and comics culture. In the last three years, this collection has been developing and growing.
While the collection itself is quite fun (what’s not to like about comics), it is also a compelling example of the way that web archiving can complement and fit into work developing a special collection. To that end, I am excited to talk with Megan Halsband, Reference Librarian with the Library of Congress Serial and Government Publications Division and one of the key staff working on this collection as part of our Content Matters interview series.
Trevor: First off, when people think Library of Congress I doubt “comics” is one of the first things that comes to mind. Could you tell us a bit about the history of the Library’s comics collection, the extent of the collections and what parts of the Library of Congress are involved in working with comics?
Megan: I think you’re right – the comics collection is not necessarily one of the things that people associate with the Library of Congress – but hopefully we’re working on changing that! The Library’s primary comics collections are two-fold – first there are the published comics held by the Serial & Government Publications Division, which appeared in newspapers/periodicals and later in comic books, as well as the original art, which is held by the Prints & Photographs Division.
The Comic Book Collection here in Serials is probably the largest publicly available collection in the country, with over 7,000 titles and more than 125,000 issues. People wonder why our section at the Library is responsible for the comic books – and it’s because most comic books are published serially. Housing the comic collection in Serials also makes sense, as we are also responsible for the newspaper collections (which include comics). The majority of our comic books come through the US Copyright Office via copyright deposit, and we’ve been receiving comic books this way since the 1930′s/1940′s.
The Library tries to have complete sets of all the issues of major comic titles but we don’t necessarily have every issue of every comic ever published (I know what you’re thinking and no, we don’t have an original Action Comics No. 1 – maybe someday someone will donate it to us!). The other main section of the Library that works with comic materials is Prints & Photographs – though Rare Book & Special Collections and the area studies reading rooms probably also have materials that would be considered ‘comics.’
Trevor: How did the idea for the SPX collection come about? What was important about going out to this event as a place to build out part of the collection? Further, in scoping the project, what about it suggested that it would also be useful/necessary to use web archiving to complement the collection?
Megan: The executive director of SPX, Warren Bernard, has been working in the Prints & Photographs Division as a volunteer for a long time, and the collection was established in 2011 after an Memorandum of Understanding was signed between the Library and SPX. I think Warren really was a major driving force behind this agreement, but the curators in both Serials and Prints & Photographs realized that our collections didn’t include materials from this particular community of creators and publishers in the way that it should.
Given that SPX is a local event with an international reputation and awards program (SPX awards the Ignatz) and the fact that we know staff at SPX, I think it made sense for the Library to have an ‘official’ agreement that serves as an acquisition tool for material that we wouldn’t probably otherwise obtain. Actually going to SPX every year gives us the opportunity to meet with the artists, see what they’re working on and pick up material that is often only available at the show – in particular mini-comics or other free things.
Something important to note is that the SPX Collection – the published works, the original art, everything – is all donated to the Library. This is huge for us – we wouldn’t be able to collect the depth and breadth of material (or possibly any material at all) from SPX otherwise. As far as including online content for the collection, the Library’s Comics and Cartoons Collection Policy Statement (PDF) specifically states that the Library will collect online/webcomics, as well as award-winning comics. The SPX Collection, with its web archiving component, specifically supports both of these goals.
Trevor: What kinds of sites were selected for the web archive portion of the collection? In this case, I would be interested in hearing a bit about the criteria in general and also about some specific examples. What is it about these sites that is significant? What kinds of documentation might we lose if we didn’t have these materials in the collection?
Megan: Initially the SPX webarchive (as I refer to it – though its official name is Small Press Expo and Comic Art Collection) was extremely selective – only the SPX website itself and the annual winner of the Ignatz Award for Outstanding Online Comic were captured. The staff wanted to see how hard it would be to capture websites with lots of image files (of various types). Turns out it works just fine (if there’s not paywall/subscriber login credentials required) – so we expanded the collection to include all the Ignatz nominees in the Outstanding Online Comic category as well.
Some of these sites, such as Perry Bible Fellowship and American Elf, are long-running online comics who’s creators have been awarded Eisner, Harvey and Ignatz awards. There’s a great deal of content on these websites that isn’t published or available elsewhere – and I think that this is one of the major reasons for collecting this type of material. Sometimes the website might have initial drafts or ideas that later are published, sometimes the online content is not directly related to published materials, but for in-depth research on an artist or publication, often this type of related content is extremely useful.
Trevor: You have been working with SPX to build this collection for a few years now. Could you give us an overview of what the collection consists of at this point? Further, I would be curious to know a bit about how the idea of the collection is playing out in practice. Are you getting the kinds of materials you expected? Are there any valuable lessons learned along the way that you could share? If anyone wants access to the collection how would they go about that?
Megan: At this moment in time, the SPX Collection materials that are here in Serials include acquisitions from 2011-2013, plus two special collections that were donated to us, the Dean Haspiel Mini-Comics Collection and the Heidi MacDonald Mini-Comics Collection. I would say that the collection has close to 2,000 items (we don’t have an exact count since we’re still cataloging everything) as well as twelve websites in the web archive. We have a wonderful volunteer who has been working on cataloging items from the collection, and so far there are over 550 records available in the Library’s online catalog.
Personally, I didn’t have any real expectations of what kinds of materials we would be getting – I think that definitely we are getting a good selection of mini-comics, but it seems like there are more graphic novels that I anticipated. One of the fun things about this collection are the new and exciting things that you end up finding at the show – like an unexpected tiny comic that comes with its own magnifying glass or an oversize newsprint series.
The process of collecting has definitely gotten easier over the years. For example, the Head of the Newspaper Section, Georgia Higley, and I just received the items that were submitted in consideration for the 2014 Ignatz Awards. We’ll be able to prep permission forms/paperwork in advance of the show for the materials we’re keeping from this material, and it will help us cut down on potential duplication. This is definitely a valuable lesson learned! We’ve also come up with a strategy for visiting the tables at the show – there are 287 tables this year – so we divide up the ballroom between four of us (Georgia and I, as well as two curators from Prints & Photographs – Sara Duke and Martha Kennedy) to make it manageable.
We also try to identify items that we know we want to ask for in advance of the show – such as ongoing serial titles or debut items listed on the SPX website – to maximize our time when we’re actually there. Someone wanting to access the collection would come to the Newspaper & Current Periodical Reading Room to request the comic books and mini-comics. Any original art or posters from the show would be served in the Prints & Photographs Reading Room. As I mentioned – there is still a portion of this collection that is unprocessed – and may not be immediately accessible.
Trevor: Stepping back from the specifics of the collection, what about this do you think stands for a general example of how web archiving can complement the development of special collections?
Megan: One of the true strengths of the Library of Congress is that our collections often include not only the published version, but also the ephemeral material related to the published item/creator, all in one place. From my point of view, collecting webcomics gives the Library the opportunity to collect some of this ‘ephemera’ related to comics collections and only serves to enhance what we are preserving for future research. And as I mentioned earlier, some of the content on the websites provides context, as well as material for comparison, to the physical collection materials that we have, which is ideal from a research perspective.
Trevor: Is there anything else with web archiving and comics on the horizon for your team? Given that web comics are such significant part of digital culture I’m curious to know if this is something you are exploring. If so, is there anything you can tell us about that?
We recently began another web archive collection to collect additional webcomics beyond those nominated for Ignatz Awards – think Dinosaur Comics and XKCD. It’s very new (and obviously not available for research use yet) – but I am really excited about adding materials to this collection. There are a lot of webcomics out there – and I’m glad that the Library will now be able to say we have a selection of this type of content in our collection! I’m also thinking about proposing another archive to capture comics literature and criticism on the web – stay tuned!