Planet DigiPres

Archiving video

File Formats Blog - 19 July 2014 - 10:59am

Suppose you see a cop beating someone up for jaywalking, or you’re stopped at one of the Border Patrol’s internal checkpoints. You’ve got your camera, phone, or tablet, so you make a video record of the incident. What do you do next? The Activists’ Guide to Archiving Video has some solid advice. Its purpose is to help you “make sure that the video documentation you have created or collected can be used for advocacy, as evidence, for education or historical memory – not just now but into the future.” The advice is solid, and most of it applies to any video recording that has long-term importance. In essence, it’s the same advice you’d get from Files that Last or from the Library of Congress. It includes considerations that especially apply to sensitive video, such as encryption and information that might put people at risk, but it’s a valuable addition to anyone’s digital preservation library.

There’s a PDF version of the guide for people who don’t like hopping around web pages. Versions in Spanish and Arabic are also provided.


Tagged: metadata, preservation, video
Categories: Planet DigiPres

A VM4C3PO

Open Planets Foundation Blogs - 17 July 2014 - 2:36pm

We have just set up a vagrant environment for C3PO. It starts a headless vm where the C3PO related functionalities (Mongodb, Play, a downloadable commandline jar) are managable from the host's browser. Further, the vm itself has all relevant processes configured at start-up independently from vagrant, so it can be, once created, downloaded and used as a stand-alone C3PO vm. We think this could be a scenario applicable to other SCAPE projects as well. The following is a summary of the ideas we've had and the experiences we've made.

The Result

The Vagrantfile and a directory containing all vagrant-relevant files live directly in the root directory of the C3PO repository. So after installing Vagrant and cloning the repository a simple 'vagrant up' should do all the work, as downloading the base box, installing the necessary software and booting the new vm.

After a few minutes one should have a running vm that is accessible from the hosts browser at localhost:8000. This opens a central welcome page that contains information about the vm-specific aspects and links to the playframework's url (localhost:9000) and the Mongodb admin interface (localhost:28017). It also provides a download link for the command-line jar, which has to be used in order to import data. This can be used from the outside of the vm as the Mongodb port is mapped as well. So I can import and analyse data with C3PO without having to fiddle through the setup challenges myself, and, believe me, that way can be long and stony.

The created image is self-contained in that sense that, if I put it on a server, anyone who has Virtualbox installed can download it and use it, without having to rely on vagrant working on their machine.

General Setup

The provisioning script has a number of tasks:

  • it downloads all required dependencies for building the C3PO environment
  • it installs a fresh C3PO (from /vagrant, which is the shared folder connection between the git repository and the vm) and assembles the command-line app
  • it installs and runs a Mongodb server
  • it installs and runs the Playframework
  • it creates a port-forwarded static welcome page with links to all the functionalities above
  • it adds all above to the native ubuntu startup (using /etc/rc.local, if necessary), so that an image of the vm can theoretically be run independently from the vagrant environment

These are all trivial steps, but it can make a difference not having to manually implement all of them.

Getting rid of proxy issues

In case you're behind one of those very common NTLM company proxies, you'll really like that the only thing you have to provide is a config script with some some details around your proxy. If the setup script detects this file, it will download the necessary software and configure maven to use it. Doing it in this way has been actually the first time I got maven running smoothly on a linux VM behind our proxy.

Ideas for possible next steps

There is loads left to do, here are a few ideas:

  • provide interesting initial test-data that ships with the box, so that people can play around with C3PO without having to install/import anything at all.
  • why not having a vm for more SCAPE projects? we could quickly create a repository for something like a SCAPE base vm configuration that is useable as a base for other vms. The central welcome page could be pre-configured (SCAPE branded) as well as all the proxy- and development-environment-related stuff mentioned above.
  • I'm not sure about the sustainablity of shell provisioning scripts with increasing complexity of the bootstrap process. Grouping the shell commands in functions is certainly an improvement, it might be worth though to check out other, more dynamic provisioners. One I find particularly interesting is Ansible.
  • currently there's no way of testing that the vm works with the current development trunk; a test environment that runs the vm and tests for all the relevant connection bits would be handy

 

Preservation Topics: SCAPE
Categories: Planet DigiPres

CSV Validator version 1.0 release

Open Planets Foundation Blogs - 15 July 2014 - 12:10pm

Following on from my previous brief post announcing the beta release of the CSV Validator, http://www.openplanetsfoundation.org/blogs/2014-03-21-csv-validator-beta-releases, today we've made the formal version 1.0 release of the CSV Validator and the associated CSV Schema Language.  I've described this in more detail on The NAtional Archives' blog, http://blog.nationalarchives.gov.uk/blog/csv-validator-new-digital-preservation-tool/

Preservation Topics: Tools
Categories: Planet DigiPres

Crowdsourcing song identification

File Formats Blog - 14 July 2014 - 10:04am

Some friends of mine are pulling together a project for crowdsourcing identification of a large collection of music clips. At least a couple of us are professional software developers, but I’m the one with the most free time right now, and it fits with my library background, so I’ve become lead developer. In talking about it, we’ve realized it can be useful to librarians, archivists, and researchers, so we’re looking into making it a crowdfunded open source project.

A little background: “Filk music” is songs created and sung by science fiction and fantasy fans, mostly at conventions and in homes. I’ve offered a definition of filk on my website. There are some shoestring filk publishers; technically they’re in business, but it’s a labor of love rather than a source of income. Some of them have a large backlog of recordings from past conventions. Just identifying the songs and who’s singing them is a big task.

This project is, initially, for one of these filk publishers, who has the biggest backlog of anyone. The approach we’re looking at is making short clips available to registered crowdsource contributors, and letting them identify as much as they can of the song, the author, the performer(s), the original tune (many of these songs are parodies), etc. Reports would be delivered to editors for evaluation. There could be multiple reports on the same clip; editors would use their judgment on how to combine them. I’ve started on a prototype, using PHP and MySQL.

There’s a huge amount of enthusiasm among the people already involved, which makes me confident that at least the niche project will happen. The question is whether there may be broader interest. I can see this as a very useful tool for professionals dealing with archives of unidentified recordings: folk music, old jazz, transcribed wax cylinder collections, whatever. There’s very little in the current design that’s specific to one corner of the musical world.

The first question: Has anyone already done it? Please let me know if something like this already exists.

If not, how interesting does it sound? Would you like it to happen? What features would you like to see in it?

Update: On the Code4lib mailing list, Jodi Schneider pointed out that nichesourcing is a more precise word for what this project is about.


Tagged: archiving, crowdsourcing, filk, music
Categories: Planet DigiPres

New QA tool for finger detection on scans

Open Planets Foundation Blogs - 10 July 2014 - 11:49am

I would like to draw your attention to the new QA tool for finger detection on scans: https://github.com/openplanets/finger-detection-tool. This tool was developed by AIT in scope of the SCAPE project.

 

Checking to identify fingers on scan manually is a very time-consuming and error-prone process. You need a tool to help you: Fingerdet.

Fingerdet is an open source tool which:

  • provides decision-making support for finger on scans detection in or across collections
  • identifies fingers even independent from file format, scan quality, finger sizes, direction, shape, colour and lighting conditions
  • applies state-of-the art image processing
  • is useful in assembling collections from multiple sources, and identifying corrupted files

Fingerdet brings the following benefits:

  • Automated quality assurance
  • Reduced manual effort and error
  • Saved time
  • Lower costs, e.g. storage, effort
  • Open source, standalone tool. Also as Taverna component for easy invocation
  • Invariant to format, rotation, scale, translation, illumination, resolution, cropping, warping and distortions
  • May be applied to wide range of image collections, not just print images
Preservation Topics: Web ArchivingPreservation RisksSCAPESoftware
Categories: Planet DigiPres

Quality assured ARC to WARC migration

Open Planets Foundation Blogs - 10 July 2014 - 10:44am

This blog post continues a series of posts about the weeb archiving topic „ARC to WARC migration“, namely it is a follow-up on the posts „ARC to WARC migration: How to deal with de-duplicated records?“, and „Some reflections on scalable ARC to WARC migration“.

Especially the last one of these posts ,which described how SCAPE tools can be used for multi-Terabyte web archive data migration, is the basis for this post from a subject point of view. One consequence of evaluating alternative approaches for processing web archive records using the Apache Hadoop framework was to abandon the native Hadoop job implementation (the arc2warc-migration-hdp module was deprecated and removed from the master branch) because of having some disadvantages without bringing significant benefits in terms of performance and scale-out cabability compared to the command line application arc2warc-migration-cli used together with the SCAPE tool ToMaR for parallel processing. While this previous post did not elaborate on quality assurance, it will be the main focus of this post.

The workflow diagram in figure 1 illustrates the main components and processes that were used to create a quality assured ARC to WARC migration workflow.

arc2warc-workflow-diagram.png

Figure 1: Workflow diagram of the ARC to WARC migration workflow

The basis of the main components used in this workflow is the Java Web Archive Toolkit (JWAT) for reading web archive ARC container files. Based on this toolkit the „hawarp“ tool set was developed in the SCAPE project which bundles several components for preparing and processing web archive data, especially data that is stored in ARC or WARC container files. Special attention was given to making sure that data can be processed using the Hadoop framework, an essential part of the SCAPE platform for distributed data processing using computer clusters.

The input of the workflow is an ARC container file, a format originally proposed by the Internet Archive to persistently store web archive records. The ARC to WARC migration tool is a JAVA command line application which takes an ARC file as input and produces a WARC file. The tool basically performs a procedural mapping of metadata between the ARC and WARC format (see constructors of the eu.scape_project.hawarp.webarchive.ArchiveRecord class). One point that is worth highlighting is that the records of ARC container files from the Austrian National Library's web archive were not structured homogeneously. When creating ARC records (using the Netarchive Suite/Heritrix web crawler in this case), the usual procedure was to strip off the HTTP response metadata from the server's response and store these data as part of the header of the ARC record. Possibly due to malformed URLs this was not applied to all records, so that the HTTP response metadata were still part of the payload content as it was actually defined later for the WARC standard. The ARC to WARC migration tool handles these special cases accordingly. Generally, and as table 1 shows, HTTP response metadata is transferred from ARC header to WARC payload header and therefore becomes part of the payload content.

ARC Header
→ HTTP Response MetadataWARC HeaderARC Payload→ HTTP Response Metadata
WARC Payload

Table 1: HTTP response metadata is transferred from ARC header to WARC payload header

The CDX-Index Creation module, which is also part of the „hawarp“ tool set, is used to create a file in the CDX file format to store selected attributes of web archive records aggregated in ARC or WARC container files - one line per record - in a plain text file. The main purpose of the CDX-index is to provide a lookup table for the wayback software. The index file contains the necessary information (URL, date, offset: record position in container file, container identifier, etc) to retrieve data required for rendering an archived web page and depending ressources from the container files.

Apart from serving the purpose of rendering web ressources using the wayback software, the CDX index file can also be used to do a basic verification if the container format migration process was successful or not, namely by comparing the CDX fields of the ARC CDX file and the WARC CDX file. The basic assumption here is that apart from the offset and container identifier fields all the other fields must have the same values for corresponding records. Especially the payload digest allows verifying if the digest computed for the binary data (payload content) are the same for all records in the two container formats respectively.

An additional step of the workflow in order to verify the quality of the migration is to compare the rendering results of selected ressources when being retrieved from the original ARC and the migrated WARC container files. To this end, the CDX files are deployed to the wayback application in a first step. In a second step the PhantomJS framwork is used to take snapshots from rendering the same ressource retrieved once from the ARC container and once from the WARC container file.

Finally, the snapshot images are compared using Exiftool (basic image properties) and ImageMagick (measure AE: absolute error) in order to determine if the rendering result is equal for both instances. Randomized manual verification of individual cases may then conclude the quality control process.

There is an executable Taverna workflow available on myExperiment. The Taverna workflow is configured by adapting the values of the constant values (light-blue boxes) which define the paths to configuration files, deployment files, and scripts in the processing environment. However, as in this workflow Taverna is just used as an orchestration tool to build a sequence of bash script invocations, it is also possible to just use the individual scripts of this workflow and replace the Taverna variables (embraced by two per cent symbols) accordingly.

The following prerequisites must be fulfilled to be able to execute the Taverna workflow and/or the bash scripts it contains:

The following screencast demonstrates the workflow using a simple "Hello World!" crawl as example:

Taxonomy upgrade extras: SCAPESCAPEProjectSCAPE-ProjectWeb ArchivingPreservation Topics: Preservation ActionsMigrationWeb ArchivingSCAPE
Categories: Planet DigiPres

EaaS in Action — And a short meltdown due to a friendly DDoS

Open Planets Foundation Blogs - 9 July 2014 - 4:23pm

On June 24th 9.30 AM EST Dragan Espenschied, Digital Conservator at Rhizome NY, released an editorial on rhizome.org featuring a restored home computer previously owned by Cory Arcangel. The article uses an embedded emulator powered by the bwFLA Emulation as a Service framework  and the University of Freiburg’s computing center. The embedded emulator allows readers to explore and interact with the re-enacted machine. 

Currently the bwFLA test and demo infrastructure runs an old, written off IBM Blade Cluster, using 12 blades for demo purposes, each equipped with 8 core Intel(R) Xeon(R) CPUs (E5440  @ 2.83GHz). All instances are booted diskless (network boot) with the latest bwFLA codebase deployed. Additionally, there is a EaaS gateway running on 4 CPUs delegating request and providing a Web container framework (JBoss) for IFrame delivery. To ensure, a decent performance of individual emulation sessions we assign one emulation session to every available physical CPU. Hence, our current setup can handle 96 parallel session. 

Due to some social media propaganda and good timing (US breakfast/coffee time) our resources were exhausted within minutes.

The figure above shows the total number of sessions for June 26th. Between 16:00 and 20:00 CEST, however, we were unable to deal with the demand.

After two days however, load has normalized again, even though on a higher level.

Lessons learned

The bwFLA EaaS framework is able to scale with demand, however, not with our available (financial) resources. Our cluster is suitable for „normal“ load scenarios. For peaks like we have experienced with Dragan’s post, a temporary deployment in the Cloud (e.g. using PaaS / IaaS Cloud services) is cost effective strategy, since these heavy load situations last only for a few days. The „local“ infrastructure should be scaled to average demand, keeping costs running EaaS at bay. For instance, Amazon EC2 charges for an 8 CPU machine about € 0.50 per hour. In case of Dragan’s post, the average session time of a user playing with the emulated machine was 15 minutes, hence, the average cost per user is about 0.02€ if a machine is fully utilized. 

 

Taxonomy upgrade extras: EaaSPreservation Topics: Emulation
Categories: Planet DigiPres

BSDIFF: Technological Solutions for Reversible Pre-conditioning of Complex Binary Objects

Open Planets Foundation Blogs - 9 July 2014 - 12:31am

During my time at The National Archives UK, colleague, Adam Retter, developed a methodology for the reversible pre-conditioning of complex binary objects. The technique was required to avoid the doubling of storage for malformed JPEG2000 objects numbering in the hundreds of thousands. The difference between a malformed JPEG2000 file and a corrected, well-formed JPEG2000 file, in this instance was a handful of bytes, yet the objects themselves were many megabytes in size. The cost of storage means that doubling it in such a scenario is not desirable in today’s fiscal environment – especially if it can be avoided.

As we approach ingest of our first born-digital transfers at Archives New Zealand, we also have to think about such issues. We’re also concerned about the documentation of any comparable changes to binary objects, as well as any more complicated changes to objects in any future transfers.

The reason for making changes to a file pre-ingest, in our process terminology - pre-conditioning, is to ensure well-formed, valid objects are ingested into the long term digital preservation system. Using processes to ensure changes are:

  • Reversible
  • Documented
  • Approved

We can counter any issues identified as digital preservation risks in the system’s custom rules up-front ensuring we don’t have to perform any preservation actions in the short to medium term. Such issues may be raised through format identification, validation, or characterisation tools. Such issues can be trivial or complex and the objects that contain exceptions may also be trivial or complex themselves.

At present, if pre-conditioning is approved, it will result in a change being made to the digital object and written documentation of the change, associated with the file, in its metadata and in the organisation’s content management system outside of the digital preservation system.

As example documentation for a change we can look at a provenance note I might write to describe a change in a plain text file. The reason for the change is the digital preservation system looking for the object to be encoded as UTF-8. A conversion can give us stronger confidence about what this file is in future. Such a change, converting the object from ASCII to UTF-8, can be completed as either a pre-conditioning action pre-ingest, or preservation migration post-ingest.

Provenance Note

“Programmers Notepad 2.2.2300-rc used to convert plain-text file to UTF-8. UTF-8 byte-order-mark (0xEFBBBF) added to beginning of file – file size +3 bytes. Em-dash (0x97 ANSI) at position d1256 replaced by UTF-8 representation 0xE28094 at position d1256+3 bytes (d1259-d1261) – file size +2 bytes.”

Such a small change is deceptively complex to document. Without the presence of a character sitting outside of the ASCII range we might have simply been able to write, “UTF-8 byte-order-mark added to beginning of file.” – but with its presence we have to provide a description complete enough to ensure that the change can be observed, and reversed by anyone accessing the file in future.

Pre-conditioning vs. Preservation Migration

As pre-conditioning is a form of preservation action that happens outside of the digital preservation system we haven’t adequate tools to complete the action and document it for us – especially for complex objects. We’re relying on good hand-written documentation being provided on ingest. The temptation, therefore, is to let the digital preservation system handle this using its inbuilt capability to record and document all additions to a digital object’s record, including the generation of additional representations; but the biggest reason to not rely on this is the cost of storage and how this increases with the likelihood of so many objects requiring this sort of treatment over time.

Proposed Solution

It is important to note that the proposed solution can be implemented either pre- or post-ingest therefore removing the emphasis from where in the digital preservation process this occurs, however, incorporating it post-ingest requires changes to the digital preservation system. Doing it pre-ingest enables it to be done manually with immediate challenges addressed. Consistent usage and proven advantages over time might see it included in a digital preservation mechanism at a higher level.

The proposed solution is to use a patch file, specifically a binary diff (patch file) which stores instructions about how to convert one bitstream to another. We can create a patch file by using a tool that compares an original bitstream to a corrected (pre-conditioned) version of it and stores the result of the comparison. Patch files can add and remove information as required and so we can apply the instructions created to a corrected version of any file to re-produce the un-corrected original.

The tool we adopted at The National Archives, UK was called BSDIFF. This tool is distributed with the popular operating system, FreeBSD, but is also available under Linux, and Windows.

The tool was created by Colin Percival and there are two utilities required; one to create a binary diff - BSDIFF itself, and the other to apply it BSPATCH. The manual instructions are straightforward, but the important part of the solution in a digital preservation context is to flip the terms <oldfile> and <newfile>, so for example, in the manual:

  • $ bsdiff <oldfile> <newfile> <patchfile>

Can become:

  • $ bsdiff <newfile> <oldfile> <patchfile>

Further, in the below descriptions, I will replace <newfile> and <oldfile> for <pre-conditioned-file> and <malformed-file> respectively, e.g.

  • $ bsdiff <pre-conditioned-file> <malformed-file> <patchfile>

BSDIFF

BSDIFF generates a patch <patchfile> between two binary files. It compares <pre-conditioned-file> to <malformed-file> and writes a <patchfile> suitable for use by BSPATCH.

BSPATH

BSPATCH applies a patch built with BSDIFF, it generates <malformed-file> using <pre-conditioned-file>, and <patchfile> from BSDIFF.

Examples

For my examples I have been using the Windows port of BSDIFF referenced from Colin Percival’s site.

To begin with, a non-archival example simply re-producing a binary object:

If I have the plain text file, hen.txt:

  • The quick brown fox jumped over the lazy hen.

I might want to correct the text to its more well-known pangram form – dog.txt:

  • The quick brown fox jumped over the lazy dog.

I create dog.txt and using the following command I create hen-reverse.diff:

  • $ bsdiff dog.txt hen.txt hen-reverse.diff

We have two objects we need to look after, dog.txt and hen-reverse.diff.

If we ever need to look at the original again we can use the BSPATCH utility:

  • $ bspatch dog.txt hen-original.txt hen-reverse.diff

We end up with a file that matched the original byte for byte and can be confirmed by comparing the two checksums.

$ md5sum hen.txt 84588fd6795a7e593d0c7454320cf516 *hen.txt $ md5sum hen-original.txt 84588fd6795a7e593d0c7454320cf516 *hen-original.txt

Used as an illustration, we can re-create the original binary object, but we’re not saving any storage space at this point as the patch file is bigger than the <malformed-file> and <pre-conditioned-file> together:

  • hen.txt – 46 bytes
  • dog.txt – 46 bytes
  • hen-reverse.diff – 159 bytes

The savings we can begin to make, however, using binary diff objects to store pre-conditioning instructions can be seen when we begin to ramp up the complexity and size of the objects we’re working with. Still working with text, we can convert the following plain-text object to UTF-8 complementing the pre-conditioning action we might perform on archival material as described in the introduction to this blog entry:

  • Lorem ipsum dolor sit amet, consectetur adipiscing elit. Maecenas quam lacus, tincidunt sit amet lobortis eget, auctor non nibh. Sed fermentum tempor luctus. Phasellus cursus, risus nec eleifend sagittis, odio tellus pretium dui, ut tincidunt ligula lorem et odio. Ut tincidunt, nunc ut volutpat aliquam, quam diam varius elit, non luctus nulla velit eu mauris. Curabitur consequat mauris sit amet lacus dignissim bibendum eget dignissim mauris. Nunc eget ullamcorper felis, non scelerisque metus. Fusce dapibus eros malesuada, porta arcu ut, pretium tellus. Pellentesque diam mauris, mollis quis semper sit amet, congue at dolor. Curabitur condimentum, ligula egestas mollis euismod, dolor velit tempus nisl, ut vulputate velit ligula sed neque. Donec posuere dolor id tempus sodales. Donec lobortis elit et mi varius rutrum. Vestibulum egestas vehicula massa id facilisis.

Converting the passage to UTF-8 doesn’t require the conversion of any characters within the text itself, rather just the addition of the UTF-8 byte-order-mark at the beginning of the file. Using Programmers Notepad we can open lorem-ascii.txt and re-save it with a different encoding as lorem-utf-8.txt. As with dog.txt and hen.txt we can then create the patch, and then apply to see the original again using the following commands:

  • $ bsdiff lorem-utf-8.txt lorem-ascii.txt lorem-reverse.diff
  • $ bspatch lorem-utf-8.txt lorem-ascii-original.txt lorem-reverse.diff

Again, confirmation that bspatch outputs a file matching the original can be seen by looking at their respective MD5 values:

$ md5sum lorem-old.txt ec6cf995d7462e20f314aaaa15eef8f9 *lorem-ascii.txt $ md5sum lorem-ascii.txt ec6cf995d7462e20f314aaaa15eef8f9 *lorem-ascii-original.txt

The file sizes here are much more illuminating:

  • lorem-ascii.txt – 874 bytes
  • lorem-utf-8.txt – 877 bytes
  • lorem-reverse.diff – 141 bytes

Just one more thing… Complexity!

We can also demonstrate the complexity of the modifications we can make to digital objects that BSDIFF affords us. Attached to this blog is a zip file containing supporting files, lorem-ole2-doc.doc and lorem-xml-docx.docx.

The files are used to demonstrate a migration exercise from an older Microsoft OLE2 format to the up-to-date OOXML format.

I’ve also included the patch file lorem-word-reverse.diff.

Using the commands as documented above:

  • $ bsdiff lorem-xml-docx.docx lorem-ole2-doc.doc lorem-word-reverse.diff
  • $ bspatch lorem-xml-docx.docx lorem-word-original.doc lorem-word-reverse.diff

We can observe that application of the diff file to the ‘pre-conditioned’ object, results in a file identical to the original OLE2 object:

$ md5sum lorem-ole2-doc.doc 3bb94e23892f645696fafc04cdbeefb5 *lorem-ole2-doc.doc $ md5sum lorem-word-original.doc 3bb94e23892f645696fafc04cdbeefb5 *lorem-word-original.doc

The file-sizes involved in this example are as follows:

  • Lorem-ole2-doc.doc – 65,536 bytes
  • Lorem-xml-docx.docx – 42.690 bytes
  • Lorem-word-reverse.diff – 16,384 bytes

The neat part of this as a solution, if it wasn’t enough that the most complex of modifications are reversible, is that the provenance note remains the same for all transformations between all digital objects. The tools and techniques are documented instead, and the rest is consistently consistent, and perhaps even more accessible to all users who can understand this documentation over more complex narrative, byte-by-byte breakdowns that might otherwise be necessary.

Conclusions

Given the right problem and the insight of an individual from outside of the digital preservation sphere (at the time) in Adam, we have been shown an innovative solution that helps us to demonstrate provenance in a technologically and scientifically sound manner, more accurately, and more efficiently than we might otherwise be able to do so using current approaches. The solution:

  • Enables more complex pre-conditioning actions on more complex objects
  • Prevents us from doubling storage space
  • Encapsulates pre-conditioning instructions more efficiently and more accurately - there are fewer chances to make errors

While it is unclear whether Archives New Zealand will be able to incorporate this technique into its workflows at present, the solution will be presented alongside our other options so that it can be discussed and taken into consideration by the organisation as appropriate.

Work does need to be done to incorporate it properly, e.g. respecting original file naming conventions; and some consideration should be given as to where and when in the transfer / digital preservation process the method should be applied, however, it should prove to be an attractive, and useful option for many archives performing pre-conditioning or preservation actions on all future, trivial and complex digital objects.

 

---

Footnotes

Documentation for the BSDIFF file format is attached to this blog and was created by Colin Percival and is BSD licensed. 

 

Preservation Topics: Preservation ActionsMigrationPreservation StrategiesNormalisation AttachmentSize BSDIFF format documentation provided by Colin Percival [PDF]29.93 KB Support files for OLE2 to OOXML example [ZIP]69.52 KB
Categories: Planet DigiPres

HTML and fuzzy validity

File Formats Blog - 8 July 2014 - 5:54pm

Andy Jackson wrote an interesting post on the question of HTML validity. Only registered Typepad users can comment, so it’s easier for me to add something to the discussion here.

When I worked on JHOVE, I had to address the question of valid HTML. A few issues are straightforward; the angle brackets for tags have to be closed, and so do quoted strings. Beyond that, everything seems optional. There are no required elements in HTML, not even html, head, or body; a blank file or a plain text file with no tags can be a valid HTML document. The rules of HTML are designed to be forgiving, which just makes it harder to tell if a document is valid or not. I’ve recommended that JHOVE users not use the HTML module; it’s time-consuming and doesn’t give you much useful information.

There are things in XHTML which aren’t legal in HTML. The “self-closing” tag (<tag/>) is good XHTML, but not always legal HTML. In HTML5, <input ... /> is legal, but <span ... /> isn’t, because input doesn’t require a closing tag but span does. (In other words, it’s legal only when it’s superfluous.) However, any recent browser will accept both of them.

The set of HTML documents which are de facto acceptable and unambiguous is much bigger than the set which is de jure correct. Unfortunately, the former is a fuzzy set. How far can you push the rules before you’ve got unsafe, ambiguous HTML? It depends on which browsers and versions you’re looking at, and how strenuous your test cases are.

The problem goes beyond HTML proper. Most browsers deal with improper tag nesting, but JavaScript and CSS can raise bigger issues. These are very apt to have vendor-specific features, and they may have major rendering problems in browsers for which they weren’t tested. A document with broken JavaScript can be perfectly valid, as far as the HTML spec is concerned.

It’s common for JavaScript to be included by an external reference, often on a completely different website. These scripts may themselves have external dependencies. Following the dependency chain is a pain, but without them all the page may not work properly. I don’t have data, but my feeling is that far more web pages are broken because of bad scripts and external references than because of bad HTML syntax.

So what do you do when validating web pages? Thinking of it as “validating HTML” pulls you into a messy area without addressing some major issues. If you insist on documents that are fully compliant with the specs, you’ll probably throw out more than you accept, without any good reason. But at the same time, unless you validate the JavaScript and archive all external dependencies, you’ll accept some documents that have significant preservation issues.

It’s a mess, and I don’t think anyone has a good solution.


Categories: Planet DigiPres

Professional update

File Formats Blog - 8 July 2014 - 10:08am

Just to keep everyone up to date on what I’m doing professionally:

Currently I’m back in consulting mode, offering my services for software development and consultations. Those of you who’ve been following this blog regularly know I’ve been working with libraries for a long time and I’m familiar with the technology. I’ve updated my business home page at garymcgath.com and moved it to new hosting, which will allow me to put demos and other materials of interest on the site.

The key to success is, of course, networking. so if you happen to hear of a situation where my skills could be put to good use, please let me know.


Tagged: business
Categories: Planet DigiPres

The SCAPE Project video is out!

Open Planets Foundation Blogs - 3 July 2014 - 8:31am

Do you want a quick intro to what SCAPE is all about?

Then you should watch the new SCAPE video!

The video will be used at coming SCAPE events like SCAPE demonstration days and workshops and it will be available on Vimeo for everyone to use. You can help us to disseminate this SCAPE video by tweeting using this link https://vimeo.com/99803729

 

     

Standard tools become overtaxed.....          ....SCAPE addresses these challenges 

 

The production of this SCAPE video was part of the final project presentation. The idea behind the video is to explain what SCAPE is about to both technical and non-technical audiences. In other words, the overall outcomes and unique selling points of the project in a short and entertaining video. But how do you condense a four year project with 19 partners and lots of different tools and other SCAPE products in just two minutes?

We started with formulating the SCAPE overall messages and unique selling points, from which a script was distilled. This was the basis for the voice over text and a story board, after which the animation work began. There were lots of adjustments to be made in order to stay close to the actual SCAPE situation. It was great that SCAPErs from different areas of the project were kind enough to look at what we from the Take Up team came up with. 

Please take a look and use this video to tell everyone how SCAPE helps you to bring your digital preservation into the petabyte dimension!

 

SCAPE Project - Digital Preservation into the Petabyte Dimension from SCAPE project on Vimeo.

Preservation Topics: SCAPE
Categories: Planet DigiPres

Introducing Flint

Open Planets Foundation Blogs - 2 July 2014 - 12:53pm

Hi, this is my first blog post in which I want to introduce the project I am currently working on: Flint.

history

Flint (File/Format Lint) has developed out of DRMLint, a lightweight piece of Java software that makes use of different third party tools (Preflight, iText, Calibre, Jhove) to detect DRM in PDF-files and EPUBs. Since its initial release we have added validation of files against an institutional policy, making use of Johan’s pdfPolicyValidate work, restructured it to be modular and easily extendible, and found ourselves having developed a rather generic file format validation framework.  

what does Flint do?

Flint is an application and framework to facilitate file/format validation against a policy. It's underlying architecture is based on the idea that file/format validation has nearly always a specific use-case with concrete requirements that may differ from say a validation against the official industry standard of a given format. We discuss the principal ideas we've implemented in order to match such requirements.

The code centres on individual file format modules, and thus takes a different approach to FITS; for example - the PDF module makes use of its own code and external libraries to check for DRM. Creating a custom module for your own file formats is relatively straight-forward.

The Flint core, and modules, can be used via a command line interface, graphical user interface or as a software library. A MapReduce/Hadoop program that makes use of Flint as a software library is also included.

The following focuses on the main features:

Flint-the-API

The core module provides an interface for new format-specific implementations, which makes it easy to write a new module. The implementation is provided with a straight-forward core workflow from the input-file to standardised output results. Several optional functionalities (e.g. schematron-based validation, exception and time-out handling of the validation process of corrupt files) help to build a robust validation module.

 

Visualisation of Flints core functionality; a format-specific implementation can have domain-specific validation logic on code-level (category C) or on configuration level (categories A and B). The emphasis is on a simple workflow from input-file to standardised check results that bring everything together.

Policy-focused validation

The core module optionally includes a schematron-based policy-focused validator. 'Policy' in this context means a set of low-level requirements in form of a schematron xml file, that is meant to validate against the xml output of other third-party programs. In this way domain-specific validity requirements can be customised and reduced to the essential.  For example: does this PDF require fonts that are not embedded?

We make use of Johan’s work for a Schematron check of Apache Preflight outputs, introduced in this blog post. Using Schematron it is possible to check the XML output from tools, filtering and evaluating them based on a set of rules and tests that describe the concrete requirements of *your* organisation on digital preservation.

 

Flint-the-toolbox

Aside from its internal logic, Flint contains internal wrapper code around a variety of third-party libraries and tools to make them easier to use, ensuring any logic to deal with them is in one place only:

* Apache PDFBox

* Apache Tika

* Calibre

* EPUBCheck

* iText - if this library is enabled note that it is AGPL3 licensed

These tools all do (a) something slightly different or (b) do not have full coverage of the file formats in some respects or (c) they do more than what one actually needs. All these tools relate more or less to the fields of PDF and EPUB validation, as these are the two existing implementations we're working on at the moment.

Format-specific Implementations
  • flint-pdf: validation of PDF files using configurable, schematron-based validation of Apache Preflight results and additionally internal logic and all tools in the list above to focus on DRM and Wellformedness
  • flint-epub: validation of EPUB files using configurable, schematron-based validation of EPUBCheck results and additionally internal logic and all tools in the list above to focus on DRM and Wellformedness

NOTE: both implementations are work-in-progress, and should be a good guide for how to implement your own format-validation implementation using Flint.  It would be easy to add a Microsoft Office file format module that looked for DRM etc, for example.

 

 

Visualisation of the Flint ecosystem with different entry points and several format/feature-specific implementations (deep blue: existing ones, baby blue: potentially existing ones); the core, as visualised in Figure 1 connects the different ends of the ‘ecosystem’ with each other

 

how we are using it

Due to the recent introduction of non print legal deposit The British Library is preparing to receive large numbers of PDF and EPUB files. Development of this tool has been discussed with operational staff within the British Library and we aim for it to be used to help determine preservation risks within the received files.

what’s next

Having completed some initial large-scale testing of a previous version of Flint we plan on running more large-scale tests with the most recent version.  We are also interested in the potential of adding additional file format modules; work is underway on some geospatial modules.

help us make it better

It’s all out there (the schematron utils are part of our tools collection at https://github.com/bl-dpt/dptutils, Flint is here: https://github.com/openplanets/flint), please use it, please help us to make it better.

Preservation Topics: CharacterisationPreservation RisksSCAPE
Categories: Planet DigiPres

How much of the UK's HTML is valid?

Open Planets Foundation Blogs - 2 July 2014 - 12:05pm

I thought OPF members might be interested in this UK Web Archive blog post I wrote on format identification and validation of our historical web archives: How much of the UK's HTML is valid?

Preservation Topics: Identification
Categories: Planet DigiPres

OOXML: The good and the bad

File Formats Blog - 27 June 2014 - 12:05pm

An article by Markus Feilner presents a very critical view of Microsoft’s Open Office XML as it currently stands. There are three versions of OOXML — ECMA, Transitional, and Strict. All of them use the same extensions, and there’s no easy way for the casual user to tell which variant a document is. If a Word document is created on one computer in the Strict format, then edited on another machine with an older version of Word, it may be silently downgraded to Transitional, with resulting loss of metadata or other features.

On the positive side, Microsoft has released the Open XML SDK as open source on Github. This is at least a partial answer to Feilner’s complaint that “there are no free and open source solutions that fully support OOXML.”

Incidentally, I continue to hate Microsoft’s use of the deliberately confusing term “Open XML” for OOXML.

Thanks to @willpdp for tweeting the links referenced here.


Tagged: Microsoft, standards, XML
Categories: Planet DigiPres

SCAPE Demo Day at Statsbiblioteket

Open Planets Foundation Blogs - 27 June 2014 - 8:38am

Statsbiblioteket (The State and University Library, Aarhus, hereafter called SB) welcomed a group of people from The Royal Library, The National Archives, and Danish e-Infrastructure Cooperation on June 25, 2014. They were invited for our SCAPE Demo day where some of SCAPE’s results and tools were presented. Bjarne S. Andersen, Head of IT Technologies, welcomed everybody and then our IT developers presented and demonstrated SB’s SCAPE work.

The day started with a nice introduction to the SCAPE project by Per Møldrup-Dalum, including short presentations of some of the tools which would not be presented in a demo.  Among others this triggered questions about how to log in to Plato – a Preservation Planning Tool developed in SCAPE.

Per continued with a presentation about Hadoop and its applications. Hadoop is a large and complex technology, which was already decided to use before the project started. This has resulted in some discussion during the project, but Hadoop has proven really useful for large-scale digital preservation. Hadoop is available both as open source and as commercial distributions. The core concept of Hadoop is the MapReduce algorithm which was presented in the paper “MapReduce: Simplified Data Processing on Large Clusters” in 2004 by Jeffrey Dean and Senjay Ghemawat. This paper prompted Cutting and Cafarella to implement Hadoop and they published their system under an open source license. Writing jobs for Hadoop has traditionally been done by using the Java programming language, but in the recent years several alternatives to Java have been introduced, e.g. Pig Latin  and Hive  Other interesting elements in a Hadoop cluster are HBase, Mahout, Giraph, Zookeeper and a lot more. At SB we use an Isilon Scale-Out NAS storage cluster which enables us to make a lot of different experiments on the four 96GB RAM CPU nodes each with a 2 Gbit Ethernet interface. This setup potentially makes the complete online storage of SB reachable for the Hadoop cluster.

                                            Sometimes it is hard to fit an elephant in a library

 

Bolette A. Jurik was next in line and told the story about how Statsbiblioteket wanted to migrate audio files using Hadoop (and Taverna…. and xcorrSound Waveform Compare). The files were supposed to be migrated from mp3 to wav. Checking this collection in Plato gave us the result ‘Do nothing’ – meaning leave the files as mp3. But we still wanted to perform the experiment – to test that we have the tools to migrate, extract and compare properties, validate the file format and compare the content of the mp3 and wav files, and that we can create a scalable workflow for this.". We did not have a tool for the content comparison, so we had to develop one, xcorrSound waveform-compare. The output shows which files need special attention – as an example one of the files failed the waveform comparison although it looked right. This was due to a lack of content in some parts of the file so Waveform Compare had no sound to compare! Bolette also asked her colleagues to create "migrated" soundfiles with problems that the tool would not find – read more about this small competition in this blog post.

Then Per was up for yet another presentation – this time describing the experiment: Identification and feature extraction of web archive data based on Nanite. The test was to extract different kinds of metadata (like authors, GPS coordinates for photographs etc.) using Apache Tika, DROID, (and libmagic) . The experiment was run on the Danish Netarchive (archiving of the Danish web – a task undertaken by The Royal Library and SB together). For the live demo a small job with only three ARC files was used – taking all of the 80,000 files in the original experiment would have lasted 30 hours.  Hadoop generates loads of technical metadata that enables us to analyse such jobs in detail after the execution. Per’s presentation was basically a quick review of what is described in the blog post A Weekend With Nanite.

An analysis of the original Nanite experiment was done live in Mathematica presenting a lot of fun facts and interesting artefacts. For one thing we counted the number of unique MIME types in the 80,000 ARC files or 260,603,467 individual documents to

  • 1384 different MIME types were reported by the HTTP server at harvest time,
  • DROID counted 319 MIME types,
  • Tika counted 342 MIME types.

A really weird artefact was that approx. 8% of the identification tasks were complete before they started! The only conclusion to this is that we’re experiencing some kind of temporal shift that would also explain the great performance of our cluster…

Two years ago SB concluded a job that had run for 15 months. 15 months of FITS characterising 12TB of web archive data. The experiment with Nanite characterised 8TB in 30 hours. Overall this extreme shift in performance is due to our involvement in the SCAPE project.

After sandwiches and a quick tour to the library tower Asger Askov Blekinge took over to talk about Integrating the Fedora based DOMS repository with Hadoop. He described Bitmagasinet (SB’s data repository) and DOMS (SB’s Digital Object Management System based on Fedora) and how our repository is integrated with Hadoop.

SB is right now working on a very large project to digitize 32 million pages of newspapers. The digitized files are delivered in batches and we run Hadoop map/reduce jobs on each batch to do quality assurance. An example is to run Jpylyzer on a batch (Map runs Jpylyzer on each file, Reduce stores the results back in DOMS). The SCAPE way to do it includes three steps:

  • Staging – retrieves records
  • Hadooping – reads, works and writes new updated records
  •  Loading  - stores updated records in DOMS

The SCAPE Data model is mapped with the newspapers in the following way:

                                           SCAPE Data Model mapped with newspapers

SCAPE Stager/Loader creates a sequence file which can then be read and each record updated by Hadoop and after that the records are stored in DOMS.

The last demo was presented by Rune Bruun Ferneke-Nielsen. He described the policy driven validation of JPEG 2000 files based on Jpylyzer and performed on SB’s Newspaper digitization project. The newspapers are scanned from microfilms by a company called Ninestars, and then quality assured by SB’s own IT department. We need to make sure that the content conforms to the corresponding file format specifications and that the file format profile conforms to our institutional policies.

                                                     

530,000 image files have been processed within approx. five hours.

We want to be able to receive 50,000 newspaper files per day and this is more than one server can handle. All access on data for quality assurance etc. is done via Hadoop. Ninestars runs a quality assurance before they send the files back to SB and then the files are QA’ed again inhouse.

                                                               Fuel for the afternoon (Photo by Per Møldrup-Dalum)

One of the visitors at the demo is working at The Royal Library with the NetArchive and would like to make some crawl log analyses. These could perhaps be processed by using Hadoop - this is definitely worth discussing after today to see if our two libraries can work together on this.

All in all this was a very good day, and the audience learned a lot about SCAPE and the benefits of the different workflows and tools. We hope they will return for further discussion on how they can best use SCAPE products at their own institutions.

Preservation Topics: SCAPE AttachmentSize ElephantOnSB.png662 KB DataModel.png70.11 KB NewspaperQA.png160.83 KB Fuel for the afternoon (Photo by Per Møldrup-Dalum)45.11 KB
Categories: Planet DigiPres

Bulk disk imaging and disk-format identification with KryoFlux

Open Planets Foundation Blogs - 26 June 2014 - 3:15pm
The problem

We have a large volume of content on floppy disks that we know are degrading but which we don't know the value of.

Considerations
  1. We don't want to waste time/resources on low-value content.
  2. We don't know the value of the content.
  3. We want to be able to back up the content on the disks to ensure it doesn't degrade any more than it already has.
  4. Using unskilled students to do the work is cost-effective.
  5. Unskilled students have often never seen "floppy" disks, let alone can distinguish between different formats of floppy disk. So we need a solution that doesn't require them to differentiate (e.g. between apple formats, PC formats, Amiga, etc).
Solution
  1. Make KryoFlux stream files using the KryoFlux hardware and software.
  2. Use the KryoFlux software to create every variant of disk image from those streams
  3. Use the mount program on Linux to mount each disk image using each variant of file system parameter. 
  4. Keep the disk images that can mount in Linux (as that ability implies that they are the right format).

Very rough beginnings of a program to perform the automatic format identification using the KryoFlux software and Mount are available here.


Issues with the solution
  1. When you use the KryoFlux to create raw stream files it only seems to do one pass of each sector. Whereas when you specify the format it will try to re-read sectors that it identifies as "bad sectors" in the first pass. This can lead to it successfully reading those sectors when it otherwise wouldn't. So using the KryoFlux stream files may not lead to as much successful content preservation as you would get if you specified the format of the disk before beginning the imaging process. I'm trying to find out whether using "multiple" in the output options in the KryoFlux software might help with this
  2. Mount doesn't mount all file-systems - though as this is improved in the future the process could be re-run
  3. Mount can give false positives
  4. I don't know whether there is a difference between disk images created with Kroflux using many of the optional parameters or using the defaults. For example there doesn't appear to be a difference in mount-ability of disk images created where the number of sides is specified or disk images when it is not and defaults to both sides (for e.g. MFM images the results of both seem to mount successfully).
  5. Keeping the raw streams is costly. A disk image for a 1.44mb floppy is ~1.44mb. The stream files are in the 10s of MBs
Other observations:
  1. It might be worth developing signatures for use in e.g. DROID to identify the format of the stream files directly in the future. Some e.g. emulators can directly interact with the stream files already I believe
  2. The stream files might provide a way of over-coming bad-sector based copy protection, (e.g. the copy protection used in Lotus 1-2-3 and Lotus Jazz) by enabling the use of raw stream files (which -i believe- contain the "bad" sectors as well as good) in emulators


Thoughts/feedback appreciated

Preservation Topics: IdentificationPreservation RisksBit rotTools
Categories: Planet DigiPres

Will the real lazy pig please scale up: quality assured large scale image migration

Open Planets Foundation Blogs - 24 June 2014 - 9:12am

Authors: Martin Schaller, Sven Schlarb, and Kristin Dill

In the SCAPE Project, the memory institutions are working on practical application scenarios for the tools and solutions developed within the project. One of these application scenarios is the migration of a large image collection from one format to another.

There are many reasons why such a scenario may be of relevance in a digital library. On the one hand, conversion from an uncompressed to a compressed file format can significantly decrease storage costs. On the other hand, particularly from a long-term perspective, file formats may be in danger of becoming obsolete, which means that institutions must be able to undo the conversion and return to the original file format. In this case a quality assured process is essential to allow for reconstruction of the original file instances and especially to determine when deletion of original uncompressed files is needed – this is the only way to realize the advantage of reducing storage costs. Based on these assumptions we have developed the following use case: Uncompressed TIFF image files are converted into compressed JPEG2000 files; the quality of the converted file is assured by applying a pixel for pixel comparison between the original and the converted image.

For this, a sequential Taverna concept workflow was first developed, which was then modelled into a scalable procedure using different tools developed in the SCAPE Project.

The Taverna Concept Workflow

The workflow input is a text file containing paths to the TIFF files to be converted. This text file is then transformed into a list that allows the sequential conversion of each file, hence simulating a non-scalable process. Before the actual migration commences, validity of the TIFF file is checked. This step is realized by using FITS - a wrapper that applies different tools to extract the identification information of a file. Since the output of FITS is an XML-based validation report, an XPath service extracts and checks the validity information. If the file is valid, migration from TIFF to JPEG2000 can begin. The tool used in this step is OpenJPEG 2.0. In order to verify the output, Jpylyzer – a validator as well as feature extractor for JPEG2000 images created within the SCAPE Project – is employed. Again, an Xpath service is used to extract the validity information. This step concludes the file format conversion itself, but in order to ensure that the migrated file is indeed a valid surrogate, the file is reconverted into a TIFF file, again using OpenJPEG 2.0. Finally, in a last step the reconverted and the original TIFF files are compared pixel for pixel using LINUX based ImageMagick. Only through the successful execution of this final step can the validity as well as the possibility of a complete reconversion be assured. taverna workflow

Figure 1 (above): Taverna concept workflow

In order to identify how much time was consumed by each element of this workflow, we ran a test consisting of the migration of 1,000 files. Executing the described workflow on the 1,000 image files took about 13 hours and five minutes. Rather unsurprisingly, conversion and reconversion of the files took the longest: the conversion to JPEG2000 took 313 minutes and the reconversion 322 minutes. FITS validation needed 70 minutes and the pixel-wise comparison was finished in 62 minutes. The SCAPE developed tool Jypylizer required only 18 minutes and was thus much faster than the above mentioned steps. diagram taverna workflow

Figure 2 (above): execution times of each of the concept workflows' steps

Making the Workflow Scale

The foundation for the scalability of the described use case is a Hadoop cluster containing five Data Nodes and one Name Node (specification: see below). Besides having economic advantages – Hadoop runs on commodity hardware – it also bears the advantage of being designed for failure, hence reducing the problems associated with hardware crashes.

The distribution of tasks for each core is implemented via MapReduce jobs. A Map job splits the handling of a file. For example, if a large text file is to be processed, a Map job divides the file into several parts. Each part is then processed on a different node. Hadoop Reduce jobs then aggregates the outputs of the processing nodes again to a single file.

But writing MapReduce jobs is a complex matter. For this reason, the programming language Apache Pig is used. Pig was built for Hadoop and translates a set of commands in a language called “Pig Latin” into MapReduce jobs, thus making the handling of MapReduce jobs much easier or, as Professor Jimmy Lin described the powerful tool during the ‘Hadoop-driven digital preservation Hackathon’ in Vienna, easy enough “… for lazy pigs aiming for hassle-free MapReduce.”

Hadoop HDFS, Hadoop MapReduce and Apache Pig make up the foundation of the scalability on which the SCAPE tools ToMaR and XPath Service are based. ToMaR wraps command line tasks for parallel execution as Hadoop MapReduce jobs. These are in our case the execution of FITS, OpenJPEG 2.0, Jpylyzer and ImageMagick. As a result, the simultaneous execution of these tools on several nodes is possible. This has a great impact on execution times as Figure 3 (below) shows.

The blue line represents the non-scalable Taverna workflow. It is clearly observable how the time needed for file migration increases in proportion to the number of files that are converted. The scalable workflow, represented by the red line, shows a much smaller increase in time needed, thus suggesting that scalability has been achieved. This means that, by choosing the appropriate size for the cluster, it is possible to migrate a certain number of image files within a given time frame. Performance_image_migration

Figure 3 (above): Wallclock times of concept workflow and scalable workflow

Below is the the specification of the Hadoop Cluster where the master node runs the jobtracker and namenode/secondary namenode daemons, and the worker nodes each runs a tasktracker and a data node daemon.

Master node: Dell Poweredge R510

  • CPU: 2 x Xeon E5620@2.40GHz
  • Quadcore CPU (16 HyperThreading cores)
  • RAM: 24GB
  • NIC: 2 x GBit Ethernet (1 used)
  • DISK: 3 x 1TB DISKs; configured as RAID5 (redundancy); 2TB effective disk space

Worker nodes: Dell Poweredge R310

  • CPU: 1 x Xeon X3440@2.53GHz
  • Quadcore CPU (8 HyperThreading cores)
  • RAM: 16GB
  • NIC: 2 x GBit Ethernet (1 used)
  • DISK: 2 x 1TB DISKs; configured as RAID0 (performance); 2TB effective disk space

However, the throughput we can reach using this cluster and pig/hadoop job configuration is limited; as figure 4 shows, the throughput (measured in Gigabytes per hour - GB/h) is rapidly growing when the number of files being processed is increased, and then stabilises at a value around slightly more than 90 Gigabytes per hour (GB/h) when processing more than 750 image files. throughput_gb_per_h

Figure 4 (above): Throughput of the distributed execution measured in Gigabytes per hour (GB/h) against the number of files processed

As our use case shows, by using a variety of tools developed in the SCAPE Project together with the Hadoop framework it is possible to distribute the processing on various machines thus enabling the scalability of large scale image migration and significantly reducing the time needed for data processing. In addition, the size of the cluster can be tailored to fit the size of the job so that it can be completed within a given time frame.

Apart from the authors of this blog post, the following SCAPE Project partners contributed to this experiment:

  • Alan Akbik, Technical University of Berlin
  • Matthias Rella, Austrian Institute of Technology
  • Rainer Schmidt, Austrian Institute of Technology
Preservation Topics: MigrationSCAPEjpylyzer
Categories: Planet DigiPres

Library of Congress format recommendations

File Formats Blog - 23 June 2014 - 8:39pm

The Library of Congress has issued a set of recommendations for formats for both physical and digital documents. The LoC’s digital preservation blog has an interview with Ted Westervelt of the LoC on their development. They’re not just for the library’s own staff, he explains, but for “all stakeholders in the creative process.”

The guidelines repeatedly state: “Files must contain no measures that control access to or use of the digital work (such as digital rights management or encryption).” That’s pushback that can’t be ignored. In some cases, though, the message is mixed. For theatrically released films, standard or recordable Blu-Ray is accepted, but the boilerplate against DRM is included. I don’t know where they expect to get DRM-free Blu-Ray, but DRM-free options are few when it comes to big-name movies.

It’s also interesting that software, specifically games and learning materials, is included. This has been a growing area of interest in recent years. Rather than relying on emulation, the recommendations call for source code, documentation, and a specification of the exact compiler used to build the application.

There’s material here to fuel constructive debate and expansion for years.


Tagged: libraries, preservation, standards
Categories: Planet DigiPres

Interview with a SCAPEr - Leïla Medjkoune

Open Planets Foundation Blogs - 20 June 2014 - 11:59am
Leïla MedjkouneWho are you?

My name is Leïla Medjkoune and I am responsible for the Web Archiving projects and activities at Internet Memory.

Tell us a bit about your role in SCAPE and what SCAPE work you are involved in right now?

My involvement in Scape is twofold. I am working as a project manager, following the project and ensuring that Internet Memory as a partner fulfils the project plan. I am also involved as a functional expert, representing web archivists’ needs. I am therefore working within several areas of the project such as Quality Assurance and the Web Testbed work, where I contribute to the development of tools and workflows in relation to web archiving.

Why is your organisation involved in SCAPE?

Since its creation in 2004 Internet Memory actively participates in improving the preservation of the Internet. It supports cultural institutions involved in web archiving projects through its large scale shared platform, by building its own web archive and also by developing innovative methods and tools, such as its own crawler, MemoryBot, either internally or as a result of participation in EU-funded research projects, aiming to tackle web archiving and large scale preservation challenges. As part of SCAPE, Internet Memory is willing to test, develop and hopefully implement within its infrastructure, preservation tools and methods, including an automated visual quality tool applied to web archives.

What are the biggest challenges in SCAPE as you see it?

SCAPE is a very interesting project with a quite complex organisation. This is due to the fact that we are looking at a broad range of tools and methods trying to tackle a variety of preservation issues. Beyond the organisational aspects, one of the biggest challenges is as stated within the acronym, to answer the scalability issues currently met by most archives and libraries. This is even more critical for web archives as the amount of the heterogeneous content to preserve and to provide access to is constantly growing in size. Another challenge will be to disseminate SCAPE's outcomes so that they reach the preservation community and will be used within libraries, archives and preservation institutions in general.

What do you think will be the most valuable outcome of SCAPE?

As most web archives, we are willing to implement robust automated tools within our infrastructure that could not only facilitate operations but would also reduce costs. Improving characterisation tools so that they scale and developing QA tools designed for web archives, such as the Pagelyzer, are the most useful outcomes from our perspective. We are also strongly involved within the SCAPE platform work and believe this platform is a useful example of how several preservation tools and systems can be integrated within one single infrastructure.

Contact information:

Leïla Medjkoune

leila.medjkoune@internetmemory.net

Preservation Topics: SCAPE
Categories: Planet DigiPres

Update on JHOVE

File Formats Blog - 18 June 2014 - 12:50pm

I’ve updated the UTF-8 module in the JHOVE source on Github to include the new code blocks for Unicode 7.0.0. Also, I’ve recently fixed the pom.xml file so it will put both the command line and the GUI JAR files into the local repository.

I need more input before I’m comfortable with creating a release 1.12 of JHOVE. I don’t have any prior experience with creating a public, open-source project that’s built with Maven, and I don’t know how much of the baggage of the SourceForge project really needs to be kept. There are some specialty JARs in the old project, but I don’t know if anyone uses them. Most importantly, there still needs to be a distribution in Zip and Tar formats. New features would be interesting, but the first thing is to make a JHOVE that was as useful as it was before.

Comments, suggestions, and code contributions are welcome, as always.


Tagged: JHOVE, preservation, software, Unicode
Categories: Planet DigiPres