Corpora | Open Planets Foundation

Measuring Bigfoot

My previous blog Assessing file format risks: searching for Bigfoot? resulted in some interesting feedback from a number of people. There was a particularly elaborate response from Ross Spencer, and I originally wanted to reply to that directly using the comment fields.

Submitted by Johan van der Knijff on 8 October 2013 – 4:24pm

FIDO News

Here’s a little newsbulletin about FIDO, the open source file format identification tool of OPF.

It seems that the use of FIDO is growing the last few months. I am getting responses by e-mail and through the Github issuetracker from all over the world, ranging from requests for help, giving suggestions for improvement and even some bugfixes. Thanks and please keep them coming!

Submitted by Maurice de Rooij on 18 September 2013 – 10:27am

Identification of PDF preservation risks: the sequel

Submitted by Johan van der Knijff on 25 July 2013 – 12:57pm

SCAPE Planning and Watch: Two years and a bit more

Now that the subproject lead in PW is being transferred from me to

Submitted by Christoph Becker on 23 July 2013 – 9:20am

SCAPE & OPF Hackathon: Hadoop-driven digital preservation

The SCAPE Project and OPF are running a hackathon for developers and practitioners, focussing on Hadoop, an open source software framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. Hadoop is designed to scale out from single servers to thousands of machines.

Jimmy Lin from the University of Maryland will our guest speaker at the event. Jimmy has been working with Big Data and Hadoop for many years, with a focus on natural language processing and information retrieval. He spent an extended sabbatical at Twitter from 2010-2012 to work on large-scale analytics, on which he provides valuable insights in his 2013 HadoopsummitEU talk ‘Big Data Mining Infrastructure: The Twitter Experience” (http://www.youtube.com/watch?v=T5ZjSFnOxys). He has a book out on MapReduce (http://lintool.github.io/MapReduceAlgorithms/) and is currently working on a scalable rendering engine for web archives based on HBase.

Scenarios
We will be working with two digital preservation scenarios:

Web Archiving: File Format Identification/Characterisation
Digital Books: Quality Assurance, text mining (OCR Quality)

Alternatively, if you have something else you would like to work on using Hadoop, just let us know, we are keen to hear your ideas.

*Competition*
Practitioners and developers will work together in groups to address digital preservation challenges using Hadoop. Practitioners will take the role of issue champion, and will articulate their requirements to the developers and document them on the wiki. Developers will brainstorm ideas, and work on solutions to the issues. There will be regular check in points to get feedback and refine requirements. There will be prize for the best issue champion and development solution.

All participants will gain practical experience of using digital preservation tools in characterisation and quality assurance processes. We will provide step-by-step worksheets for those who are less familiar with using the command line, and our experts will be on hand to help you through them.

There will be plenty of opportunities for discussion. We have a session for sharing experiences implementing Hadoop at your organisation, research project reports and a break out space for lightening talks. We welcome suggestions for talks or discussions you would like to hear about.

Agenda

The draft agenda can be seen at: http://wiki.opf-labs.org/display/SP/Agenda+-+Hadoop+Driven+Digital+Preservation

Who should attend?

Practitioners (digital librarians and archivists, digital curators, repository managers, or anyone responsible for managing digital collections) You will learn how Hadoop might fit your organisation, how to write requirements to guide development and gain some hands on experience using tools yourself and finding out how they work. To get the most out of this training course you will ideally have some knowledge or experience of digital preservation.

Developers of all experience can participate, from writing your first Hadoop jobs, to working on scalable solutions for issues identified in the scenarios.

Registration

Please register here: https://hadoop-driven-digital-preservation.eventbrite.co.uk.

OPF members are invited to attend free of charge. Please use the code issued by email to waive the fee.

Non-members are welcome to attend at a cost of €200. Morning and afternoon coffee breaks and lunch will be provided and are included in the registration fee.

*Early bird rate* register before 25 October to get 10% off.

Registration will close on Monday 25 November.

For information about travel and accommodation please visit the event wiki page: http://wiki.opf-labs.org/pages/viewpage.action?pageId=32604217.

Date:

2 December 2013 to 4 December 2013

Event Types:

Hackathon

Link:

Hackathon registration

Open Research Challenges in Digital Preservation: Call for contributions!

Following the community response to our workshop last year, we want to invite you again to contribute your future preservation challenge!

Submitted by Christoph Becker on 17 June 2013 – 5:24pm

EPUB for archival preservation: an update

Last year (2012) the KB released a report on the suitability of the EPUB format for archival preservation. A substantial number of EPUB-related developments have happened since then, and as a result some of the report’s findings and conclusions have become outdated.

Submitted by Johan van der Knijff on 23 May 2013 – 2:23pm

Software Archiving for EaaS

The typical digital artefact or complex object does not function (render, execute, …) without a certain software environment. Emulation-as-a-Service (EaaS) provides original environments running in platform emulators. Depending on the (complex) object to be handled, several software components are required to reproduce an original environment.

Submitted by Dirk von Suchodoletz on 1 April 2013 – 2:23pm

What do we mean by "embedded" files in PDF?

The most important new feature of the recently released PDF/A-3 standard is that, unlike PDF/A-2 and PDF/A-1, it allows you to embed any file you like. Whether this is a good thing or not is the subject of some heated on-line discussions. But what do we actually mean by embedded files?

Submitted by Johan van der Knijff on 9 January 2013 – 1:42pm

Identification of PDF preservation risks with Apache Preflight: a first impression

The PDF format contains various features that may make it difficult to access content that is stored in this format in the long term. Examples include (but are not limited to):

Submitted by Johan van der Knijff on 19 December 2012 – 3:15pm