Blogs

A spot of Gardening: Weeding the Open Planets Foundation Format Corpus

Like a garden needs watering, it sometimes needs a little weeding too. I think that's where we've been recently with the Open Planets Foundation, Format Corpus on GitHub. In this blog I describe how I've remixed it to enable it to be used more flexibly moving forward. Hopefully leaving it in a position to be forked and consumed again by the wider digital preservation community.

SCAPE QA Tool: Technologies behind Pagelyzer - II Web Page Segmentation

Web pages are getting more complex than ever. Thus, identifying different elements from web pages, such as main content, menus, user comments, advertising among others, becomes difficult. Web page segmentation refers to the process of dividing a Web page into visually and semantically coherent segments called Blocks or Segments.

SCAPE QA Tool: Technologies behind Pagelyzer - I Support Vector Machine

EDRMS across New Zealand’s Government – Challenges with even the most managed of records management systems!

Sarah McKenzie, a student completing a summer scholarship project with Victoria University and Archives New Zealand, blogs for the OPF on the work she is currently doing. Delving into the world of Electronic Document and Records Management Systems and the challenges of technical metadata extraction, she describes how the challenge is as much about understanding the range of EDRMS in use across the government horizon as it is about connecting the tools in the digital preservation toolkit to that range of systems. Sarah talks about how she went about that research, the technical work completed so far, and her goals in the remaining few weeks of the project.

Why can't we have digital preservation tools that just work?

One of my first blogs here covered an evaluation of a number of format identification tools. One of the more surprising results of that work was that out of the five tools that were tested, no less than four of them (FITS, DROID, Fido and JHOVE2) failed to even run when executed with their associated launcher script. In many cases the Windows launcher scripts (batch files) only worked when executed from the installation folder. Apart from making things unnecessarily difficult for the user, this also completely flies in the face of all existing conventions on command-line interface design. Around the time of this work (summer 2011) I had been in contact with the developers of all the evaluated tools, and until last week I thought those issues were a thing of the past. Well, was I wrong!