Several of us at The British Library took part in the CURATEcamp file id hackathon on Friday.
We decided that one issue we could make a useful impact on was identification of various ebook formats. eBooks are an important content type for the British Library, especially with the expected implementation of non-print legal deposit legislation next year. For a long list of formats look here: http://wiki.mobileread.com/wiki/E-book_formats
In the last months, I have been researching the problem of large-scale content profiling for preservation analysis. I do this for a number of reasons. For one, I support the opinion that formats are just another property. Undoubtedly, a very important one, but knowing which formats you have is not sufficient for good preservation planning and actions.
I'm not an astronomer but if I were I'd probably get excited watching the birth of a star. What I do get excited about is being around to watch the creation and evolution of a digital preservation problem right here and now.
In the context of digital information, many curation tasks need to be performed to ensure continuous access to information. As digital assets grow in size and number, tools must be deployed to ease on the execution of common digital preservation tasks and therefore making the whole digital preservation process more manageable.
In the context of the SCAPE project, we have recently been doing a series of experiments associated with content file identification of ARC.GZ web archive containers. Why? Because you will presumably be interested in which different file formats you have in your archive containers and how many of them per type.
On Monday I was asked to speak at an experts workshop aimed at steering developments in preservation services on the Reponet+ Project (part of JISC Innovation Zone).
I primarily wanted to get an understanding of SPARQL queries and how they can be used to query linked data. As a focus for my work, I set myself a challenge to get Fido working using signatures from the UDFR registry.
One of the biggest initial challenges to digital preservation is file format identification. While there has been a lot of work in this area, the ever changing nature of digital formats realistically means the problem will never be "solved". This first SCAPE training course will give you the knowledge and experience to confidently choose file formation identification and characterisation tools, which have been developed or extended during the SCAPE project.
I've already written a number of blog posts on format validation of JP2 files. Format validation is only a one aspect of a quality assessment workflow. Digitisation guidelines typically impose various constraints on the technical characteristics of preservation and access images. For example, they may state that a preservation master must be losslessly compressed, and that its progression order must be RPCL. A format profile is a set of such technical constraints.