SCAPE

Quality Assurance for Digital Book Collections

Automated identification of corresponding images and duplicate removal is a challenging task due to inconsistent quality of digitized book collections. Validating such collections based on qualitative criteria is a challenging endeavor according to the sheer amount of data that has to be processed. Traditional approaches seem to have peaked at a certain level.

SPRUCE Mashup: Batch File Identification using Apache Tika

My last post discussed the benefits of collabaration, centred around a SCAPE hackathon. I argued that, in general, it was the collaborative, collocated nature of the developers working together that made demo development quicker; more people staring at the same problem results in multiple and varied viewpoints, ideas, and solutions.

Identification tools, an evaluation

We have created a testing framework based on the Govdocs1 digital Corpora (http://digitalcorpora.org/corpora/files), and are using the characterisation results from Forensic Innovations, Inc. ((http://www.forensicinnovations.com/), as ground truths.

We have tested Tika 1.0, Fido 0.9.6 and Droid 6.0 with the V45 signature file.

Tika generally performs best for all the 20 most common formats. Especially for text files (text/plain), it is the only tested tool that correctly identifies the files.

Tika is the fastests of the tools, and Fido is the slowest.

Benefits of Collaboration

Being relatively new to SCAPE and these "hackathons", I wasn't entirely sure what to expect. In theory, I could see the benefits of the group collectively sitting together, jointly discussing and working on project issues, but in practice I wasn't sure exactly how it would work. How prescriptive would the agenda be (or need to be)? Would there be enough time? Would people actually sit working together, or just sit next to each other working?

How SCAPE will contribute to us, how SCAPE can benefit from us and why others should follow SCAPE

We (Exlibris) are part of SCAPE for already a year. We joined the project as a 'Digital Preservation Commercial Company' (actually – the only one among the partners – off course that commercial companies are in, but not companies that produce a full scale Digital Preservation solution).

Pages