As part of the evaluation framework i’m developing for OPF and Scape I’ve been working on gathering a corpora of files to run experiments against.
Although Govdocs1 would seem like a good place to start there are a few problems:
1) It’s too big, 1 Million Files is just showing off.
2) It’s full of repeats! There are over 700,000 PDF files.
3) Running experiments on 1 Million files that is full of repeats generates too much data (yes there is such a thing)
Currently the scientific and R&D communities are continuously talking about data and dataset collection and reuse. Core to these aspects is archiving and preserving this content.
Sharing data is key for many reasons:
In order to make the genreation of debian pacakges easy, OPF has created and paying to host a number of Amazon AMIs which can be lauched by anyone. These AMIs are already set up to build the package automatically and their only function is to download the latest release (by tag number), build it and put it on the servers web page such that you can download it.
Since joining the project in July 2011 I have focussed on aligning a number of different groups and outputs to be consistant and maintainable into the future. In this way I feel my role is not only to support OPF but to use it as a platform to support the on going digital preservation targets of others outside of the immediate OPF and SCAPE project comminuties.