davetaz’s blog

Years on from the registry, why has the preservation community not solved the problem of well managed and high quality data publication?

While open data sources, such as PRONOM, Software Conversion Registry (CSR) and govdocs are excellent examples example of publishing re-usable data (to some extent) there is still a big problem with gaining access to other sources of data.

From 1 Million to 21,000: Reducing Govdocs Significantly

As part of the evaluation framework i’m developing for OPF and Scape I’ve been working on gathering a corpora of files to run experiments against.

Although Govdocs1 would seem like a good place to start there are a few problems:

1) It’s too big, 1 Million Files is just showing off.

2) It’s full of repeats! There are over 700,000 PDF files.

3) Running experiments on 1 Million files that is full of repeats generates too much data (yes there is such a thing)

Scalable Data Preservation

Currently the scientific and R&D communities are continuously talking about data and dataset collection and reuse. Core to these aspects is archiving and preserving this content.

Sharing data is key for many reasons:

  • Discoving new science
  • Re-producing results
  • Verifying research
  • Evidence based decision making
  • Establishing trust

Turning GitHub Code into Debian Packages – The OPF Way

In order to make the genreation of debian pacakges easy, OPF has created and paying to host a number of Amazon AMIs which can be lauched by anyone. These AMIs are already set up to build the package automatically and their only function is to download the latest release (by tag number), build it and put it on the servers web page such that you can download it.

Summary of Outputs and Roadmap (Feb 2012)

Since joining the project in July 2011 I have focussed on aligning a number of different groups and outputs to be consistant and maintainable into the future. In this way I feel my role is not only to support OPF but to use it as a platform to support the on going digital preservation targets of others outside of the immediate OPF and SCAPE project comminuties.