Blogs

An Analysis Engine for the DROID CSV Export

I have been working on some code to ensure the accurate and consistent output of any file format analysis based on the DROID CSV export. The tool produces summary information about any DROID export and more detailed listings for content of interest such as files with potentially problematic file names or duplicate content based on MD5 hash value. I describe some of the rationale and ask for advice on where to go next.

A Weekend With Nanite

Well over a year ago I wrote the ”A Year of FITS”(http://www.openplanetsfoundation.org/blogs/2013-01-09-year-fits) blog post describing how we, during the course of 15 months, characterised 400 million of harvested web documents using the File Information Tool Kit (FITS) from Harvard University. I presented the technique and the technical metadata and basically concluded that FITS didn’t fit that kind of heterogenic data in such large amounts. In the time that has passed since that experiment, FITS has been improved in several areas including the code base and organisation of the development and it could be interesting to see how far it has evolved for big data. Still, FITS is not what I will be writing on today. Today I’ll present how we characterised more than 250 million web documents, not in 9 months, but during a weekend.

Using Kanban at the SCAPE Developer Workshop

The SCAPE project is into its final 6 months and with that came our final developer workshop. The main focus of this event was demonstrations, productisation and sustainability, however with everyone together it provided an opportune time to make progress with other SCAPE related activities.