Open Planets Foundation Blogs

Subscribe to Open Planets Foundation Blogs feed
The Open Planets Foundation has been established to provide practical solutions and expertise in digital preservation, building on the €15 million investment made by the European Union and Planets consortium.
Updated: 1 hour 45 min ago

SCAPE Demo Day at Statsbiblioteket

27 June 2014 - 8:38am

Statsbiblioteket (The State and University Library, Aarhus, hereafter called SB) welcomed a group of people from The Royal Library, The National Archives, and Danish e-Infrastructure Cooperation on June 25, 2014. They were invited for our SCAPE Demo day where some of SCAPE’s results and tools were presented. Bjarne S. Andersen, Head of IT Technologies, welcomed everybody and then our IT developers presented and demonstrated SB’s SCAPE work.

The day started with a nice introduction to the SCAPE project by Per Møldrup-Dalum, including short presentations of some of the tools which would not be presented in a demo.  Among others this triggered questions about how to log in to Plato – a Preservation Planning Tool developed in SCAPE.

Per continued with a presentation about Hadoop and its applications. Hadoop is a large and complex technology, which was already decided to use before the project started. This has resulted in some discussion during the project, but Hadoop has proven really useful for large-scale digital preservation. Hadoop is available both as open source and as commercial distributions. The core concept of Hadoop is the MapReduce algorithm which was presented in the paper “MapReduce: Simplified Data Processing on Large Clusters” in 2004 by Jeffrey Dean and Senjay Ghemawat. This paper prompted Cutting and Cafarella to implement Hadoop and they published their system under an open source license. Writing jobs for Hadoop has traditionally been done by using the Java programming language, but in the recent years several alternatives to Java have been introduced, e.g. Pig Latin  and Hive  Other interesting elements in a Hadoop cluster are HBase, Mahout, Giraph, Zookeeper and a lot more. At SB we use an Isilon Scale-Out NAS storage cluster which enables us to make a lot of different experiments on the four 96GB RAM CPU nodes each with a 2 Gbit Ethernet interface. This setup potentially makes the complete online storage of SB reachable for the Hadoop cluster.

                                            Sometimes it is hard to fit an elephant in a library

 

Bolette A. Jurik was next in line and told the story about how Statsbiblioteket wanted to migrate audio files using Hadoop (and Taverna…. and xcorrSound Waveform Compare). The files were supposed to be migrated from mp3 to wav. Checking this collection in Plato gave us the result ‘Do nothing’ – meaning leave the files as mp3. But we still wanted to perform the experiment – to test that we have the tools to migrate, extract and compare properties, validate the file format and compare the content of the mp3 and wav files, and that we can create a scalable workflow for this.". We did not have a tool for the content comparison, so we had to develop one, xcorrSound waveform-compare. The output shows which files need special attention – as an example one of the files failed the waveform comparison although it looked right. This was due to a lack of content in some parts of the file so Waveform Compare had no sound to compare! Bolette also asked her colleagues to create "migrated" soundfiles with problems that the tool would not find – read more about this small competition in this blog post.

Then Per was up for yet another presentation – this time describing the experiment: Identification and feature extraction of web archive data based on Nanite. The test was to extract different kinds of metadata (like authors, GPS coordinates for photographs etc.) using Apache Tika, DROID, (and libmagic) . The experiment was run on the Danish Netarchive (archiving of the Danish web – a task undertaken by The Royal Library and SB together). For the live demo a small job with only three ARC files was used – taking all of the 80,000 files in the original experiment would have lasted 30 hours.  Hadoop generates loads of technical metadata that enables us to analyse such jobs in detail after the execution. Per’s presentation was basically a quick review of what is described in the blog post A Weekend With Nanite.

An analysis of the original Nanite experiment was done live in Mathematica presenting a lot of fun facts and interesting artefacts. For one thing we counted the number of unique MIME types in the 80,000 ARC files or 260,603,467 individual documents to

  • 1384 different MIME types were reported by the HTTP server at harvest time,
  • DROID counted 319 MIME types,
  • Tika counted 342 MIME types.

A really weird artefact was that approx. 8% of the identification tasks were complete before they started! The only conclusion to this is that we’re experiencing some kind of temporal shift that would also explain the great performance of our cluster…

Two years ago SB concluded a job that had run for 15 months. 15 months of FITS characterising 12TB of web archive data. The experiment with Nanite characterised 8TB in 30 hours. Overall this extreme shift in performance is due to our involvement in the SCAPE project.

After sandwiches and a quick tour to the library tower Asger Askov Blekinge took over to talk about Integrating the Fedora based DOMS repository with Hadoop. He described Bitmagasinet (SB’s data repository) and DOMS (SB’s Digital Object Management System based on Fedora) and how our repository is integrated with Hadoop.

SB is right now working on a very large project to digitize 32 million pages of newspapers. The digitized files are delivered in batches and we run Hadoop map/reduce jobs on each batch to do quality assurance. An example is to run Jpylyzer on a batch (Map runs Jpylyzer on each file, Reduce stores the results back in DOMS). The SCAPE way to do it includes three steps:

  • Staging – retrieves records
  • Hadooping – reads, works and writes new updated records
  •  Loading  - stores updated records in DOMS

The SCAPE Data model is mapped with the newspapers in the following way:

                                           SCAPE Data Model mapped with newspapers

SCAPE Stager/Loader creates a sequence file which can then be read and each record updated by Hadoop and after that the records are stored in DOMS.

The last demo was presented by Rune Bruun Ferneke-Nielsen. He described the policy driven validation of JPEG 2000 files based on Jpylyzer and performed on SB’s Newspaper digitization project. The newspapers are scanned from microfilms by a company called Ninestars, and then quality assured by SB’s own IT department. We need to make sure that the content conforms to the corresponding file format specifications and that the file format profile conforms to our institutional policies.

                                                     

530,000 image files have been processed within approx. five hours.

We want to be able to receive 50,000 newspaper files per day and this is more than one server can handle. All access on data for quality assurance etc. is done via Hadoop. Ninestars runs a quality assurance before they send the files back to SB and then the files are QA’ed again inhouse.

                                                               Fuel for the afternoon (Photo by Per Møldrup-Dalum)

One of the visitors at the demo is working at The Royal Library with the NetArchive and would like to make some crawl log analyses. These could perhaps be processed by using Hadoop - this is definitely worth discussing after today to see if our two libraries can work together on this.

All in all this was a very good day, and the audience learned a lot about SCAPE and the benefits of the different workflows and tools. We hope they will return for further discussion on how they can best use SCAPE products at their own institutions.

Preservation Topics: SCAPE AttachmentSize ElephantOnSB.png662 KB DataModel.png70.11 KB NewspaperQA.png160.83 KB Fuel for the afternoon (Photo by Per Møldrup-Dalum)45.11 KB
Categories: Planet DigiPres

Bulk disk imaging and disk-format identification with KryoFlux

26 June 2014 - 3:15pm
The problem

We have a large volume of content on floppy disks that we know are degrading but which we don't know the value of.

Considerations
  1. We don't want to waste time/resources on low-value content.
  2. We don't know the value of the content.
  3. We want to be able to back up the content on the disks to ensure it doesn't degrade any more than it already has.
  4. Using unskilled students to do the work is cost-effective.
  5. Unskilled students have often never seen "floppy" disks, let alone can distinguish between different formats of floppy disk. So we need a solution that doesn't require them to differentiate (e.g. between apple formats, PC formats, Amiga, etc).
Solution
  1. Make KryoFlux stream files using the KryoFlux hardware and software.
  2. Use the KryoFlux software to create every variant of disk image from those streams
  3. Use the mount program on Linux to mount each disk image using each variant of file system parameter. 
  4. Keep the disk images that can mount in Linux (as that ability implies that they are the right format).

Very rough beginnings of a program to perform the automatic format identification using the KryoFlux software and Mount are available here.


Issues with the solution
  1. When you use the KryoFlux to create raw stream files it only seems to do one pass of each sector. Whereas when you specify the format it will try to re-read sectors that it identifies as "bad sectors" in the first pass. This can lead to it successfully reading those sectors when it otherwise wouldn't. So using the KryoFlux stream files may not lead to as much successful content preservation as you would get if you specified the format of the disk before beginning the imaging process. I'm trying to find out whether using "multiple" in the output options in the KryoFlux software might help with this
  2. Mount doesn't mount all file-systems - though as this is improved in the future the process could be re-run
  3. Mount can give false positives
  4. I don't know whether there is a difference between disk images created with Kroflux using many of the optional parameters or using the defaults. For example there doesn't appear to be a difference in mount-ability of disk images created where the number of sides is specified or disk images when it is not and defaults to both sides (for e.g. MFM images the results of both seem to mount successfully).
  5. Keeping the raw streams is costly. A disk image for a 1.44mb floppy is ~1.44mb. The stream files are in the 10s of MBs
Other observations:
  1. It might be worth developing signatures for use in e.g. DROID to identify the format of the stream files directly in the future. Some e.g. emulators can directly interact with the stream files already I believe
  2. The stream files might provide a way of over-coming bad-sector based copy protection, (e.g. the copy protection used in Lotus 1-2-3 and Lotus Jazz) by enabling the use of raw stream files (which -i believe- contain the "bad" sectors as well as good) in emulators


Thoughts/feedback appreciated

Preservation Topics: IdentificationPreservation RisksBit rotTools
Categories: Planet DigiPres

Will the real lazy pig please scale up: quality assured large scale image migration

24 June 2014 - 9:12am

Authors: Martin Schaller, Sven Schlarb, and Kristin Dill

In the SCAPE Project, the memory institutions are working on practical application scenarios for the tools and solutions developed within the project. One of these application scenarios is the migration of a large image collection from one format to another.

There are many reasons why such a scenario may be of relevance in a digital library. On the one hand, conversion from an uncompressed to a compressed file format can significantly decrease storage costs. On the other hand, particularly from a long-term perspective, file formats may be in danger of becoming obsolete, which means that institutions must be able to undo the conversion and return to the original file format. In this case a quality assured process is essential to allow for reconstruction of the original file instances and especially to determine when deletion of original uncompressed files is needed – this is the only way to realize the advantage of reducing storage costs. Based on these assumptions we have developed the following use case: Uncompressed TIFF image files are converted into compressed JPEG2000 files; the quality of the converted file is assured by applying a pixel for pixel comparison between the original and the converted image.

For this, a sequential Taverna concept workflow was first developed, which was then modelled into a scalable procedure using different tools developed in the SCAPE Project.

The Taverna Concept Workflow

The workflow input is a text file containing paths to the TIFF files to be converted. This text file is then transformed into a list that allows the sequential conversion of each file, hence simulating a non-scalable process. Before the actual migration commences, validity of the TIFF file is checked. This step is realized by using FITS - a wrapper that applies different tools to extract the identification information of a file. Since the output of FITS is an XML-based validation report, an XPath service extracts and checks the validity information. If the file is valid, migration from TIFF to JPEG2000 can begin. The tool used in this step is OpenJPEG 2.0. In order to verify the output, Jpylyzer – a validator as well as feature extractor for JPEG2000 images created within the SCAPE Project – is employed. Again, an Xpath service is used to extract the validity information. This step concludes the file format conversion itself, but in order to ensure that the migrated file is indeed a valid surrogate, the file is reconverted into a TIFF file, again using OpenJPEG 2.0. Finally, in a last step the reconverted and the original TIFF files are compared pixel for pixel using LINUX based ImageMagick. Only through the successful execution of this final step can the validity as well as the possibility of a complete reconversion be assured. taverna workflow

Figure 1 (above): Taverna concept workflow

In order to identify how much time was consumed by each element of this workflow, we ran a test consisting of the migration of 1,000 files. Executing the described workflow on the 1,000 image files took about 13 hours and five minutes. Rather unsurprisingly, conversion and reconversion of the files took the longest: the conversion to JPEG2000 took 313 minutes and the reconversion 322 minutes. FITS validation needed 70 minutes and the pixel-wise comparison was finished in 62 minutes. The SCAPE developed tool Jypylizer required only 18 minutes and was thus much faster than the above mentioned steps. diagram taverna workflow

Figure 2 (above): execution times of each of the concept workflows' steps

Making the Workflow Scale

The foundation for the scalability of the described use case is a Hadoop cluster containing five Data Nodes and one Name Node (specification: see below). Besides having economic advantages – Hadoop runs on commodity hardware – it also bears the advantage of being designed for failure, hence reducing the problems associated with hardware crashes.

The distribution of tasks for each core is implemented via MapReduce jobs. A Map job splits the handling of a file. For example, if a large text file is to be processed, a Map job divides the file into several parts. Each part is then processed on a different node. Hadoop Reduce jobs then aggregates the outputs of the processing nodes again to a single file.

But writing MapReduce jobs is a complex matter. For this reason, the programming language Apache Pig is used. Pig was built for Hadoop and translates a set of commands in a language called “Pig Latin” into MapReduce jobs, thus making the handling of MapReduce jobs much easier or, as Professor Jimmy Lin described the powerful tool during the ‘Hadoop-driven digital preservation Hackathon’ in Vienna, easy enough “… for lazy pigs aiming for hassle-free MapReduce.”

Hadoop HDFS, Hadoop MapReduce and Apache Pig make up the foundation of the scalability on which the SCAPE tools ToMaR and XPath Service are based. ToMaR wraps command line tasks for parallel execution as Hadoop MapReduce jobs. These are in our case the execution of FITS, OpenJPEG 2.0, Jpylyzer and ImageMagick. As a result, the simultaneous execution of these tools on several nodes is possible. This has a great impact on execution times as Figure 3 (below) shows.

The blue line represents the non-scalable Taverna workflow. It is clearly observable how the time needed for file migration increases in proportion to the number of files that are converted. The scalable workflow, represented by the red line, shows a much smaller increase in time needed, thus suggesting that scalability has been achieved. This means that, by choosing the appropriate size for the cluster, it is possible to migrate a certain number of image files within a given time frame. Performance_image_migration

Figure 3 (above): Wallclock times of concept workflow and scalable workflow

Below is the the specification of the Hadoop Cluster where the master node runs the jobtracker and namenode/secondary namenode daemons, and the worker nodes each runs a tasktracker and a data node daemon.

Master node: Dell Poweredge R510

  • CPU: 2 x Xeon E5620@2.40GHz
  • Quadcore CPU (16 HyperThreading cores)
  • RAM: 24GB
  • NIC: 2 x GBit Ethernet (1 used)
  • DISK: 3 x 1TB DISKs; configured as RAID5 (redundancy); 2TB effective disk space

Worker nodes: Dell Poweredge R310

  • CPU: 1 x Xeon X3440@2.53GHz
  • Quadcore CPU (8 HyperThreading cores)
  • RAM: 16GB
  • NIC: 2 x GBit Ethernet (1 used)
  • DISK: 2 x 1TB DISKs; configured as RAID0 (performance); 2TB effective disk space

However, the throughput we can reach using this cluster and pig/hadoop job configuration is limited; as figure 4 shows, the throughput (measured in Gigabytes per hour - GB/h) is rapidly growing when the number of files being processed is increased, and then stabilises at a value around slightly more than 90 Gigabytes per hour (GB/h) when processing more than 750 image files. throughput_gb_per_h

Figure 4 (above): Throughput of the distributed execution measured in Gigabytes per hour (GB/h) against the number of files processed

As our use case shows, by using a variety of tools developed in the SCAPE Project together with the Hadoop framework it is possible to distribute the processing on various machines thus enabling the scalability of large scale image migration and significantly reducing the time needed for data processing. In addition, the size of the cluster can be tailored to fit the size of the job so that it can be completed within a given time frame.

Apart from the authors of this blog post, the following SCAPE Project partners contributed to this experiment:

  • Alan Akbik, Technical University of Berlin
  • Matthias Rella, Austrian Institute of Technology
  • Rainer Schmidt, Austrian Institute of Technology
Preservation Topics: MigrationSCAPEjpylyzer
Categories: Planet DigiPres

Interview with a SCAPEr - Leïla Medjkoune

20 June 2014 - 11:59am
Leïla MedjkouneWho are you?

My name is Leïla Medjkoune and I am responsible for the Web Archiving projects and activities at Internet Memory.

Tell us a bit about your role in SCAPE and what SCAPE work you are involved in right now?

My involvement in Scape is twofold. I am working as a project manager, following the project and ensuring that Internet Memory as a partner fulfils the project plan. I am also involved as a functional expert, representing web archivists’ needs. I am therefore working within several areas of the project such as Quality Assurance and the Web Testbed work, where I contribute to the development of tools and workflows in relation to web archiving.

Why is your organisation involved in SCAPE?

Since its creation in 2004 Internet Memory actively participates in improving the preservation of the Internet. It supports cultural institutions involved in web archiving projects through its large scale shared platform, by building its own web archive and also by developing innovative methods and tools, such as its own crawler, MemoryBot, either internally or as a result of participation in EU-funded research projects, aiming to tackle web archiving and large scale preservation challenges. As part of SCAPE, Internet Memory is willing to test, develop and hopefully implement within its infrastructure, preservation tools and methods, including an automated visual quality tool applied to web archives.

What are the biggest challenges in SCAPE as you see it?

SCAPE is a very interesting project with a quite complex organisation. This is due to the fact that we are looking at a broad range of tools and methods trying to tackle a variety of preservation issues. Beyond the organisational aspects, one of the biggest challenges is as stated within the acronym, to answer the scalability issues currently met by most archives and libraries. This is even more critical for web archives as the amount of the heterogeneous content to preserve and to provide access to is constantly growing in size. Another challenge will be to disseminate SCAPE's outcomes so that they reach the preservation community and will be used within libraries, archives and preservation institutions in general.

What do you think will be the most valuable outcome of SCAPE?

As most web archives, we are willing to implement robust automated tools within our infrastructure that could not only facilitate operations but would also reduce costs. Improving characterisation tools so that they scale and developing QA tools designed for web archives, such as the Pagelyzer, are the most useful outcomes from our perspective. We are also strongly involved within the SCAPE platform work and believe this platform is a useful example of how several preservation tools and systems can be integrated within one single infrastructure.

Contact information:

Leïla Medjkoune

leila.medjkoune@internetmemory.net

Preservation Topics: SCAPE
Categories: Planet DigiPres

An Analysis Engine for the DROID CSV Export

3 June 2014 - 7:20am

I have been working on some code to ensure the accurate and consistent output of any file format analysis based on the DROID CSV export, example here. One way of looking at it is an executive summary of a DROID analysis, except I don't think executives, as such, will be its primary user-base. 

The reason for pushing the code and this blog out now is to seek feedback on what else might be useful for users. Specifically, formatting of output, and seeking other use-cases for how others analyse the same results to understand if I can incorporate these methods into my work to the benefit of the community. 

The hope is that the results output by the tool can be used by digital preservation researchers, analysts, coders, archivists and digital archivists alike - where there is such a distinction to be drawn. 

The tool is split into two or three components depending on how you break it down.

droid2sqlite.py

This places DROID CSV export data into a SQLite database with the same filename as the input CSV file.

The process adds two additional columns to the saved table, URI_SCHEME so we can query to a greater granularity the URI scheme used by the various URIs output by DROID; and DIR_NAME to enable analysis on base directory names, e.g. to help us understand the break-down of directories in a collection.

droidsqliteanalysis.py

This combines the functions of droid2sqlite.py by calling droid2sqlite's primary class. Further it provides a query layer on top of the DROID SQLite database, outputting the results of various queries we might ask of the dataset to the command line.

MsoftFnameAnalysis.py

This is a class created to help spot potentially difficult to handle file names from any DROID CSV output. The class is based on a Microsoft Development Network article but also checks for non-ascii characters and a handful of other characters that can prove problematic, such as square brackets.

Database and Analysis Engine

This work mirrors some of that done by DROID already. DROID outputs an Apache Derby database for its Profile file format. Information on connecting to it can be found on the droid-list mailing list. For my purposes I had a desire to learn the database management system SQLite and more practically I found a greater amount of support for it in terms of libraries available in Python or the applications I can use to access it. Instead of attempting to access the DROID Derby database and build on top of that, I decided to map the results to a SQLite database. SQLite also has features that I like that might lend itself better to long-term preservation enabling the long term storage of the database alongside any collection analysis documentation outside of the digital preservation system, if necessary.

DROID also enables filtering and the generation of reports, however I haven’t found the way it collects information to be useful in the past and so needed a different approach; an approach that gives me greater flexibility to create more reports or manipulate output.

The DROID CSV export is as simple as it needs to be and provides a lot of useful information and so provided an adequate platform for this work in its own right.

The database engine doesn’t have a hard coded schema; it simply reads the column headers in the CSV provided to the tool. Given the appearance of particular columns it creates two additional columns on top to provide greater query granularity, as mentioned above.

The analysis output includes summary statistics, along with listings of PUIDS and file paths depending on the query that we’re interested in. On top of the summary statistics, the following information is output:

  • Identified PUIDs and format names
  • PUID frequency
  • Extension only identification in the collection and frequency
  • ID method frequency
  • Unique extensions identified across the collection
  • Multiple identification listing
  • MIME type frequency
  • Zero-byte object listing
  • No identification listing
  • Top signature and identified PUIDs list
  • Container types in collection
  • Duplicate content listing
  • Duplicate filename listing
  • Listing of potentially difficult filenames

An example analysis, based on a DROID scan from the re-factored opf-format-corpus I host, can be found here. The summary statistics generated are as follows:

Total files: 500 Total container objects: 14 Total files in containers: 176 Total directories: 85 Total unique directory names: 75 Total identified files (signature and container): 420 Total multiple identifications (signature and container): 1 Total unidentified files (extension and blank): 80 Total extension ID only count: 17 Total extension mismatches: 32 Total signature ID PUID count: 54 Total distinct extensions across collection: 64 Total files with duplicate content (MD5 value): 155 Total files with duplicate filenames: 117 Percentage of collection identified: 84.0 Percentage of collection unidentified: 16.0

One point to note is that DROID can analyse the contents of container files, or not. In the former case it makes it difficult to generate a count of top-level objects (objects not stored within a container). It is, however, useful to understand both counts where possible, but duplication of reports might be undesirable. The creation of a URI_SCHEME column in the database enables this count to be calculated without the need to run DROID twice. The number of top-level objects in the opf-format-corpus can be calculated by subtracting the number of files in containers from the total number of files, so: 324.

Questions that we’re asking…

As we get into the analysis of a number of collections that we hope will be our first born-digital transfers at Archives New Zealand, we find ourselves asking questions of them to ensure they can be ingested with minimal issue into our long-term preservation system. We also want to ensure access is uninhibited for end-users once it arrives there.

Our first attempt at born-digital transfer sees us do this analysis up-front in an attempt to understand the transfers completely, looking at the issues likely to be thrown up on ingest and what pre-conditioning we are likely to have to do before that stage. Some of the questions are also part of a technical appraisal that will help us to understand what to do with examples of files with duplicate content and those that might otherwise be considered non-records or non-evidential e.g. zero-byte files.

The output of the tool represents ALL of the questions that we have considered so far. We do expect there to be more and better questions to be asked as well. Throwing the code and this blog into the public domain can help us build on this work through public input, so:

  • What other information might be useful to query the DROID export for?
  • What output format might be most useful?
  • What formatting of that output format will best lend itself to data visualisation?
  • What other comments and questions do readers of this blog have?
Footnotes

There is still some work to do here. I need to incorporate unit tests into the code base so that everything is as robust as I seek. I imagine future releases of DROID might initially break this tool's compatibility with DROID CSV exports and so that will have to be catered for at the time.

An important note about maintenance is that having created this tool for my day-to-day work I do hope to continue to maintain it for my day-to-day job as necessary.

One of the things I like about accessing DROID results via a database is that we don’t need an analysis layer on top of it. If users have a different requirement of the database than I have catered for then they can simply use the database and use their own queries on top, using their preferred flavour of programming language. Other ways of using such a database might include re-mapping the output to be suitable for cataloguing and archival description, if one desires.

I have considered adding a temporal angle to the database by enabling the storage of multiple DROID reports relating to the same transfer. This could be used to monitor the result of pre-conditioning or analysis of a collection using progressively up-to-date DROID Signature Files. This lends itself to reporting and demonstration of progress to management. The realization of this is more difficult as there doesn’t seem to be a single immutable piece of information we can hook into to make this possible with MD5 hashes likely to change on pre-conditioning, and the potential for file paths to change depending on machine being used to complete a scan. Thoughts on this matter are appreciated.

The tool is licensed under the Zlib license and so can be easily re-used and incorporated into other’s work without issue. 

Preservation Topics: Preservation ActionsIdentificationCharacterisationPreservation RisksToolsSoftware
Categories: Planet DigiPres

A Weekend With Nanite

28 May 2014 - 9:30pm
Well over a year ago I wrote the ”A Year of FITS”(http://www.openplanetsfoundation.org/blogs/2013-01-09-year-fits) blog post describing how we, during the course of 15 months, characterised 400 million of harvested web documents using the File Information Tool Kit (FITS) from Harvard University. I presented the technique and the technical metadata and basically concluded that FITS didn’t fit that kind of heterogenic data in such large amounts. In the time that has passed since that experiment, FITS has been improved in several areas including the code base and organisation of the development and it could be interesting to see how far it has evolved for big data. Still, FITS is not what I will be writing on today. Today I’ll present how we characterised more than 250 million web documents, not in 9 months, but during a weekend.Setting the Scene: The Tools and the DataThe Hardware

When we at the Danish State and University Library (SB) started our work on the SCAPE project, we acquired four machines to support this work. It was on those four machines that the FITS experiment was performed and it is on the same four machines that the present experiment was performed. The big difference being the software that handles the processes and how the data is accessed.

The four machines are all Blade servers each with two six core Intel Xeon 2.93GHz CPUs, 96GB RAM, and 2Gbit network interfaces (details at http://wiki.opf-labs.org/display/SP/SB+Test+Platform). Much to the contrary of a traditional node for a Hadoop cluster, these machines do not have local data storage. At SB we very much rely on NAS for data storage, more specifically our storage is based on the Isilon scale out NAS solution from EMC. The Isilon system is a cluster designed for storage and at SB it is at the present storing several PB of digital preservation data.

As NAS is a prerequisite for data storage at SB, we have been doing a lot of experiments on how best to integrate Hadoop with that kind of infrastructure. We have tested two Hadoop distributions, Cloudera and Pivotal HD, in different hardware and software configurations and the setup used for this experiment is the best so far while in no way being good enough. For the experiment we used the Cloudera distribution version 4.5.0, which builds upon Hadoop 2.0.0.

The Data

During the last months we have been moving our copy of the Danish Web Archive onto the above-mentioned Isilon cluster and providing online read-only access through NFS. This makes it possible for relevant jobs to access more than half a PB of web documents, roughly more than 18 billion documents harvested during the last decade. That is a lot of data!

These documents are stored as ARC or WARC container files in a shallow directory tree with the ARC and WARC files in the nethermost directory. A few months ago we shifted from storing data in the ARC format to storing the data in the newer WARC format as this format is superior to the older ARC format. This will become relevant later in this post.

To select the data for this large scale experiment, a few simple UNIX find commands were issued giving a file with the file paths of 147,776 ARC files amounting to almost 15TB, roughly around 450 million documents.

The Software

For this large-scale characterisation experiment I substituted FITS with the Nanite project. This project is lead by Andy Jackson from the UK Web Archive and enables DROID, Apache Tika, and libmagic to be effectively used on the Hadoop platform. For this experiment I will not use the libmagic component. Will Palmer of the British Library (BL) has written a blog post on this project (http://www.openplanetsfoundation.org/blogs/2014-03-21-tika-ride-characterising-web-content-nanite) and how to integrate the Apache Tika parser into Nanite. Will Palmer has also been very helpful with uncovering and rectifying the problems uncovered during the present experiment.

Before unleashing Nanite on the big data set, I did a lot of preliminary tests and tweaks in the source code of Nanite, more specifically the nanite-hadoop module that compiles to a JAR containing a self-contained Hadoop job. (In this blog post Nanite is a synonymy for nanite-hadoop)

One tweak was to write code for extracting the ARC metadata for each document and storing it along side the properties extracted by Tika.

Another tweak was to substitute the newlines in the extracted HTML title elements with spaces as those newlines breaks the data structure of one key/value pair per line in one of the output data files.

Both these two changes are now in the main Nanite code base.

After disabling a small hack that enables the code to run on the BL Hadoop cluster and disabling a preliminary check to ensure that all input files are compressed, which ours are not, the code should be ready for some real test runs.

Not quite so! The Nanite project depends upon another UK Web Archive project called warc-hadoop-recordreaders, which again depends on a version of the Heritrix project that has a bug when dealing with uncompressed ARC files. The Heritrix code simply cannot handle such uncompressed ARC files. When this bug was uncovered (with help from Will and Andy), it was easy fixing the problem by using a newer, but unreleased version of Heritrix (3.1.2-SNAPSHOT) and rebuilding the dependencies.

During some fun and interesting Skype conversations with Will, we decided to have three different output formats for Nanite. So, basically Nanite produces two kinds of outputs. It produces a traditional data set that are created by the mappers and aggregated in the reducers. This data set contains lines of MIME type, format version, and, for DROID, PUID values. The other kind of output is the extracted features of the documents. Each mapper task handles one ARC file at a time and creates a single data set with all the extracted features from all the documents contained in the ARC file. This data set is then stored by the mapper in three different containers.

  • A ZIP file per ARC file that contains one file per document with key/value pairs, one per line.
  • A ZIP file per ARC file containing the serialised metadata objects for each document.
  • A sequence file per ARC files with the serialised metadata objects for each document.

When we know more about who will use this data and to what purpose, those three formats could be reduced to one or substituted by something entirely different, maybe even improving the performance of the process. The above three output formats are only available in Will’s fork of Nanite at https://github.com/willp-bl/nanite.

As a last experiment in this prologue I ran Nanite on 221GB of ARC files to test the performance. This test showed a processing speed at 4.5GB/minute. The test created circa 1GB of extracted metadata. Extrapolating these values it seemed that the complete 15TB could be processed in less than three days and giving less than 100GB of new data. The latter would have no problem in fitting on the available HDFS space and three days processing time would be impressive, to say the least.

Running the Experiment

Only thing left was to execute

$ hadoop jar jars/nanite-1.1.5-74-SNAPSHOT-job-will.jar ~/working/pmd/netarkiv-147776 out-147776

and wait three days.

I just forgot that this was big data. I forgot that with sheer data size comes complexity. That with 147,776 ARC files, something had inevitable to go wrong.

Before initiating this first run I had decided that I wouldn’t change code or fiddle with the cluster configuration more that absolutely necessary. My primary focus was to get the job to run by jumping as many fences as I had to get the job to terminate without failure.

So when I after a few seconds got a heap space exception, I didn’t change heap space configuration. Instead I split the input file into two equal sized chunks with the intention of running two separate Hadoop jobs.

When I subsequent got an “Exceeded max jobconf size” I didn’t experiment with the mapred.user.jobconf.limit Hadoop parameter but split the original input file into chunks with only 30,000 ARC references each.

After that I started getting “ZIP file must have at least one entry” exceptions. First I removed 4832 ARC files that didn’t contain documents but instead contained metadata regarding the original web harvest. I didn’t know these files were present until I got the above error. That didn’t fix the problem. Next I discovered that I unexpectedly also had WARC files in my data set and the code presumable couldn’t handle those. The set of 147,776 files was now reduced to 79,831, split into three chunks. As a last guard against the job failing I did something bad. I surrounded the failing code with a try-catch instead of understanding the problem—actually I just postponed the real bug-hunt for another opportunity.

Late Saturday evening the first job was about to complete, but as I didn’t want to stay up too late just for starting a Hadoop job, I dared to try to run two Hadoop jobs at the same time. Wrong decision! The second job started spitting out a lot of ”Error in configuring object”. So I killed it and went to bed. The next morning the first job had, luckily, completed with success and I had the first set of results. Off course I then started the second job yet again on a idle Hadoop cluster, but, very worrying, got the same error as in the previous evening! It being Sunday, I closed the ssh connection and tried to forget all about this for the rest of the weekend.

Monday morning at work this second job started without hick-ups and I still don’t know what went wrong!

The second job ran for ten hours, completing 29,999 ARC files out of 30,000 and then it failed and cleaned everything out. Upon examination of the log files and the input data I discovered an ARC file of size zero! That is a problem, not only for my experiment, but certainly also for the web archive and this discovery has been flagged in the organisation.

I removed the reference to this zero size ARC file, ran the job again with success. In the third job I also discovered two more zero size ARC files whose references were removed and the job completed with success.

After 32 hours of processing time, all 79,829 ARC files was processed and a lot of interesting metadata created for further analysis in addition to new experience and knowledge gained.

Analysis

As mentioned above the job produces two kinds of metadata, but it is actually three kinds: MIME type identifications of the documents stored in tab separated text files, extracted metadata for each document stored in ZIP files, and data about the job run itself. I will analyse the last kind first.

To count the number of processed documents I ran this Bash script

for f in $(cat all-files.txt) do rn=$(basename $f .zip) s=$(ls -l $f|awk '{print $5}') c=$(unzip -l $f|tail -1|awk '{print $2}') echo "$f, $c, $s" done

that creates a list of ARC file name and document count pairs. This data gives the total amount of processed documents, which were 260,603,467, and the following distribution of number of documents per ARC file

Number of documents per ARC

In the tail now shown we have 350 ARC files with more than 15,000 documents. The ARC file with the most documents counts 41,140 pieces. Again, an anomaly that would be interesting to dive into as well as many other questions this distribution rises.

The execution time for the three sub jobs were

job idamount of data readnumber of ARC files readprocessing time00592.764TB30,00012hrs, 43mins, 53sec00692.763TB30,00010hrs, 46mins, 21sec00731.820TB19,8297hrs, 44mins, 38sec 7.347TB79,82931.24hrs

Basically the cluster processed 3.92GB/minute, or 44 ARC files per minute, or 63,000 ARC files a day, or 2317 documents per second—which ever sounds most impressive.

To dig a bit deeper into this data set, I collected the run time of each of the 79,839 map tasks. This data can be explored in an HTML table on the job level in the Cloudera Manager, but I wrote a small Java job that converts this data into a CSV file: https://github.com/perdalum/extract-hadoop-map-data. The data was then read into R for analysis. For the interested reader, I created a Gist with the unordered R code that generated the analysis and figures at https://gist.github.com/perdalum/a9041ff3f245986a62f3

The processing time spans from -43s [sic] to 1462s. A very weird observation is that 6,490 tasks, i.e. 8%, were reported as having completed in negative time. As I don’t think the Hadoop cluster utilises temporal shifts, though that would explained the fast processing, this should be investigated further.

Only 83 map task took more than 500s. Ignoring that long thin tail, the distribution of the processing times looks as below

This looks like an expected distribution. The median of this data set is 57s but that includes 8% negative values so it’s unclear what valuable information that gives.

The MIME type identifications are aggregated in 10 files, one per reducer. These files are easily aggregated further into one big file. Eyeball examination of this data file reveals that it contains lots of error messages. The course of these errors should be investigated, and Will has been doing that, but I’ll jump the fence once more, this time using grep

grep -v ^Exception all-parts | grep -v ^IOException > all-parts-cleaned

Still, that was not enough because after trying to read the file into R, I discovered that I needed to remove all single and double quotes

sed 's/"//g' < all-parts-cleaned | sed "s/\'//g" > all-parts-really-cleaned

With that data cleaning completed, the following command finally reads the data into R

r<-read.table("data/all-parts-really-really-cleaned",sep="\t",header=FALSE, comment.char="",col.names=c("?","http","droid","tika","tikap","year","count"))

Observe the comment.char=”” argument. This is necessary as R assumes everything after a # is a comment and in three instances in the data an http server reported a MIME type value as a series of #s. Why? The same server? This just gives rise to even more interesting questions that can be asked about the uncovered anomalies in this data.

Apart from the above mentioned error messages, the data also contain another kind of error. The Tika parser has timed-out on a lot of the documents, 177 million of them, to be exact. That is a 67% failure rate for the Tika parser and I would consider that a serious error. Fortunately this error has already been dealt with by Will and pushed to the main Nanite project.

Had I chosen to include the harvest year into this data during the job, the detected MIME types could be correlated with harvest year. Instead I can compare the distribution of MIME types between the values reported by the server that served the documents, the DROID detection, and the Tika detection.

The web servers reported 1370 different MIME types; Tika detected 342; and DROID 319.

A plot of the complete set of these MIME types is presented for completeness

If we select the top 20 MIME types we get a more clear picture

Top 10 with MIME type names

 

 

It seems that the web servers generally claim that they deliver a few documents with a lot of few different MIME types and a lot of documents with a few MIME types, primarily HTML pages, when in fact we receive quite a diversified set of document types as detected by Tika and DROID. Also, this data set can actually give a confidence factor for the trustworthiness the Danish web servers, even a time series plot of the evolution of such a confidence factor. Or…, Or… The questions just keep coming…

Manually examination of the MIME type data set also reveals fun stuff. E.g. how did a Microsoft Windows Shortcut end up on the Internet? Or what about doing a complete virus scan of all the documents. This could give data for research in the evolution of computer viruses?

There’s no end to the interesting questions when browsing this data. And that’s just looking at one feature, namely the MIME type.

The third kind of data created in the Hadoop job gives a basis for much more detailed observations. We’ve got EXIF data, including GPS if available, HTML title and keyword elements, authors, last modification time, etc. I selected at random a data file from a typical ARC file, i.e. one that contains circa the average of 3500 documents. This data file has 430 unique properties!

These extracted properties are collected in 79,829 ZIP files and before any analysis would be feasible, these files should be combined into one big sequence file and a few Pig Latin UDFs should be written. This would facilitate a kind of explorative analysis. How close to real-time read, eval, print, loop, such a investigation could be, remains to be seen, as this is, unfortunately, a task for another day.

Lessons learned

First off, thank you if you’ve read this far. I did put quite a bit of detail into this blog post—without any TL;DR warning.

The process of going from the idea of this experiment to finishing this blog post has been long but very rewarding.

I’ve uncovered bugs in and added features to Nanite in fun collaboration with Will Palmer and this is far from over. I will continue working with Will on Nanite as I see great value in this tool. As a start I would like to address all the issues uncovered as described in this blog post.

I’ve learned that CPU time actually is much cheaper than developer time. If the cluster should run for 10 hours and fail in the last few minutes, who cares? It’s better than me spending hours trying to dig up a small bug that might be hidden somewhere down in the deepest Java dependencies. Up till a certain break-even threshold, that is!

I’ve learnt a valuable lesson: Know thy data! When performing jobs that potentially could run for weeks, it is very important that you know the data you’re trying to crunch. Still, it’s equally as important to have your tools be indeed very robust. They should be able to survive anything that might be thrown at them, because, running on large data amounts like this, every kind of valid and non-valid data will be encountered. I.e. if one file in a million is a serious problem, you will encounter 17,000 serious problems processing the web archive. If the job runs for 3 months, that’s almost 10 serious problems an hour.

All this being said and done, we must not forget why we do this. It’s not for the sake of creating fast tools, nor reliable tools. It is to enable the curators to preserve our shared data as easy and trustworthy as possible. Also, especially relevant for web documents, it’s for the benefit of the researches in the humanities. To enable them to get answers to all the different questions they could possible imagine asking to such a huge corpora of documents from the last decade.

Even though this experiment answered some questions, I now stand with even more questions to be answered. Oh, and I need to run Nanite on 18 billion documents that are just waiting for me on a NFS mount point…

Preservation Topics: Preservation ActionsIdentificationCharacterisationWeb ArchivingToolsSCAPE
Categories: Planet DigiPres

New SCAPE project business case tools

21 May 2014 - 8:06pm

Over the last month I've been working with the Open Planets Foundation, as part of the SCAPE Project, to develop some guidance materials to help practitioners understand and leverage the business context to some exciting new SCAPE technologies. The SCAPE Business Case Templates provide detailed guidance on building a business case focused on the application of SCAPE tools. It's all about understanding, applying and selling relevant benefits, costs and risks.

Each template introduces one of the three SCAPE technologies before exploring the business context and the case for putting them into action. The templates cover:

  • One of the focused preservation tools developed by SCAPE: Jpylyzer
  • The scalable architecture and toolset developed by SCAPE to address the preservation processing of large datasets: The SCAPE Platform
  • The assessment and decision making tools and approaches developed by SCAPE: The Planning and Watch Suite

A core part of each of the templates is a set of concise business benefits of relevance to each technology. A business case is of course all about that cost/benefit ratio so the detailed benefits are accompanied by notes on costs and risks. This information provides many of the raw materials needed in a business case, but they need to be adapted and applied carefully so that they align with organisational objectives, use appropriate language for the target audience and sell aspects of the work specific to the case in question. Each template therefore includes a business case example illustrating how benefits can be applied to build a strong business case in a specific context.

The templates can be viewed as a stand alone deliverable from SCAPE, but they've also been carefully integrated with an existing resource: the Digital Preservation Business Case Toolkit (DPBCT). This provides users of the SCAPE work with lots of useful background guidance and examples. It also ensures that these SCAPE results are able to live on, post project, as part of a toolkit under the stewardship of the Digital Preservation Coalition. As I said, the Templates can be viewed as standalone guides, but elements of the text are also embedded (using Mediawiki transclusion) in relevant sections of the DPBCT. This makes for a mutually beneficial collaboration with enhancement for DPBCT while still delivering detailed business case examples focused on SCAPE tech. I'm very keen of project work that builds on, extends or otherwise enhances existing resources rather than reinventing the wheel and then dying out at project end, and I think we've managed to pull that off here.

Paul Wheatley, Paul Wheatley Consulting Limited

Preservation Topics: SCAPE
Categories: Planet DigiPres

Catalogue of Policy Elements

20 May 2014 - 8:55pm

Writing preservation policies might be a daunting task, but support is underway!

The SCAPE project has produced  a Catalogue of Preservation Policy elements, which is now available as a wiki. Here you will find an explanation of the SCAPE Policy Framework, consisting of 3 levels of policies. From high level or Guidance Policies to Preservation Procedure Policies to a very detailed level of policies useful for automatic workflows in preservation.

Preservation Procedure Policies, the intermediate level,  is designed to assist you to  create or update  your preservation policies. This level describes the approach an organisation intends to take in order to achieve their high level goals. It is this level that you can find in the Catalogue of Preservation Policy Elements. The catalogue offers a unique overview of policy elements that could be part of your preservation policy.

How does this work?

Each Guidance Policy has a set of related Preservation Procedure Policies. The details of these are described, using a template with a variety of information, for example a definition of the policy, the life cycle phase in which this policy will be relevant, a suggestion of who should be involved in creating the policy etc. For more inspiration you can also have a look at our collection of published preservation policies.

Although the SCAPE project is mainly focused on libraries, data centers and web archives, we believe that this Catalogue is also relevant for other disciplines with a preservation task.

We are convinced that, the Catalogue of Preservation Policy Elements will need to be updated based on new insights in digital preservation, even after the SCAPE project finishes in September. So we would like to invite you to send us your feedback, either by adding this in the Catalogue (each page offers an opportunity to add comments), or to send feedback to Barbara.Sierman@KB.NL.

The final version of the Catalogue of Preservation Policy Elements was created by:  Barbara Sierman (National Library of the Netherlands),   Catherine Jones (Science and Technologies Facilities Council, UK) and Gry Elstrøm (State and University Library, Aarhus, Denmark)

Interested? Join the SCAPE Webinar on Preservation Policies on May 28, 14.00 hrs CET.

Preservation Topics: SCAPE
Categories: Planet DigiPres

Using Kanban at the SCAPE Developer Workshop

12 May 2014 - 8:17am

The SCAPE project is into its final 6 months and with that came our final developer workshop. The main focus of this event was demonstrations, productisation and sustainability, however with everyone together it provided an opportune time to make progress with other SCAPE related activities. With nearly 30 people there, there was a lot going on and so the agenda needed to be flexible to enable productive working, but managed to ensure the workshop’s overall goals were achieved. This post discusses our use and experience of kanban as a means to manage such workshops.

Kanban is a visual way of managing tasks through a workflow. It is lightweight, applicable to whatever process you have today, and doesn’t require a lot of overhead. It is based on three principles (some say 6): visualise, limit Work in Process (WIP) and manage flow, which serve to ensure tasks and processes are brought into the open for discussion, that lead times are reduced and that the workflow is understood and continually improved through monitoring. For the purposes of this SCAPE workshop the main priority was visibility of the work.

Visualising Tasks

Typically a workflow is represented on a kanban board along with the work items that move through this workflow. Placing such a board on the wall enables everyone to see what’s being worked on, who’s working on it, and at what stage in the workflow it’s at. There is a minimal overhead - creating work item cards and moving them through the workflow stages – but this is manageable with short, regular update discussions.

Task Boards

Todo and Doing tasksUsing large sheets of paper stuck to the meeting room walls, we created a primary board with 3 main columns: To Do, Doing, Done. After introductory presentations recapping the purpose of the workshop and the main goals, participants were urged to consider tasks they needed to do and to write them on sticky notes, whilst further presentations were given about specific activities that needed doing (e.g., generating microsite documentation for each tool). This worked really well as it got participants thinking about and writing down what they needed to do whilst discussions were happening.

It is important to note that a fair amount of pre-work was performed in the lead-up to the workshop. For example, gaps in tool documentation and tool installation issues were both identified prior to the workshop. These were then introduced in the initial presentations (and discussed further during the remainder of the workshop) and provided a valuable initial set of tasks that quickly got the workshop moving. The motivation for this pre-work was driven primarily by the desired overall goals of the workshop.

Tasks

One sticky note equated to one task. We made no distinction between different coloured notes, although the colours could have been used to indicate categories of work, e.g., bugs, development, documentation, etc. This does add another level of complexity into visualising the work and given the workshop was only 3 days long, not giving meaning to the colours kept things simple.

Similarly, it is useful to know who is working on a task. We pre-printed everyone’s names onto white stickers which could be affixed to the sticky notes. Being white on coloured notes emphasised the names, making it easier to see which tasks were ownerless. Avatars were considered, however given the short duration of the workshop and the fact that few people pre-selected one, we didn't use them (they were film character based and so didn't relate to any specific individual at the workshop, meaning it probably would have been harder to work out who the task "owner" was anyway).

"Why does this tool have its own board?"

Beyond the main board we also had several smaller boards for individual software tools being worked on. These had the same 3 columns as before, but also including “Waiting for Feedback” (queue) and “Feedback” columns between ‘Doing’ and ‘Done’. These two additional columns were motivated by the work on documentation, where checks were thought necessary to ensure documentation completeness. I didn’t see these columns used much, and they could have equally been replaced by creating “Check documentation for tool X” sticky notes once the “Do documentation” task had been completed.

Specific Tool Kanban boardsWe never had any logic or consistency in which tools had their own board and which didn’t, which resulted in some confusion; a few times I overheard questions such as “why does this tool have its own board?” to which there was no obvious answer. In many ways these smaller boards acted as swim lanes for those tools, the lanes just happened to be separated out into their own boards. Separate boards highlight the work on a specific tool/topic, but perhaps unnecessarily isolate that work from the rest. If a clear distinction between work item “topics” is needed, different coloured sticky notes could always be used instead (this wasn’t the case for us though), but caution should be used to avoid making it too complex (e.g. through use of too many colours).

Work in Process - What to look out for

The emphasis for taking a kanban approach was to bring the tasks being worked on out into the open. It can often be hard to manage workshops with 30 people all working on different things and ensure that the workshop’s goals are met. However, having tasks visible on the board with people’s names attached to them means no-one can hide.

No emphasis was placed on limiting the amount worked on by one person, although people naturally tended towards only working on one or two things at a time. Progress was simply monitored through ad-hoc group discussions centred on the boards and going around the table asking each participant how they were getting on. This sometimes resulted in the need to move a sticky note from one column to another, but forgetting to do this can be excused by unfamiliarity with the kanban approach. A number of times, people were working on things that didn’t have a sticky note at all; asking them to create one then and there was the easiest approach to ensuring it happened, and also encouraged a few others to slyly put up other missing notes!

Another thing to look out for are ill-written tasks – tasks descriptions that are either too vague to be understood (by anyone other than the task owner) or too high-level that they encompass many individual tasks. A big factor in using kanban is to bring the work out into the open so everyone can become familiar with what’s going on. If the actual things being worked on are hidden behind vague descriptions then nothing is communicated and you may as well not have had the note at all.

Lessons Learned

Kanban was a very effective approach to managing the wide variety of tasks going on at the workshop, and recommended for such meetings. The following bullets summarise the discussions above, reflecting our experiences with the technique, and are directed at the use of kanban in short workshops (i.e. the recommendations may be different if applied to a long-running project, for instance).

Preparation:

  • Understand what the key goals are that need to be achieved at the workshop and prepare accordingly; from this it is likely that an initial set of tasks can be created (or at least hinted at) to help get the workshop up-to-speed quickly.

At the Workshop:

  • Briefly explain the kanban process and give everyone sticky notes so they can write tasks as soon as they think of them; do this up front before main presentations.
  • Keep any introductory presentations short and directed towards encouraging task identification around the main workshop goals; aim to swiftly move on to the "doing".

Visualise, but keep things simple:

  • Keep everything on the same board; 3 columns (To Do, Doing, Done) is often enough.
  • Don’t give meaning to sticky note colours unless there’s a good reason to; if there is a good reason, create a key to ensure everyone understands each colour’s meaning.
  • Use stickers with people’s name on. These are more obvious than handwritten names, and easier to recognise who they refer to than avatars (given the short duration of the workshop). If using avatars, consider just using people’s photo (perhaps also with their name).

Tasks:

  • Ask for note rewrites or breakdowns into multiple notes where task descriptions are vague or too high-level.
  • Encourage ownership of tasks by getting the person whose task it is to write the note.

Manage the boards:

  • Hold ad-hoc group discussions centred around the boards.
  • Ask for status updates by person rather than task, so that missing on task notes can be identified (this favours having all the tasks identified over an exact status for each).
  • Understand that people (especially those new to the process) forget to create/update tasks; encourage note creation as soon as the gap is recognised; if necessary, move tasks appropriately during the ad-hoc recaps.
  • Don’t move tasks to “Done” until they truly are done; “I’ve just got to push the code back to the repository” means it’s not finished.
Job Done!

Completed tasks

It was great to see everyone participate in the approach – I was expecting reluctance from people to get involved, but was surprised by the level of enthusiasm; I even noted someone exclaim at completing a task before jumping up to triumphantly move the sticky note to the done pile! With a variety of tasks, progress across the board can often be seen quickly (for us, things were complete by the end of the first day), and at the end of the workshop hopefully you end up with a “Done” column full of sticky notes and (if anything’s left) a “To Do” column with follow-up actions.

Preservation Topics: SCAPE
Categories: Planet DigiPres

The final SCAPE developers workshop in The Hague

9 May 2014 - 11:45am

The third and final SCAPE developers workshop was held at the Royal Dutch Library in The Hague on 23-25 April. This workshop was the final opportunity to work together face to face in a large group, since we are getting closer to the end of the project in September.

The  workshop objectives were to identify and develop demonstrators for our workshop at the DL2014 Conference in September -more info will follow- and the final EC review, to continue working towards sustaining our work beyond the project and to bring everyone together to enable productive work on project deliverables.

In order to get as much work done as possible, we decided to work roughly according to Kanban methods. There were only a few presentations to start up the process, everything was on the TODO/DOING/DONE boards and we had regular stand up moments to update progress on the boards.

 

And this is what KANBAN – the SCAPE way looked like:

                                  

At the beginning of the workshop: TO DO and DOING…             ..and the DONE board at the final stand up!

 

There was a focus on the Productization of the SCAPE tools. In preparation to the workshop tools were tested on user-friendliness. This was used as input to work on the maturity of tools.

A lot of work was also done on the demonstrators. What will be presented and how? How can we integrate our main SCAPE messages? It was good to be able to discuss this together and to come up with an outline for further preparation.

Working from input gathered at the All Staff Meeting in February a lot of the last documentation gaps were filled. This means we can now publish tool microsites, for which a template was developed in preparation to this workshop. 

All in all we can look back on three very productive days. The Kanban boards turned out to work as expected, and as always it was good to meet many other SCAPErs in a good  atmosphere. Not only during the working days but also afterwards. I'm confident we can continue this in the coming months!

Preservation Topics: Tools
Categories: Planet DigiPres