Open Planets Foundation Blogs

Subscribe to Open Planets Foundation Blogs feed
The Open Planets Foundation has been established to provide practical solutions and expertise in digital preservation, building on the €15 million investment made by the European Union and Planets consortium.
Updated: 5 min 22 sec ago

A Weekend With Nanite

28 May 2014 - 9:30pm
Well over a year ago I wrote the ”A Year of FITS”(http://www.openplanetsfoundation.org/blogs/2013-01-09-year-fits) blog post describing how we, during the course of 15 months, characterised 400 million of harvested web documents using the File Information Tool Kit (FITS) from Harvard University. I presented the technique and the technical metadata and basically concluded that FITS didn’t fit that kind of heterogenic data in such large amounts. In the time that has passed since that experiment, FITS has been improved in several areas including the code base and organisation of the development and it could be interesting to see how far it has evolved for big data. Still, FITS is not what I will be writing on today. Today I’ll present how we characterised more than 250 million web documents, not in 9 months, but during a weekend.Setting the Scene: The Tools and the DataThe Hardware

When we at the Danish State and University Library (SB) started our work on the SCAPE project, we acquired four machines to support this work. It was on those four machines that the FITS experiment was performed and it is on the same four machines that the present experiment was performed. The big difference being the software that handles the processes and how the data is accessed.

The four machines are all Blade servers each with two six core Intel Xeon 2.93GHz CPUs, 96GB RAM, and 2Gbit network interfaces (details at http://wiki.opf-labs.org/display/SP/SB+Test+Platform). Much to the contrary of a traditional node for a Hadoop cluster, these machines do not have local data storage. At SB we very much rely on NAS for data storage, more specifically our storage is based on the Isilon scale out NAS solution from EMC. The Isilon system is a cluster designed for storage and at SB it is at the present storing several PB of digital preservation data.

As NAS is a prerequisite for data storage at SB, we have been doing a lot of experiments on how best to integrate Hadoop with that kind of infrastructure. We have tested two Hadoop distributions, Cloudera and Pivotal HD, in different hardware and software configurations and the setup used for this experiment is the best so far while in no way being good enough. For the experiment we used the Cloudera distribution version 4.5.0, which builds upon Hadoop 2.0.0.

The Data

During the last months we have been moving our copy of the Danish Web Archive onto the above-mentioned Isilon cluster and providing online read-only access through NFS. This makes it possible for relevant jobs to access more than half a PB of web documents, roughly more than 18 billion documents harvested during the last decade. That is a lot of data!

These documents are stored as ARC or WARC container files in a shallow directory tree with the ARC and WARC files in the nethermost directory. A few months ago we shifted from storing data in the ARC format to storing the data in the newer WARC format as this format is superior to the older ARC format. This will become relevant later in this post.

To select the data for this large scale experiment, a few simple UNIX find commands were issued giving a file with the file paths of 147,776 ARC files amounting to almost 15TB, roughly around 450 million documents.

The Software

For this large-scale characterisation experiment I substituted FITS with the Nanite project. This project is lead by Andy Jackson from the UK Web Archive and enables DROID, Apache Tika, and libmagic to be effectively used on the Hadoop platform. For this experiment I will not use the libmagic component. Will Palmer of the British Library (BL) has written a blog post on this project (http://www.openplanetsfoundation.org/blogs/2014-03-21-tika-ride-characterising-web-content-nanite) and how to integrate the Apache Tika parser into Nanite. Will Palmer has also been very helpful with uncovering and rectifying the problems uncovered during the present experiment.

Before unleashing Nanite on the big data set, I did a lot of preliminary tests and tweaks in the source code of Nanite, more specifically the nanite-hadoop module that compiles to a JAR containing a self-contained Hadoop job. (In this blog post Nanite is a synonymy for nanite-hadoop)

One tweak was to write code for extracting the ARC metadata for each document and storing it along side the properties extracted by Tika.

Another tweak was to substitute the newlines in the extracted HTML title elements with spaces as those newlines breaks the data structure of one key/value pair per line in one of the output data files.

Both these two changes are now in the main Nanite code base.

After disabling a small hack that enables the code to run on the BL Hadoop cluster and disabling a preliminary check to ensure that all input files are compressed, which ours are not, the code should be ready for some real test runs.

Not quite so! The Nanite project depends upon another UK Web Archive project called warc-hadoop-recordreaders, which again depends on a version of the Heritrix project that has a bug when dealing with uncompressed ARC files. The Heritrix code simply cannot handle such uncompressed ARC files. When this bug was uncovered (with help from Will and Andy), it was easy fixing the problem by using a newer, but unreleased version of Heritrix (3.1.2-SNAPSHOT) and rebuilding the dependencies.

During some fun and interesting Skype conversations with Will, we decided to have three different output formats for Nanite. So, basically Nanite produces two kinds of outputs. It produces a traditional data set that are created by the mappers and aggregated in the reducers. This data set contains lines of MIME type, format version, and, for DROID, PUID values. The other kind of output is the extracted features of the documents. Each mapper task handles one ARC file at a time and creates a single data set with all the extracted features from all the documents contained in the ARC file. This data set is then stored by the mapper in three different containers.

  • A ZIP file per ARC file that contains one file per document with key/value pairs, one per line.
  • A ZIP file per ARC file containing the serialised metadata objects for each document.
  • A sequence file per ARC files with the serialised metadata objects for each document.

When we know more about who will use this data and to what purpose, those three formats could be reduced to one or substituted by something entirely different, maybe even improving the performance of the process. The above three output formats are only available in Will’s fork of Nanite at https://github.com/willp-bl/nanite.

As a last experiment in this prologue I ran Nanite on 221GB of ARC files to test the performance. This test showed a processing speed at 4.5GB/minute. The test created circa 1GB of extracted metadata. Extrapolating these values it seemed that the complete 15TB could be processed in less than three days and giving less than 100GB of new data. The latter would have no problem in fitting on the available HDFS space and three days processing time would be impressive, to say the least.

Running the Experiment

Only thing left was to execute

$ hadoop jar jars/nanite-1.1.5-74-SNAPSHOT-job-will.jar ~/working/pmd/netarkiv-147776 out-147776

and wait three days.

I just forgot that this was big data. I forgot that with sheer data size comes complexity. That with 147,776 ARC files, something had inevitable to go wrong.

Before initiating this first run I had decided that I wouldn’t change code or fiddle with the cluster configuration more that absolutely necessary. My primary focus was to get the job to run by jumping as many fences as I had to get the job to terminate without failure.

So when I after a few seconds got a heap space exception, I didn’t change heap space configuration. Instead I split the input file into two equal sized chunks with the intention of running two separate Hadoop jobs.

When I subsequent got an “Exceeded max jobconf size” I didn’t experiment with the mapred.user.jobconf.limit Hadoop parameter but split the original input file into chunks with only 30,000 ARC references each.

After that I started getting “ZIP file must have at least one entry” exceptions. First I removed 4832 ARC files that didn’t contain documents but instead contained metadata regarding the original web harvest. I didn’t know these files were present until I got the above error. That didn’t fix the problem. Next I discovered that I unexpectedly also had WARC files in my data set and the code presumable couldn’t handle those. The set of 147,776 files was now reduced to 79,831, split into three chunks. As a last guard against the job failing I did something bad. I surrounded the failing code with a try-catch instead of understanding the problem—actually I just postponed the real bug-hunt for another opportunity.

Late Saturday evening the first job was about to complete, but as I didn’t want to stay up too late just for starting a Hadoop job, I dared to try to run two Hadoop jobs at the same time. Wrong decision! The second job started spitting out a lot of ”Error in configuring object”. So I killed it and went to bed. The next morning the first job had, luckily, completed with success and I had the first set of results. Off course I then started the second job yet again on a idle Hadoop cluster, but, very worrying, got the same error as in the previous evening! It being Sunday, I closed the ssh connection and tried to forget all about this for the rest of the weekend.

Monday morning at work this second job started without hick-ups and I still don’t know what went wrong!

The second job ran for ten hours, completing 29,999 ARC files out of 30,000 and then it failed and cleaned everything out. Upon examination of the log files and the input data I discovered an ARC file of size zero! That is a problem, not only for my experiment, but certainly also for the web archive and this discovery has been flagged in the organisation.

I removed the reference to this zero size ARC file, ran the job again with success. In the third job I also discovered two more zero size ARC files whose references were removed and the job completed with success.

After 32 hours of processing time, all 79,829 ARC files was processed and a lot of interesting metadata created for further analysis in addition to new experience and knowledge gained.

Analysis

As mentioned above the job produces two kinds of metadata, but it is actually three kinds: MIME type identifications of the documents stored in tab separated text files, extracted metadata for each document stored in ZIP files, and data about the job run itself. I will analyse the last kind first.

To count the number of processed documents I ran this Bash script

for f in $(cat all-files.txt) do rn=$(basename $f .zip) s=$(ls -l $f|awk '{print $5}') c=$(unzip -l $f|tail -1|awk '{print $2}') echo "$f, $c, $s" done

that creates a list of ARC file name and document count pairs. This data gives the total amount of processed documents, which were 260,603,467, and the following distribution of number of documents per ARC file

Number of documents per ARC

In the tail now shown we have 350 ARC files with more than 15,000 documents. The ARC file with the most documents counts 41,140 pieces. Again, an anomaly that would be interesting to dive into as well as many other questions this distribution rises.

The execution time for the three sub jobs were

job idamount of data readnumber of ARC files readprocessing time00592.764TB30,00012hrs, 43mins, 53sec00692.763TB30,00010hrs, 46mins, 21sec00731.820TB19,8297hrs, 44mins, 38sec 7.347TB79,82931.24hrs

Basically the cluster processed 3.92GB/minute, or 44 ARC files per minute, or 63,000 ARC files a day, or 2317 documents per second—which ever sounds most impressive.

To dig a bit deeper into this data set, I collected the run time of each of the 79,839 map tasks. This data can be explored in an HTML table on the job level in the Cloudera Manager, but I wrote a small Java job that converts this data into a CSV file: https://github.com/perdalum/extract-hadoop-map-data. The data was then read into R for analysis. For the interested reader, I created a Gist with the unordered R code that generated the analysis and figures at https://gist.github.com/perdalum/a9041ff3f245986a62f3

The processing time spans from -43s [sic] to 1462s. A very weird observation is that 6,490 tasks, i.e. 8%, were reported as having completed in negative time. As I don’t think the Hadoop cluster utilises temporal shifts, though that would explained the fast processing, this should be investigated further.

Only 83 map task took more than 500s. Ignoring that long thin tail, the distribution of the processing times looks as below

This looks like an expected distribution. The median of this data set is 57s but that includes 8% negative values so it’s unclear what valuable information that gives.

The MIME type identifications are aggregated in 10 files, one per reducer. These files are easily aggregated further into one big file. Eyeball examination of this data file reveals that it contains lots of error messages. The course of these errors should be investigated, and Will has been doing that, but I’ll jump the fence once more, this time using grep

grep -v ^Exception all-parts | grep -v ^IOException > all-parts-cleaned

Still, that was not enough because after trying to read the file into R, I discovered that I needed to remove all single and double quotes

sed 's/"//g' < all-parts-cleaned | sed "s/\'//g" > all-parts-really-cleaned

With that data cleaning completed, the following command finally reads the data into R

r<-read.table("data/all-parts-really-really-cleaned",sep="\t",header=FALSE, comment.char="",col.names=c("?","http","droid","tika","tikap","year","count"))

Observe the comment.char=”” argument. This is necessary as R assumes everything after a # is a comment and in three instances in the data an http server reported a MIME type value as a series of #s. Why? The same server? This just gives rise to even more interesting questions that can be asked about the uncovered anomalies in this data.

Apart from the above mentioned error messages, the data also contain another kind of error. The Tika parser has timed-out on a lot of the documents, 177 million of them, to be exact. That is a 67% failure rate for the Tika parser and I would consider that a serious error. Fortunately this error has already been dealt with by Will and pushed to the main Nanite project.

Had I chosen to include the harvest year into this data during the job, the detected MIME types could be correlated with harvest year. Instead I can compare the distribution of MIME types between the values reported by the server that served the documents, the DROID detection, and the Tika detection.

The web servers reported 1370 different MIME types; Tika detected 342; and DROID 319.

A plot of the complete set of these MIME types is presented for completeness

If we select the top 20 MIME types we get a more clear picture

Top 10 with MIME type names

 

 

It seems that the web servers generally claim that they deliver a few documents with a lot of few different MIME types and a lot of documents with a few MIME types, primarily HTML pages, when in fact we receive quite a diversified set of document types as detected by Tika and DROID. Also, this data set can actually give a confidence factor for the trustworthiness the Danish web servers, even a time series plot of the evolution of such a confidence factor. Or…, Or… The questions just keep coming…

Manually examination of the MIME type data set also reveals fun stuff. E.g. how did a Microsoft Windows Shortcut end up on the Internet? Or what about doing a complete virus scan of all the documents. This could give data for research in the evolution of computer viruses?

There’s no end to the interesting questions when browsing this data. And that’s just looking at one feature, namely the MIME type.

The third kind of data created in the Hadoop job gives a basis for much more detailed observations. We’ve got EXIF data, including GPS if available, HTML title and keyword elements, authors, last modification time, etc. I selected at random a data file from a typical ARC file, i.e. one that contains circa the average of 3500 documents. This data file has 430 unique properties!

These extracted properties are collected in 79,829 ZIP files and before any analysis would be feasible, these files should be combined into one big sequence file and a few Pig Latin UDFs should be written. This would facilitate a kind of explorative analysis. How close to real-time read, eval, print, loop, such a investigation could be, remains to be seen, as this is, unfortunately, a task for another day.

Lessons learned

First off, thank you if you’ve read this far. I did put quite a bit of detail into this blog post—without any TL;DR warning.

The process of going from the idea of this experiment to finishing this blog post has been long but very rewarding.

I’ve uncovered bugs in and added features to Nanite in fun collaboration with Will Palmer and this is far from over. I will continue working with Will on Nanite as I see great value in this tool. As a start I would like to address all the issues uncovered as described in this blog post.

I’ve learned that CPU time actually is much cheaper than developer time. If the cluster should run for 10 hours and fail in the last few minutes, who cares? It’s better than me spending hours trying to dig up a small bug that might be hidden somewhere down in the deepest Java dependencies. Up till a certain break-even threshold, that is!

I’ve learnt a valuable lesson: Know thy data! When performing jobs that potentially could run for weeks, it is very important that you know the data you’re trying to crunch. Still, it’s equally as important to have your tools be indeed very robust. They should be able to survive anything that might be thrown at them, because, running on large data amounts like this, every kind of valid and non-valid data will be encountered. I.e. if one file in a million is a serious problem, you will encounter 17,000 serious problems processing the web archive. If the job runs for 3 months, that’s almost 10 serious problems an hour.

All this being said and done, we must not forget why we do this. It’s not for the sake of creating fast tools, nor reliable tools. It is to enable the curators to preserve our shared data as easy and trustworthy as possible. Also, especially relevant for web documents, it’s for the benefit of the researches in the humanities. To enable them to get answers to all the different questions they could possible imagine asking to such a huge corpora of documents from the last decade.

Even though this experiment answered some questions, I now stand with even more questions to be answered. Oh, and I need to run Nanite on 18 billion documents that are just waiting for me on a NFS mount point…

Preservation Topics: Preservation ActionsIdentificationCharacterisationWeb ArchivingToolsSCAPE
Categories: Planet DigiPres

New SCAPE project business case tools

21 May 2014 - 8:06pm

Over the last month I've been working with the Open Planets Foundation, as part of the SCAPE Project, to develop some guidance materials to help practitioners understand and leverage the business context to some exciting new SCAPE technologies. The SCAPE Business Case Templates provide detailed guidance on building a business case focused on the application of SCAPE tools. It's all about understanding, applying and selling relevant benefits, costs and risks.

Each template introduces one of the three SCAPE technologies before exploring the business context and the case for putting them into action. The templates cover:

  • One of the focused preservation tools developed by SCAPE: Jpylyzer
  • The scalable architecture and toolset developed by SCAPE to address the preservation processing of large datasets: The SCAPE Platform
  • The assessment and decision making tools and approaches developed by SCAPE: The Planning and Watch Suite

A core part of each of the templates is a set of concise business benefits of relevance to each technology. A business case is of course all about that cost/benefit ratio so the detailed benefits are accompanied by notes on costs and risks. This information provides many of the raw materials needed in a business case, but they need to be adapted and applied carefully so that they align with organisational objectives, use appropriate language for the target audience and sell aspects of the work specific to the case in question. Each template therefore includes a business case example illustrating how benefits can be applied to build a strong business case in a specific context.

The templates can be viewed as a stand alone deliverable from SCAPE, but they've also been carefully integrated with an existing resource: the Digital Preservation Business Case Toolkit (DPBCT). This provides users of the SCAPE work with lots of useful background guidance and examples. It also ensures that these SCAPE results are able to live on, post project, as part of a toolkit under the stewardship of the Digital Preservation Coalition. As I said, the Templates can be viewed as standalone guides, but elements of the text are also embedded (using Mediawiki transclusion) in relevant sections of the DPBCT. This makes for a mutually beneficial collaboration with enhancement for DPBCT while still delivering detailed business case examples focused on SCAPE tech. I'm very keen of project work that builds on, extends or otherwise enhances existing resources rather than reinventing the wheel and then dying out at project end, and I think we've managed to pull that off here.

Paul Wheatley, Paul Wheatley Consulting Limited

Preservation Topics: SCAPE
Categories: Planet DigiPres

Catalogue of Policy Elements

20 May 2014 - 8:55pm

Writing preservation policies might be a daunting task, but support is underway!

The SCAPE project has produced  a Catalogue of Preservation Policy elements, which is now available as a wiki. Here you will find an explanation of the SCAPE Policy Framework, consisting of 3 levels of policies. From high level or Guidance Policies to Preservation Procedure Policies to a very detailed level of policies useful for automatic workflows in preservation.

Preservation Procedure Policies, the intermediate level,  is designed to assist you to  create or update  your preservation policies. This level describes the approach an organisation intends to take in order to achieve their high level goals. It is this level that you can find in the Catalogue of Preservation Policy Elements. The catalogue offers a unique overview of policy elements that could be part of your preservation policy.

How does this work?

Each Guidance Policy has a set of related Preservation Procedure Policies. The details of these are described, using a template with a variety of information, for example a definition of the policy, the life cycle phase in which this policy will be relevant, a suggestion of who should be involved in creating the policy etc. For more inspiration you can also have a look at our collection of published preservation policies.

Although the SCAPE project is mainly focused on libraries, data centers and web archives, we believe that this Catalogue is also relevant for other disciplines with a preservation task.

We are convinced that, the Catalogue of Preservation Policy Elements will need to be updated based on new insights in digital preservation, even after the SCAPE project finishes in September. So we would like to invite you to send us your feedback, either by adding this in the Catalogue (each page offers an opportunity to add comments), or to send feedback to Barbara.Sierman@KB.NL.

The final version of the Catalogue of Preservation Policy Elements was created by:  Barbara Sierman (National Library of the Netherlands),   Catherine Jones (Science and Technologies Facilities Council, UK) and Gry Elstrøm (State and University Library, Aarhus, Denmark)

Interested? Join the SCAPE Webinar on Preservation Policies on May 28, 14.00 hrs CET.

Preservation Topics: SCAPE
Categories: Planet DigiPres

Using Kanban at the SCAPE Developer Workshop

12 May 2014 - 8:17am

The SCAPE project is into its final 6 months and with that came our final developer workshop. The main focus of this event was demonstrations, productisation and sustainability, however with everyone together it provided an opportune time to make progress with other SCAPE related activities. With nearly 30 people there, there was a lot going on and so the agenda needed to be flexible to enable productive working, but managed to ensure the workshop’s overall goals were achieved. This post discusses our use and experience of kanban as a means to manage such workshops.

Kanban is a visual way of managing tasks through a workflow. It is lightweight, applicable to whatever process you have today, and doesn’t require a lot of overhead. It is based on three principles (some say 6): visualise, limit Work in Process (WIP) and manage flow, which serve to ensure tasks and processes are brought into the open for discussion, that lead times are reduced and that the workflow is understood and continually improved through monitoring. For the purposes of this SCAPE workshop the main priority was visibility of the work.

Visualising Tasks

Typically a workflow is represented on a kanban board along with the work items that move through this workflow. Placing such a board on the wall enables everyone to see what’s being worked on, who’s working on it, and at what stage in the workflow it’s at. There is a minimal overhead - creating work item cards and moving them through the workflow stages – but this is manageable with short, regular update discussions.

Task Boards

Todo and Doing tasksUsing large sheets of paper stuck to the meeting room walls, we created a primary board with 3 main columns: To Do, Doing, Done. After introductory presentations recapping the purpose of the workshop and the main goals, participants were urged to consider tasks they needed to do and to write them on sticky notes, whilst further presentations were given about specific activities that needed doing (e.g., generating microsite documentation for each tool). This worked really well as it got participants thinking about and writing down what they needed to do whilst discussions were happening.

It is important to note that a fair amount of pre-work was performed in the lead-up to the workshop. For example, gaps in tool documentation and tool installation issues were both identified prior to the workshop. These were then introduced in the initial presentations (and discussed further during the remainder of the workshop) and provided a valuable initial set of tasks that quickly got the workshop moving. The motivation for this pre-work was driven primarily by the desired overall goals of the workshop.

Tasks

One sticky note equated to one task. We made no distinction between different coloured notes, although the colours could have been used to indicate categories of work, e.g., bugs, development, documentation, etc. This does add another level of complexity into visualising the work and given the workshop was only 3 days long, not giving meaning to the colours kept things simple.

Similarly, it is useful to know who is working on a task. We pre-printed everyone’s names onto white stickers which could be affixed to the sticky notes. Being white on coloured notes emphasised the names, making it easier to see which tasks were ownerless. Avatars were considered, however given the short duration of the workshop and the fact that few people pre-selected one, we didn't use them (they were film character based and so didn't relate to any specific individual at the workshop, meaning it probably would have been harder to work out who the task "owner" was anyway).

"Why does this tool have its own board?"

Beyond the main board we also had several smaller boards for individual software tools being worked on. These had the same 3 columns as before, but also including “Waiting for Feedback” (queue) and “Feedback” columns between ‘Doing’ and ‘Done’. These two additional columns were motivated by the work on documentation, where checks were thought necessary to ensure documentation completeness. I didn’t see these columns used much, and they could have equally been replaced by creating “Check documentation for tool X” sticky notes once the “Do documentation” task had been completed.

Specific Tool Kanban boardsWe never had any logic or consistency in which tools had their own board and which didn’t, which resulted in some confusion; a few times I overheard questions such as “why does this tool have its own board?” to which there was no obvious answer. In many ways these smaller boards acted as swim lanes for those tools, the lanes just happened to be separated out into their own boards. Separate boards highlight the work on a specific tool/topic, but perhaps unnecessarily isolate that work from the rest. If a clear distinction between work item “topics” is needed, different coloured sticky notes could always be used instead (this wasn’t the case for us though), but caution should be used to avoid making it too complex (e.g. through use of too many colours).

Work in Process - What to look out for

The emphasis for taking a kanban approach was to bring the tasks being worked on out into the open. It can often be hard to manage workshops with 30 people all working on different things and ensure that the workshop’s goals are met. However, having tasks visible on the board with people’s names attached to them means no-one can hide.

No emphasis was placed on limiting the amount worked on by one person, although people naturally tended towards only working on one or two things at a time. Progress was simply monitored through ad-hoc group discussions centred on the boards and going around the table asking each participant how they were getting on. This sometimes resulted in the need to move a sticky note from one column to another, but forgetting to do this can be excused by unfamiliarity with the kanban approach. A number of times, people were working on things that didn’t have a sticky note at all; asking them to create one then and there was the easiest approach to ensuring it happened, and also encouraged a few others to slyly put up other missing notes!

Another thing to look out for are ill-written tasks – tasks descriptions that are either too vague to be understood (by anyone other than the task owner) or too high-level that they encompass many individual tasks. A big factor in using kanban is to bring the work out into the open so everyone can become familiar with what’s going on. If the actual things being worked on are hidden behind vague descriptions then nothing is communicated and you may as well not have had the note at all.

Lessons Learned

Kanban was a very effective approach to managing the wide variety of tasks going on at the workshop, and recommended for such meetings. The following bullets summarise the discussions above, reflecting our experiences with the technique, and are directed at the use of kanban in short workshops (i.e. the recommendations may be different if applied to a long-running project, for instance).

Preparation:

  • Understand what the key goals are that need to be achieved at the workshop and prepare accordingly; from this it is likely that an initial set of tasks can be created (or at least hinted at) to help get the workshop up-to-speed quickly.

At the Workshop:

  • Briefly explain the kanban process and give everyone sticky notes so they can write tasks as soon as they think of them; do this up front before main presentations.
  • Keep any introductory presentations short and directed towards encouraging task identification around the main workshop goals; aim to swiftly move on to the "doing".

Visualise, but keep things simple:

  • Keep everything on the same board; 3 columns (To Do, Doing, Done) is often enough.
  • Don’t give meaning to sticky note colours unless there’s a good reason to; if there is a good reason, create a key to ensure everyone understands each colour’s meaning.
  • Use stickers with people’s name on. These are more obvious than handwritten names, and easier to recognise who they refer to than avatars (given the short duration of the workshop). If using avatars, consider just using people’s photo (perhaps also with their name).

Tasks:

  • Ask for note rewrites or breakdowns into multiple notes where task descriptions are vague or too high-level.
  • Encourage ownership of tasks by getting the person whose task it is to write the note.

Manage the boards:

  • Hold ad-hoc group discussions centred around the boards.
  • Ask for status updates by person rather than task, so that missing on task notes can be identified (this favours having all the tasks identified over an exact status for each).
  • Understand that people (especially those new to the process) forget to create/update tasks; encourage note creation as soon as the gap is recognised; if necessary, move tasks appropriately during the ad-hoc recaps.
  • Don’t move tasks to “Done” until they truly are done; “I’ve just got to push the code back to the repository” means it’s not finished.
Job Done!

Completed tasks

It was great to see everyone participate in the approach – I was expecting reluctance from people to get involved, but was surprised by the level of enthusiasm; I even noted someone exclaim at completing a task before jumping up to triumphantly move the sticky note to the done pile! With a variety of tasks, progress across the board can often be seen quickly (for us, things were complete by the end of the first day), and at the end of the workshop hopefully you end up with a “Done” column full of sticky notes and (if anything’s left) a “To Do” column with follow-up actions.

Preservation Topics: SCAPE
Categories: Planet DigiPres

The final SCAPE developers workshop in The Hague

9 May 2014 - 11:45am

The third and final SCAPE developers workshop was held at the Royal Dutch Library in The Hague on 23-25 April. This workshop was the final opportunity to work together face to face in a large group, since we are getting closer to the end of the project in September.

The  workshop objectives were to identify and develop demonstrators for our workshop at the DL2014 Conference in September -more info will follow- and the final EC review, to continue working towards sustaining our work beyond the project and to bring everyone together to enable productive work on project deliverables.

In order to get as much work done as possible, we decided to work roughly according to Kanban methods. There were only a few presentations to start up the process, everything was on the TODO/DOING/DONE boards and we had regular stand up moments to update progress on the boards.

 

And this is what KANBAN – the SCAPE way looked like:

                                  

At the beginning of the workshop: TO DO and DOING…             ..and the DONE board at the final stand up!

 

There was a focus on the Productization of the SCAPE tools. In preparation to the workshop tools were tested on user-friendliness. This was used as input to work on the maturity of tools.

A lot of work was also done on the demonstrators. What will be presented and how? How can we integrate our main SCAPE messages? It was good to be able to discuss this together and to come up with an outline for further preparation.

Working from input gathered at the All Staff Meeting in February a lot of the last documentation gaps were filled. This means we can now publish tool microsites, for which a template was developed in preparation to this workshop. 

All in all we can look back on three very productive days. The Kanban boards turned out to work as expected, and as always it was good to meet many other SCAPErs in a good  atmosphere. Not only during the working days but also afterwards. I'm confident we can continue this in the coming months!

Preservation Topics: Tools
Categories: Planet DigiPres