This post covers two main topics that are related; characterising web content with Nanite, and my methods for successfully integrating the Tika parsers with Nanite.
Nanite is a Java project lead by Andy Jackson from the UK Web Archive, formed of two main subprojects:
Whenever you run into the situation that you have got used to a command line tool and all of a sudden need to apply it to a large amount of files over a Hadoop cluster without having any clue of writing distributed programs ToMaR will be your friend.
The SCAPE project is developing solutions to enable the processing of very large data sets with a focus on long-term preservation. One of the application areas is web archiving where long-term preservation is of direct relevance for different task areas, like harvesting, storage, and access.
First things first. The Github repository with the Audio QA workflows is here: https://github.com/statsbiblioteket/scape-audio-qa. And version 1 is working. Version is really all wrong here. I should call it Workflow 1, which is this one:
Learning to Think Like a Package Maintainer
Lots of great digital preservation applications and services exist, however very few are actively maintained and thus preserved! This is a big problem! By introducing the steps to develop these and engage the support of the community, this training course looks at what can be done to improve this situation. Specifically, this training course looks at how to prepare packages for submission into the very heart of many digital environments; the operating system and directly associated “app-stores”. Attendees will be given hands-on experience with developing and maintaining packages rather than software and key differences will be discussed and evaluated. Better preservation of preservation tools, means better preservation our digital history.
Learning Outcomes (by the end of the training event the attendees will be able to):
- Understand the complexities of package management and distinguish between the different practices relating to both package objectives and chosen programming language.
- Be able to carry out advanced package management operations in order to critically appraise current packages and propose changes.
- Understand the importance of clearly defined versioning and licenses and the role of clear documentation and examples.
- Apply best practice techniques in order to create a simple package suitable for long term maintenance.
- Evaluate a number of options for managing package configuration and behavior relating to package installation, removal, upgrade and re-installation.
- Analyse opportunities for automating package management and releases, maintaining a clear focus on the user and not the developer.
- Critically evaluate opportunities to generalise package management to allow the easy building and maintenance of packages on multiple platforms.
- Assess the potential to apply package management techniques in your own environment.
Delegates will receive a certificate of attendance for the training course.
The agenda can be seen here: http://wiki.opf-labs.org/display/SP/Agenda+-+Preserving+Your+Preservation+Tools.
Registration is now open! https://scape-preserving-tools.eventbrite.co.uk
How to fund and solve your digital preservation challenges
What will the event do for me?
This event will help to make your digital preservation more effective by demonstrating the best community focused approaches and results from the JISC funded SPRUCE Project. You'll be hearing from the SPRUCE Team experts and from the practitioners and developers who have been tackling digital preservation challenges in targeted SPRUCE Award projects. We'll also be hearing from you, so we can take on board what you need from our future work.
- If you're taking your first steps in preserving your digital assets we will demonstrate how to get started, where to get help, and how to make the case to resource your work more effectively.
- If you're already engaged in digital preservation we'll show how your efforts can be supported more effectively with help from the community.
Key topics we will be covering include:
- Securing funding for your digital preservation activities with the Digital Preservation Business Case Toolkit
- Community approaches to solving digital preservation challenges
- SPRUCE guides on how to assess your digital collections
- Stabilising data stored on obsolete hand-held media
- Results from the SPRUCE Award Projects
Who is this for?
Practitioners, developers and middle managers who are engaged (or would like to be engaged) in preserving their organisation's digital assets.
When, where and how do I register?
The free event will take place at 11am on the 25th November at the brand new Library of Birmingham. Register your attendance here. Please note that anyone who registers for the event and then fails to attend without giving at least one week of notice will be liable for a £50 cancellation charge. Places are limited, so please don't waste them!
The ‘Digital Preservation Without Tears’ Mash-up will appeal to collection owners and developers. The programme offers two connected strands – a hack and a sprint.
- In the hack, developers will have two days to develop, test and enhance practical tools for digital preservation. Collection owners will be invited to bring problem elements of their digital collections for analysis using the latest digital forensic and characterisation tools. This will help the collection owners develop practical workflows for management and preservation while helping developers spot and refine solutions that will enable better tools.
- In the sprint, collection owners will examine current thinking on digital preservation policy and planning in their organisations. Collections owners will present their own digital preservation policies and will be invited to assess these against each other and against emerging good practice, providing a managed environment for policy development and peer review. Collection owners will then be invited to pool their wisdom to create a Digital Preservation Policy Building Toolkit that can be shared.
This mashup will:
- Provide a forum for practical problem solving for analysis of digital collection
- Provide a forum for discussion, review and development of digital preservation policy
- Bring together developers and collection owners from across the DPC and OPF to address shared challenges
- Extend and enhance the corpus of digital preservation tools
- Deliver a simple beginners’ guide for the development of digital preservation policies
This event will be of interest to:
- Collections managers, librarians, curators and archivists and policy makers in all institutions with an interest in digital preservation
- Techies, tools developers, IT officers, database managers and systems analysts with an interest in long term data management
- Innovators and researchers digital preservation
- Vendors and providers of digital preservation services
- CEO’s CTO’s and CIO’s seeking to develop institutional capacity for digital preservation
Everyone coming needs to bring a lap top computer. In addition:
- Collection owners will need to bring a data set that is giving them trouble in terms of characterisation or identification and be prepared to present their institutional policy on digital preservation
- Techies will need to tell us about the skills they have and bring a knowledge of existing digital forensic and characterisation tools
Also, because elements of the mash-up include peer-review of existing practice, participants need to understand and consent to working under ‘Chatham House Rules’ for parts of the programme.
Places are strictly limited and should be booked in advance. Priority will be given to DPC and OPF members who can attend at no cost. Non-members are welcome at a cost of £150 pounds per person. Lunch and refreshments are provided on three days and dinner on the first night. Accommodation will be recommended but is not included in the cost. Register online at: http://www.dpconline.org/events
Can’t make it?
Parts of the event will be available as a webcast. We’ll publish the slides after each event and will tweet live from the event using the hashtag #DPnoTears.
This webinar will present a case study of digital preservation and digital library development at the London School of Economics. It will cover the nature of digital library collections we are working with now and a bit about our experiments and future directions for other kinds of born-digital material; the high-level architecture and functional components we have in place, and a discussion about our general approach and what we feel we can avoid having an opinion about for now; discussion of our user experience design process and how we are integrating this way of thinking into other areas of the library like our main website; and a bit about how we made the case to fund digital preservation and the development of our core team and how we involve others within the library.
Session lead: Ed Fay, Digital Library Manager, London School of Economics
Time: 14:00 BST / 15:00 CET
There are 25 places available which will be allocated on a first come, first serve basis. Registration will open soon.