Open Planets Foundation Blogs
It's been two weeks since the internal SCAPE developer workshop in Brno, Czech Republic. It was a great workshop. We had a lot of presentations and demos, and were brought up to date on what's going on in the other corners of the SCAPE project. We also had some (loud) discussions, but I think we came to some good agreements on where we as developers are going next. And we started a number of development and productisation activities. I came home with a long list of things to do next week (this ended up not at all being what I did last week, but I still have the list, so next week, fingers crossed). Tasks for week 48:
- make versioning stable and meaningful (this I looked at together with my colleague in week 48)
- release new version (this one we actually did)
- finish writing nice microsite
- tell my colleague to finish writing small website, where you can test the xcorrSound tools without installing them yourself
- write unit tests
- introduce automatic rpm packaging?
- finish xcorrSound Hadoop job
- do the xcorrSound Hadoop Testbed Experiment
- Update the corresponding user story on the wiki
- Write the new evaluation on the wiki
- finish the full Audio Migration + QA Hadoop job
- do the full Audio Migration + QA Hadoop Testbed Experiment
- Update the corresponding user story on the wiki
- Write the new evaluation on the wiki
- write a number of new blog posts about xcorrsound and SCAPE testbed experiments
- new demo of xcorrsound for the SCAPE all-staff meeting in February
- SCAPE testbed demonstrations
- define the demos that we at SB are going to do as part of testbed (this one we also did in week 48; the actual demos we'll make next year)
- FITS experiment (hopefully not me, but a colleague)
- JPylyzer experiment (hopefully not me, but a colleague)
- Mark FFprobe experiment as not active
- ... there are some more points for the next months, but I'll spare you...
So what did I do in week 48? Well, I sort of worked on the JPylyzer experiment, which is on the list above. In the Digital Preservation Technology Development department at SB we are currently working on a large scale digitized newspapers ingest workflow including QA. As part of this work we run JPylyzer from Hadoop on all the ingested files, and then validate a number of properties using Schematron. These properties come from the requirements to the digitization company, but in SCAPE context these properties should come from policies, so there is still some work to do for the experiment. But running JPylyzer from Hadoop, and validating properties from the JPylyzer output using Schematron now seems to work in the SB large scale digitized newspapers ingest project :-)
And for now I'll put week 50 on the above list, and when I have finished a sufficient number of bullet points I'll blog again! This post is missing links, so I hope you can read it without.Preservation Topics: SCAPE
Lovebytes currently holds an archive of digital media assets representing 19 years of the organisation’s activities in the field of digital art and a rich historical record of emerging digital culture at the turn of the century. It contains original artworks in a wide variety of formats, video and audio documentation of events alongside websites and print objects.
In June 2013 we were delighted to receive an award from SPRUCE, which enabled us to devise and test a digital preservation plan for the archive through auditing, migrating and stabilising a representative sample of material, concentrating on migrating digital video and Macromedia Director files.
Alongside this we developed a Business Case, which makes the case for preserving the archive and describes the work that needs to be done to make it accessible for the benefit of current and future generations, with a view to this forming the basis of applications for funding to continue this work.
Lovebytes was set up to explore the cultural and creative impact of digitalisation across the whole gamut of artistic and creative practice through a festival of exhibitions, talks, workshops, performances, film screenings and commissions of new artwork.
We wanted the festival to be a forum to pose open questions about the impact of digitalisation for artists and audiences, in an attempt to find commonalities in working practice, new themes and highlight new and emerging forms and trends in creative digital practice and also provide support for artists to disseminated and distribute their own work through commissions.
This was a groundbreaking model for a UK media festival and established Lovebytes as key player amongst a new wave of international arts festivals.
The intention in developing a plan for Lovebytes Media Archive is to look at how best to capture the 'shape' of the festival by and how to best represent this in creating an accessible version of archive.
The Objectives of the project funded through SPRUCE are outlined below:
- Develop a workflow for the migration of the digital files and interactive content, progressing on from work done during SPRUCE Mashup London.
- Tackle issues around dealing with obsolete formats and authoring platforms used by artists (such as Macromedia Director Projector files) and look at ways of making this content more accessible whilst also maintaining original copies for authenticity.
- Research and develop systems for transcription, data extraction and the use of metadata to increase accessibility of the archive.
- Report on progress and share our findings for the benefit of the digital preservation community.
- Develop a digital preservation Business case, with a view to approaching funders.
We started by developing a research plan for a representational sample of the archive (see below), focusing on one festival, rather than a range of samples from over the 19 years. We selected the year 2000 as this included a limited edition CD Rom / Audio CD publication which contains specially commissioned interactive and generative artwork in a variety of formats.
Additional assets in the representation sample include video documentation of panel sessions, printed publicity, photographs, press cuttings and audience interviews in a wide variety of formats.
Research plan for the representational sample
- Auditing the archive.
- Choosing a representative sample.
Stabilising and migrating
- Reviewing content to assess problems and risk
- Stabilise again with a view to rectifying problems
- Cataloguing and naming.
- Planning for future accessibility and interpretation.
- Extracting metadata.
- Prototyping a search interface to provide access to the archive (with Mark Osbourne from Nooode).
Data integrity is paramount in digital preservation and requires utmost scrutiny when dealing with 'born digital' artworks, where every aspect of the artists original intentions should be considered a matter for preservation and any re-presentation of a digital artwork can be regarded as a reinterpretation of the work.
In all cases, the most urgent work was the migration of data to stabilise and secure it. Amongst the wide range of formats we hold, CDs and CD ROMs are prone to bit rot and other magnetic formats can degrade gradually or be damaged by electrical and environmental conditions or easily damaged during attempts to read or playback.
The majority of our preservation work was to migrate from a wide variety of formats to hard drive, essentially consolidating our collection into one storage medium, which is then duplicated as a part of a back up routine.
Our research focused on the following 6 areas
- Macromedia Director Projector files
- Migrating obsolete files and addressing compatibility issues.
- DV Tapes
- Migrating DV tapes and transcribing panel sessions with a view to researching how transcriptions could be used for text based searches of video content, and how this can be embedded as subtitles using YouTube.
- Restoring Lovebytes website
- Lovebytes website is currently offline, although is captured on the British Library's UK Web Archive.
- Developing naming systems for assets
- Prototyping a searchable web interface and exploring the potential for using ready-made, free and accessible tools for transcription dissemination.
- Writing a Business Case for Lovebytes Media Archive
We learned some valuable lessons on the way that we'd we like to share with likeminded organisations, especially those who have limited resources and are looking to preserve their own digital legacy on a tight budget.
Our findings have been compiled into a detailed report, providing a workflow model which makes recommendations for capturing, cataloguing and preserving material. It outlines our research into preserving artwork on obsolete formats and authoring platforms, as well as systems for transcription, data extraction and the use of metadata to increase accessibility of the archive.
We wanted to begin looking at the preservation issues for our collection and devise our own systems and best practice, therefore the recommendations reached for preserving digital assets in various media formats reflect the organisational needs of Lovebyes and might not align with another organisations goals.
We used the Digital Preservation Business Case Toolkit to help us get started on our Business Case. This was a fantastic resource and helped us shape our Case and consider all the information and options we needed to include.
The Business Case will form the foundation for applications for public and private funding and will be tailored to meet specific requirements. Through writing this, we were able to identify the potential risks to the archive, its value and how we might restage artworks or commission artists to use data from it within the preservation process.
As non-experts in digital preservation we knew we were about to encounter some steep climbs and were initially apprehensive about what lay ahead, given that most of our material had been sat in a garage for ten years. Our collection, until then, had remained largely un-catalogued and aside from being physically sealed in oversized tupperware, the digital assets had been neglected. Many items were the only copy, stored in one location in danger of decay, damage or loss. As a small arts organisation recently hit by cuts to the art funding, Lovebytes and its archives were in a precarious position; unsupported and vulnerable.
The SPRUCE Award gave us the opportunity to take a step back and re-evaluate these assets, making us aware of their value and the need to save them and to start the preservation process. It has given us the opportunity to explore solutions and devise our own systems for best practice within the limited resources and funding options available to us.
It has allowed us to crystallize our thoughts around using the Lovebytes Media Archive to investigate digital archivism as a creative process and specifically how digital preservation techniques may be used to capture and preserve the curatorial shape and context of arts festivals.
By using available resources and bringing in external expertise where necessary, we found this process rewarding both in terms of developing new skills and also reaffirming in terms our past, current and future curatorial practice.
Having undertaken this research we now feel positive about the future of the archive and have a clear strategy for preservation and a case to take to funders and partners to secure it as an exemplar digital born archive project which attempts to capture preserve and represent the history of Lovebytes as a valuable record of early international digital arts practice at the turn of the century.SPRUCE
Preservation Topics: Open Planets Foundation
The preservation of audio CDs is something that is slightly different from the preservation of CDs containing data other than audio. Data on audio CDs cannot be easily cloned for preservation, as the music industry has lobbied the main operating system developers to curtail the duplication of CDs to crack down on the mass production of pirate copies. While this is understandable from an intellectual property perspective, it is rather problematic from a preservation viewpoint.
I have scoured published documents in this area but there are no comprehensive examples of best practice related to data preservation from audio CDs. There are guidebooks on the preservation of the CDs themselves but next to nothing about the preservation of the data on the audio CDs. This area requires urgent attention because audio CDs may contain risk and decaying audio data on a fragile medium. Certain types of audio CDs are nearing their end of life faster than others.
At the SPRUCE London Mashup in July 2013 I proposed the creation of a workflow model for the preservation of audio CDs. Working mainly with Peter May (British Library) and Carl Wilson (OPF), with input from other developers at the mashup, we established that the main problem that needed to be resolved was the fact that there was no open source tool to easily create a disk image or clone of data on an audio CD.
While this may seem a straightforward project, it took no fewer than three experienced developers working on this problem many hours before a practical solution was proposed, based on cdrdao. (See: an outline of the initial solution)
Having resolved the basic need to create a clone or disk image from an audio CD, the next step in this project was to explore how to catalogue the disk image and its contents, as well as normalise the audio files into the standard BWAV format. This was supported by a SPRUCE award (funded by JISC) covering the period August-October 2013, involving Carl Wilson and Toni Sant, with the participation of Darren Stephens from the University of Hull. Through further consultation with digital forensics experts at the British Library and elsewhere, as well as systematic development, this project has addressed this issue directly.
Once the fundamental open solution was in hand, our attention could be turned to the development of a four-step workflow model for the preservation of audio CDs. The four steps are as follows:
1. Disk Imaging (stabilizing the data)
2. Cataloguing (through individual Cue sheets)
3. Data Ripping (normalising the data)
4. Open access to the catalogue (outputting the metadata)
Working with an specific dataset (see: an outline of the dataset) this project is now able to provide a practical workflow model utilizing the solution proposed during the London SPRUCE mashup as a tool for steps 1 & 3 called arcCD. An example of good practice has now been established in this under-explored area of preservation. All materials produced for this project are available on GitHub. Darren Stephens is also integrating further development on outputting the metadata into MediaWiki for easy access and editing of the catalogue, as part of his PhD research project entitled 'A Framework for Optimised Interaction Between Mediated Memory Repositories and Social Media Networks.'
The initial dataset used for the development of this project is managed by the Malta Music Memory Project (M3P), which seeks to provide an inclusive repository for memories of Maltese music and associated arts, ensuring that these are kept in posterity for current and future generations. M3P is one of the projects within the Media and Memory Research Initiative (MaMRI) of the University of Hull and it is facilitated by the M3P Foundation, a voluntary organization registered in Malta.
1. Open Source: Previous research undertaken by the Digital Curator indicated that the implementation of an open source digital repository would not be feasible due to the investment and expertise required.
2. Out of the Box (recommended option): Preservica scored very highly and also proved to be the most cost effective solution based on initial calculations. Other out of the box solutions were considered such as Ex Libris Rosetta, but the cost of implementing this system in-house was prohibitive.
3. Hybrid: The combination of using the OAIS compliant Archivematica in conjunction with bit-level preservation provided by Arkivum was considered. However, the combination of these two solutions was not as comprehensive and cost effective in comparison to an out of the box solution.Once the recommended option was decided, it was a case of using the guidance of the Digital Preservation Business Case Toolkit to form the final business case. What resulted was a straight to the point and clear justification based on expert knowledge which was presented internally to key stakeholders within NE. Lessons Learnt There is no one size fits all solution!
- Much of what is concluded will be based on your own organisational context, all of which can influence the right approach towards digital preservation. However, it is hoped that this project can establish a methodology which other small to medium organisations can adopt.
- Aligning organisational goals from the onset will save you a great deal of work further down the line. By identifying these key drivers you can begin to build up support for your recommended solution before the big pitch to senior management.
- There are a number of fantastic resources out there which can save you reinventing the wheel. The first and most obvious point of contact is the new Digital Preservation Business Case Toolkit. A fantastic resource including everything you need to get started.
- Nail down upfront costs for at least the first three years. After all, you want a solution which can be sustained into the future. For any costs include benefits and any potential returns on investments which can be identified
Preservation Topics: SPRUCE
- Travis compiles the projects and executes unit tests whenever a new commit is pushed to Github, or when a pull request is submitted to the project.
- Jenkins builds are generally scheduled once per day. After a build the software has its code quality analysed by Sonar
Complete details of how to build each non-Java project are contained within the .travis.yml files that are found in the project directories. As a side effect of this work the .travis.yml files can be used as instructions for independently building the projects.
Matchbox, Xcorrsound and Jpylyzer have CI builds that are capable of generating an installable Debian package, which we are aiming to publish. Java projects have had their Maven GroupId and package names changed to the appropriate SCAPE names so we can publish binary snapshots.
The daily Maven snapshots of code built in Jenkins are now (or soon will be) published to https://oss.sonatype.org/content/repositories/snapshots/eu/scape-project/ and can be used by adding this repository to your pom.xml:<parent> <groupId>org.sonatype.oss</groupId> <artifactId>oss-parent</artifactId> <version>7</version> </parent>
What you can do for your project
- Maintain your .travis.yml file if project dependencies change
- Ensure code matches the SCAPE/OPF functional review criteria – correct Java package names and Maven GroupIds are essential to be able to publish snapshots
- Ensure your project has an up to date README that contains details of how to build and run your software (including dependencies)
- Very importantly ensure that your project has (at the very least) a top level LICENSE, ideally source files should each contain a license header
- Add unit tests for your project
- Ensure that unit tests for your project can easily be run using standard dependencies. Relying on your particular installation for unit tests to pass means that they cannot be successfully run by Travis/Jenkins and show as test failures. Whilst it might not always be possible to have unit tests that can be run independently, if there have to be test dependencies then please document how these should be set up!
- Check your project at http://projects.opf-labs.org/
The CI days are generally about once a month. If you are interested in joining us do let us know as we could always do with more help. It’s an opportunity for you to work on CI with Travis/Jenkins, and do other work that is interesting (and rewarding), such as Debian packaging, that you might not normally get to work on.Preservation Topics: ToolsPackagingOpen Planets FoundationSCAPESoftwarejpylyzer
During and around the iPRES a couple of discussions sprung up around the topic of proper software archiving and it was part of the DP challenges workshop discussions. With services emerging around emulation as e.g. developed in the bwFLA project (see e.g. the blog post on EaaS demo or Digital Art curation) proper measures need to be taken to make them sustainable from the software side. There are hardware museums around; similar might be desirable too.
Research data, business processes, digital art and generic digital artefacts can often not be viewed or handled simply by themselves, instead they require a specific software and hardware environment to be accessed or executed properly. Software is a necessary mediator for humans to deal with and understand digital objects of any kind. In particular, artefacts based on any one of the many complex and domain specific formats are often best handled by the matching them with the application they were created with. Software can be seen as the ground truth for any file format. It is the software that creates files that truly defines how those files are formatted.
To make old software environments available on an automatable and scalable basis (for example, via Emulation-as-a-Service) proper enterprise-scale software archiving is required. At first look the task appears to be huge because of the large amount of software that has been produced in the past. Nevertheless, much of the software that has been created is standard software, and more or less used all over the world; and there are a lot of low hanging fruit to pick off that would be highly beneficial to preserve and make avaialble. If components of software can be uniquely described, deduplication should also reduce the overall workload significantly. For at least a significant proportion of the software to be covered, licensing might complicate the whole issue a fair amount as different software licensing variants were deployed in different domains and different parts of the world, and current copyright and patent law differs in different jurisdictions in how it applies to older software.Types of Software
Institutions and users have to decide which software needs to preserved, how and by whom. The answers to these questions will depend on the intended use cases. In simpler cases all that may be needed to render preserved artefacts in emulated original environments could be a few standard office or business environments with standard software. Complex use cases may require very special non-standard, custom-made software components from non-standard sources, like use cases involving development systems or use cases involving the preservation of complex business processes.
Software components required to reproduce original environments for certain (complex) digital objects can be classified in several ways. Firstly, there are the standard software packages like operating systems and off-the-shelf applications sold in (significant) numbers to customers. And secondly there can be different releases and various localized versions (the user interaction part of a software application is often translated to different languages such as in Microsoft Windows or Adobe products) but otherwise the copies are often exactly the same. In general it does not really matter if it is a French, English, or German Word Perfect version being used to interact with a document. But for the user dealing with it or an automated process like the process used for migration-through-emulation the different labeling of menu entries and error messages matters.
The concept of versions is somewhat different for Open Source or Shareware-like software. Often there are many more "releases" available than with commercial software as the software usually gets updated regularly and does not necessarily have a distinct release cycle. Also, different to commercial software, the open source packages feature full localization, as they did not need to distinguish different markets.
In many domains custom made software and user programming plays a significant role. This can be scripts or applications written by scientists to run their analysis on gathered data, run specific computations, or extend existing standard software packages. Or it could be software tools written for governmental offices or companies to produce certain forms or implement and configure certain business processes. Such software needs to be taken care of and stored alongside the preserved base-files of an object in order to ensure they can be accessed and interacted with in the future. The same applies for complex setups of standard components with lots of very specific configurations.
If such standard software is required, it would make sense to be able to assign each instance a unique identifier. This would help to de-duplicate efforts to store copies. Even if a memory institution or commercial service maintains its own copy, it does not necessarily need to replicate the actual bits if other copies are already available somewhere. It may simply be able to manage it’s own licenses and use the bits/software copies provided by a central service. Additionally, it would simplify efforts to reproduce environments in an efficient way.
Some ideas about how to identify and describe software have already been discussed for the upcoming PREMIS 3.0 standard, in particular for the section regarding environments. Suitable persistent identifiers would definitely be helpful for tagging software. Something like ISBNs or the ISSNs that describe books and other media (or DOIs that are becoming ubiquitous for digital artefacts). These tags would be useful for tool registries like TOTEM as well or coudl match to PREMIS PUIDs. There could be three layers of IDing that could become relevant:
- On the most abstract layer a software instance is described as a complete package, e.g. Windows 3.11 US Edition, Adobe Page Maker Version X or Command & Conquer II containing all the relevant installation media, license keys etc. The ID of such a package could be the official product code or derived from it. However when using such an approach it might be difficult to distinguish between hidden updates, for example, during the software archiving experiment at Archives New Zealand we acquired and identified two different package sets of Word Perfect 6.0. So a more nuanced approach may be required.
- At the layer of the different media (relevant only if it is not just one downloaded installation package) each floppy disk or each optical medium (or USB media) could be distinguished. E.g. Windows 3.11 as well as applications like Word Perfect came with specific disks for just the printer drivers, and the CD (1 or 2) in the Command & Conquer game differentiated which adversary in the game you were assigned to.
- At the individual file layer executables, libraries, helper files like font-sets etc. could be distinguished. The number of items in this set is the largest. An approach centered on running a collection of digital signatures of known, traceable software applications is followed e.g. by the NSRL (National Software Reference Library) and may be the most appropriate option for these types of applications.
Software Museum or Archive
Usually it is not trivial to map the installed files in an environment to files on the installation medium, as the files typically get packed (compressed in ‘archive’ files) on the medium and a some files get created from scratch during the installation procedure.
Depending on the actual goal, the focus of the IDs will be different. To actually derive what kind of application or operating system is installed on a machine, file level identifiers will be needed. To just reproduce a particular original environment (for e.g. emulation) package level identifiers are more relevant. In some cases it may be useful to address a single carrier, e.g. to automate installation processes of standard environments consisting of an operating system and a couple of applications.
For the description of software and environments it might be useful to investigate what can be learned from commercial software installation handling and lifecycle management. Large institutions and companies have well-defined workflows to create software environments for certain purposes and their approaches may be directly applicable to the long term preservation use case(s).
What should be archived, who are the stakeholders and users and how can the archive be supported?
A model for nearly full-archiving of a domain is the Computer Games Museum in Berlin which receives every piece of computer game which requires an USK, which is the German abbreviation for the Entertainment Software Self-Regulation Body, an organisation which has been voluntarily established by the computer games industry to classify computer games, classification. The collection is supplemented by donations of a wide range of software (operating systems, popular non-gaming applications) and hardware items (computers, gaming consoles, controllers). Thus, the museum has acquired a nearly complete collection of the domain. An upcoming problem is the rising number of browser and online games which never get a representation on a physical medium. Another unresolved issue is the maintenance of the collection. At the moment the museum does not even have enough funds for bitstream preservation and proper cataloguing the collection.
Archiving (of standard software) already takes place, for example, at the Computer History Museum, the Australian National Library, the National Archives of New Zealand or the Internet Archive to mention a few. Unfortunately, the activities are not coordinated. Both the mostly "dark archives" of memory institutions and the online sites for deprecated software of questionable origin are not sufficient for a sustainable strategy. Nevertheless, landmark institutions like national libraries and archives could be a good place to archive software in a general way. Nevertheless, the archived software is only of any use if it is properly described with standard metadata. Ideally, the software repositories would provide APIs to communicate with a central software archive and attach services to it. The service levels could differ from just offering metadata information to offering access to complete software packages. As an addition to the basic services museums could offer interactive access to selected original environments, as there is a significant difference between having a software package just bit-stream preserved and have it available to explore and test it for a particular purpose interactively. Often, specific, implicit knowledge is required to get some software item up and running. So keeping instances running permanently would have a great benefit. Archiving institutions like museums could try to build online communities around platforms and software packages. Live ‘’exhibition'' of software helps community exchange and can attract users with knowledge who would be otherwise difficult to find.
Software museums can help to reduce duplicated effort to archive and describe standard software. It can at least help that not every archive needs to store multiple copies of standard software but simply can refer to other repositories. Software museums or archives could become brokers for (obsolete) software licenses. They could serve as a place to donate software (from public, private entities), firmware and platform documentation. Such institutions could simplify the proceedings for a software company to take care of their digital legacy. A one-stop institution might be much more attractive to software vendors and archival institutions than the possible alternative of having multiple parties negotiating license terms of legacy packages with multiple stakeholders (Software companies might have a positive attitude towards such a platform or lawmakers could be persuaded to push it a bit). Software escrow services (discussed e.g. within the TIMBUS EU project) can complement these activities. A museum can operate in different modes like in a non-for-profit branch for public presentation, community building, education etc. and commercial branch to lend/lease out software to actually reproduce environments in emulators for commercial customers.
The situation could be totally different for research institutions and users of custom made software. Such packages do not necessarily make sense in a (public) repository. In such cases the question of, how the licensing will be handled arises. If obsolete, they could be handed over to the archive managing the research primary data.
Another issue is the handling of software versions. Products are updated until announced end-of-live. Would it be necessary to keep every intermediate version or concentrate on general milestones. An operating system like ''Windows XP'' (32bit) was officially available in several flavors (like ''Home'' or ''Professional'') from 2001 till 2014. In many cases a ''fuzzy matching'' would be acceptable as a certain software package runs properly in all versions. Other software might require a very specific version to function properly. This needs to be addressable (and could be matched to the appropriate PRONOM environment identifiers). Plus, there are a couple of preservation challenges in the software lifecylce.
There are a number of questions which arise when creating or running a software archive or museum:
- On which level should a software archive be run: Institutional (e.g. for larger (national) research institutions, state or federal or global level or should a federated approach be favoured)?
- Does it make sense (at all) to run a centralized software archive in a relevant size, assuming that for modern, complex scientific environments, the software components are much too individual? What kind of software would be useful in such an archive? Which versions should be kept?
- Would it be possible to establish a PRONOM-like identifier system (agreed upon and shared among the relevant memory institutions)? Or use the DOI system to provide access to the base objects?
- How, through which APIs should software and/or metadata be offered (or ingested)?
- How should the software archive adapt to the ever changing form of installation media from tapes, floppies to optical media of different types to solely network based installations?
- Would it be possible to run the software archive as a backend, where locally ingested software is stored in the end?
- Is the advantage gain of centralizing knowledge and storage of standard software components big enough to outweigh the efforts required to run such an archive?
- Do proper software license and handling models exist for such an archive, like donation of licenses, taking over abandoned packages, escrow services? Would it be possible to bridge the diverse interests of diverse users of a diverse range of software and software producers?
- Would there be advantages in running such an archive as/in a non-profit organisation?/What business model would make most sense for such an organisation?
My previous blog Assessing file format risks: searching for Bigfoot? resulted in some interesting feedback from a number of people. There was a particularly elaborate response from Ross Spencer, and I originally wanted to reply to that directly using the comment fields. However, my reply turned out to be a bit more lengthy than I meant to, so I decided to turn it into a separate blog entry.Numbers first?
Ross overall point is that we need the numbers first; he makes a plea for collecting more format-related data, and adding numbers to these. Although these data do not directly translate into risks, Ross argues that it might be able to use these data to address format risks at a later stage. This may look like a sensible approach at first glance, but on closer inspection there's a pretty fundamental problem, which I'll try to explain below. To avoid any confusion here, I will be speaking of "format risk" here in the sense used by Graf & Gordea, which follows from the idea of "institutional obsolescence" (which is probably worth a blog post by itself, but I won't go into this here).The risk model
Graf & Gordea define institutional obsolescence in terms of "the additional effort required to render a file beyond the capability of a regular PC setup in particular institution". Let's call this effort E. Now the aim is to arrive at an index that has some predictive power of E. Let's call this index RE. For the sake of the argument it doesn't matter how RE is defined precisely, but it's reasonable to assume it will be proportional to E (i.e. as the effort to render a file increases, so does the risk):
RE ∝ E
The next step is to find a way to estimate RE (the dependent variable) as a function of a set of potential predictor variables:
RE = f(S, P, C, ... )
where S = software count, P = popularity, C = complexity, and so on. To establish the predictor function we have two possibilities:
- use a statistical approach (e.g. multiple regression or something more sophisticated);
- use a conceptual model that is based on prior knowledge of how the predictor variables affect RE.
The first case (statistical approach) is only feasible if we have actual data on E. For the second case we also need observations on E, if only to be able to say anything about the model's ability to predict RE (verification).No observed data on E!
Either way, the problem here is that there's an almost complete lack of any data on E. Although we may have a handful of isolated 'war stories', these don't even come close to the amount of data that would be needed to support any risk model, no matter whether it is purely statistical or based on an underlying conceptual model1. So how are we going to model a quantity for which we do not have any observed data in the first place? Or am I overlooking something here?
Looking at Ross's suggestions for collecting more data, all of the examples he provides fall into the potential (!) predictor variables category. For instance, prompted by my observation on compression in PDF, Ross suggests to start analysing large collections of PDFs to establish patterns on the occurrence of various types compression (and other features), and attach numbers to them. Ross acknowledges that such numbers by themselves don't tell you if PDF is "riskier" than another format, but he argues that:
once we've got them [the numbers], subject matter experts and maybe some of those mathematical types with far greater statistics capability than my own might be able to work with us to do something just a little bit clever with them.
Aside from the fact that it's debatable whether, in practical terms, the use of compression is really a risk (is there any evidence to back up this claim?), there's a more fundamental issue here. Bearing in mind that, ultimately, the thing we're really interested in here is E, how could collecting more data on potential predictor variables of E ever help here in the near absence of any actual data on E? No amount of clever maths or statistics can compensate for that! Meanwhile, ongoing work on the prediction of E mainly seems to be focused on the collection, aggregation and analysis of potential predictor variables (which is also illustrated by Ross's suggestions), even though the purpose of these efforts remains largely unclear.
Within this context I was quite intrigued by the grant proposal mentioned by Andrea Goethals which, from the description, looks like an actual (and quite possibly the first) attempt at the systematic collection of data on E (although like Andy Jackson said here I'm also wondering whether this may be too ambitious).Obsolescence-related risks versus format instance risks
On a final note, Ross makes the following remark about the role of tools:
[W]ith tools such as Jpylyzer we have such powerful ways of measuring formats - and more and more should appear over time.
This is true to some extent, but a tool like jpylyzer only provides information on format instances (i.e. features of individual files); it doesn't say anything about preservation risks of the JP2 format in general. The same applies to tools that are are able to detect features in individual PDF files that are risky from a long-term preservation point of view. Such risks affect file instances of current formats, and this is an area that is covered by the OPF File Format Risk Registry that is being developed within SCAPE (it only covers a limited number of formats). They are largely unrelated to (institutional) format obsolescence, which is the domain that is being addressed by FFMA. This distinction is important, because both types of risks need to be tackled in fundamentally different ways, using different tools, methods and data. Also, by not being clear about which risks are being addressed, we may end up not using our data in the best possible way. For example, Ross's suggestion on compression in PDF entails (if I'm understanding him correctly) the analysis of large volumes of PDFs in order to gather statistics on the use of different compression types. Since such statistics say little about individual file instances, a more practically useful approach might be to profile individual files instances for 'risky' features.
On a side note even conceptual models often need to be fine-tuned against observed data, which can make them pretty similar to statistically-derived models. ↩
One of the activities in the European project SCAPE is to create a catalogue of policy elements. At the last iPRES conference we explained our work and you can read about it . During our activities we started collecting existing, published policies and we have now put the current set on a wiki http://wiki.opf-labs.org/display/SP/Published+Preservation+Policies Looking at the results of your colleagues might help to create or finalize your own preservation policies. As I said during my presentation at iPRES 2013, there are far more organizations dealing with digital preservation than published preservation policies on the internet – at least based on what we found!
If your organization has a digital preservation policy and you want to see yours in this list as well, please send an email to Barbara.Sierman@kb.nl and it will be added.
Last week someone pointed my attention to a recent iPres paper by Roman Graf and Sergiu Gordea titled "A Risk Analysis of File Formats for Preservation Planning". The authors propose a methodology for assessing preservation risks for file formats using information in publicly available information sources. In short, their approach involves two stages:
- Collect and aggregate information on file formats from data sources such as PRONOM, Freebase and DBPedia
- Use this information to compute scores for a number of pre-defined risk factors (e.g. the number of software applications that support the format, the format's complexity, its popularity, and so on). A weighted average of these individual scores then gives an overall risk score.
This has resulted in the "File Format Metadata Aggregator" (FFMA), which is an expert system aimed at establishing a "well structured knowledge base with defined rules and scored metrics that is intended to provide decision making support for preservation experts".
The paper caught my attention for two reasons: first, a number of years ago some colleagues at the KB developed a method for evaluating file formats that is based on a similar way of looking at preservation risks. Second, just a few weeks ago I found out that the University of North Carolina is also working on a method for assessing "File Format Endangerment" which seems to be following a similar approach. Now let me start by saying that I'm extremely uneasy about assessing preservation risks in this way. To a large extent this is based on experiences with the KB-developed method, which is similar to the assessment method behind FFMA. I will use the remainder of this blog post to explain my reservations.Criteria are largely theoretical
FFMA implicitly assumes that it is possible to assess format-specific preservation risks by evaluating formats against a list of pre-defined criteria. In this regard it is similar to (and builds on) the logic behind, to name but two examples, Library of Congress' Sustainability Factors and UK National Archives' format selection criteria. However, these criteria are largely based on theoretical considerations, without being backed up by any empirical data. As a result, their predictive value is largely unknown.Appropriateness of measures
Even if we agree that criteria such as software support and the existence of migration paths to some alternative format are important, how exactly do we measure this? It is pretty straightforward to simply count the number of supporting software products or migration paths, but this says nothing about their quality or suitability for a specific task. For example, PDF is supported by a plethora of software tools, yet it is well known that few of them support every feature of the format (possibly even none, with the exception of Adobe's implementation). Here's another example: quite a few (open-source) software tools support the JP2 format, but for this many of them (including ImageMagick and GraphicsMagick) rely on JasPer, a JPEG 2000 library that is notorious for its poor performance and stability. So even if a format is supported by lots of tools, this will be of little use if the quality of those tool are poor.Risk model and weighting of scores
Just as the employed criteria are largely theoretical, so is the computation of the risk scores, the weights that are assigned to each risk factor, and they way the individual scores are aggregated into an overall score. The latter is computed as the weighted sum of all individual scores, which means that a poor score on, for example, Software Count can be compensated by a high score on other factors. This doesn't strike me as very realistic, and it is also at odds with e.g. David Rosenthal's view of formats with open source renderers being immune from format obsolescence.Accuracy of underlying data
A cursory look at the web service implementation of FFMA revealed some results that make me wonder about the data that are used for the risk assessment. According to FFMA:
- PNG, JPG and GIF are uncompressed formats (they're not!);
- PDF is not a compressed format (in reality text in PDF nearly always uses Flate compression, whereas a whole array of compression methods may be used for images);
- JP2 is not supported by any software (Software Count=0!), it doesn't have a MIME type, it is frequently used, and it is supported by web browsers (all wrong, although arguably some browser support exists if you account for external plugins);
- JPX is not a compressed format and it is less complex than JP2 (in reality it is an extension of JP2 with added complexity).
To some extent this may also explain the peculiar ranking of formats in Figure 6 of the paper, which marks down PDF and MS Word (!) as formats with a lower risk than TIFF (GIF has the overall lowest score).What risks?
It is important that the concept of 'preservation risk' as addressed by FFMA is closely related to (and has its origins in) the idea of formats becoming obsolete over time. This idea is controversial, and the authors do acknowledge this by defining preservation risks in terms of the "additional effort required to render a file beyond the capability of a regular PC setup in [a] particular institution". However, in its current form FFMA only provides generalized information about formats, without addressing specific risks within formats. A good example of this is PDF, which may contain various features that are problematic for long-term preservation. Also note how PDF is marked as a low-risk format, despite the fact that it can be a container for JP2 which is considered high-risk. So doesn't that imply that a PDF that contains JPEG 2000 compressed images is at a higher risk?Encyclopedia replacing expertise?
A possible response to the objections above would be to refine FFMA: adjust the criteria, modify the way the individual risk scores are computed, tweak the weights, change the way the overall score is computed from the individual scores, and improve the underlying data. Even though I'm sure this could lead to some improvement, I'm eerily reminded here of this recent rant blog post by Andy Jackson, in which he shares his concerns about the archival community's preoccupation with format, software, and hardware registries. Apart from the question whether the existing registries are actually helpful in solving real-world problems, Jackson suggests that "maybe we don't know what information we need", and that "maybe we don't even know who or what we are building registries for". He also wonders if we are "trying to replace imagination and expertise with an encyclopedia". I think these comments apply equally well to the recurring attempts at reducing format-specific preservation risks to numerical risk factors, scores and indices. This approach simply doesn't do justice to the subtleties of practical digital preservation. Worse still, I see a potential danger of non-experts taking the results from such expert systems at face value, which can easily lead to ill-judged decisions. Here's an example.KB example
About five years some colleagues at the KB developed a "quantifiable file format risk assessment method", which is described in this report. This method was applied to decide which still image format was the best candidate to replace the then-current format for digitisation masters. The outcome of this was used to justify a change from uncompressed TIFF to JP2. It was only much later that we found out about a host of practical and standard-related problems with the format, some of which are discussed here and here. None of these problems were accounted for by the earlier risk assessment method (and I have a hard time seeing how they ever could be)! The risk factor approach of GGMA is covering similar ground, and this adds to my scepticism about addressing preservation risks in this manner.Final thoughts
Taking into account the problems mentioned in this blog post, I have a hard time seeing how scoring models such as the one used by FFMA would help in solving practical digital preservation issues. It also makes me wonder why this idea keeps on being revisited over and over again. Similar to the format registry situation, is this perhaps another manifestation of the "trying to replace imagination and expertise with an encyclopedia phenomenon? What exactly is the point of classifying or ranking formats according to perceived preservation "risks" if these "risks" are largely based on theoretical considerations, and are so general that they say next to nothing about individual file (format) instances? Isn't this all a bit like searching for Bigfoot? Wouldn't the time and effort involved in these activities be better spent on trying to solve, document and publish concrete format-related problems and their solutions? Some examples can be found here (accessing old Powerpoint 4 files), here (recovering the contents of an old Commodore Amiga hard disk), here (BBC Micro Data Recovery), or even here (problems with contemporary formats)?
I think there could also be a valuable role here for some of the FFMA-related work in all this: the aggregation component of FFMA looks really useful for the automatic discovery of, for example, software applications that are able to read a specific format, and this could be could be hugely helpful in solving real-world preservation problems.Preservation Topics: Preservation RisksFormat Registry
For our evaluations within SCAPE it would be useful to have the ability to quantitatively measure the abilities of the Hadoop clusters available to us, to allow results from each cluster to be compared.
Fortunately as part of the standard Hadoop distribution there are some examples included that can be run as tests. Intel has produced a benchmarking suite - HiBench - that uses those included Hadoop examples to produce a set of results.
There are various aspects of performance that can be assessed. The main ones being:
- CPU loaded workflows (e.g. file format migration) where the workflow speed is limited by the CPU processing available
- I/O loaded workflows (e.g. identification/characterisation) where the workflow speed is limited by the I/O bandwidth available
For the testing of our cluster I used HiBench 2.2.1. I made some notes about getting it to run that should be useful (see below). Apart from the one change described below in the notes, there was no need to edit or change the code.
In SCAPE testbeds we are running various workflows on various clusters. However, individual workflows tend to be run on only one cluster. Running a standard benchmark on each Hadoop installation may allow us to better compare and extrapolate results from the different testbed workflows.
Notes - These are only required to be done on the node where HiBench is run from.
- JAVA_HOME is needed by some tests - I set this using “export JAVA_HOME=/usr/lib/jvm/j2sdk1.6-oracle/”.
- For the kmeans test I changed the HADOOP_CLASSPATH line in “kmeans/bin/prepare.sh” to “export HADOOP_CLASSPATH=`mahout classpath | tail -1`” as it was unable to run without that change; mahout already being in the path.
- The nutchindexing and bayes tests required a dictionary to be installed on the node that HiBench was started from – I installed the “wbritish-insane” package.
Some tests use less map/reduce slots than are available and therefore are not that useful for comparison as we want to max out the cluster. For example, the kmeans tests only used 5 map slots.
I have created a page on the SCAPE wiki where I have put the results from our cluster: “Benchmarking Hadoop installations”. I invite and encourage you to run the same tests above and add them to the wiki page. Running the tests was much quicker than I thought it might be – it took less than a morning to setup and execute.
To get a better understanding of which benchmarks are more/less appropriate I propose we first get some metrics from all the HiBench tests across different clusters. In future we may choose to refine or change the tests to be run but this is just a start of a process to better understand how our Hadoop clusters perform. It’s only through you participating that we will get useful results, so please join in!Preservation Topics: SCAPE
One of the Open Planets Foundation’s main roles in the SCAPE project is to provide stewardship for, and ensure longevity of the SCAPE software outputs.
The SCAPE project is committed to producing open source software that is available to the wider community on GitHub, with clear licence terms and appropriate documentation, at an early stage in development.
While the above steps are important and helpful in encouraging other developers to download a project's source code, compile it, and try the software, this isn’t an everyday activity for the less geeky members of the digital preservation community. Software in this state is also unlikely to meet with the approval of an institution's IT Operations / Support section.
What’s really required for software longevity is an active community of users who:
Use the software for real world activities in their day to day work.
Report bugs and request enhancements on the project's issue tracker.
Contribute to community software documentation.
So how do we bridge the gap between our current developer-ready software, and software that non-geeks find easy to install and use?
Over October there will be a sustained effort to package, document and publish SCAPE software for download by anybody who wants to try it. If that sounds like you then read on.
Where can I find the SCAPE software?
We have compiled a list of tools that have been developed or extended as part of the SCAPE Project: http://www.scape-project.eu/tools. Currently our software is on the OPF’s GitHub page, though if you’re not comfortable with source code this might not prove very helpful. To help you make sense of what’s on the GitHub page the OPF have created a project health check page, which distills the information a little and provides helpful links to the projects' README and LICENSE files. This page is still a work in progress, so if there’s some information you’d like to see on it you can raise an issue on GitHub.
How do I know that the software builds?
All SCAPE software should have a Continuous Integration build that runs on the Travis-CI site, this means that the software is built every time somebody checks in a change to the source code in GitHub. If the build fails the developer is informed, and corrects the problem as soon as possible. Every project listed on the project health check site has one of these graphics:
indicating the result of the most recent attempt to build the project on Travis, or informs you that a Travis build couldn’t be found. Click on the image and you’ll be taken to the project’s Travis page if you’re interested in the gory details.
So how do I download and use SCAPE software?
Which brings us round to October, where we’ll be fitting the final piece of the puzzle. The real aim of the nightly builds is to build installable packages to be downloaded by you. These packages will be debian apt packages, installable on debian based linux distributions including ubuntu, mint, and of course debian itself.
We’ll be creating stable release packages for download from the OPF's Bintray page, and overnight “snapshot” builds of the current project at a to be decided location. Keep an eye @openplanets and @scapeproject for news and download links over the coming month.
But I use Windows, Mac OS, or another linux packaging system.
Fear not, all is not lost. We’ve chosen debian based linux distros first because:
it simplifies licensing issues for build machines and virtual test and demonstration environments.
debian based distros are among the most widely used linux distributions.
Hadoop, the engine that runs SCAPE’s scalable platform, has historically not played well with Windows, although this is no longer such a problem.
Some of the software will run on other platforms easily, Jpylyzer is available for Windows. Others may require a little more work, but if there’s interest and it’s practical we’ll do our best. We’re trying to establish a community of users, not exclude people.
So that’s why SCAPE software needs you, hopefully as much as you need SCAPE software.Preservation Topics: PackagingSCAPE
I’m Rui Castro. I work at KEEP SOLUTIONS since 2010 where I have the roles of Director of Infrastructures, project manager and researcher. Before joining KEEP SOLUTIONS, I was part of the team who developed RODA, the digital preservation repository used by the Portuguese National Archives.Tell us a bit about your role in SCAPE and what SCAPE work you are involved in right now?
My role in SCAPE is primarily focused on Preservation Action Components and Repository Integration.
In Action Components, I’ve worked in the identification, evaluation and selection of large-scale action tools & services to be adapted to the SCAPE platform. I’ve contributed to the definition of a preservation tool specification with the purpose of creating a standard interface for all preservation tools and a simplified mechanism for packaging and redistributing those tools to the wider community of preservation practitioners. I have also contributed to the definition of a preservation component specification with the purpose of creating standard preservation components that can be automatically searched for, composed into executable preservation plans and deployed on SCAPE-like execution platforms.
Currently my work is focused on repository integrations where I have the task of implementing the SCAPE repository interfaces into RODA, an open-source digital repository supported and maintained by KEEP SOLUTIONS. These interfaces when implemented will enable the repository to use the SCAPE preservation environment to perform preservation planning, watch and large-scale preservation actions.Why is your organisation involved in SCAPE?
KEEP SOLUTIONS is a company that provides advanced services for managing and preserving digital information. One of the vectors that drive us is continuous innovation in the area of digital preservation. In the SCAPE project, KEEP SOLUTIONS is contributing with expertise in digital preservation, especially migration technologies, and with practical knowledge on the development of large-scale digital repository systems. KEEP SOLUTIONS is also acquiring new skills in digital preservation, especially in preservation planning, watch and service parallelisation, we are enhancing digital preservation products and services we currently support, such as RODA, and enhancing relationships with world leader digital preservation researchers and institutions. KEEP SOLUTIONS’ participation in the project will enhance our expertise in digital preservation and that will result in better products and services for our current and future clients.What are the biggest challenges in SCAPE as you see it?
SCAPE is a big project, from the number of people and institutions involved to the number of digital preservation aspects covered. I think the biggest challenge will be the integration of all parts into a single coherent system. From a technical point of view the integration between content repositories, automated planning & watch and the executable platform is a huge challenge.What do you think will be the most valuable outcome of SCAPE?
I see two very interesting aspects emerging from SCAPE.
One is the integration of automated planning & watch into digital preservation repositories. Planning is an essential part of digital preservation and it involves human level activities (like policy and decision making) and machine activities (like evaluation of alternative strategies, characterisation and migration of contents). Being able to make the bridge between these two realms and provide content holders the tools to take informed decisions about what to do with their data is a great achievement.
The other is the definition of a system architecture for large-scale processing, applied to the specific domain of digital preservation, that is able of executing preservation actions like characterisation, migration and quality-assurance over huge amounts of data in a “short” time.
Preservation Topics: SCAPE AttachmentSize rcastro.jpg15.08 KB
I've started to publish some of my notes on digital preservation. It's mostly a collection of 'war stories' and summaries of some of the little experiments I've carried out over the years, but never had time to write up properly. The idea of publishing these stories is inspired in part by XFNSTN, but also by my experience co-coordinating the AQuA workshops and from observing the success of the SPRUCEdp project.
In short, I think we need to share more war stories, not just the occasional full research paper, but also the small stuff, and the failures. Maybe I can start the ball rolling by sharing mine. I'd really like to know if anyone else out there is interested in sharing theirs.
There's a couple of bigger items on there that I think might be of particular interest:
- A long-winded data migration story about accessing data from BBC Master floppy disks.
- A description of how bitwise analysis can be used to better understand formats and the tools that act upon them, somewhat related to an OPF blog post by Jay Gattuso earlier this year.
Feedback welcome, as ever.
Here's a little newsbulletin about FIDO, the open source file format identification tool of OPF.
It seems that the use of FIDO is growing the last few months. I am getting responses by e-mail and through the Github issuetracker from all over the world, ranging from requests for help, giving suggestions for improvement and even some bugfixes. Thanks and please keep them coming!
Most important change currently is the versioning schema of tagged releases.
If you forked FIDO or watching the tags for updates, please notice that the versioning schema has changed from [major].[minor].[patch] to [major].[minor].[patch]-[PRONOM version number].
The reason for this is that from time to time there is a new PRONOM version available but there are no code changes to commit. As it is bad practice to update a tagged release this was the only reasonable way to fix this.
For example, release 1.3.1 has PRONOM version 70 distributed with it and is tagged '1.3.1-70'.
If a PRONOM update is available but there are no code changes the consecutive tag will be '1.3.1-71'. Please note that this is only reflected in release tags, FIDO will still only report its version number without the PRONOM version number.
Currently I am also working on the FIDO usage guide. It is still a work in progress, but it could help you on your way using FIDO.
I'll be the first to admit that FIDO is still far from being "the perfect file format identification tool". Although it is quite stable and many things are improved or fixed lately such as the handling of files passed to STDIN or the possibility to use only the official PRONOM signatures, it still needs improvement on many levels.
Recently Carl Wilson (OPF technical lead) and I started to work on thinking what needs changing for FIDO version 2. This second generation of FIDO will not differ much in functionality of the current version 1 generation but the way we plan on doing things will make a big difference. For starters we will be creating unit tests for every function of FIDO. Second important thing are unit testing of individual PRONOM signatures and PRONOM container signatures. With each update of PRONOM we will run unit tests using corpora files.
But the biggest change of all will be the way we build FIDO. It will no longer be just "a script", but rather an API. The "fido.py" script will then merely function as a prototype how to build your "own" FIDO into your workflow systems. It will also no longer output to STDOUT and STDERR but will return results in a more Pythonic way. You will read more about all this in a later post.
In the mean while I (with a little help of you) will continue on improving version 1 where possible. If you have any questions or suggestions about any of the above, please let me know.
I found it both truthful and inspiring...
Truthful, because the chaotic path of discovery involved in understanding mysterious digital media reflected my own experiences on similar digital preservation adventures, both for the library and for the AQuA and SPRUCE projects.
Inspiring, because it brought new light to my old concerns about format/software/hardware registry systems. I've long been worried that they have not been designed with their users in mind. Specifically, the users that know all of this information and are willing to spend time sharing it. Why would they do it? What incentive would they need? What form of knowledge sharing would they choose?
Upon reading Ben's article, things became clearer. As I twittered at the time:Now, go through and read it one more time, and think about how such a registry could actually have helped. What would it need to include? [t]Could it really replace the expertise of those five (or so) people? Or should its purpose be to capture and link what they have achieved? [t]Is the answer really in building registries? Or is it better to run more XFR STNs and help document and preserve what they do? [t]Maybe we don't know what information we need? Maybe we don't even know who or what we are building registries for? Are we trying to replace imagination and expertise with an encyclopedia? Is it wrong to focus on the information, and ignore the people? Do we need a registry if we have a community of expertise to rely on? Should that community come first, and then be allowed to build whatever it needs? Maybe running and documenting more events like XFR STN and AQuA/SPRUCE is the only way to find out? Preservation Topics: Format Registry
- Preservation Policy Levels in SCAPE. Barbara Sierman, Catherine Jones, Sean Bechhofer and Gry Elstrøm, iPRES 2013, 10th International Conference on Preservation of Digital Objects.
- Open Preservation Data: Controlled vocabularies and ontologies for preservation ecosystems. Hannes Kulovits, Michael Kraxner, Markus Plangg, Christoph Becker and Sean Bechhofer, iPRES 2013, 10th International Conference on Preservation of Digital Objects.
- SCAPE Project Site http://www.scape-project.eu/