Feed aggregator

Apache PDFBox Error Messages

OPF Wiki Activity Feed - 22 August 2014 - 11:44am

Page edited by Yvonne Friese

View Online Yvonne Friese 2014-08-22T11:44:34Z

Apache PDFBox Error Messages

OPF Wiki Activity Feed - 22 August 2014 - 9:26am

Page edited by Yvonne Friese

View Online Yvonne Friese 2014-08-22T09:26:38Z

2014-09-01 Preserving PDF - identify, validate, repair

OPF Wiki Activity Feed - 22 August 2014 - 9:25am

Page edited by Becky McGuinness

View Online Becky McGuinness 2014-08-22T09:25:36Z

2014-09-01 Preserving PDF - identify, validate, repair

OPF Wiki Activity Feed - 21 August 2014 - 1:53pm

Page edited by Becky McGuinness

View Online Becky McGuinness 2014-08-21T13:53:22Z

When is a PDF not a PDF? Format identification in focus.

Open Planets Foundation Blogs - 21 August 2014 - 10:40am

In this post I'll be taking a look at format identification of PDF files and highlighting a difference in opinion between format identification tools. Some of the details are a little dry but I'll restrict myself to a single issue and be as light on technical details as possible. I hope I'll show that once the technical details are clear it really boils down to policy and requirements for PDF processing.

Assumptions

I'm considering format identification in its simplest role as first contact with a file that little, if anything, is known about. In these circumstances the aim is to identify the format as quickly and accurately  as possible then pass the file to format specific tools for deeper analysis.

I'll also restrict the approach to magic number identification rather than trust the file extension, more on this a little later.

Software and data

I performed the tests using the selected govdocs corpora (that's a large download BTW) that I mentioned in my last post. I chose four format identification tools to test:

  • the fine free file utility (also known simply as file),
  • DROID,
  • FIDO, and
  • Apache Tika.

I used as up to date versions as possible but will spare the details until I publish the results in full.

So is this a PDF?

So there was plenty of disagreement between the results from the different tools, I'll be showing these in more detail at our upcoming PDF Event. For now I'll focus on a single issue, there are a set of files that FIDO and DROID don't identify as PDFs that file and Tika do. I've attached one example to this post, Google chrome won't open it but my ubuntu based document viewer does. It's a three page PDF about Rumen Microbiology and this was obviously the intention of the creator. I've not systematically tested multiple readers yet but Libre Office won't open it while ubuntu's print preview will. Feel free to try the reader of your choice and comment.

What's happening here?

It appears we have a malformed PDF and this is the case . The issue is caused by a difference in the way that the tools go about identifying PDFs in the first place. This is where it gets a little dull but bear with me. All of these tools use "magic" or "signature" based identification. This means that they look for unique (hopefully) strings of characters in specific positions in the file to work out the format. Here's the Tika 1.5 signature for PDF:

<match value="%PDF-" type="string" offset="0"/>

What this says is look for the string %PDF- (the value) at the start of the file (offset="0") and if it's there identify this as a PDF. The attached file indeed starts:

%PDF-1.2

meaning it's a PDF version 1.2. Now we can have a look at the DROID signature (version 77) for the PDF 1.2 sig:

<InternalSignature ID="125" Specificity="Specific">    <ByteSequence Reference="BOFoffset">        <SubSequence MinFragLength="0" Position="1"            SubSeqMaxOffset="0" SubSeqMinOffset="0">            <Sequence>255044462D312E32</Sequence>            <DefaultShift>9</DefaultShift>            <Shift Byte="25">8</Shift>            <Shift Byte="2D">4</Shift>            <Shift Byte="2E">2</Shift>            <Shift Byte="31">3</Shift>            <Shift Byte="32">1</Shift>            <Shift Byte="44">6</Shift>            <Shift Byte="46">5</Shift>            <Shift Byte="50">7</Shift>        </SubSequence>    </ByteSequence>    <ByteSequence Reference="EOFoffset">        <SubSequence MinFragLength="0" Position="1"            SubSeqMaxOffset="1024" SubSeqMinOffset="0">            <Sequence>2525454F46</Sequence>            <DefaultShift>-6</DefaultShift>            <Shift Byte="25">-1</Shift>            <Shift Byte="45">-3</Shift>            <Shift Byte="46">-5</Shift>            <Shift Byte="4F">-4</Shift>        </SubSequence>    </ByteSequence></InternalSignature>Which is a little more complex than Tika's signature but what it says is a matching file should start with the string %PDF-1.2, which our sample does. This is in the first <ByteSequence Reference="BOFoffset"> section, a begining of file offset. Crucially this signature adds another condition, that the file contains the string %EOF within 1024 bytes of the end of the tile. There are two things that are different here. The start condition change, i.e. Tika's "%PDF-" vs. DROID's "%PDF-1.2%" is to support DROID's capability to identify versions of formats. Tika simply detects that a file looks like a PDF and returns the application/pdf mime type and has a single signature for the job. DROID can distinguish between versions and so has 29 different signatures for PDF. It's also NOT the cause of the problem. The disagreement between the results is caused by DROID's requirement for a valid end of file marker %EOF. A hex search of our PDF confirms that it doesn't contain an %EOF marker.So who's right?

An interesting question. The PDF 1.3 Reference states:

The last line of the file contains only the end-of-file marker,%%EOF. (See implementation note 15 in Appendix H.)The referenced implementation note reads:3.4.4, “File Trailer”15. Acrobat viewers require only that the %%EOF marker appear somewherewithin the last 1024 bytes of the file. 

So DROID's signature is indeed to the letter of the law plus amendments. It's really a matter of context when using the tools. Does DROID's signature introduce an element of format validation to the identification process? In a way yes, but understanding what's happening and making an informed decision is what really matters.

What's next?

I'll be putting some more detailed results onto GitHub along with a VM demonstrator. I'll tweet and add a short post when this is finished, it may have to wait until next week.

Preservation Topics: Identification AttachmentSize It looks like a PDF to me....44.06 KB
Categories: Planet DigiPres

2014-09-01 Preserving PDF - identify, validate, repair

OPF Wiki Activity Feed - 21 August 2014 - 9:33am

Page edited by Becky McGuinness

View Online Becky McGuinness 2014-08-21T09:33:57Z

Win an e-book reader!

SCAPE Blog Posts - 21 August 2014 - 7:30am

On September 8 the SCAPE/ APARSEN workshop Digital Preservation Sustainability on the EU Level is held at London City University in connection with the DL2014 conference.

The main objective of the workshop is to provide an overview of solutions to challenges within Digital Preservation Sustainability developed by current and past Digital Preservation research projects. The event brings together various EU projects/initiatives to present their solutions and approaches, and to find synergies between them.

Attached to the workshop Digital Preservation Sustainability on the EU Level SCAPE and APARSEN launch a competition:

 

Which message do YOU want to send to the EU for the future of Digital Preservation projects?

 

You can join the competition on Twitter. Only tweets including the hashtag #DP2EU are contending in the competition. You are allowed to include a link to a text OR one picture with your message. Messages which contain more than 300 characters in total are excluded from the competition, though.

The competition will close September 8th at 16:30 UK time. The workshop panel will then choose one of the tweets as a winner. The winner will receive an e-book reader as a prize.

 

There are only a few places left for the workshop.  Registration for the workshop is FREE and must be completed by filling out the form here http://bit.ly/DPSustainability. Please don’t register for this workshop on the DL2014 registration page, since this workshop  is free of charge!

 

Preservation Topics: SCAPE
Categories: SCAPE

Win an e-book reader!

Open Planets Foundation Blogs - 21 August 2014 - 7:30am

On September 8 the SCAPE/ APARSEN workshop Digital Preservation Sustainability on the EU Level is held at London City University in connection with the DL2014 conference.

The main objective of the workshop is to provide an overview of solutions to challenges within Digital Preservation Sustainability developed by current and past Digital Preservation research projects. The event brings together various EU projects/initiatives to present their solutions and approaches, and to find synergies between them.

Attached to the workshop Digital Preservation Sustainability on the EU Level SCAPE and APARSEN launch a competition:

 

Which message do YOU want to send to the EU for the future of Digital Preservation projects?

 

You can join the competition on Twitter. Only tweets including the hashtag #DP2EU are contending in the competition. You are allowed to include a link to a text OR one picture with your message. Messages which contain more than 300 characters in total are excluded from the competition, though.

The competition will close September 8th at 16:30 UK time. The workshop panel will then choose one of the tweets as a winner. The winner will receive an e-book reader as a prize.

 

There are only a few places left for the workshop.  Registration for the workshop is FREE and must be completed by filling out the form here http://bit.ly/DPSustainability. Please don’t register for this workshop on the DL2014 registration page, since this workshop  is free of charge!

 

Categories: Planet DigiPres

Emulation as a Service (EaaS) at Yale University Library

The Signal: Digital Preservation - 20 August 2014 - 1:35pm

The following is a guest post from Euan Cochrane, ‎Digital Preservation Manager at Yale University Library. This piece continues and extends exploration of the potential of emulation as a service and virtualization platforms.

Increasingly, the intellectual productivity of scholars involves the creation and development of software and software-dependent content. For universities to act as responsible stewards of these materials we need to have a well-formulated approach to how we can make these legacy works of scholarship accessible.

While there have been significant concerns with the practicality of emulation as a mode of access to legacy software, my personal experience (demonstrated via one of my first websites about Amiga emulation) has always been contrary to that view. It is with great pleasure that I can now illustrate the practical utility of Emulation as a Service via three recent case studies from my work at Yale University Library. Consideration of interactive artwork from 1997, interactive Hebrew texts from a 2004 CD-ROM and finance data from 1998 illustrate that it’s no longer really a question of if emulation is a viable option for access and preservation, but of how we can go about scaling up these efforts and removing any remaining obstacles to their successful implementation.

At Yale University Library we are conducting a research pilot of the bwFLA Emulation as a Service software framework.  This framework greatly simplifies the use of emulators and virtualization tools in a wide range of contexts by abstracting all of the emulator configuration (and its associated issues) away from the end-user. As well as simplifying use of emulators it also simplifies access to emulated environments by providing the ability to access and interact with emulated environments from right within your web browser, something that we could only dream of just a few years ago.

At Yale University Library we are evaluating the software against a number of criteria including:

  1. In what use-cases might it be used?
  2. How might it fit in with digital content workflows?
  3. What challenges does it present?

The EaaS software framework shows great promise as a tool for use in many digital content management workflows such as appraisal/selection, preservation and access, but also presents a few unique and particularly challenging issues that we are working to overcome.  The issues are mostly related to copyright and software licensing.  At the bottom of this post I will discuss what these issues are and what we are doing to resolve them, but before I do that let me put this in context by discussing some real-life use-cases for EaaS that have occurred here recently.

It has taken a few months (I started in my position at the Library in September 2013) but recently people throughout the Library system have begun to forward queries to me if they involve anything digital preservation-related. Over the past month or so we have had three requests for access to digital content from the general collections that couldn’t be interacted with using contemporary software.  These requests are all great candidates for resolving using EaaS but, unfortunately (as you will see) we couldn’t do that.

Screenshot of Puppet Motel running in the emulation service using the Basilisk II emulator.

Screenshot of Puppet Motel running in the emulation service using the Basilisk II emulator.

Interactive Artwork, Circa 1997: Use Case One

An Arts PhD student wanted to access an interactive CD-ROM-based artwork (Laurie Anderson’s “Puppet Motel”) from the general collections. The artwork can only be interacted with on old versions of the Apple Mac “classic” operating system.

Fortunately the Digital Humanities Librarian (Peter Leonard) has a collection of old technology and was willing to bring a laptop into the library from his personal collection for the PhD student to use to access it on. This was not an ideal or sustainable solution (what would have happened if Peter’s collection wasn’t available? What happens when that hardware degrades past usability?).

Since responding to this request we have managed to get the Puppet Motel running in the emulation service using the Basilisk II emulator (for research purposes).

This would be a great candidate for accessing via the emulation service. The sound and interaction aspects all work well and it is otherwise very challenging for researchers to access the content.

Screenshot virtual machine used to access CD-ROM that wouldn't play in current OS.

Screenshot virtual machine used to access CD-ROM that wouldn’t play in current OS.

Hebrew Texts, Circa 2004: Use Case Two

One of the Judaica librarians needed to access data for a patron and the data was in a Windows XP CD-ROM (Trope Trainer) from the general collections. The software on the CD would not run on the current Windows 7 operating system that is installed on the desktop PCs here in the library.

The solution we came up with was to create a Windows XP virtual machine for the librarian to have on her desktop. This is a good solution for her as it enables her to print the sections she wants to print and export pdfs for printing elsewhere as needed.

We have since ingested this content into the emulation service for testing purposes. In the EaaS it can run on either the virtualization software from Oracle: VirtualBox (which doesn’t provide full-emulation) or QEMU an emulation and virtualization tool.

It is another great candidate for the service as this version of the content can no longer be accessed on contemporary operating systems and the emulated version enables users to play through the texts and hear them read just as though they were using the CD on their local machine. The ability to easily export content from the emulation service will be added in a future update and will enable this content to become even more useful.

Accessing legacy finance data through a Windows 98 Virtual Machine.

Accessing legacy finance data through a Windows 98 Virtual Machine.

Finance Data, Circa 1998/2003: Use Case Three

A Finance PhD student needed access to data (inter-corporate ownership data) trapped within software within a CD-ROM from the general collection. Unfortunately the software was designed for Windows 98: “As part of my current project I need to use StatCan data saved using some sort of proprietary software on a CD. Unfortunately this software seemed not to be compatible with my version of Windows.” He had been able to get the data out of the disc but couldn’t make any real sense of it without the software: “it was all just random numbers.”

We have recently been developing a collection of old hardware at the Library to support long-term preservation of digital content. Coincidentally, and fortunately, the previous day someone had donated a Windows 98 laptop. Using that laptop we were able to ascertain that the CD hadn’t degraded and the software still worked.  A Windows 98 virtual machine was then created for the student to use to extract the data. Exporting the data to the host system was a challenge. The simplest solution turned out to be having the researcher email the data to himself from within the virtual machine via Gmail using an old web browser (Firefox 2.x).

We were also able to ingest the virtual machine into the emulation service where it can run on either VirtualBox or QEMU.

This is another great candidate for the emulation service. The data is clearly of value but cannot be properly accessed without using the original custom software which only runs on older versions of the Microsoft Windows operating system.

Other uses of the service

In exploring these predictable use-cases for the service, we have also discovered some less-expected scenarios in which the service offers some interesting potential applications. For example, the EaaS framework makes it trivially easy to set up custom environments for patrons. These custom environments take up little space as they are stored as a difference from a base-environment, and they have a unique identifier that can persist over time (or not, as needed).  Such custom environments may be a great way for providing access to sets of restricted data that we are unable to allow patrons to download to their own computers. Being able to quickly configure a Windows 7 virtual machine with some restricted content included in it (and appropriate software for interacting with that content, e.g., an MS Outlook PST archive file with MS Outlook), and provide access to it in this restricted online context, opens entirely new workflows for our archival and special collections staff.

Why we couldn’t use bwFLA’s EaaS

In all three of the use-cases outlined above EaaS was not used as the solution for the end-user. There were two main reasons for this:

  1. We are only in possession of a limited number of physical operating system and application licenses for these older systems. While there is some capacity to use downgrade rights within the University’s volume licensing agreement with Microsoft, with Apple operating systems the situation is much less clear. As a result we are being conservative in our use of the service until we can resolve these issues.
  2. It is not always clear in the license of old software whether this use-case is allowed. Virtualization is rarely (if ever) mentioned in the license agreements. This is likely because it wasn’t very common during the period when much of the software we are dealing with was created. We are working to clarify this point with the General Counsel at Yale and will be discussing it with the software vendors.

Addressing the software licensing challenges

As things stand we are limited in our ability to provide access to EaaS due to licensing agreements (and other legal restrictions) that still apply to the content-supporting operating system and productivity software dependencies. A lot of these dependencies that are necessary for providing access to valuable historic digital content do not have a high economic value themselves.  While this will likely change over time as the value of these dependencies becomes more recognized and the software more rare, it does make for a frustrating situation.  To address this we are beginning to explore options with the software vendors and will be continuing to do this over the following months and years.

We are very interested in the opportunities EaaS offers for opening access to otherwise inaccessible digital assets.  There are many use-cases in which emulation is the only viable approach for preserving access to this content over the long term. Because of this, anything that prevents the use of such services will ultimately lead to the loss of access to valuable and historic digital content, which will effectively mean the loss of that content. Without engagement from software vendors and licensing bodies it may require law change to ensure that this content is not lost forever.

It is our hope that the software vendors will be willing to work with us to save our valuable historic digital assets from becoming permanently inaccessible and lost to future generations. There are definitely good reasons to believe that they will, and so far, those we have contacted have been more than willing to work with us.

Categories: Planet DigiPres

Win an e-book reader

OPF Wiki Activity Feed - 20 August 2014 - 8:46am

Page edited by Jette Junge

View Online | Add Comment Jette Junge 2014-08-20T08:46:39Z

Win an e-book reader

SCAPE Wiki Activity Feed - 20 August 2014 - 8:46am

Page edited by Jette Junge

View Online | Add Comment Jette Junge 2014-08-20T08:46:39Z
Categories: SCAPE

Digital Preservation Sustainability on the EU Policy Level

OPF Wiki Activity Feed - 20 August 2014 - 8:29am

Page edited by Jette Junge

View Online | Add Comment Jette Junge 2014-08-20T08:29:28Z

Digital Preservation Sustainability on the EU Policy Level

SCAPE Wiki Activity Feed - 20 August 2014 - 8:29am

Page edited by Jette Junge

View Online | Add Comment Jette Junge 2014-08-20T08:29:28Z
Categories: SCAPE

Win an e-book reader > ebook reader.jpg

OPF Wiki Activity Feed - 20 August 2014 - 8:26am

File attached by Jette Junge

JPEG File ebook reader.jpg (114 kB)

View Attachments Jette Junge 2014-08-20T08:26:02Z

Win an e-book reader > ebook reader.jpg

SCAPE Wiki Activity Feed - 20 August 2014 - 8:26am

File attached by Jette Junge

JPEG File ebook reader.jpg (114 kB)

View Attachments Jette Junge 2014-08-20T08:26:02Z
Categories: SCAPE

Digital Preservation Sustainability on the EU Policy Level

OPF Wiki Activity Feed - 19 August 2014 - 11:46am

Page edited by Jette Junge

View Online | Add Comment Jette Junge 2014-08-19T11:46:43Z

Digital Preservation Sustainability on the EU Policy Level

SCAPE Wiki Activity Feed - 19 August 2014 - 11:46am

Page edited by Jette Junge

View Online | Add Comment Jette Junge 2014-08-19T11:46:43Z
Categories: SCAPE

Curating Extragalactic Distances: An interview with Karl Nilsen & Robin Dasler

The Signal: Digital Preservation - 18 August 2014 - 4:54pm
EDD Homepage

Screenshot of Extragalactic Distance Database Homepage.

While a fair amount of digital preservation focuses on objects that have clear corollaries to objects from our analog world (still and moving images and documents for example), there are a range of forms that are basically natively digital. Completely native digital forms, like database-driven web applications, introduce a variety of challenges for long-term preservation and access. I’m thrilled to discuss just such a form with Karl Nilsen and Robin Dasler from the University of Maryland, College Park. Karl is the Research Data Librarian, and Robin is the Engineering/Research Data Librarian. Karl and Robin spoke on their work to ensure long-term access to the Extragalactic Distance Database at the Digital Preservation 2014 conference.

Trevor: Could you tell us a bit about the Extragalactic Distance Database? What is it? How does it work? Who does it matter to today and who might make use of it in the long term?

//en.wikipedia.org/wiki/Cosmic_distance_ladder#mediaviewer/File:Extragalactic_distance_ladder.JPG">Wikimedia Commons</a>.

Representation of the Extragalactic distance ladder from Wikimedia Commons.

Karl and Robin: The Extragalactic Distance Database contains information that can be used to determine distances between galaxies. For a limited number of nearby galaxies, the distances can be measured directly with a few measurements, but for galaxies beyond these, astronomers have to correlate and calibrate data points obtained from multiple measurements. The procedure is called a distance ladder. From a data curation perspective, the basic task is to collect and organize measurements in such a way that researchers can rapidly collate data points that are relevant to the galaxy or galaxies of interest.

The EDD was constructed by a group of astronomers at various institutions over a period of about a decade and is currently deployed on a server at the Institute for Astronomy at the University of Hawaii. It’s a continuously (though irregularly) updated, actively used database. The technology stack is Linux, Apache, MySQL and PHP. It also has an associated file system that contains FITS files and miscellaneous data and image files. The total system is approximately 500GB.

EDD Result table

Extragalactic Distance Database Result table.

The literature mentioning extragalactic or cosmic distance runs to thousands of papers in Google Scholar, and over one hundred papers have appeared with 2014 publication dates. Explicit references to the EDD appear in twelve papers with 2014 publication dates and a little more than seventy papers published before 2014. We understand that some astronomers use the EDD for research that is not directly related to distances simply because of the variety of data compiled into the database. Future use is difficult to predict, but we view the EDD as a useful reference resource in an active field. That being said, some of the data in the EDD will likely become obsolete as new instruments and techniques facilitate more accurate distances, so a curation strategy could include a reappraisal and retirement plan.

Our agreement with the astronomers has two parts. In the first part, we’ll create a replica of the EDD at our institution that can serve as a geographically distinct backup for the system in Hawaii. We’re using rsync for transfer. Our copy will also serve as a test case for digital curation and preservation research. In this period, the copy in Hawaii will continue to be the database-of-record. In the second part, our copy may become the database-of-record, with responsibility for long-term stewardship passing more fully to the University of Maryland Libraries. In general, this project gives us an opportunity to develop and fine-tune curation processes, procedures, policies and skills with the goal of expanding the Libraries’ capacity to support complex digital curation and preservation projects.

Trevor: How did you get involved with the database? Did the astronomers come to you or did you all go to them?

Karl and Robin: One of the leaders of the EDD project is a faculty member at the University of Maryland and he contacted us. We’re librarians on the Research Data Services team and we assist faculty and graduate students with all aspects of data management, curation, publishing and preservation. As a new program in the University Libraries, we actively seek and cultivate opportunities to carry out research and development projects that will let us explore different data curation strategies and practices. In early 2013 we included a brief overview of our interests and capabilities in a newsletter for faculty, and that outreach effort lead to an inquiry from the faculty member.

We occasionally hear from other faculty members who have developed or would like to develop databases and web applications as a part of their research, so we expect to encounter similar projects in the future. For that reason, we felt that it was important to initiate a project that involves a database. The opportunities and challenges that arise in the course of this project will inform the development of our services and infrastructure, and ultimately, shape how we support faculty and students on our campus.

Trevor: When you started in on this, were there any other particularly important database preservation projects, reports or papers that you looked at to inform your approach? If so, I’d appreciate hearing what you think the takeaways are from related work in the field and how you see your approach fitting into the existing body of work.

Karl and Robin: Yes, we have been looking at work on database preservation as well as work on curating and preserving complex objects. We’re fortunate that there has been a considerable amount of research and development on database preservation and there is a body of literature available. As a starting point, readers may wish to review:

Some of the database preservation efforts have produced software for digital preservation. For example, readers may wish to look at SIARD (Software Independent Archiving of Relational Databases) or the Database Preservation Toolkit. In general, these tools transform the database content into a non-proprietary format such as XML. However, there are quite a few complexities and trade-offs involved. For example, database management systems provide a wide range of functionality and a high level of performance that may be lost or not easily reconstructed after such transformations. Moreover, these preservation tools may involve dependencies that seem trivial now but could introduce significant challenges in the future. We’re interested in these kinds of tools and we hope to experiment with them, but we recognize that heavily transforming a system for the sake of preservation may not be optimal. So we’re open to experimenting with other strategies for longevity, such as emulation or simply migrating the system to state-of-the-art databases and applications.

Trevor:  Having a fixed thing to preserve makes things a lot easier to manage, but the database you are working with is being continuously updated. How are you approaching that challenge? Are you taking snapshots of it? Managing some kind of version control system? Or something else entirely? I would also be interested in hearing a bit about what options you considered in this area and how you made your decision on your approach.

Karl and Robin: We haven’t made a decision about versioning or version control, but it’s obviously an important policy matter. At this stage, the file system is not a major concern because we expect incremental additions that don’t modify existing files. The MySQL database is another story. If we preserve copies of the database as binary objects, we face the challenge of proliferating versions. That being said, it may not be necessary to preserve a complete history of versions. Readers may be interested to know that we investigated Git for transfer and version control, but discovered that it’s not recommended for large binary files.

Trevor: How has your idea of database preservation changed and evolved by working through this project? Are there any assumptions you had upfront that have been challenged?

Karl and Robin: Working with the EDD has forced us to think more about the relationship between preservation and use. The intellectual value of a data collection such as the EDD is as much in the application–joins, conditions, grouping–as in the discrete tables. Our curation and preservation strategy will have to take this fact into account. We expect that data curators, librarians and archivists will increasingly face the difficult task of preservation planning, policy development and workflow design in cases where sustaining the value of data and the viability of knowledge production depends on sustaining access to data, code and other materials as a system. We’re interested to hear from other librarians, archivists and information scientists who are thinking about this problem.

Trevor: Based on this experience, is there a checklist or key questions for librarians or archivists to think through in devising approaches to ensuring long term access to databases?

Karl and Robin: At the outset, the questions that have to be addressed in database preservation are identical to the questions that have to be addressed in any digital preservation project. These have to do with data value, future uses, project goals, sustainability, ownership and intellectual property, ethical issues, documentation and metadata, data quality, technology issues and so on. A couple of helpful resources to consult are:

Databases may complicate these questions or introduce unexpected issues. For example, if the database was constructed from multiple data sources by multiple researchers, which is not unusual, the relevant documentation and metadata may be difficult to compile and the intellectual property issues may be somewhat complicated.

Trevor: Why are the libraries at UMD the place to do this kind of curation and preservation? In many cases scientists have their own data managers, and I imagine there are contributions to this project from researchers at other universities. So what is it that makes UMD the place to do it and how does doing this kind of activity fit into the mission of the university and the libraries in particular?

Karl and Robin: While there are well-funded research projects that employ data managers or dedicated IT specialists, there are far more scientists and scholars who have little or no data management support. The cost of employing a data manager, even part-time, is too great for most researchers and often too great for most collaborations. In addition, while the IT departments at universities provide data storage services and web servers, they are not usually in the business of providing curatorial expertise, publishing infrastructure and long-term preservation and access. Further, while individual researchers recognize the importance of data management to their productivity and impact, surveys show that they have relatively little time available for data curation and preservation. There is also a deficit of expertise in general, though some researchers possess sophisticated data management skills.

Like many academic libraries, the UMD Libraries recognize the importance of data management and curation to the progress of knowledge production, the growth of open science and the success of our faculty and students. We also believe that library and archival science provide foundational principles and sets of practices that can be applied to support these activities. The Research Data Services program is a strategic priority for the University of Maryland Libraries and is highly aligned with the Libraries’ mission to accelerate and support research, scholarship and creativity. We have a cross-functional, interdisciplinary team in the Libraries–made up of subject specialists and digital curation specialists as needed–and partners across the campus, so we can bring a range of perspectives and skills to bear on a particular data curation project. This diversity is, in our view, essential to solving complex data curation and preservation problems.

We have to acknowledge that our work on the EDD involves a number of people in the Libraries. In particular, Jennie Levine Knies, Trevor Muñoz and Ben Wallberg, as well as University of Maryland iSchool students Marlin Olivier and, formerly, Sarah Hovde, have made important contributions to this project.

Categories: Planet DigiPres

SCAPE Software Projects

OPF Wiki Activity Feed - 18 August 2014 - 4:27pm

Page edited by Hélder Silva

View Online | Add Comment Hélder Silva 2014-08-18T16:27:39Z

SCAPE Software Projects

SCAPE Wiki Activity Feed - 18 August 2014 - 4:27pm

Page edited by Hélder Silva

View Online | Add Comment Hélder Silva 2014-08-18T16:27:39Z
Categories: SCAPE