Web pages are getting more complex than ever. Thus, identifying different elements from web pages, such as main content, menus, user comments, advertising among others, becomes difficult. Web page segmentation refers to the process of dividing a Web page into visually and semantically coherent segments called Blocks or Segments. Detecting these different blocks is a crucial step for many applications, for example mobile devices content visualization, information retrieval and change detection between versions in the web archive context.Web Page Segmentation at a Glance
For a web page (W) the output of its segmentation is the semantic tree of a web page (W'). Each node represents a data region in the web page, which is called a block. The root block represents the whole page. Each inner block is the aggregation of all its children blocks. All leaf blocks are atomic units and form a flat segmentation of the web page. Each block is identified by a block-id value (See Figure 1 for an example).
An efficient web page segmentation aproach is important for several issues:
Process different part of a web page accordingly to its type of content.
Assign importance to a region in a web page over the rest
Understand the structure of a web page
In this post, I will try to explain what web page segmentation does specially for pagelyzer. It provides information of about the web page content.Web page Segmentation Algorithm
We present here the detail for the Block-o-Matic web page segmentation algorithm used by pagelyzer to perform the segmentation. It is an hybrid between the visual-based approach and document processing approach.
The segmentation process is divided in three phases: analysis, understanding and reconstruction. It comprise three taks: filter, mapping and combine. It produces three structures: DOM structure, content structure and logic structure. The main aspect of the whole process is producing this structures where the logic structure represent the final segmentation of the web page.
The DOM tree is obtained from the rendering of a web browser. The result of the analysis phase is the content structure (Wcont ), built from the DOM tree with the d2c algorithm. Mapping the content structure into a logical structure (Wlog ) is called document understanding. This mapping is performed by the c2l algorithm with a granularity parameter pG. Web page reconstruction gather the three structures (Rec function),
W' = Rec(DOM, d2c(DOM ), c2l(d2c(DOM, pG))).
For the integration of the segmentation outcome to pagelyzer it is used a XML representation: ViDIFF. It represent hierarchicaly the blocks, their geometric properties, the links and text in each block.Implementation
Block-o-matic algorithm is available:
- through pagelyzer itself https://github.com/openplanets/pagelyzer),
There was a week in January 2014 where I participated in three meetings/events where emulation came up as a digital preservation solution. Emulation has really hit its stride, 20 years after I first heard about it.
An emulator is an environment that imitates the behavior of a computer or other electronic system. In recent years, this has come to be known as a Virtual Machine, which is a recreated computer environment — from the operating system to the video drivers and software — that can be run in an interactive manner using current technology, including a web browser in some instances.
I was very much the fan of collecting hardware for digital preservation, until I participated in the Library of Congress Preserving.exe meeting in May of 2013. I wrote about my own conversion to Team Emulation in an earlier post on this blog., and my colleague Bill Lefurgy responded to my post with a post of his own. (That said, we still need vintage hardware to read older media to bring operating systems and software into emulation environments.)
There are a few key articles on this topic:
- Granger, Stewart. “Emulation as a Digital Preservation Strategy.” D-Lib Magazine 6.19 (2000).
- Guttenbrunner, Mark, and Andreas Rauber. “A measurement framework for evaluating emulators for digital preservation.” ACM Transactions on Information Systems (TOIS) 30.2 (2012): 14.
- Rechert, Klaus, Dirk von Suchodoletz, and Randolph Welte. “Emulation based services in digital preservation.” Proceedings of the 10th annual joint conference on Digital libraries. ACM, 2010.
- Rothenberg, Jeffrey. “The Emulation Solution.” Avoiding Technological Quicksand: Finding a Viable Technical Foundation for Digital Preservation. Washington, DC: Council on Library and Information Resources, 1998. Council on Library and Information Resources.
- Van der Hoeven, Jeffrey, Bram Lohman, and Remco Verdegem. “Emulation for digital preservation in practice: The results.” International journal of digital curation 2.2 (2008): 123-132.
Don’t let some of the early dates fool you – this issue was debated in just as lively a way 15 years ago as it is now.
The beginning is a very fine place to start indeed for the Federal Agencies Digitization Guidelines Initiative Born Digital Video subgroup of the Audio-Visual Working Group. As mentioned in a previous blog post, the FADGI Born Digital Video subgroup is taking a close look at the range of decisions to be made throughout the lifecycle of born digital video objects, from file creation through archival ingest and access delivery. Through case histories from federal agencies such as National Archives and Records Administration, Smithsonian Institution Archives, and National Oceanic and Atmospheric Administration, Library of Congress, Voice of America and American Folklife Center, we are exploring the “truth and consequences” when creating and archiving born digital video. In this blog post, we’ll look at some of our guiding principles for creating born digital video.
But as Julie Andrew’s says, let’s start at the very beginning. What do we mean by born digital video? Quite simply, it’s video that is recorded to digital file at the point of creation. Born digital video is distinct from digitized or reformatted video, a label used to describe the result of translating the analog signal data emanating from a video object into a digitally encoded format. FADGI’s Reformatted Video subgroup is developing a matrix which compares target wrappers and encodings against a set list of criteria that come into play when reformatting analog videotapes.
The first set of FADGI BDV case histories highlight what we call advice for shooters (a.k.a. videographers), and by extension, the project managers within cultural heritage institutions who are responsible for the creation new born digital video files – especially determining the technical file specifications. It’s important to recognize that the FADGI target audience for these case histories isn’t Hollywood or commercial entertainment producers. It’s the cultural heritage community or smaller archives who create non-broadcast classes of content recording such as oral histories. A great example is the Civil Rights History Project at AFC. These types of projects have the opportunity to spec out the born digital video deliverable from the very beginning and end up with a file that is ingest ready for preservation and access systems.
The goal of the case histories project is to use guiding principles to illustrate the advantages of starting with high quality data capture from the very start. Two examples of FADGI’s guiding principles for creating born digital video include:
- Create uncompressed video instead of compressed video. Compressed video reduces the amount of data in a file or stream. Although a reduced amount of data can be beneficial for easing storage, transfer, and play-out requirements, it generally introduces additional technical complexity which can have a negative impact on usability of the file over time. Uncompressed video retains all the visual information captured at the selected resolution, which is preferable for preservation purposes.
- If compression is required, use lossless compression over lossy compression. Lossless compression uses algorithms that restore the compressed data after decompression. It is essentially reversible compression. Lossy compression permanently alters or deletes the compressed data. If data reduction gains are significant enough to warrant using the added complexity of compressed files, lossless compression is preferred to preserve video quality.
These are just two examples that focus on the video encoding. The guiding principles also cover considerations for file wrapper or container capabilities, format sustainability and more general project concerns.
But here’s the thing: our case histories don’t always follow our own guiding principles. And that’s just fine by us. None of us live in a utopian world where digital storage is abundant and systems are completely interoperable. We all have to make choices and compromises to work within our restraints. Uncompressed video files can be huge and a burden to manage and maintain. Lossy compression can be appropriate for certain projects. The guiding principles should all be read with the caveat “if you have the option….” Sometimes, you simply don’t have the option for a myriad of reasons. But when you do have the option, the guiding principles highlight the advantages of high quality data capture. The important take-away from the case histories project is the choices made during the file creation process will have impacts on the long term archiving and distribution processes and it’s essential to understand what those impacts are and have a plan for to resolve any conflicts.
Our hope is that these guiding principles and case histories help us start to flesh out more specific format guidance for born digital video but that’s in the future. The case history project, which will be published on the Federal Agencies Digitization Guidelines Initiative website this spring, is the first step towards understanding where we are as a community and what we can learn from each other.
How do I know if a digital file/object has been corrupted, changed or altered? Further how can I prove that I know what I have? How can I be confident that the content I am providing is in good condition, complete, or reasonably complete? How do I verify that a file/object has not changed over time or during transfer processes?
In digital preservation, a key part of answering these questions comes through establishing and checking the “fixity” or stability of digital content. At this point, many in the preservation community know they should be checking the fixity of their content, but how, when and how often?
A team of individuals from the NDSA Infrastructure & Standards working groups have developed Checking Your Digital Content: How, What and When to Check Fixity? in an effort to help stewards answer these questions in a way that makes sense for their organization based on their needs and resources. We are excited to publicly share this draft document for more broad open discussion and review here on The Signal. We welcome comments and questions, please post them at the bottom of this post for the working group to review.
Not Best Practices, but Guidance for Making Best Use of Resources at Hand
In keeping with work on the NDSA Levels of Digital Preservation, this document is not a benchmark or requirement. It is instead intended as a tool to help organizations develop a plan that fits resource constraints. Different systems and different collections are going to require different fixity checking approaches, and our hope is that this document can help.
Connection to National Agenda for Digital Stewardship
This guidance was developed in direct response to start to address the need articulated in the infrastructure section of the inaugural National Agenda for Digital Stewardship. I’ll include it below at length for context.
Fixity checking is of particular concern in ensuring content integrity. Abstract requirements for fixity checking can be useful as principals, but when applied universally can actually be detrimental to some digital preservation system architectures. The digital preservation community needs to establish best practices for fixity strategies for different system configurations. For example, if an organization were keeping multiple copies of material on magnetic tape and wanted to check fixity of content on a monthly basis, they might end up continuously reading their tape and thereby very rapidly push their tape systems to the limit of reads for the lifetime of the medium.
There is a clear need for use ‐ case driven examples of best practices for fixity in particular system designs and configurations established to meet particular preservation requirements. This would likely include description of fixity strategies for all spinning disk systems, largely tape ‐ based systems, as well as hierarchical storage management systems. A chart documenting the benefits of fixity checks for certain kinds of digital preservation activities would bring clarity and offer guidance to the entire community. A document modeled after the NDSA Levels of Digital Preservation would be a particularly useful way to provide guidance and information about fixity checks based on storage systems in use, as well as other preservation choices.
Again, please share your comments on this here, and consider forwarding this on to others who you think might have comments to share with us.
The Web is constantly evolving over time. Web content like texts, images, etc. are updated frequently. One of the major problems encountered by archiving systems is to understand what happened between two different versions of the web page. We want to underline that the aim is not to compare two web pages like this (however, the tool can also do that):
but web page versions:
An efficient change detection approach is important for several issues:
Crawler optimization by deciding if the page should be crawled or not on the fly.
Discovering new crawl strategies e.g. based on patterns
Quality assurance for crawlers, for example, by comparing the live version of the page with the just crawled one.
Detecting format obsolescence following to evolving technologies, is the rendering of web pages are identique visually by using different versions of the browser or different browsers
Archive maintenance, different operations like format migration can change the archived versions renderings.
Pagelyzer is a tool containing a supervised framework that decides if two web page versions are similar or not. Pagelyzer takes two urls and two browsers types (e.g. firefox, chrome) and one comparison type as input (image-based, hybrid or content-based). If browsers types are not set, it uses firefox by default.
It is based on two different technologies:
1 – Web page segmentation (let's keep the details for another blog post)
2 – Supervised Learning with Support Vector Machine(SVM).
In this blog, I will try to explain simply (without any equations) what SVM does specially for pagelyzer. You have two urls, let's say url1 and url2 and you would like to know if they are similar (1) or dissimilar (0).
You calculate the distance (or similarity) as a vector based on the comparison type. If it is image-based, your vector will contain the features related to images (e.g. SIFT, HSV). If it is content-based, your vector will contain features for text similarities(e.g. jacard distance for links, images and words). To better explain how it works, let's assume that we have two dimensions (two features). One feature is SIFT and the other one is HSV. They are both color descriptives.
To make your system learn, you should provide at the beginning annotated data to your system. In our case, we need a list of url pairs <url1,url2> annotated manually as similar or not similar. For pagelyzer, this dataset is provided by Internet Memory Foundatation (IMF). With a part of your dataset (ideally 1/3) you train your system, with the other part you test your results.
Let's start training:
First, you put all your vectors in input space.*/
As, this data is annotated, you know which one is similar (in green), which one is dissimilar(in red).*/
You find the optimal decision boundary (hyperplane) in input space. Anything above the decision boundary should have label 1 (similar). Similarly, anything below the decision boundary should have label 0 (dissimilar).
Your system is intelligent now! When you have new pair of urls without any annotation, based on the decision boundry, you can say if they are similar or not.
The pair of urls in blue will be considered as dissimilar, the one in orange will be considered as similar by pagelyzer.
When you choose different types of comparison, you choose different types of features and dimensions. The actual version of Pagelyzer uses the results of SVM learned with 202 couples of web page provided by IMF, 147 are in positive class and 55 are in negative class. As it is a supervised system, increasing the training set size will always lead to better results.
An image to show what happens when you have more than two dimensions:
My two young teenage daughters spend hours playing Minecraft, building elaborate virtual landscapes and structures. They are far from alone; the game has millions of fans around the world. Teachers are seizing on Minecraft’s popularity with kids as a tool to teach both abstract and concrete subjects. What’s unique about this situation is not so much the product as that a virtual world is functioning as both a fun, engaging activity and a viable teaching tool. We’re witnessing the birth of a new genre of tools and a new set of challenges for preserving the digital creations people build with those tools.
Like most parents, I save many of the things that my daughters create. From where I’m sitting in my home as I write this blog post, I can see their works dotting the room. On one wall is a framed pencil sketch one daughter drew of our family; on a shelf is a perfect clay replica she made of Moomintroll. Hanging above a window are drawings my other daughter did — a Sharpie drawing of tree houses and a pen doodle of kaleidoscopic patterns that disappear into a tunnel-like vanishing point. Huge snowflakes (no two alike) that they cut from paper dangle here and there around the room.
I never gave much thought to their virtual gaming activities, aside from monitoring how much time they spend on their electronic devices. But I like that Minecraft lets my kids invent universes and play inside them together and I can tell that it feeds an important part of their intellectual growth as they make things, investigate things and solve problems. So I decided that I’d like to save what I can of the worlds they create, just as I save the rest of their crafts and artwork, which raised questions about what I can save, how I can save it and why I would even want to save it.
Over the last decade, the Library of Congress and its NDIIPP and NDSA partners have led the research into preserving virtual worlds, from military simulations to consumer games. Many of the questions – technological and philosophical – have long been asked and answered or at the least the challenges have been identified and defined. That’s fine for institutions that recognize the cultural value of virtual worlds and have the resources to archive them but what does it mean for a parent who just wants to save his or her kid’s virtual world creations?
A colleague at the Library of Congress, Trevor Owens, is part of the ongoing research on preserving digital worlds and preserving software. In fact, Owens is one of the organizers of the preserving software conference. He said that the solution to the question of saving something from virtual worlds depends on whether you want to save:
- the virtual world that you or someone else built
- testimony about what the virtual world meant to you or them at a particular time
- or documentation of the virtual world.
Preserving the virtual world itself is the most difficult and challenging option. The complexities of preserving virtual worlds are too much to go into in this blog post. And when it comes to talking about networked virtual worlds inhabited by live human participants, the subject often gets downright esoteric, like defining where “here” actually is and what “here” means in a shared virtual world and how telepresence applies to the virtual world experience. But to illustrate the basic technological dilemma of preserving a virtual world, here’s a simple example .
Let’s say I build an island, castle and estate in a virtual world and name it Balmy Island. If I want to save Balmy Island and be able to walk around it anytime I want to, I need all the digital files of which Balmy Island is constructed. I might need the exact version of the application or software that I used to build Balmy Island, as well as the exact operating system — and version of the OS — of the hardware device on which I built Balmy Island. And I might need the hardware device itself on which I created Balmy Island. So if I build Balmy Island on my computer, I have to preserve the computer, the software and the files just as they are. Never upgrade or modify anything. Just stick the whole computer in the closet, buy a new computer and pull out the old one whenever I wanted to revisit Balmy Island.
Another less-certain and less-authentic option is that I could save the Balmy Island files and hope that someday someone will build an emulator that will restore some approximate version of my original Balmy Island. It will not be exactly the same, but it might be close enough.
Saving the hardware and software for just this one purpose is unrealistic for the average person but for cultural institutions it makes perfect sense. Stanford University is the home of the Stephen M. Cabrinety Collection in the History of Microcomputing and it is also building a Forensics Lab with a library of software and electronic devices for extracting software from original media, so that it can be run later in native or emulated environments. Similar labs at other institutions include the Maryland Institute for Technology in the Humanities, the International Center for the History of Electronic Games at the Strong National Museum of Play and the UT Videogame Archive at the Dolph Briscoe Center for American History, University of Texas at Austin. The Briscoe Center was featured in the Signal post about video game music composer George Sanger. (Dene Grigar, who was the subject of another Signal blog post, created a similar lab devoted to her vintage electronic literature collection at Washington State University, Vancouver)
Henry Lowood, curator for History of Science & Technology Collections and Film & Media Collections in the Stanford University Libraries, was a lead in the Preserving Virtual Worlds project. Lowood has a historical interest in games, virtual worlds and their role in society, and he makes a case for the option of recording testimony about what a virtual world means to its users and builders.
Lowood helped create the Machinima and Virtual Worlds collections, which are hosted by our NDIIP/NDSA partner, the Internet Archive. These collections host video recordings of activities and events in virtual worlds and immersive games. As the users perform actions and navigate through the worlds, they sometimes give a running commentary about what is happening and their thoughts and observations about its meaning to them.
A parent or teacher could use this same approach by shooting a video of a child giving you a tour of their virtual world. It’s an opportunity to capture the context around their creation of the worlds and for them to tell you how they felt about it and what choices they made. If they interact with others in a shared virtual world, the child can describe his or her interactions and maybe even relate anecdotes about certain events and experiences.
Screenshots are easy to take on computers and most hand-held devices. PCs have a “print screen” button on the keyboard; for Macs, hold down the Apple key ⌘ plus shift plus 3. For iPods, press and hold the main button below the screen and the power button on the top edge of the device at the same time. And so on. Search online for how to take screen shots or screen captures for your device.
The screenshot will save as a graphic file, usually a JPEG or PNG file. Transfer that JPEG to your computer, crop it and modify it with a photo processing program if you want. Maybe print the screen shots and put them on the refrigerator for you to admire. When you’re finished with the digital photo file, back it up with your other personal digital archives.
Recording a walk through of a virtual world can be a slightly more complex task than taking a screenshot but not terribly so. Search online for “screencast software,” “motion capture” or “screen recording” to find commercial and freeware screencast software. Even version 10 of the QuickTime player includes a screen recording function. They all pretty much operate the same way: click a “Record” button, do your action on the computer and click “Stop” when you are finished. Everything that was displayed on the screen will be captured into a video file.
With the different screen capture software programs, be aware of the video file type that the software generates. QuickTime saves the video as an MOV file, Jing saves the video as an SWF file and so on. Different file types require different digital video players, so if you have any difficulty playing the file back on your computer search online to find the software that will play your video file type. If you upload a copy of your video to YouTube, backup a master copy somewhere else. Don’t rely on the YouTube version as your master “archived” copy.
Although this story is about the challenges of saving mementos from digital virtual worlds, the essence of the challenge — trying to preserve an experience — is not new. If I go to Hawaii, snorkel, build sand castles and have the time of my life, I cannot capture or hold onto that experience. I can only document the experience with photos, video and maybe write in a journal about it. In a way, it even goes back to the dawn of humanity, where people recorded their experiences by means of cave paintings.
So you cannot capture the experience of a virtual world but you can document it. And virtual worlds are a lot more accessible in 2014 than they were in 1990. It’s a long way from Jaron Lanier‘s work, from VPL labs and data gloves and headsets and exclusive access in special labs. Kids now carry their personalized virtual worlds in their handheld devices. Minecraft is just the current cool tool. Who can tell what is yet to come?
It seems appropriate to let Howard Rheingold have the last word on the subject. Rheingold is a writer, teacher, social scientist and thought-leader about the cultural impacts of technology. He is also an authority on virtual reality and virtual communities, having written the definitive books about both topics over twenty years ago. His current book is titled NetSmart.
In addition to his professional expertise, Rheingold is a caring father who dotes on his daughter. While he was researching and writing the books Virtual Reality(1991) and Virtual Communities: Homesteading on the Electronic Frontier(1994), his office walls were filled with her childhood artwork (she is now in her 20s). He brings a unique and authoritative perspective to this story.
Rheingold said, “I’ve been closely observing and writing about innovations in digital media and learning in recent years – and experiencing/experimenting directly through the classes I teach at Stanford and Rheingold U. Among my activities in this sphere is a video blog for DMLcentral, a site sponsored by the MacArthur Foundation’s Digital Media and Learning Initiative. It was there that I delved into the educational uses – and students and teachers’ passion for – Minecraft.
“In my interviews with teachers Liam O’Donnell and Sara Kaviar, it became clear that Minecraft was about much more than using computers to build things. It was a way to engage with a diverse range of abstract subject matter in concrete ways, from comparative religion to mathematics, and more importantly, a way for students to exercise agency in a schooling environment in which so much learning is dependent on what the teacher or textbook says.
“Minecraft artifacts are also important contributions to student e-portfolios, which will become more important than resumes in the not too distant future. Given the growing enthusiasm over Minecraft by students, teachers, and parents, and the pedagogical value of seeing these creations as artifacts and instruments of learning, it only makes sense to make it easy and inexpensive to preserve virtual world creations.”
The February issue of the Library of Congress Digital Preservation Newsletter (pdf) is now available!
Included in this issue:
- Spotlight on Digital Collections, including an interview with Lisa Green on Machine Scale Analysis of collections, and a look at the Cultural Heritage of the Great Smoky Mountains
- Digital Preservation Aid in Response to Tornado
- NDSA Digital Content Area: Web and Social Media
- Wikipedia and Digital Preservation
- AV Artifact Atlas, FADGI interview with Hanna Frost
- Several updates on the Residency Program
- Listing of upcoming events including the IDCC (Feb 24-27), Digital Maryland conference (March 7), Computers in Libraries (April 7-10), Personal Digital Archiving 2014 (April 10-11)
- And other articles about data, preservation of e-serials, and more.
To subscribe to the newsletter, sign up here
We’ve started planning our annual meeting, Digital Preservation 2014, which will be held July 22-24 in the Washington, DC area, and we want to hear from you! Any organization or individual with an interest in digital stewardship can propose ideas for potential inclusion in the meeting.
The Library of Congress has hosted annual meetings with digital preservation partners, collaborators and others committed to stewardship of digital content for the past ten years. The meetings have served as a forum for sharing achievements in the areas of technical infrastructure, innovation, content collection, standards and best practices and outreach efforts.
This year we’ve expanded participation from NDSA member organizations on the program committee. We’re delighted to have NDIIPP staff and NDSA members working together to contribute to the success of the meeting.
Digital Preservation 2014 Program Committee
- Vickie Allen, PBS Media Library
- Meghan Banach Bergin, University of Massachusetts Amherst
- Erin Engle, NDIIPP
- Abbie Grotke, NDIIPP
- Barrie Howard, NDIIPP
- Butch Lazorchak, NDIIPP
- Vivek Navale, U.S. National Archives and Records Administration
- Michael Nelson, Old Dominion University
- Trevor Owens, NDIIPP
- Abbey Potter, NDIIPP
- Nicole Scalessa, The Library Company of Philadelphia
Call for Proposals
We are looking for your ideas, accomplishments and project updates that highlight, contribute to, and advance the community dialog. Areas of interest include, but are not limited to:
- Scientific data and other content at risk of obsolescence, and what methods, techniques, and tools are being deployed to mitigate risk;
- Innovative methods of digital preservation, especially regarding sustainable practices, community approaches, and software solutions;
- Collaboration successes and lessons learned highlighting a wide-range of digital preservation activities, such as best practices, open source solutions, project management techniques and emerging tools;
- Practical examples of research and scholarly use of stewarded data or content;
- Educational trends for emerging and practicing professionals.
You are invited to express your interest in any of the following ways:
- Panels or presentations
- 5-minute lightning talks
A highlight of this past year was the release of the 2014 National Digital Stewardship Agenda at Digital Preservation 2013. The Agenda integrates the perspective of dozens of experts to provide funders and decision-makers with insight into emerging technological trends, gaps in digital stewardship capacity and key areas for development. It suggests a number of important sets of issues for the digital stewardship community to consider prioritizing for developments. We’d be particularly interested for you to share projects your organization has undertaken in the last year that address any of the issues listed in the Agenda.
To be considered, please send 300 words or less describing what you would like to present to ndiipp [at] loc.gov by March 14. Accepted proposals will be notified on or around April 3.
The last day of the meeting, July 24, will be a CURATEcamp, which will take place off-site from the main meeting venue. The topic focus of this camp is still under discussion, so stay tuned for more information about the camp in the coming weeks.
Please let us know if you have any questions. Your contributions are important in making this a community program and we’re looking forward to your participation.
EDRMS across New Zealand’s Government – Challenges with even the most managed of records management systems!
First things first. The Github repository with the Audio QA workflows is here: https://github.com/statsbiblioteket/scape-audio-qa. And version 1 is working. Version is really all wrong here. I should call it Workflow 1, which is this one:
To sum up what this workflow does, is migration, conversion and content comparison. The top left box (nested workflow) migrates a list of mp3s to wav files using a Hadoop map-reduce job using the command line tool Ffmpeg, and outputs a list of migrated wav files. The top right box converts the same list of mp3s to wav files using another Hadoop map-reduce job using the command line tool mpg321, and outputs a list of converted wav files. The Taverna work flow then puts the two lists of wav files together and the bottom box receives a list of pairs of wav files to compare. The bottom box compares the content of the paired files using a Hadoop map-reduce job using the xcorrSound waveform-compare commandline tool, and outputs the results of the comparisons.
What we would like to do next is:
- "Reduce" the output of the Hadoop map-reduce job using the waveform-compare commandline tool
- Do an experiment on 1TB input mp3 files on the SB Hadoop cluster, and write an evaluation and a new blog post ;-)
- Extend the workflow with property comparison. The waveform-compare tool only compares sound waves; it does not look at the header information. This should be part of a quality assurance of a migration. The reason this is not top priority is that FFprobe property extraction and comparison is very fast, and will probably not affect performance much...
The following is a guest post by Julia Blase, National Digital Stewardship Resident at the National Security Archive.
In case you hadn’t heard, the ALA Midwinter Meeting took place in Philadelphia last weekend, attended by around 12,000 librarians and exhibitors. If you didn’t attend, or didn’t have friends there to take notes for you, the Twitter hashtag #alamw14 has it covered – enough content for days of exploration! If you’d like to narrow your gaze, and in the theme of this post, you could refine your search for tweets containing both #alamw14 and #NDSR, because the National Digital Stewardship Residents were there in force, attending and presenting.
Emily Reynolds, the Resident at the World Bank, was so kind as to compile a list of the sessions we aimed to attend before the conference. On Saturday, though none of us made it to every event, at least a few of us were at the Preservation Administrators Interest Group, Scholarly Communications Interest Group, Digital Conversion Interest Group, Digital Special Collections Discussion Group and Challenges of Gender Issues in Technology sessions.
The first session I attended, along with Lauren Work and Jaime McCurry, was the Digital Conversion Interest Group session, where we heard fantastic updates on audiovisual digital conversion practices and projects from the American Folklife Center, the American Philosophical Society library, Columbia University Libraries and George Blood Audio and Video. I particularly enjoyed hearing about the successful APS attempt to digitize audio samples of Native American languages, many of which are endangered and the positive reaction from the Native community. For audio, it seemed, sometimes digitization is the best form of preservation!
The second session I attended, with Emily Reynolds and Lauren Work, was the Gender Issues in Technology discussion group (see news for it at #libtechgender). We were surprised, but pleased, at the number of attendees and quality of the discussion around ways to improve diversity in the profession. Among the suggestions we heard were to include diverse staff members on search committees, to monitor the language within your own organization when you review candidates to ensure that code words like “gravitas” (meaning “male,” according to the panelists) aren’t being used to exclude groups of candidates, to put codes of conduct into place to help remind everyone of a policy of inclusiveness, and to encourage employees to respond positively to mentorship requests, especially from members of minority groups (women, non-white, not traditionally gendered). The discussion seemed to us to be a piece of a much larger, evolving, and extended conversation that we were glad to see happening in our professional community!
On Sunday, though a few of us squeezed in a session or two, our primary focus was our individual project update presentations, given at the Digital Preservation Interest Group morning session, and also our extended project or topic presentations at the Library of Congress booth in the early afternoon. The individual presentations, I’m please to say, went very well! It would be impossible to recap each presentation here, however, many of us have posted project updates recently, so please be sure to look us up for details. Furthermore, searching Twitter for #alamw14 and #NDSR brings you to this list, in which you can find representative samples of the highlights from our individual presentations.
Presentations – Question and Answer Session
We concluded the session by taking some questions, all of which were excellent – particularly the one from Howard Besser, who wanted to know how we believed our projects (or any resident or fellowship temporary project) could be carried on at the conclusion of our project term. The general response was that we are doing our best to ensure they are continued by integrating the projects, and ourselves, into the general workflows of our organizations – keeping all stakeholders informed from an early stage of our progress, finding support from other divisions, and documenting all of our decisions so that any action may be picked up again as easily as possible.
We also had an excellent question about how important networking had been for the success of our projects, and all agreed that, while networking with the D.C. community has been essential (through our personal efforts and also through groups like the DCHDC meetup), almost more significant has been our ability to network with each other – to share feedback, resources, documents, websites, and connections to other networks, which has helped us accomplish our goals more efficiently and effectively. One of the goals of the NDSR program was, of course, to help institutions get valuable work done in the area of digital stewardship, which we are all doing. However, another goal was for the program to help build a professional community in digital stewardship. What is a community if not a group of diverse professionals who trust and rely on each other, who share successes and setbacks, resources and networks, and who support each other as we learn and grow? Though the language is my own, the sentiment is one I heard shared between us over and over during the ALA weekend.
NDSR Recent Activity
In recent news, Emily Reynolds and Lauren Work both discuss their take on our ALA experience, Emily’s here and Lauren’s here. Molly Swartz published some pictures and thoughts on ALA Midwinter over here. Jaime McCurry recently interviewed Maureen McCormick-Harlow about her work at the National Library of Medicine. And to conclude, I’ve recently posted two updates on my project, one on this page and another courtesy of the Digital Libraries Federation.
Thanks for listening, and be sure to tune in two weeks from now when Maureen McCormick-Harlow will be writing another NDSR guest post. If you, like us, were at ALA Midwinter last weekend, I hope you found it as enjoyable as we did!
One of my first blogs here covered an evaluation of a number of format identification tools. One of the more surprising results of that work was that out of the five tools that were tested, no less than four of them (FITS, DROID, Fido and JHOVE2) failed to even run when executed with their associated launcher script. In many cases the Windows launcher scripts (batch files) only worked when executed from the installation folder. Apart from making things unnecessarily difficult for the user, this also completely flies in the face of all existing conventions on command-line interface design. Around the time of this work (summer 2011) I had been in contact with the developers of all the evaluated tools, and until last week I thought those issues were a thing of the past. Well, was I wrong!FITS 0.8
Fast-forward 2.5 years: this week I saw the announcement of the latest FITS release. This got me curious, also because of the recent work on this tool as part of the FITS Blitz. So I downloaded FITS 0.8, installed it in a directory called c:\fits\on my Windows PC, and then typed (while being in directory f:\myData\):f:\myData>c:\fits\fits
Instead of the expected helper message I ended up with this:The system cannot find the path specified. Error: Could not find or load main class edu.harvard.hul.ois.fits.Fits
Hang on, I've seen this before ... don't tell me this is the same bug that I already reported 2.5 years ago ? Well, turns out it is after all!
This got me curious about the status of the other tools that had similar problems in 2011, so I started downloading the latest versions of DROID, JHOVE2 and Fido. As I was on a roll anyway, I gave JHOVE a try as well (even though it was not part of the 2011 evaluation). The objective of the test was simply to run each tool and get some screen output (e.g. a help message), nothing more. I did these tests on a PC running Windows 7 with Java version 1.7.0_25. Here are the results.DROID 6.1.3
First I installed DROID in a directory C:\droid\. Then I executed it using:f:\myData>c:\droid\droid
This started up a Java Virtual Machine Launcher that showed this message box:
The Running DROID text document that comes with DROID says:
To run DROID on Windows, use the "droid.bat" file. You can either double-click on this file, or run it from the command-line console, by typing "droid" when you are in the droid installation folder.
So, no progress on this for DROID either, then. I was able to get DROID running by circumventing the launcher script like this:java -jar c:\droid\droid-command-line-6.1.3.jar
This resulted in the following output:No command line options specified
This isn't particularly helpful. There is a helper message, for which you have to give the -h flag on the command line. But you don't get to see this until you give the -h flag on the command line. Catch 22 anyone?JHOVE2-2.1.0
After installing JHOVE2 in c:\jhove2\, I typed:f:\myData>c:\jhove2\jhove2
This gave me 1393 (yes, you read that right: 1393!) Java deprecation warnings, each along the lines of:16:51:02,702 [main] WARN TypeConverterDelegate : PropertyEditor [com.sun.beans.editors.EnumEditor] found through deprecated global PropertyEditorManager fallback - consider using a more isolated form of registration, e.g. on the BeanWrapper/BeanFactory!
This was eventually followed by the (expected) JHOVE2 help message, and a quick test on some actual files confirmed that JHOVE2 does actually work. Nevertheless, by the time the tsunami of warning messages is over, many first-time users will have started running for the bunkers!Fido 1.3.1
Fido doesn't make use of any launcher scripts any more, and the default way to run it is to use the Python script directly. After installing in c:\fido\ I typed:f:\myData>c:\fido\fido.py
Which resulted in ..... (drum roll) ... a nicely formatted Fido help message, which is exactly what I was hoping for. Beautiful!JHOVE 1.11
I installed JHOVE in c:\jhove\ and then typed:f:\myData>c:\jhove\jhove
Which resulted in this:Exception in thread "main" java.lang.NoClassDefFoundError: edu/harvard/hul/ois/j hove/viewer/ConfigWindow at edu.harvard.hul.ois.jhove.DefaultConfigurationBuilder.writeDefaultCon figFile(Unknown Source) at edu.harvard.hul.ois.jhove.JhoveBase.init(Unknown Source) at Jhove.main(Unknown Source) Caused by: java.lang.ClassNotFoundException: edu.harvard.hul.ois.jhove.viewer.Co nfigWindow at java.net.URLClassLoader$1.run(Unknown Source) at java.net.URLClassLoader$1.run(Unknown Source) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(Unknown Source) at java.lang.ClassLoader.loadClass(Unknown Source) at sun.misc.Launcher$AppClassLoader.loadClass(Unknown Source) at java.lang.ClassLoader.loadClass(Unknown Source) ... 3 more
I limited my tests to a Windows environment only, and results may well be better under Linux for some of these tools. Nevertheless, I find it nothing less than astounding that so many of these (often widely cited) preservation tools fail to even execute on today's most widespread operating system. Granted, in some cases there are workarounds, such as tweaking the launcher scripts, or circumventing them altogether. However, this is not an option for less tech-savvy users, who will simply conclude "Hey, this tool doesn't work", give up, and move on to other things. Moreover, this means that much of the (often huge) amounts of development effort that went into these tools will simply fail to reach its potential audience, and I think this is a tremendous waste. I'm also wondering why there's been so little progress on this over the past 2.5 years. Is it really that difficult to develop preservation tools with command-line interfaces that follow basic design conventions that have been ubiquitous elsewhere for more than 30 years? Tools that just work?Preservation Topics: IdentificationCharacterisationToolsSCAPE
Here’s a simple experiment that involves asking an average person two questions. Question one is: “how do you feel about physical books?” Question two is: “how do you feel about digital data?”
The first question almost surely will quickly elicit warm, positive exclamations about a life-long relationship with books, including the joy of using and owning them as objects. You may also hear about the convenience of reading on an electronic device, but I’ll wager that most people will mention that only after expounding on paper books.
The second question shifts to cooler, more uncertain ground. The addressee may well appear baffled and request clarification. You could help the person a bit by specifying digital materials of personal interest to them, such as content that resides on their tablet or laptop. “Oh, that stuff,” they might say with measured relief. “I’m glad it’s there.”
These divergent emotional reactions should be worrying to those of us who are committed to keeping digital cultural heritage materials accessible over time. Trying to make a case for something that lacks emotional resonance is difficult, as marketing people say. Most certainly, the issue of limited resources is a common refrain when it comes to assessing the state of digital preservation in cultural heritage institutions; see the Canadian Heritage Information Network’s Digital Preservation Survey: 2011 Preliminary Results, for example.
The flip side is that traditional analog materials are a formidable competitor for management resources because those materials are seen in a glowing emotional context. I don’t mean to say that analog materials are awash in preservation money; far from it. But physical collections still have to be managed even as the volume of digital holdings rapidly rise, and efforts to move away from reliance on the physical are vulnerable to impassioned attack by people such as Nicholson Baker.
What is curious is that even as we collectively move toward an ever deeper relationship with digital, there remains a strong nostalgic bond with traditional book objects. A perfect example of this is a recent article, Real books should be preserved like papyrus scrolls. The author fully accepts the convenience and the future dominance of ebooks, and is profoundly elegiac in his view of the printed word. But, far from turning away from physical books, he declares that “books have a new place as sacred objects, and libraries as museums.” One might see this idea as one person’s nostalgic fetish, but it’s more than that. We can only wonder how long and to what extent this kind of powerful, emotionally-propelled thinking will drive how cultural heritage institutions operate, and more importantly, how they are funded.
As I’ve written before, we’re at a point where intriguing ideas are emerging about establishing a potentially deeper and more meaningful role for digital collections. This is vitally important, as a fundamental challenge that lies before those who champion digital cultural heritage preservation is how to develop a narrative that can compete in terms of personal meaning and impact.
Anyone willing to preserve digital content must be aware of events that might constitute a relevant risk. In SCAPE we are developing tools that will allow you to detect risks before they cause any irreversible damage.
Help us understand the preservation events, threats and opportunities, you find more relevant and the ways you would like us to detect them.
Participate in our survey and help us develop tools that would help you to automatically detect problems in your own content, and events that might put it at risk.
The survey has 30 short questions that should take about 10 minutes to complete.Join the survey now!http://survey.scape-project.eu/index.php/862812/lang-en Preservation Topics: Preservation StrategiesPreservation RisksBit rotFormat RegistryRepresentation InformationSCAPE
How do we make digital collections available at scale for today’s scholars and researchers? Lisa Green, director of Common Crawl, tackled this and related questions in her keynote address at Digital Preservation 2013. (You can view her slides and watch a video of her talk online.) As a follow up to ongoing discussions of what users can do with dumps of large sets of data, I’m thrilled to continue exploring the issues she raised in this insights interview.
Trevor: Could you tell us a bit about Common Crawl? What is your mission, what kinds of content do you have and how do you make it available to your users?
Lisa: Common Crawl is a non-profit organization that builds and maintains an open repository of web crawl data that is available for everyone to access and analyze. We believe that the web is is an incredibly valuable dataset capable of driving innovation in research, business, and education and that the more people that have access to this dataset, the greater the benefit to society. The data is stored on public cloud platforms so that anyone with a access to the internet can access and analyze it.
Trevor: In your talk, you described the importance of machine scale analysis. Could you define that term for us and give some examples of why you think that kind of analysis is important for digital collections?
Lisa: Let me start by describing human scale analysis. Human scale analysis means that a person ingests information with their eyes and then processes and analyzes it with their brain. Even if several people – or even hundreds of people – work on the analysis, it is not as fast as a computer program can ingest, process, and analyze information. Machine scale analysis is when a computer program does the analysis. A computer program can analysis data millions to billions of times faster than a human. It can run 24 hours a day with no need for rest and it can simultaneously run on multiple machines.
Machine scale analysis is important for digital collections because of the massive volume of data in most digital collections. Imagine that a researcher wanted to study the etymology of a word and planned to use a digital collection to answers questions such as:
- What is the first occurrence of this word?
- How did the frequency of occurrence change over time?
- What types of publication it is first appear in?
- When did it first appear in other types of publications and how did the types of publications it appeared in change over time?
- What other words most commonly appear in the same sentence, paragraph or page with the word and how did that change over time?
Answering such questions using human scale analysis would take lifetimes of man hours to search the collection for the given word. Machine scale analysis could retrieve the information in seconds or minutes. And if the researcher wanted to make changes in the questions or criteria, only a small amount of effort would be required to alter the software program, then the program could be rerun and return the new the information in seconds or minutes. If we want to optimize the extraction of knowledge from the enormous amounts of data digital collections, human analysis is simply too slow.
Trevor: What do you think libraries, archives and museums can learn from Common Crawl’s approach?
Lisa: I think it is of crucial importance to preserve data in a format that it can be analyzed by computers. For instance, if material is stored as a PDF, it difficult – and sometimes impossible – for software programs to analysis the material and therefore libraries, archives and museums will be limited in the amount of information that can be extracted from the material in a reasonable amount of time.
Trevor: What kind of infrastructure do you think libraries, archives and museums need to have to be able to provide capability for machine scale analysis? Do you think they need to be developing that capacity on their own systems or relying on third party systems and platforms?
Lisa: The two components are storage and compute capacity. When one thinks of digital preservation, storage is always considered but compute capacity is not always considered. Storage is necessary for preservation and the type of storage system influences access to the collection. Compute capacity is necessary for analysis. Building and maintaining the infrastructure or storage and compute can be expensive, so it doesn’t make much financial sense for each organization to develop it own their own.
One option would be a collaborative, shared system build and used by many organizations. This would allow the costs to be shared, avoid duplicative work and storing duplicate material, and – perhaps most importantly – maximize the number of people who have access to the collections.
Personally I believe a better option would be to utilize existing third party systems and platforms. This option avoids the cost of developing custom systems and often makes it easier to maintain or alter the system as there is a greater pool of technologists familiar with the popular third party platforms.
I am a strong believer in public cloud platforms is because there is no upfront cost for the hardware, no need to maintain or replace hardware, and one only pays for the storage and compute that is used. I think it would be wonderful to see more libraries, museums, and archives storing copies of their collections on public cloud platforms in order to increase access. The most interesting use of your data may be thought of by someone outside your organization and the more people who can access the data, the more minds can work to find valuable insight within your data.
Interface, Exhibition & Artwork: Geocities, Deleted City and the Future of Interfaces to Digital Collections
In 2009, a band of rogue digital preservationists called Archive Team did their best to collect and preserve Geocities. The resulting data has became the basis for at least two works of art: Deleted City and One Terabyte of Kilobyte Age. I think the story of this data set and these works offer insights into the future roles of cultural heritage organizations and their collections.
Let Them Build Interfaces
In short, Archive Team collected the data and made the dataset available for bulk download. If you like, you can also just access the 51,000 MIDI music files from the data set from the Internet Archive. Beyond that, because the data was available in mass, the corpus of personal websites became the basis for other works. Taking the Geocities data as a basis, Richard Vijgen’s Deleted City interprets and presents an interface to the data and Olia Lialina & Dragan Espenschied’s One Terabyte of Kilobyte Age is in effect designed reenactment grounded in an articulated approach to accessibility and authenticity.
An Artwork as the Interface to Your Collection
Some of the most powerful ways to interact the Geocities collection is through works of created by those who have access to the collection as a dataset. Working with digital objects means we don’t need to define the way that they will be accessed or made available. By making the raw data available on the web, and providing a point of reference for the data set everyone is enabled to create interfaces to it.
How to make available digital collections and objects?
Access remains the burning question for cultural heritage organizations interested in the acquisition and preservation of digital artifacts and collections. What kinds of interfaces do they need in place to serve what kinds of users? If you don’t know how to make it available in advance what can you do with it? I’ve been in discussions with staff from a range of cultural heritage organizations who don’t really want to wade too deep into acquiring born digital materials without having a plan for how to make them available.
The story of Geocities, Archive Team and these artists suggests that if you can make the data avaliable you can invite others to invent the interfaces. If users can help figure out and develop modes of access, as illustrated in this case, then cultural heritage organizations could potentially invite much larger communities of users to help figure out issues around migration and emulation as modes of access as well. By making the content broadly available, organizations have the ability to broaden the network of people who might contribute to efforts to make digital artifacts accessible into the future.
Collections and Interfaces Inside and Outside
An exciting model can emerge here. Through data dumps of full sets of raw data, cultural heritage organizations can consider embracing the fact that they don’t need to provide the best interface, or for that matter much of any interface at all, for digital content they agree to steward. Instead, a cultural heritage organization can agree to acquire materials or collections which are considered interesting and important, but which they don’t necessarily have the resources or inclination to build sophisticated interfaces to if they are willing to simply provide canonical homes for the data, offer information about the provenance of the data, and invest in dedicated ongoing bit-level preservation. This approach would resonate quite strongly with a more product less process approach to born digital archival materials.
An Example: 4Chan Collection/Dataset @ Stanford
For a sense of what it might look like for a cultural heritage organization to do something like this we need look no further than a recent Stanford University Library acquisition. The recent acquisition of an archive of a collection of 4Chan data into Stanford’s digital repository shows how a research library could go about exactly this sort of activity. The page for the data set/collection briefly describes the structure of the data and some information and context about the collector who offered it to Stanford. Stanford acts as the repository and makes the data available for others to explore, manipulate and create a multiplicity of interfaces to. How will others explore or interface with this content? Only time will tell. In any event, it likely did not take that many resources to acquire it and it will likely not require that much in resources to maintain it at a basic level into the future.
How to encourage rather than discourage this?
If we wanted to encourage this kind of behavior how would we do it? First, off I think we need more data dumps for this kind of data. With the added note that bitsize downloadable chunks of data are going to be the easiest thing for any potential user to right click and save to their desktop. Beyond that, cultural heritage organizations could embrace this example and put up prizes and bounties for artists and designers to develop and create interfaces to different collections.
What I think is particularly exciting here is that by letting go of the requirement to provide the definitive interface cultural heritage organizations could focus more on selection and working to ensure long term preservation and integrity of data. Who knows, some of the interfaces others create might be such great works of art that another cultural heritage organization might feature it in their own database of works.
Last spring, I attended a Hackathon at the University of Leeds, which resulted in my getting a SPRUCE Grant for a month’s work enhancing FITS, a tool which at the time was technically open source but which the Harvard Library treated a bit possessively. After I finished, it seemed for a while that nothing was happening with my work, but it was just a matter of being patient enough. Collaboration between Harvard and the Open Planets Foundation has resulted in a more genuinely open FITS, which now has its own website. There’s also a GitHub repository with five contributors, none of which are me since my work was on an earlier repository that was incorporated into this one.
It really makes me happy to see my work reach this kind of fruition, even if I’m so busy on other things now that I don’t have time to participate.
Tagged: FITS, Harvard, Open Planets Foundation, preservation, software
In western North Carolina, in the foothills of the Great Smoky Mountains, rests a boulder covered in prehistoric petroglyphs attributed to the Native Americans who have resided in the area for thousands of years. Experts debate the specific origin and meaning of the glyphs but the general interpretation describes Judaculla, a human-like giant with supernatural powers, who protects the Cherokee Nation and the land that nourishes and supports them. This cultural record of Cherokee society, called Judaculla Rock, has been accessible for millennia because it is recorded in stone. With protection and preservation, it might continue to be accessible for thousands of years to come.
A few miles away, at Western Carolina University (which was built on the site of a Cherokee village) in the town of Cullowhee, Anna Fariello has helped create digital cultural records of Cherokee society, which she has preserved and made accessible online as the Cherokee Traditions collection. Given the potential longevity of digital collections, Cherokee Traditions — with protection and appropriate preservation — might be accessible for many years to come. Maybe even as long as Judaculla Rock.
Fariello, the head of Digital Initiatives at Western Carolina University’s Hunter Library, does not limit her preservation work to the Cherokee culture. She is trying to digitally preserve as much of the rich cultural heritage of the western North Carolina Smoky Mountain region as she can and make those collections available online.
She spent the early part of her career creating exhibits for museums, which is evident in how she stages each of Hunter Library’s online collections. But the transition from displaying material objects in a museum to displaying digital objects online did not happen quickly for Fariello. Creating an appealing online collection involved more than just displaying photos and text in a browser; it required conceptualizing and planning for the browser medium and the user experience. The process also required some trial and error. For example, she points out the text-heaviness of Hunter Library’s first online collection, Craft Revival, and notes that with each collection they moved further away from dense explanatory text toward showcasing the richness of the cultural artifacts, within the limitations and the possibilities of the medium.
Soon after Fariello started working at Hunter Library in 2005, she began scouting her community for primary source material for possible collections to put online. There was the Craft Revival, of course. Cherokee culture was also an obvious choice. “When I first moved here, I knew of the Cherokee people here but I didn’t realize we were at the seat of the Cherokee homeland,” said Fariello. “That collection developed out of my growing awareness of that and reaching out to our partners, the Museum of the Cherokee Indian and Qualla Arts and Crafts, the Cherokee’s artisan guild. The project won a major recognition last year from the Association of Tribal Archives, Libraries and Museums.”
Each digital collection that she developed presented a new challenge. The Western Carolina University Herbarium seemed promising and uncomplicated because the content — 100,000 plant specimens — is archived at the university. And while The Western Carolina University Herbarium was historically relevant (among the specimens collected over 150 years, it contains specimens from the decimated American Chestnut tree), funding was a challenge because the collection is a natural history collection and the grants that Fariello was pursuing applied to cultural history collections.
She traveled throughout Appalachia — county to county, museum to museum, library to library — to talk with archivists and librarians and gather material. When Fariello researched content for Hunter Library’s Great Smoky Mountains exhibit, she found very few historic photos and digitized artifacts online relating to the Great Smoky Mountains National Park. When she went to the national park to assess its collections, she discovered that they had many well-preserved photos and artifacts but they had no plans to put them online. “Digitization is outside the scope of what they can do in the current economic climate,” said Fariello. She took that as a confirmation that Hunter Library should digitize the Great Smoky Mountains National Park materials and develop a digital collection.
Some of the collections Fariello digitized were not organized to begin with. “There’s quite a bit of curating that needs to happen with those,” said Fariello. “How to tell a coherent story and find the important aspects of that story. How to figure out what to leave out in order to build a strong collection.”
Fariello gave a presentation last fall at the American Folklife Center’s Cultural Heritage Archives Symposium. During the presentation she spoke about how Hunter Library acquired and archived a unique oral history collection through serendipity and rescued it from possible digital loss. She said she was approached for an interview for Stories of Mountain Folk, a highly polished radio show produced near Cullowhee. She was impressed by the mission of the show, the professionalism of the interviewers and the show’s high production values. When the producers told her they record the show digitally and it had been around for five years, Fariello’s digital-preservation instincts kicked in. She said, “I asked them, ‘You’ve been doing this for five years? Where are all the sound files?’ And the answer was, ‘On GoDaddy.’ I was surprised, to say the least.”
Fariello immediately began to make arrangements to archive the program at the university, which resulted in Hunter Library hosting the Stories of Mountain Folk collection. Hunter Library’s website describes the collection as, “Over 200 half-hour and hour-long recordings capture ‘local memory’ detailing traditions, events, and the life stories of mountain people. A wide range of interviewees include down-home gardeners, herbalists, and farmers, as well as musicians, artists, local writers, and more.”
Except for the digital audio files from Stories of Mountain Folk, most of the digital files in Hunter Library’s digital repository are photographs and documents. The library’s Digital Production Team scans each photograph and document as a 600 dpi TIFF master copy for preservation; these TIFFs reside on servers at Western Carolina University and are also backed up onto gold CDs. The team also creates a 300 dpi JPEG copy of each scan to display online in the collections. They enter the related metadata into a database.
Hunter Library uses a content management system to transfer the JPEGs to a vendor, who displays each digitized item along with its metadata. Fariello likes the convenience and reliability of using a vendor — for which Hunter Library pays an annual fee — but doesn’t like that the URL changes in the browser from WCU’s to the vendor’s when a user is on a Hunter Library collection web page and they click on an item for a closer look. That “closer look” page displays its contents from the vendor’s server.
In other words, the collections’ top-level introductory pages reside at WCU and the individual item-level pages reside on the vendor’s server. Fariello would like to keep the entire online collection on campus, but Hunter Library lacks the financial and technological resources for that right now. The vendor service is an affordable compromise.
Like most libraries and museums in the U.S., Hunter Library’s small staff and tight funds limit the number of online collections it can create. Their vision exceeds their resources. Fariello said, “It seems to me that all over the country, digitization projects -– and digital tools for preservation -– are not always a funded part of core library services.” So she doggedly pursues grants. In the ten years she has been at Hunter Library, Fariello has raised more than a half million dollars to support their digital projects. She especially appreciates the way the state of North Carolina distributes the Library Service Technology Act funds, by way of IMLS. “In North Carolina those funds are administered by the state library,” said Fariello, “which created a grant program to get the funds out into the community at a local level.”
I asked Fariello if she saw Hunter Library’s online collections as a future direction for all libraries and her response was both realistic and hopeful. She said that the determining factor is whether a library has archived any collections to begin with. “The next phase for them would be to make the collections accessible through digitization,” said Fariello. “Not all libraries have an archival focus. If they don’t have collections, digitization is not going to part of their responsibility.”
She said that libraries are changing with the times and librarians, especially young librarians, accept digital services as a natural function of a modern library. “It’s no longer a future function, it’s a present function,” said Fariello. If a library is interested in developing digital collections, the tools are available and standards are in place.
“In 2005 when I started, the standards weren’t clear,” she said. “We wondered, ‘How do you do this?’ Now the standards are standard. Sites like the federal digitization standards site (Federal Agencies Digitization Guidelines Initiative) and the Northeast Document Conservation Center are well established. You don’t have to invent how to do it. If you want to achieve a certain level of professionalism, follow those guidelines. Things have changed. It’s not that hard once somebody figures out how to do it.”
Most researchers begin online and they expect to — or hope to — find what they are looking for or something related to what they are looking for. Fariello said that, for researchers, online collections are equally as useful as eJournals and Wikipedia. Online collections do not replace research at a library or a museum but online collections do make digital versions readily accessible.
“Access” has always been a guiding principle for Fariello in developing collections. She concentrates on making them useful and friendly for people. “The collections have been successful because I approach their development from the standpoint of someone who would use these collections,” said Fariello.
Librarians, curators, archivists and other information professionals provide a unique service by developing digital collections. And not just by digitizing the collections that reside within their institutions but also by looking outside, into the surrounding community, to rescue collections that are at risk.
“My position has never been to work within an ivory tower institution,” said Fariello. “I try to be aware of what is out in my community. Public institutions need to look to our communities and see where content is being created, especially by non-academic folks who don’t really know what to do with it once they pull it together.”
This blog follows up on three earlier posts about detecting preservation risks in PDF files. In part 1 I explored to what extent the Preflight component of the Apache PDFBox library can be used to detect specific preservation risks in PDF documents. This was followed up by some work during the SPRUCE Hackathon in Leeds, which is covered by this blog post by Peter Cliff. Then last summer I did a series of additional tests using files from the Adobe Acrobat Engineering website. The main outcome of this more recent work was that, although showing great promise, Preflight was struggling with many more complex PDFs. Fast-forward another six months and, thanks to the excellent response of the Preflight developers to our bug reports, the most serious of these problems are now largely solved1. So, time to move on to the next step!Govdocs Selected
Ultimately, the aim of this work is to be able to profile large PDF collections for specific preservation risks, or to verify that a PDF conforms to an institute-specific policy before ingest. To get a better idea of how that might work in practice, I decided to do some tests with the Govdocs Selected dataset, which is a subset of the Govdocs1 corpus. As a first step I ran the latest version of Preflight on every PDF in the corpus (about 15 thousand)2.Validation errors
As I was curious about the most common validation errors (or, more correctly, violations of the PDF/A-1b profile), I ran a little post-processing script on the output files to calculate error occurrences. The following table lists the results. For each Preflight error (which is represented as an error code), the table shows the number of PDFs for which the error was reported (expressed as a percentage)3.Error code% PDFs reportedDescription (from Preflight source code)2.4.379.5color space used in the PDF file but the DestOutputProfile is missing7.152.5Invalid metadata found220.127.116.11RGB color space used in the PDF file but the DestOutputProfile isn't RGB18.104.22.168Error on the object delimiters (obj / endobj)1.4.634.3ID in 1st trailer and the last is different1.2.532.1The length of the stream dictionary and the stream length is inconsistent7.1131.9PDF/A Identification Schema not found22.214.171.124Some mandatory fields are missing from the FONT Descriptor Dictionary3.1.329.4Error on the "Font File x" in the Font Descriptor (ed.:font not embedded?)126.96.36.199Some mandatory fields are missing from the FONT Dictionary3.1.617.1Width array and Font program Width are inconsistent5.2.213The annotation uses a flag which is forbidden188.8.131.52CMYK color space used in the PDF file but the DestOutputProfile isn't CMYK1.2.212Error on the stream delimiters (stream / endstream)184.108.40.206The stream uses a filter which isn't defined in the PDF Reference document220.127.116.11ID is missing from the trailer18.104.22.168The CIDSet entry i mandatory from a subset of composite font1.18.3Header syntax error22.214.171.124The stream uses an invalid filter (The LZW)126.96.36.199Encoding is inconsistent with the Font2.36.7A XObject has an unexpected key definedException6.6Preflight raised an exception188.8.131.52The CIDToGID is invalid184.108.40.206Charset declaration is missing in a Type 1 Subset7.25Metadata mismatch between PDF Dictionnary and xmp7.34.3Description schema required not embedded220.127.116.11A XObject has an unexpected value for a defined key18.104.22.168Unknown metadata22.214.171.124a glyph is missing126.96.36.199Optional content is forbidden188.8.131.52A XObject SMask value isn't None184.108.40.206An object has an invalid offset220.127.116.11Last %%EOF sequence is followed by data18.104.22.168A Group entry with S = Transparency is used or the S = Null11.6Syntax error22.214.171.124Annotation uses a Color profile which isn't the same than the profile contained by the OutputIntent126.96.36.199The number is out of Range188.8.131.52The AP dictionary of the annotation contains forbidden/invalid entries (only the N entry is authorized)6.2.51An explicitly forbidden action is used in the PDF file1.4.71EmbeddedFile entry is present in the Names dictionary
This table does look a bit intimidating (but see this summary of Preflight errors); nevertheless it is useful to point out a couple of general observations:
- Some errors are really common; for instance, error 2.4.3 is reported for nearly 80% of all PDFs in the corpus!
- Errors related to color spaces, metadata and fonts are particularly common.
- File structure errors (1.x range) are reported quite a lot as well. Although I haven't looked at this in any detail, I expect that for some files these errors truly reflect a deviation from the PDF/A-1 profile, whereas in other cases these files may simply not be valid PDF (which would be more serious).
- About 6.5% of all analysed files raised an exception in Preflight, which could either mean that something is seriously wrong with them, or alternatively it may point to bugs in Preflight.
Although it's easy to get overwhelmed by the Preflight output above, we should keep in mind here that the ultimate aim of this work is not to validate against PDF/A-1, but to assess arbitrary PDFs against a pre-defined technical profile. This profile may reflect an institution's low-level preservation policies on the requirements a PDF must meet to be deemed suitable for long-term preservation. In SCAPE such low-level policies are called control policies, and you can find more information on them here and here.
To illustrate this, I'll be using a hypothetical control policy for PDF that is defined by the following objectives:
- File must not be encrypted or password protected
- Fonts must be embedded and complete
- File must not contain embedded files (i.e. file attachments)
- File must not contain multimedia content (audio, video, 3-D objects)
- File should be valid PDF
Preflight's output contains all the information that is needed to establish whether each objective is met (except objective 6, which would need a full-fledged PDF validator). By translating the above objectives into a set of Schematron rules, it is pretty straightforward to assess each PDF in our dataset against the control policy. If that sounds familiar: this is the same approach that we used earlier for assessing JP2 images against a technical profile. A schema that represents our control policy can be found here. Note that this is only a first attempt, and it may well need some further fine-tuning (more about that later).Results of assessment
As a first step I validated all Preflight output files against this schema. The result is rather disappointing:OutcomeNumber of files%Pass397326Fail1112074
So, only 26% of all PDFs in Govdocs Selected meet the requirements of our control policy! The figure below gives us some further clues as to why this is happening:
Here each bar represents the occurrences of individual failed tests in our schema.Font errors galore
What is clear here is that the majority of failed tests is font-related. The Schematron rules that I used for the assessment currently includes all font errors that are reported by Preflight. Perhaps this is too strict on objective 2 ("Fonts must be embedded and complete"). A particular difficulty here is that it is often hard to envisage the impact of particular font errors on the rendering process. On the other hand, the results are consistent with the outcome of a 2013 survey by the PDF Association, which showed that its members see fonts as the most challenging aspect of PDF, both for processing and writing (source: this presentation by Duff Johnson). So, the assessment results may simply reflect that font problems are widespread4. One should also keep in mind that Govdocs selected was created by selecting on unique combinations of file properties from files in Govdocs1. As a result, one would expect this dataset to be more heterogeneous than most 'typical' PDF collections, and this would also influence the results. For instance, the Creating Program selection property could result in a relative over-representation of files that were produced by some crappy creation tool. Whether this is really the case could be easily tested by repeating this analysis for other collections.Other errors
These preliminary results show that policy-based assessment of PDF is possible using a combination of Apache Preflight and Schematron. However, dealing with font issues appears to be a particular challenge. Also, the lack of reliable tools to test for overall conformity to PDF (e.g. ISO 32000) is still a major limitation. Another limitation of this analysis is the lack of ground truth, which makes it difficult to assess the accuracy of the results.Demo script and data downloads
For those who want to have a go at the analyses that I've presented here, I've created a simple demo script here. The raw output data of the Govdocs selected corpus can be found here. This includes all Preflight files, the Schematron output and the error counts. A download link for the Govdocs selected corpus can be found at the bottom of this blog post.Acknowledgements
Apache Preflight developers Eric Leleu, Andreas Lehmkühler and Guillaume Bailleul are thanked for their support and prompt response to my questions and bug reports.Related blog posts
- Identification of PDF preservation risks with Apache Preflight: a first impression
- Identification of PDF preservation risks: the sequel
- Are your documents readable? How would you know? (Duff Johnson)
- From 1 Million to 21,000: Reducing Govdocs Significantly (Dave Tarrant)
- Creating machine understandable policy from human readable policy (Catherine Jones)
- Control Policies in the SCAPE Project (Sean Bechhofer)
This selection was only based on file extension, which introduces the possibility that some of these files aren't really PDFs. ↩
Errors that were reported for less than 1% of all analysed PDFs are not included in the table. ↩
In addition to this, it seems that Preflight sometimes fails to detect fonts that are not embedded, so the number of PDFs with font issues may be even greater than this test suggests. ↩