The following is a guest post by Nicole Scalessa, IT manager at The Library Company of Philadelphia, an NDSA member.
Digital stewardship is a prime topic for small institutions trying to keep pace with the increasing demands for digital content. The Library Company of Philadelphia, a special collections library founded by Benjamin Franklin in 1731, hosted the National Digital Stewardship Alliance Philly Regional meeting to inform and connect mid-Atlantic institutions so they may consider new collaborations to meet digital preservation demands.
The intent of the NDSA Philly Regional meeting was to present a slate of speakers that represented some of the most influential thinking and trends in digital preservation today. The event was opened to new audiences to spread the word of NDSA’s accomplishments and ongoing activities. Members of the Philadelphia Area Consortium of Special Collections Libraries, PhillyDH, and the Delaware Valley Archivists Group were in attendance. The event, on the cusp of ALA Mid-Winter, also drew audiences from around the country from as far as North Carolina, Florida, Colorado and Washington State.
Things kicked off Thursday evening, January 23rd, with a welcome by Library Company Director John C. Van Horne and an introduction by Erin Engle, digital archivist with NDIIPP. Erin provided a clear presentation of the NDSA mission to advocate for common needs among members through reports, guidance, meetings, events and webinars. As an example, she mentioned the 2014 National Agenda for Digital Stewardship, an insightful look into the trends and current state of digital preservation and a tool to help decision makers and funders.
This was followed by an enthusiastic and compelling keynote by Emily Gore, DPLA director for content, entitled “Building the Digital Public Library of America: Successes, Challenges and Future Directions.” A theme of sustainability resonated through her talk as she described the development of the DPLA and how it became clear that the hub model was the most successful strategy for the long term success of the project. The establishment of hubs is driven by the idea that asking a few existing digital repositories to aggregate content is the most efficient way to bring more institutions into DPLA. Hubs help DPLA with the management of data aggregation, metadata consistency, continual repository services, promoting new digitization, encouraging community engagement and self-evaluation for the improvement of existing and the development of new DPLA hubs.
The evening progressed into a series of lightning talks that focused on standards for preservation, digitization and description. This was a natural transition in the conversation and established a complete picture of the issues that must be addressed in any collaborative digitization strategy. Consistency was the prevailing message for success when conformity is often unattainable.
Meg Phillips, NARA’s external affairs liaison, initiated the lightning part of the evening with a presentation on the NDSA Levels of Digital Preservation, “a tiered set of recommendations for how organizations should begin to build or enhance their digital preservation activities.” She emphasized the importance of this document as a tool for self-assessment, program planning, institutional advocacy, strategic planning, and as a way to open communication with content creators. The success of the document lies in its simple descriptive format that is content agnostic. It includes four levels of preservation – protect your data, know your data, monitor your data, and repair your data; across five functions – storage and geographic location, file fixity and data integrity, information security, metadata and file formats.
Ian Bogus, MacDonald Curator of Preservation at the University of Pennsylvania Libraries lightening talk was entitled “Why Create a Standard on Digitization? An Experience Creating the Association for Library Collections and Technical Services Minimum Digitization Capture Recommendation.” The goal of this project was to establish an acceptable minimum standard that would resonate with staff with different degrees of digitization experience. With this standard libraries can create digital surrogates that are sustainable into the future. The guiding principles of the project were to create a standard high enough to meet adequacy, that kept in line with other recommendations and projects, did not reduplicate existing work, was basic enough for novices to use and was accurate enough for experts.
The evening was concluded with a fun discussion on metadata, with some serious undertones. George Blood of George Blood Audio|Video|Film discussed how we as librarians are “Describing Ourselves to Death” and the “the Failures of Metadata.” He began by affirming he is a metadata pessimist because no one asks “what problem are we trying to solve? Or “what are we trying to provide metadata for?” Most metadata is collected “just because we can” and because of this we do not test our metadata. The variety of metadata standards across and within institutions is staggering. Sometimes metadata standardization costs more than digitization itself. He encouraged the audience to consider what is a standard, does a standard need to be perfect, what are the implications of local modifications, and is there a one size fits all solution? This was quite a formidable list of questions to end the evening but a wonderful starting point for the Friday unconference the next morning.
Approximately 50 attendees convened to propose and vote upon the unconference sessions. The largest sessions included “making the case for digital preservation,” “let’s discuss a consortium data center,” and “how do we approach becoming a regional hub of DPLA.” The smaller breakout sessions included discussions on minimal standards for archival description, engaging leadership and encouraging organizational responsibility for digital projects, approaching rights and access issues, metrics for evaluation of digital archival resources, new technologies in digitization, and teaching digital preservation in library science and graduate archival programs. Notes from these sessions will be forthcoming on the event web page here.
The two-day event was attended by nearly one hundred and fifty people from around the country and ended in promising collaboration discussions and new friendships. This experience demonstrates that NDSA Regional meetings offer opportunities for local institutions to connect with one another while becoming informed on trends in digital stewardship on a national scale.
The following is a guest post by Maureen McCormick Harlow, a National Digital Stewardship Resident at the National Library of Medicine in Bethesda, Maryland. She is working on a project to build a thematic web collection.
Greetings from the National Library of Medicine! It’s hard to believe it, but I’m heading into the fourth quarter of my residency here. I thought it was time to give an update on what I’ve been doing for my project, even though it’s not terribly Valentine’s Day-related!
My NDSR project is to build a thematic web collection at NLM that will be incorporated into the History of Medicine Division collection. HMD has extensive digital and modern manuscript collections, and this little collection that I’m working on will be accessioned into it as a curated, intentional collection.
Thematic collections can provide institutions with an opportunity to close known collection gaps. If institutions can identify areas of weakness within their collections, they can intentionally collect on the topics as they exist today on the Internet. This is an especially attractive option for topics that are in flux, or whose understanding is changing frequently.
Another benefit of thematic web collections is that they allow institutions to collect material that may be ephemeral. Blogs come and go frequently, and once they are taken down, the information contained in them is gone as well. Collecting websites can be akin to collecting gray literature.
My project is limited to creating one thematic collection to add to the HMD holdings, but I wanted to also establish a framework that could be used in the future for other thematic collections. The framework that we eventually settled on is a thematic collection that represents two sides of the same coin so to speak. In this case, Autism Spectrum Disorders are brain disorders generally diagnosed at the beginning of life, while the brain is developing, while Alzheimer’s Disease is a brain disorder diagnosed at the end of life, in old age.
Although the two diseases are not related, they are diagnosed during the organ’s development and decay. Future thematic web collections could explore diagnoses in a particular body system or region made during the system/region’s development and at the end of life, or two extremes of the same issue. Some examples include:
- Teen pregnancy and infertility
- Diabetes type 1 and type 2
- Scoliosis and osteoarthritis
- Eating disorders and obesity
Each of these issues is one of strategic importance to NIH and, in some cases, the nation (see: the Let’s Move project by Michelle Obama and the Teen Pregnancy Prevention Resource Center in HHS’s Office of Adolescent Health). More importantly, many of these topics represent areas of great change and understandings that are in flux, making websites a viable way for future researchers to examine change over time.
Picking a Theme
Before you can create a thematic web collection, you’ve got to have a theme. This process took awhile. My first step was to look over the various collecting documents. In my section at NLM, there were three to consider: the NIH Research Priorities, the NLM Collection Development Policy and Manual, and an internal document that deals with known collection gaps (for example, the caregiver perspective). Each of these helped to inform narrow my possibilities. For instance, the Research Priorities at NIH report indicated several areas of interest to the larger NIH audience, alerting me to trends in research and some of the most prevalent problems in medicine. It stood to reason that, since these were priorities for NIH, there would be scholarly work about the diseases produced, and that the understanding of the diseases was in a period of flux, making web collecting more important than ever. Since this is a bit of a pioneer collection, I wanted it to fit squarely within each of these areas.
After reviewing all of these documents and spending a significant amount of time looking at internet resources, I came up with three proposals:
- Eating disorders
- Sexual assault
- Autism and Alzheimer’s
My last step in the process was to plug each potential topic into the NLM catalog and the HMD finding aid search to see what kind of resources we already had on each topic. Since one of my personal goals was to help fill some of the collecting gaps, I wanted to see that the web collection would be contributing something original to the HMD collection. In each case, I found that, while NLM collected extensively on each topic, the HMD holdings were limited.
We ended up going with the third option, and I’m calling the collection Disorders of the Developing and Aging Brain: Autism and Alzheimer’s on the Web.
Picking the Seeds
The scope of my collection was limited to approximately 40-60 seeds (individual websites/URLs that will be added to the collection). I decided to split the seeds roughly in half (a total of 64 seeds) and divided the ~30 per topic into six or seven different areas:
- Current understanding
- Caretakers (first-person resources, primarily blogs of caretakers)
- Patients/sufferers (also first-person, also primarily blogs)
- Prevention (for Alzheimer’s only)
For the first-person categories, I tried to make sure to cover a wide variety of ages, diagnoses, and roles/perspectives to represent a range of experiences.
Collecting the Material
After picking the seeds, we went about collecting permissions for the blogs. Although we have a strong argument for use under the ARL Best Practices for Fair Use guidelines, we’re proceeding with an abundance of caution and collecting as many permissions as possible for the blogs in the collection.
Describing the Collection
This is where I am now. My preliminary plan is to use the following methods to describe the new collection:
- Create a catalog record so that the collection is discoverable through the NLM catalog;
- Fully arrange and describe the collection using a finding aid and adhering to DACS principles and local implementations.
There are very few examples that I’ve found of web collections described in this manner, so it’s going to be a lot of work creating standards and best practices that will be robust and durable enough to make the collection usable to researchers, while also being flexible enough for archivists at NLM to use into the future.
That’s where my project stands now! I’m looking forward to finishing it, and I welcome the challenge of describing the collection and getting it incorporated into the HMD collection!
Other residents in the blogosphere: Heidi Dowding discusses digital asset management at cultural institutions in Baltimore, Emily Reynolds recaps her presentation at ALA Mid-Winter with Julia Blase and shares her slides, and Lauren Work shares her ALA Mid-Winter slides.
Every two years there is a fresh opportunity for excitement in following the Olympic games – not only for the thrill of the sports themselves, and rooting for hometown heroes, but for the fascination and variety of all the international culture in one place. And now, there is an effort going on behind the scenes to capture the highlights, the competition, and the general cultural history surrounding the Olympic Games. That is, a project to archive the 2014 Olympics web sites. This effort may not be well known, but the resultant archive will be invaluable for researchers in the future.
This web archiving project is being produced by the International Internet Preservation Consortium. The IIPC has been around since 2003, and it’s a collaborative organization dedicated to improving the tools, standards and best practices of web archiving while promoting international collaboration and the broad access and use of web archives for research and cultural heritage. Membership in the IIPC currently includes almost 50 organizations; libraries (including the Library of Congress), archives, museums and other cultural heritage organizations, representing over 25 countries. This Olympics project is being coordinated through the IIPC Access Working Group, and the project leaders are Nicola Bingham and Helen Helen Hockx-Yu, both of the British Library.
The IIPC has produced similar projects before: there is an archive of the 2010 Winter Olympics in Vancouver and the 2012 Summer Olympics and Parolympics in London. And this current effort aims to preserve a range of web sites relating to the 2014 Olympics in Sochi, Russia.
A little bit about the process – IIPC member institutions all contribute their own list of suggested web sites (referred to as “seeds”) for inclusion in the collection. With so many member organizations around the world, the aim is to include Olympics-related sites from many countries, in a variety of languages and from a variety of viewpoints.
The previous IIPC project to capture the 2012 Olympics in London included many British sites that provide an overall view of the host country preparations. These archived sites are not available yet, but include the official London 2012 Olympic and Paralympic Games sites as well as the British Olympic Association which includes details of the Olympics bid, and a local council’s 2012 Olympic and Paralympic website. It also includes the Hidden London site showing the building stages of the Olympic stadium, as well as blogs and commentaries related to arts and culture, featuring such things as a torch from the 1948 London Olympics acquired by the Victoria and Albert Museum.
For this current 2014 Olympics project, the various IIPC member institutions are all recommending their own list of websites to be included. For example, the Library of Congress has recommended 131 web sites. As described by Michael Neubert, Supervisory Digital Projects Specialist here at the Library: “The selection of most sites for such collections is mechanical, in that we know we want sites for the various US teams – each team sport has its own site, for example, then along with that site there will be various social media sites/channels. In order to optimize the crawls, we nominate the social media separately. In addition to the team sites, we also chose a limited number of news media sites where the coverage of the Olympics seemed segregated from the rest of the site.”
Nicola Bingham of the British Library, and one of the project coordinators, emphasizes additional contributions to this project. “The IIPC 2014 Winter Olympics project is being supported by the Internet Archive who are crawling the seeds (sites) and the University of North Texas who are supporting the nomination tool. A common subject scheme is being used to categorize websites according to producer type and Olympic sport. Crawling began in mid December 2013, and to-date 745 seeds have been nominated by 17 IIPC member institutions.”
“The Internet Archive has taken on the role of crawling, without which the project would have been much more difficult. Many other IIPC members would not have been able to perform the crawling, not necessarily for technical reasons but due to legal and/or political considerations.” For more about the web archiving process, see the IIPC “About Archiving” page.
As stated on the group’s Access Working Group page, “It is hoped that the project will enable institutions to continue to experiment with tools and processes that facilitate collaborative definition, collection and accessibility of web data.”
Over the next year or so, the IIPC will be working on creating wider access to all these Olympic archives. For the latest updates on this and other IIPC projects, follow @netpreserve.
Web pages are getting more complex than ever. Thus, identifying different elements from web pages, such as main content, menus, user comments, advertising among others, becomes difficult. Web page segmentation refers to the process of dividing a Web page into visually and semantically coherent segments called Blocks or Segments. Detecting these different blocks is a crucial step for many applications, for example mobile devices content visualization, information retrieval and change detection between versions in the web archive context.Web Page Segmentation at a Glance
For a web page (W) the output of its segmentation is the semantic tree of a web page (W'). Each node represents a data region in the web page, which is called a block. The root block represents the whole page. Each inner block is the aggregation of all its children blocks. All leaf blocks are atomic units and form a flat segmentation of the web page. Each block is identified by a block-id value (See Figure 1 for an example).
An efficient web page segmentation aproach is important for several issues:
Process different part of a web page accordingly to its type of content.
Assign importance to a region in a web page over the rest
Understand the structure of a web page
In this post, I will try to explain what web page segmentation does specially for pagelyzer. It provides information of about the web page content.Web page Segmentation Algorithm
We present here the detail for the Block-o-Matic web page segmentation algorithm used by pagelyzer to perform the segmentation. It is an hybrid between the visual-based approach and document processing approach.
The segmentation process is divided in three phases: analysis, understanding and reconstruction. It comprise three taks: filter, mapping and combine. It produces three structures: DOM structure, content structure and logic structure. The main aspect of the whole process is producing this structures where the logic structure represent the final segmentation of the web page.
The DOM tree is obtained from the rendering of a web browser. The result of the analysis phase is the content structure (Wcont ), built from the DOM tree with the d2c algorithm. Mapping the content structure into a logical structure (Wlog ) is called document understanding. This mapping is performed by the c2l algorithm with a granularity parameter pG. Web page reconstruction gather the three structures (Rec function),
W' = Rec(DOM, d2c(DOM ), c2l(d2c(DOM, pG))).
For the integration of the segmentation outcome to pagelyzer it is used a XML representation: ViDIFF. It represent hierarchicaly the blocks, their geometric properties, the links and text in each block.Implementation
Block-o-matic algorithm is available:
- through pagelyzer itself https://github.com/openplanets/pagelyzer),
There was a week in January 2014 where I participated in three meetings/events where emulation came up as a digital preservation solution. Emulation has really hit its stride, 20 years after I first heard about it.
An emulator is an environment that imitates the behavior of a computer or other electronic system. In recent years, this has come to be known as a Virtual Machine, which is a recreated computer environment — from the operating system to the video drivers and software — that can be run in an interactive manner using current technology, including a web browser in some instances.
I was very much the fan of collecting hardware for digital preservation, until I participated in the Library of Congress Preserving.exe meeting in May of 2013. I wrote about my own conversion to Team Emulation in an earlier post on this blog., and my colleague Bill Lefurgy responded to my post with a post of his own. (That said, we still need vintage hardware to read older media to bring operating systems and software into emulation environments.)
There are a few key articles on this topic:
- Granger, Stewart. “Emulation as a Digital Preservation Strategy.” D-Lib Magazine 6.19 (2000).
- Guttenbrunner, Mark, and Andreas Rauber. “A measurement framework for evaluating emulators for digital preservation.” ACM Transactions on Information Systems (TOIS) 30.2 (2012): 14.
- Rechert, Klaus, Dirk von Suchodoletz, and Randolph Welte. “Emulation based services in digital preservation.” Proceedings of the 10th annual joint conference on Digital libraries. ACM, 2010.
- Rothenberg, Jeffrey. “The Emulation Solution.” Avoiding Technological Quicksand: Finding a Viable Technical Foundation for Digital Preservation. Washington, DC: Council on Library and Information Resources, 1998. Council on Library and Information Resources.
- Van der Hoeven, Jeffrey, Bram Lohman, and Remco Verdegem. “Emulation for digital preservation in practice: The results.” International journal of digital curation 2.2 (2008): 123-132.
Don’t let some of the early dates fool you – this issue was debated in just as lively a way 15 years ago as it is now.
The beginning is a very fine place to start indeed for the Federal Agencies Digitization Guidelines Initiative Born Digital Video subgroup of the Audio-Visual Working Group. As mentioned in a previous blog post, the FADGI Born Digital Video subgroup is taking a close look at the range of decisions to be made throughout the lifecycle of born digital video objects, from file creation through archival ingest and access delivery. Through case histories from federal agencies such as National Archives and Records Administration, Smithsonian Institution Archives, and National Oceanic and Atmospheric Administration, Library of Congress, Voice of America and American Folklife Center, we are exploring the “truth and consequences” when creating and archiving born digital video. In this blog post, we’ll look at some of our guiding principles for creating born digital video.
But as Julie Andrew’s says, let’s start at the very beginning. What do we mean by born digital video? Quite simply, it’s video that is recorded to digital file at the point of creation. Born digital video is distinct from digitized or reformatted video, a label used to describe the result of translating the analog signal data emanating from a video object into a digitally encoded format. FADGI’s Reformatted Video subgroup is developing a matrix which compares target wrappers and encodings against a set list of criteria that come into play when reformatting analog videotapes.
The first set of FADGI BDV case histories highlight what we call advice for shooters (a.k.a. videographers), and by extension, the project managers within cultural heritage institutions who are responsible for the creation new born digital video files – especially determining the technical file specifications. It’s important to recognize that the FADGI target audience for these case histories isn’t Hollywood or commercial entertainment producers. It’s the cultural heritage community or smaller archives who create non-broadcast classes of content recording such as oral histories. A great example is the Civil Rights History Project at AFC. These types of projects have the opportunity to spec out the born digital video deliverable from the very beginning and end up with a file that is ingest ready for preservation and access systems.
The goal of the case histories project is to use guiding principles to illustrate the advantages of starting with high quality data capture from the very start. Two examples of FADGI’s guiding principles for creating born digital video include:
- Create uncompressed video instead of compressed video. Compressed video reduces the amount of data in a file or stream. Although a reduced amount of data can be beneficial for easing storage, transfer, and play-out requirements, it generally introduces additional technical complexity which can have a negative impact on usability of the file over time. Uncompressed video retains all the visual information captured at the selected resolution, which is preferable for preservation purposes.
- If compression is required, use lossless compression over lossy compression. Lossless compression uses algorithms that restore the compressed data after decompression. It is essentially reversible compression. Lossy compression permanently alters or deletes the compressed data. If data reduction gains are significant enough to warrant using the added complexity of compressed files, lossless compression is preferred to preserve video quality.
These are just two examples that focus on the video encoding. The guiding principles also cover considerations for file wrapper or container capabilities, format sustainability and more general project concerns.
But here’s the thing: our case histories don’t always follow our own guiding principles. And that’s just fine by us. None of us live in a utopian world where digital storage is abundant and systems are completely interoperable. We all have to make choices and compromises to work within our restraints. Uncompressed video files can be huge and a burden to manage and maintain. Lossy compression can be appropriate for certain projects. The guiding principles should all be read with the caveat “if you have the option….” Sometimes, you simply don’t have the option for a myriad of reasons. But when you do have the option, the guiding principles highlight the advantages of high quality data capture. The important take-away from the case histories project is the choices made during the file creation process will have impacts on the long term archiving and distribution processes and it’s essential to understand what those impacts are and have a plan for to resolve any conflicts.
Our hope is that these guiding principles and case histories help us start to flesh out more specific format guidance for born digital video but that’s in the future. The case history project, which will be published on the Federal Agencies Digitization Guidelines Initiative website this spring, is the first step towards understanding where we are as a community and what we can learn from each other.
How do I know if a digital file/object has been corrupted, changed or altered? Further how can I prove that I know what I have? How can I be confident that the content I am providing is in good condition, complete, or reasonably complete? How do I verify that a file/object has not changed over time or during transfer processes?
In digital preservation, a key part of answering these questions comes through establishing and checking the “fixity” or stability of digital content. At this point, many in the preservation community know they should be checking the fixity of their content, but how, when and how often?
A team of individuals from the NDSA Infrastructure & Standards working groups have developed Checking Your Digital Content: How, What and When to Check Fixity? in an effort to help stewards answer these questions in a way that makes sense for their organization based on their needs and resources. We are excited to publicly share this draft document for more broad open discussion and review here on The Signal. We welcome comments and questions, please post them at the bottom of this post for the working group to review.
Not Best Practices, but Guidance for Making Best Use of Resources at Hand
In keeping with work on the NDSA Levels of Digital Preservation, this document is not a benchmark or requirement. It is instead intended as a tool to help organizations develop a plan that fits resource constraints. Different systems and different collections are going to require different fixity checking approaches, and our hope is that this document can help.
Connection to National Agenda for Digital Stewardship
This guidance was developed in direct response to start to address the need articulated in the infrastructure section of the inaugural National Agenda for Digital Stewardship. I’ll include it below at length for context.
Fixity checking is of particular concern in ensuring content integrity. Abstract requirements for fixity checking can be useful as principals, but when applied universally can actually be detrimental to some digital preservation system architectures. The digital preservation community needs to establish best practices for fixity strategies for different system configurations. For example, if an organization were keeping multiple copies of material on magnetic tape and wanted to check fixity of content on a monthly basis, they might end up continuously reading their tape and thereby very rapidly push their tape systems to the limit of reads for the lifetime of the medium.
There is a clear need for use ‐ case driven examples of best practices for fixity in particular system designs and configurations established to meet particular preservation requirements. This would likely include description of fixity strategies for all spinning disk systems, largely tape ‐ based systems, as well as hierarchical storage management systems. A chart documenting the benefits of fixity checks for certain kinds of digital preservation activities would bring clarity and offer guidance to the entire community. A document modeled after the NDSA Levels of Digital Preservation would be a particularly useful way to provide guidance and information about fixity checks based on storage systems in use, as well as other preservation choices.
Again, please share your comments on this here, and consider forwarding this on to others who you think might have comments to share with us.
The Web is constantly evolving over time. Web content like texts, images, etc. are updated frequently. One of the major problems encountered by archiving systems is to understand what happened between two different versions of the web page. We want to underline that the aim is not to compare two web pages like this (however, the tool can also do that):
but web page versions:
An efficient change detection approach is important for several issues:
Crawler optimization by deciding if the page should be crawled or not on the fly.
Discovering new crawl strategies e.g. based on patterns
Quality assurance for crawlers, for example, by comparing the live version of the page with the just crawled one.
Detecting format obsolescence following to evolving technologies, is the rendering of web pages are identique visually by using different versions of the browser or different browsers
Archive maintenance, different operations like format migration can change the archived versions renderings.
Pagelyzer is a tool containing a supervised framework that decides if two web page versions are similar or not. Pagelyzer takes two urls and two browsers types (e.g. firefox, chrome) and one comparison type as input (image-based, hybrid or content-based). If browsers types are not set, it uses firefox by default.
It is based on two different technologies:
1 – Web page segmentation (let's keep the details for another blog post)
2 – Supervised Learning with Support Vector Machine(SVM).
In this blog, I will try to explain simply (without any equations) what SVM does specially for pagelyzer. You have two urls, let's say url1 and url2 and you would like to know if they are similar (1) or dissimilar (0).
You calculate the distance (or similarity) as a vector based on the comparison type. If it is image-based, your vector will contain the features related to images (e.g. SIFT, HSV). If it is content-based, your vector will contain features for text similarities(e.g. jacard distance for links, images and words). To better explain how it works, let's assume that we have two dimensions (two features). One feature is SIFT and the other one is HSV. They are both color descriptives.
To make your system learn, you should provide at the beginning annotated data to your system. In our case, we need a list of url pairs <url1,url2> annotated manually as similar or not similar. For pagelyzer, this dataset is provided by Internet Memory Foundatation (IMF). With a part of your dataset (ideally 1/3) you train your system, with the other part you test your results.
Let's start training:
First, you put all your vectors in input space.*/
As, this data is annotated, you know which one is similar (in green), which one is dissimilar(in red).*/
You find the optimal decision boundary (hyperplane) in input space. Anything above the decision boundary should have label 1 (similar). Similarly, anything below the decision boundary should have label 0 (dissimilar).
Your system is intelligent now! When you have new pair of urls without any annotation, based on the decision boundry, you can say if they are similar or not.
The pair of urls in blue will be considered as dissimilar, the one in orange will be considered as similar by pagelyzer.
When you choose different types of comparison, you choose different types of features and dimensions. The actual version of Pagelyzer uses the results of SVM learned with 202 couples of web page provided by IMF, 147 are in positive class and 55 are in negative class. As it is a supervised system, increasing the training set size will always lead to better results.
An image to show what happens when you have more than two dimensions:
My two young teenage daughters spend hours playing Minecraft, building elaborate virtual landscapes and structures. They are far from alone; the game has millions of fans around the world. Teachers are seizing on Minecraft’s popularity with kids as a tool to teach both abstract and concrete subjects. What’s unique about this situation is not so much the product as that a virtual world is functioning as both a fun, engaging activity and a viable teaching tool. We’re witnessing the birth of a new genre of tools and a new set of challenges for preserving the digital creations people build with those tools.
Like most parents, I save many of the things that my daughters create. From where I’m sitting in my home as I write this blog post, I can see their works dotting the room. On one wall is a framed pencil sketch one daughter drew of our family; on a shelf is a perfect clay replica she made of Moomintroll. Hanging above a window are drawings my other daughter did — a Sharpie drawing of tree houses and a pen doodle of kaleidoscopic patterns that disappear into a tunnel-like vanishing point. Huge snowflakes (no two alike) that they cut from paper dangle here and there around the room.
I never gave much thought to their virtual gaming activities, aside from monitoring how much time they spend on their electronic devices. But I like that Minecraft lets my kids invent universes and play inside them together and I can tell that it feeds an important part of their intellectual growth as they make things, investigate things and solve problems. So I decided that I’d like to save what I can of the worlds they create, just as I save the rest of their crafts and artwork, which raised questions about what I can save, how I can save it and why I would even want to save it.
Over the last decade, the Library of Congress and its NDIIPP and NDSA partners have led the research into preserving virtual worlds, from military simulations to consumer games. Many of the questions – technological and philosophical – have long been asked and answered or at the least the challenges have been identified and defined. That’s fine for institutions that recognize the cultural value of virtual worlds and have the resources to archive them but what does it mean for a parent who just wants to save his or her kid’s virtual world creations?
A colleague at the Library of Congress, Trevor Owens, is part of the ongoing research on preserving digital worlds and preserving software. In fact, Owens is one of the organizers of the preserving software conference. He said that the solution to the question of saving something from virtual worlds depends on whether you want to save:
- the virtual world that you or someone else built
- testimony about what the virtual world meant to you or them at a particular time
- or documentation of the virtual world.
Preserving the virtual world itself is the most difficult and challenging option. The complexities of preserving virtual worlds are too much to go into in this blog post. And when it comes to talking about networked virtual worlds inhabited by live human participants, the subject often gets downright esoteric, like defining where “here” actually is and what “here” means in a shared virtual world and how telepresence applies to the virtual world experience. But to illustrate the basic technological dilemma of preserving a virtual world, here’s a simple example .
Let’s say I build an island, castle and estate in a virtual world and name it Balmy Island. If I want to save Balmy Island and be able to walk around it anytime I want to, I need all the digital files of which Balmy Island is constructed. I might need the exact version of the application or software that I used to build Balmy Island, as well as the exact operating system — and version of the OS — of the hardware device on which I built Balmy Island. And I might need the hardware device itself on which I created Balmy Island. So if I build Balmy Island on my computer, I have to preserve the computer, the software and the files just as they are. Never upgrade or modify anything. Just stick the whole computer in the closet, buy a new computer and pull out the old one whenever I wanted to revisit Balmy Island.
Another less-certain and less-authentic option is that I could save the Balmy Island files and hope that someday someone will build an emulator that will restore some approximate version of my original Balmy Island. It will not be exactly the same, but it might be close enough.
Saving the hardware and software for just this one purpose is unrealistic for the average person but for cultural institutions it makes perfect sense. Stanford University is the home of the Stephen M. Cabrinety Collection in the History of Microcomputing and it is also building a Forensics Lab with a library of software and electronic devices for extracting software from original media, so that it can be run later in native or emulated environments. Similar labs at other institutions include the Maryland Institute for Technology in the Humanities, the International Center for the History of Electronic Games at the Strong National Museum of Play and the UT Videogame Archive at the Dolph Briscoe Center for American History, University of Texas at Austin. The Briscoe Center was featured in the Signal post about video game music composer George Sanger. (Dene Grigar, who was the subject of another Signal blog post, created a similar lab devoted to her vintage electronic literature collection at Washington State University, Vancouver)
Henry Lowood, curator for History of Science & Technology Collections and Film & Media Collections in the Stanford University Libraries, was a lead in the Preserving Virtual Worlds project. Lowood has a historical interest in games, virtual worlds and their role in society, and he makes a case for the option of recording testimony about what a virtual world means to its users and builders.
Lowood helped create the Machinima and Virtual Worlds collections, which are hosted by our NDIIP/NDSA partner, the Internet Archive. These collections host video recordings of activities and events in virtual worlds and immersive games. As the users perform actions and navigate through the worlds, they sometimes give a running commentary about what is happening and their thoughts and observations about its meaning to them.
A parent or teacher could use this same approach by shooting a video of a child giving you a tour of their virtual world. It’s an opportunity to capture the context around their creation of the worlds and for them to tell you how they felt about it and what choices they made. If they interact with others in a shared virtual world, the child can describe his or her interactions and maybe even relate anecdotes about certain events and experiences.
Screenshots are easy to take on computers and most hand-held devices. PCs have a “print screen” button on the keyboard; for Macs, hold down the Apple key ⌘ plus shift plus 3. For iPods, press and hold the main button below the screen and the power button on the top edge of the device at the same time. And so on. Search online for how to take screen shots or screen captures for your device.
The screenshot will save as a graphic file, usually a JPEG or PNG file. Transfer that JPEG to your computer, crop it and modify it with a photo processing program if you want. Maybe print the screen shots and put them on the refrigerator for you to admire. When you’re finished with the digital photo file, back it up with your other personal digital archives.
Recording a walk through of a virtual world can be a slightly more complex task than taking a screenshot but not terribly so. Search online for “screencast software,” “motion capture” or “screen recording” to find commercial and freeware screencast software. Even version 10 of the QuickTime player includes a screen recording function. They all pretty much operate the same way: click a “Record” button, do your action on the computer and click “Stop” when you are finished. Everything that was displayed on the screen will be captured into a video file.
With the different screen capture software programs, be aware of the video file type that the software generates. QuickTime saves the video as an MOV file, Jing saves the video as an SWF file and so on. Different file types require different digital video players, so if you have any difficulty playing the file back on your computer search online to find the software that will play your video file type. If you upload a copy of your video to YouTube, backup a master copy somewhere else. Don’t rely on the YouTube version as your master “archived” copy.
Although this story is about the challenges of saving mementos from digital virtual worlds, the essence of the challenge — trying to preserve an experience — is not new. If I go to Hawaii, snorkel, build sand castles and have the time of my life, I cannot capture or hold onto that experience. I can only document the experience with photos, video and maybe write in a journal about it. In a way, it even goes back to the dawn of humanity, where people recorded their experiences by means of cave paintings.
So you cannot capture the experience of a virtual world but you can document it. And virtual worlds are a lot more accessible in 2014 than they were in 1990. It’s a long way from Jaron Lanier‘s work, from VPL labs and data gloves and headsets and exclusive access in special labs. Kids now carry their personalized virtual worlds in their handheld devices. Minecraft is just the current cool tool. Who can tell what is yet to come?
It seems appropriate to let Howard Rheingold have the last word on the subject. Rheingold is a writer, teacher, social scientist and thought-leader about the cultural impacts of technology. He is also an authority on virtual reality and virtual communities, having written the definitive books about both topics over twenty years ago. His current book is titled NetSmart.
In addition to his professional expertise, Rheingold is a caring father who dotes on his daughter. While he was researching and writing the books Virtual Reality(1991) and Virtual Communities: Homesteading on the Electronic Frontier(1994), his office walls were filled with her childhood artwork (she is now in her 20s). He brings a unique and authoritative perspective to this story.
Rheingold said, “I’ve been closely observing and writing about innovations in digital media and learning in recent years – and experiencing/experimenting directly through the classes I teach at Stanford and Rheingold U. Among my activities in this sphere is a video blog for DMLcentral, a site sponsored by the MacArthur Foundation’s Digital Media and Learning Initiative. It was there that I delved into the educational uses – and students and teachers’ passion for – Minecraft.
“In my interviews with teachers Liam O’Donnell and Sara Kaviar, it became clear that Minecraft was about much more than using computers to build things. It was a way to engage with a diverse range of abstract subject matter in concrete ways, from comparative religion to mathematics, and more importantly, a way for students to exercise agency in a schooling environment in which so much learning is dependent on what the teacher or textbook says.
“Minecraft artifacts are also important contributions to student e-portfolios, which will become more important than resumes in the not too distant future. Given the growing enthusiasm over Minecraft by students, teachers, and parents, and the pedagogical value of seeing these creations as artifacts and instruments of learning, it only makes sense to make it easy and inexpensive to preserve virtual world creations.”
The February issue of the Library of Congress Digital Preservation Newsletter (pdf) is now available!
Included in this issue:
- Spotlight on Digital Collections, including an interview with Lisa Green on Machine Scale Analysis of collections, and a look at the Cultural Heritage of the Great Smoky Mountains
- Digital Preservation Aid in Response to Tornado
- NDSA Digital Content Area: Web and Social Media
- Wikipedia and Digital Preservation
- AV Artifact Atlas, FADGI interview with Hanna Frost
- Several updates on the Residency Program
- Listing of upcoming events including the IDCC (Feb 24-27), Digital Maryland conference (March 7), Computers in Libraries (April 7-10), Personal Digital Archiving 2014 (April 10-11)
- And other articles about data, preservation of e-serials, and more.
To subscribe to the newsletter, sign up here
We’ve started planning our annual meeting, Digital Preservation 2014, which will be held July 22-24 in the Washington, DC area, and we want to hear from you! Any organization or individual with an interest in digital stewardship can propose ideas for potential inclusion in the meeting.
The Library of Congress has hosted annual meetings with digital preservation partners, collaborators and others committed to stewardship of digital content for the past ten years. The meetings have served as a forum for sharing achievements in the areas of technical infrastructure, innovation, content collection, standards and best practices and outreach efforts.
This year we’ve expanded participation from NDSA member organizations on the program committee. We’re delighted to have NDIIPP staff and NDSA members working together to contribute to the success of the meeting.
Digital Preservation 2014 Program Committee
- Vickie Allen, PBS Media Library
- Meghan Banach Bergin, University of Massachusetts Amherst
- Erin Engle, NDIIPP
- Abbie Grotke, NDIIPP
- Barrie Howard, NDIIPP
- Butch Lazorchak, NDIIPP
- Vivek Navale, U.S. National Archives and Records Administration
- Michael Nelson, Old Dominion University
- Trevor Owens, NDIIPP
- Abbey Potter, NDIIPP
- Nicole Scalessa, The Library Company of Philadelphia
Call for Proposals
We are looking for your ideas, accomplishments and project updates that highlight, contribute to, and advance the community dialog. Areas of interest include, but are not limited to:
- Scientific data and other content at risk of obsolescence, and what methods, techniques, and tools are being deployed to mitigate risk;
- Innovative methods of digital preservation, especially regarding sustainable practices, community approaches, and software solutions;
- Collaboration successes and lessons learned highlighting a wide-range of digital preservation activities, such as best practices, open source solutions, project management techniques and emerging tools;
- Practical examples of research and scholarly use of stewarded data or content;
- Educational trends for emerging and practicing professionals.
You are invited to express your interest in any of the following ways:
- Panels or presentations
- 5-minute lightning talks
A highlight of this past year was the release of the 2014 National Digital Stewardship Agenda at Digital Preservation 2013. The Agenda integrates the perspective of dozens of experts to provide funders and decision-makers with insight into emerging technological trends, gaps in digital stewardship capacity and key areas for development. It suggests a number of important sets of issues for the digital stewardship community to consider prioritizing for developments. We’d be particularly interested for you to share projects your organization has undertaken in the last year that address any of the issues listed in the Agenda.
To be considered, please send 300 words or less describing what you would like to present to ndiipp [at] loc.gov by March 14. Accepted proposals will be notified on or around April 3.
The last day of the meeting, July 24, will be a CURATEcamp, which will take place off-site from the main meeting venue. The topic focus of this camp is still under discussion, so stay tuned for more information about the camp in the coming weeks.
Please let us know if you have any questions. Your contributions are important in making this a community program and we’re looking forward to your participation.
EDRMS across New Zealand’s Government – Challenges with even the most managed of records management systems!
First things first. The Github repository with the Audio QA workflows is here: https://github.com/statsbiblioteket/scape-audio-qa. And version 1 is working. Version is really all wrong here. I should call it Workflow 1, which is this one:
To sum up what this workflow does, is migration, conversion and content comparison. The top left box (nested workflow) migrates a list of mp3s to wav files using a Hadoop map-reduce job using the command line tool Ffmpeg, and outputs a list of migrated wav files. The top right box converts the same list of mp3s to wav files using another Hadoop map-reduce job using the command line tool mpg321, and outputs a list of converted wav files. The Taverna work flow then puts the two lists of wav files together and the bottom box receives a list of pairs of wav files to compare. The bottom box compares the content of the paired files using a Hadoop map-reduce job using the xcorrSound waveform-compare commandline tool, and outputs the results of the comparisons.
What we would like to do next is:
- "Reduce" the output of the Hadoop map-reduce job using the waveform-compare commandline tool
- Do an experiment on 1TB input mp3 files on the SB Hadoop cluster, and write an evaluation and a new blog post ;-)
- Extend the workflow with property comparison. The waveform-compare tool only compares sound waves; it does not look at the header information. This should be part of a quality assurance of a migration. The reason this is not top priority is that FFprobe property extraction and comparison is very fast, and will probably not affect performance much...
The following is a guest post by Julia Blase, National Digital Stewardship Resident at the National Security Archive.
In case you hadn’t heard, the ALA Midwinter Meeting took place in Philadelphia last weekend, attended by around 12,000 librarians and exhibitors. If you didn’t attend, or didn’t have friends there to take notes for you, the Twitter hashtag #alamw14 has it covered – enough content for days of exploration! If you’d like to narrow your gaze, and in the theme of this post, you could refine your search for tweets containing both #alamw14 and #NDSR, because the National Digital Stewardship Residents were there in force, attending and presenting.
Emily Reynolds, the Resident at the World Bank, was so kind as to compile a list of the sessions we aimed to attend before the conference. On Saturday, though none of us made it to every event, at least a few of us were at the Preservation Administrators Interest Group, Scholarly Communications Interest Group, Digital Conversion Interest Group, Digital Special Collections Discussion Group and Challenges of Gender Issues in Technology sessions.
The first session I attended, along with Lauren Work and Jaime McCurry, was the Digital Conversion Interest Group session, where we heard fantastic updates on audiovisual digital conversion practices and projects from the American Folklife Center, the American Philosophical Society library, Columbia University Libraries and George Blood Audio and Video. I particularly enjoyed hearing about the successful APS attempt to digitize audio samples of Native American languages, many of which are endangered and the positive reaction from the Native community. For audio, it seemed, sometimes digitization is the best form of preservation!
The second session I attended, with Emily Reynolds and Lauren Work, was the Gender Issues in Technology discussion group (see news for it at #libtechgender). We were surprised, but pleased, at the number of attendees and quality of the discussion around ways to improve diversity in the profession. Among the suggestions we heard were to include diverse staff members on search committees, to monitor the language within your own organization when you review candidates to ensure that code words like “gravitas” (meaning “male,” according to the panelists) aren’t being used to exclude groups of candidates, to put codes of conduct into place to help remind everyone of a policy of inclusiveness, and to encourage employees to respond positively to mentorship requests, especially from members of minority groups (women, non-white, not traditionally gendered). The discussion seemed to us to be a piece of a much larger, evolving, and extended conversation that we were glad to see happening in our professional community!
On Sunday, though a few of us squeezed in a session or two, our primary focus was our individual project update presentations, given at the Digital Preservation Interest Group morning session, and also our extended project or topic presentations at the Library of Congress booth in the early afternoon. The individual presentations, I’m please to say, went very well! It would be impossible to recap each presentation here, however, many of us have posted project updates recently, so please be sure to look us up for details. Furthermore, searching Twitter for #alamw14 and #NDSR brings you to this list, in which you can find representative samples of the highlights from our individual presentations.
Presentations – Question and Answer Session
We concluded the session by taking some questions, all of which were excellent – particularly the one from Howard Besser, who wanted to know how we believed our projects (or any resident or fellowship temporary project) could be carried on at the conclusion of our project term. The general response was that we are doing our best to ensure they are continued by integrating the projects, and ourselves, into the general workflows of our organizations – keeping all stakeholders informed from an early stage of our progress, finding support from other divisions, and documenting all of our decisions so that any action may be picked up again as easily as possible.
We also had an excellent question about how important networking had been for the success of our projects, and all agreed that, while networking with the D.C. community has been essential (through our personal efforts and also through groups like the DCHDC meetup), almost more significant has been our ability to network with each other – to share feedback, resources, documents, websites, and connections to other networks, which has helped us accomplish our goals more efficiently and effectively. One of the goals of the NDSR program was, of course, to help institutions get valuable work done in the area of digital stewardship, which we are all doing. However, another goal was for the program to help build a professional community in digital stewardship. What is a community if not a group of diverse professionals who trust and rely on each other, who share successes and setbacks, resources and networks, and who support each other as we learn and grow? Though the language is my own, the sentiment is one I heard shared between us over and over during the ALA weekend.
NDSR Recent Activity
In recent news, Emily Reynolds and Lauren Work both discuss their take on our ALA experience, Emily’s here and Lauren’s here. Molly Swartz published some pictures and thoughts on ALA Midwinter over here. Jaime McCurry recently interviewed Maureen McCormick-Harlow about her work at the National Library of Medicine. And to conclude, I’ve recently posted two updates on my project, one on this page and another courtesy of the Digital Libraries Federation.
Thanks for listening, and be sure to tune in two weeks from now when Maureen McCormick-Harlow will be writing another NDSR guest post. If you, like us, were at ALA Midwinter last weekend, I hope you found it as enjoyable as we did!
One of my first blogs here covered an evaluation of a number of format identification tools. One of the more surprising results of that work was that out of the five tools that were tested, no less than four of them (FITS, DROID, Fido and JHOVE2) failed to even run when executed with their associated launcher script. In many cases the Windows launcher scripts (batch files) only worked when executed from the installation folder. Apart from making things unnecessarily difficult for the user, this also completely flies in the face of all existing conventions on command-line interface design. Around the time of this work (summer 2011) I had been in contact with the developers of all the evaluated tools, and until last week I thought those issues were a thing of the past. Well, was I wrong!FITS 0.8
Fast-forward 2.5 years: this week I saw the announcement of the latest FITS release. This got me curious, also because of the recent work on this tool as part of the FITS Blitz. So I downloaded FITS 0.8, installed it in a directory called c:\fits\on my Windows PC, and then typed (while being in directory f:\myData\):f:\myData>c:\fits\fits
Instead of the expected helper message I ended up with this:The system cannot find the path specified. Error: Could not find or load main class edu.harvard.hul.ois.fits.Fits
Hang on, I've seen this before ... don't tell me this is the same bug that I already reported 2.5 years ago ? Well, turns out it is after all!
This got me curious about the status of the other tools that had similar problems in 2011, so I started downloading the latest versions of DROID, JHOVE2 and Fido. As I was on a roll anyway, I gave JHOVE a try as well (even though it was not part of the 2011 evaluation). The objective of the test was simply to run each tool and get some screen output (e.g. a help message), nothing more. I did these tests on a PC running Windows 7 with Java version 1.7.0_25. Here are the results.DROID 6.1.3
First I installed DROID in a directory C:\droid\. Then I executed it using:f:\myData>c:\droid\droid
This started up a Java Virtual Machine Launcher that showed this message box:
The Running DROID text document that comes with DROID says:
To run DROID on Windows, use the "droid.bat" file. You can either double-click on this file, or run it from the command-line console, by typing "droid" when you are in the droid installation folder.
So, no progress on this for DROID either, then. I was able to get DROID running by circumventing the launcher script like this:java -jar c:\droid\droid-command-line-6.1.3.jar
This resulted in the following output:No command line options specified
This isn't particularly helpful. There is a helper message, for which you have to give the -h flag on the command line. But you don't get to see this until you give the -h flag on the command line. Catch 22 anyone?JHOVE2-2.1.0
After installing JHOVE2 in c:\jhove2\, I typed:f:\myData>c:\jhove2\jhove2
This gave me 1393 (yes, you read that right: 1393!) Java deprecation warnings, each along the lines of:16:51:02,702 [main] WARN TypeConverterDelegate : PropertyEditor [com.sun.beans.editors.EnumEditor] found through deprecated global PropertyEditorManager fallback - consider using a more isolated form of registration, e.g. on the BeanWrapper/BeanFactory!
This was eventually followed by the (expected) JHOVE2 help message, and a quick test on some actual files confirmed that JHOVE2 does actually work. Nevertheless, by the time the tsunami of warning messages is over, many first-time users will have started running for the bunkers!Fido 1.3.1
Fido doesn't make use of any launcher scripts any more, and the default way to run it is to use the Python script directly. After installing in c:\fido\ I typed:f:\myData>c:\fido\fido.py
Which resulted in ..... (drum roll) ... a nicely formatted Fido help message, which is exactly what I was hoping for. Beautiful!JHOVE 1.11
I installed JHOVE in c:\jhove\ and then typed:f:\myData>c:\jhove\jhove
Which resulted in this:Exception in thread "main" java.lang.NoClassDefFoundError: edu/harvard/hul/ois/j hove/viewer/ConfigWindow at edu.harvard.hul.ois.jhove.DefaultConfigurationBuilder.writeDefaultCon figFile(Unknown Source) at edu.harvard.hul.ois.jhove.JhoveBase.init(Unknown Source) at Jhove.main(Unknown Source) Caused by: java.lang.ClassNotFoundException: edu.harvard.hul.ois.jhove.viewer.Co nfigWindow at java.net.URLClassLoader$1.run(Unknown Source) at java.net.URLClassLoader$1.run(Unknown Source) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(Unknown Source) at java.lang.ClassLoader.loadClass(Unknown Source) at sun.misc.Launcher$AppClassLoader.loadClass(Unknown Source) at java.lang.ClassLoader.loadClass(Unknown Source) ... 3 more
I limited my tests to a Windows environment only, and results may well be better under Linux for some of these tools. Nevertheless, I find it nothing less than astounding that so many of these (often widely cited) preservation tools fail to even execute on today's most widespread operating system. Granted, in some cases there are workarounds, such as tweaking the launcher scripts, or circumventing them altogether. However, this is not an option for less tech-savvy users, who will simply conclude "Hey, this tool doesn't work", give up, and move on to other things. Moreover, this means that much of the (often huge) amounts of development effort that went into these tools will simply fail to reach its potential audience, and I think this is a tremendous waste. I'm also wondering why there's been so little progress on this over the past 2.5 years. Is it really that difficult to develop preservation tools with command-line interfaces that follow basic design conventions that have been ubiquitous elsewhere for more than 30 years? Tools that just work?Preservation Topics: IdentificationCharacterisationToolsSCAPE
Here’s a simple experiment that involves asking an average person two questions. Question one is: “how do you feel about physical books?” Question two is: “how do you feel about digital data?”
The first question almost surely will quickly elicit warm, positive exclamations about a life-long relationship with books, including the joy of using and owning them as objects. You may also hear about the convenience of reading on an electronic device, but I’ll wager that most people will mention that only after expounding on paper books.
The second question shifts to cooler, more uncertain ground. The addressee may well appear baffled and request clarification. You could help the person a bit by specifying digital materials of personal interest to them, such as content that resides on their tablet or laptop. “Oh, that stuff,” they might say with measured relief. “I’m glad it’s there.”
These divergent emotional reactions should be worrying to those of us who are committed to keeping digital cultural heritage materials accessible over time. Trying to make a case for something that lacks emotional resonance is difficult, as marketing people say. Most certainly, the issue of limited resources is a common refrain when it comes to assessing the state of digital preservation in cultural heritage institutions; see the Canadian Heritage Information Network’s Digital Preservation Survey: 2011 Preliminary Results, for example.
The flip side is that traditional analog materials are a formidable competitor for management resources because those materials are seen in a glowing emotional context. I don’t mean to say that analog materials are awash in preservation money; far from it. But physical collections still have to be managed even as the volume of digital holdings rapidly rise, and efforts to move away from reliance on the physical are vulnerable to impassioned attack by people such as Nicholson Baker.
What is curious is that even as we collectively move toward an ever deeper relationship with digital, there remains a strong nostalgic bond with traditional book objects. A perfect example of this is a recent article, Real books should be preserved like papyrus scrolls. The author fully accepts the convenience and the future dominance of ebooks, and is profoundly elegiac in his view of the printed word. But, far from turning away from physical books, he declares that “books have a new place as sacred objects, and libraries as museums.” One might see this idea as one person’s nostalgic fetish, but it’s more than that. We can only wonder how long and to what extent this kind of powerful, emotionally-propelled thinking will drive how cultural heritage institutions operate, and more importantly, how they are funded.
As I’ve written before, we’re at a point where intriguing ideas are emerging about establishing a potentially deeper and more meaningful role for digital collections. This is vitally important, as a fundamental challenge that lies before those who champion digital cultural heritage preservation is how to develop a narrative that can compete in terms of personal meaning and impact.
Anyone willing to preserve digital content must be aware of events that might constitute a relevant risk. In SCAPE we are developing tools that will allow you to detect risks before they cause any irreversible damage.
Help us understand the preservation events, threats and opportunities, you find more relevant and the ways you would like us to detect them.
Participate in our survey and help us develop tools that would help you to automatically detect problems in your own content, and events that might put it at risk.
The survey has 30 short questions that should take about 10 minutes to complete.Join the survey now!http://survey.scape-project.eu/index.php/862812/lang-en Preservation Topics: Preservation StrategiesPreservation RisksBit rotFormat RegistryRepresentation InformationSCAPE
How do we make digital collections available at scale for today’s scholars and researchers? Lisa Green, director of Common Crawl, tackled this and related questions in her keynote address at Digital Preservation 2013. (You can view her slides and watch a video of her talk online.) As a follow up to ongoing discussions of what users can do with dumps of large sets of data, I’m thrilled to continue exploring the issues she raised in this insights interview.
Trevor: Could you tell us a bit about Common Crawl? What is your mission, what kinds of content do you have and how do you make it available to your users?
Lisa: Common Crawl is a non-profit organization that builds and maintains an open repository of web crawl data that is available for everyone to access and analyze. We believe that the web is is an incredibly valuable dataset capable of driving innovation in research, business, and education and that the more people that have access to this dataset, the greater the benefit to society. The data is stored on public cloud platforms so that anyone with a access to the internet can access and analyze it.
Trevor: In your talk, you described the importance of machine scale analysis. Could you define that term for us and give some examples of why you think that kind of analysis is important for digital collections?
Lisa: Let me start by describing human scale analysis. Human scale analysis means that a person ingests information with their eyes and then processes and analyzes it with their brain. Even if several people – or even hundreds of people – work on the analysis, it is not as fast as a computer program can ingest, process, and analyze information. Machine scale analysis is when a computer program does the analysis. A computer program can analysis data millions to billions of times faster than a human. It can run 24 hours a day with no need for rest and it can simultaneously run on multiple machines.
Machine scale analysis is important for digital collections because of the massive volume of data in most digital collections. Imagine that a researcher wanted to study the etymology of a word and planned to use a digital collection to answers questions such as:
- What is the first occurrence of this word?
- How did the frequency of occurrence change over time?
- What types of publication it is first appear in?
- When did it first appear in other types of publications and how did the types of publications it appeared in change over time?
- What other words most commonly appear in the same sentence, paragraph or page with the word and how did that change over time?
Answering such questions using human scale analysis would take lifetimes of man hours to search the collection for the given word. Machine scale analysis could retrieve the information in seconds or minutes. And if the researcher wanted to make changes in the questions or criteria, only a small amount of effort would be required to alter the software program, then the program could be rerun and return the new the information in seconds or minutes. If we want to optimize the extraction of knowledge from the enormous amounts of data digital collections, human analysis is simply too slow.
Trevor: What do you think libraries, archives and museums can learn from Common Crawl’s approach?
Lisa: I think it is of crucial importance to preserve data in a format that it can be analyzed by computers. For instance, if material is stored as a PDF, it difficult – and sometimes impossible – for software programs to analysis the material and therefore libraries, archives and museums will be limited in the amount of information that can be extracted from the material in a reasonable amount of time.
Trevor: What kind of infrastructure do you think libraries, archives and museums need to have to be able to provide capability for machine scale analysis? Do you think they need to be developing that capacity on their own systems or relying on third party systems and platforms?
Lisa: The two components are storage and compute capacity. When one thinks of digital preservation, storage is always considered but compute capacity is not always considered. Storage is necessary for preservation and the type of storage system influences access to the collection. Compute capacity is necessary for analysis. Building and maintaining the infrastructure or storage and compute can be expensive, so it doesn’t make much financial sense for each organization to develop it own their own.
One option would be a collaborative, shared system build and used by many organizations. This would allow the costs to be shared, avoid duplicative work and storing duplicate material, and – perhaps most importantly – maximize the number of people who have access to the collections.
Personally I believe a better option would be to utilize existing third party systems and platforms. This option avoids the cost of developing custom systems and often makes it easier to maintain or alter the system as there is a greater pool of technologists familiar with the popular third party platforms.
I am a strong believer in public cloud platforms is because there is no upfront cost for the hardware, no need to maintain or replace hardware, and one only pays for the storage and compute that is used. I think it would be wonderful to see more libraries, museums, and archives storing copies of their collections on public cloud platforms in order to increase access. The most interesting use of your data may be thought of by someone outside your organization and the more people who can access the data, the more minds can work to find valuable insight within your data.
Interface, Exhibition & Artwork: Geocities, Deleted City and the Future of Interfaces to Digital Collections
In 2009, a band of rogue digital preservationists called Archive Team did their best to collect and preserve Geocities. The resulting data has became the basis for at least two works of art: Deleted City and One Terabyte of Kilobyte Age. I think the story of this data set and these works offer insights into the future roles of cultural heritage organizations and their collections.
Let Them Build Interfaces
In short, Archive Team collected the data and made the dataset available for bulk download. If you like, you can also just access the 51,000 MIDI music files from the data set from the Internet Archive. Beyond that, because the data was available in mass, the corpus of personal websites became the basis for other works. Taking the Geocities data as a basis, Richard Vijgen’s Deleted City interprets and presents an interface to the data and Olia Lialina & Dragan Espenschied’s One Terabyte of Kilobyte Age is in effect designed reenactment grounded in an articulated approach to accessibility and authenticity.
An Artwork as the Interface to Your Collection
Some of the most powerful ways to interact the Geocities collection is through works of created by those who have access to the collection as a dataset. Working with digital objects means we don’t need to define the way that they will be accessed or made available. By making the raw data available on the web, and providing a point of reference for the data set everyone is enabled to create interfaces to it.
How to make available digital collections and objects?
Access remains the burning question for cultural heritage organizations interested in the acquisition and preservation of digital artifacts and collections. What kinds of interfaces do they need in place to serve what kinds of users? If you don’t know how to make it available in advance what can you do with it? I’ve been in discussions with staff from a range of cultural heritage organizations who don’t really want to wade too deep into acquiring born digital materials without having a plan for how to make them available.
The story of Geocities, Archive Team and these artists suggests that if you can make the data avaliable you can invite others to invent the interfaces. If users can help figure out and develop modes of access, as illustrated in this case, then cultural heritage organizations could potentially invite much larger communities of users to help figure out issues around migration and emulation as modes of access as well. By making the content broadly available, organizations have the ability to broaden the network of people who might contribute to efforts to make digital artifacts accessible into the future.
Collections and Interfaces Inside and Outside
An exciting model can emerge here. Through data dumps of full sets of raw data, cultural heritage organizations can consider embracing the fact that they don’t need to provide the best interface, or for that matter much of any interface at all, for digital content they agree to steward. Instead, a cultural heritage organization can agree to acquire materials or collections which are considered interesting and important, but which they don’t necessarily have the resources or inclination to build sophisticated interfaces to if they are willing to simply provide canonical homes for the data, offer information about the provenance of the data, and invest in dedicated ongoing bit-level preservation. This approach would resonate quite strongly with a more product less process approach to born digital archival materials.
An Example: 4Chan Collection/Dataset @ Stanford
For a sense of what it might look like for a cultural heritage organization to do something like this we need look no further than a recent Stanford University Library acquisition. The recent acquisition of an archive of a collection of 4Chan data into Stanford’s digital repository shows how a research library could go about exactly this sort of activity. The page for the data set/collection briefly describes the structure of the data and some information and context about the collector who offered it to Stanford. Stanford acts as the repository and makes the data available for others to explore, manipulate and create a multiplicity of interfaces to. How will others explore or interface with this content? Only time will tell. In any event, it likely did not take that many resources to acquire it and it will likely not require that much in resources to maintain it at a basic level into the future.
How to encourage rather than discourage this?
If we wanted to encourage this kind of behavior how would we do it? First, off I think we need more data dumps for this kind of data. With the added note that bitsize downloadable chunks of data are going to be the easiest thing for any potential user to right click and save to their desktop. Beyond that, cultural heritage organizations could embrace this example and put up prizes and bounties for artists and designers to develop and create interfaces to different collections.
What I think is particularly exciting here is that by letting go of the requirement to provide the definitive interface cultural heritage organizations could focus more on selection and working to ensure long term preservation and integrity of data. Who knows, some of the interfaces others create might be such great works of art that another cultural heritage organization might feature it in their own database of works.
Last spring, I attended a Hackathon at the University of Leeds, which resulted in my getting a SPRUCE Grant for a month’s work enhancing FITS, a tool which at the time was technically open source but which the Harvard Library treated a bit possessively. After I finished, it seemed for a while that nothing was happening with my work, but it was just a matter of being patient enough. Collaboration between Harvard and the Open Planets Foundation has resulted in a more genuinely open FITS, which now has its own website. There’s also a GitHub repository with five contributors, none of which are me since my work was on an earlier repository that was incorporated into this one.
It really makes me happy to see my work reach this kind of fruition, even if I’m so busy on other things now that I don’t have time to participate.
Tagged: FITS, Harvard, Open Planets Foundation, preservation, software