What if the Kennedy assassination had happened during the era of smartphones and laptops? And, assuming the perpetrator left a digital trail, would that evidence uncover any associated conspiracy?
As we approach the 50th anniversary of that awful day in Dallas, recent public opinion polls indicate that over 60 percent of Americans believe more than one person was involved with the assassination. These beliefs float on a steady stream of books and other media that scrutinize the various pieces of evidence available: recorded gunshots, photographs, bullets (both “magic” and regular) and the most famous home movie ever, the Zapruder film.
All manner of experts and enthusiasts have reviewed the evidence but agreement about what it means remains elusive: while 95 percent of all books on the subject depict a conspiracy, the purported conspirators are wildly varied and include Nazis, extraterrestrials and Corsican hitmen, among others. As The Atlantic noted a while back, much of this output is “popularized by a national appetite for mystery and entertainment.” Other studies have looked at the same evidence and concluded with certainty that Oswald acted alone.
If Oswald had lived in an the digital age, he seems to me like the sort of person who would have activity participated in chat rooms, commented on blogs and broadcasted his opinions via all kinds of social media. He probably would have left behind a device, such as a laptop, that documented his web browsing habits and his email contacts. Forensic investigators would have had a trove of information about who he knew and when he knew them. That evidence would have been critical both for the initial needs of law enforcement and for later researchers.
Ah, endlessly fascinating. Would there be emails from disgruntled government operators? Texts from organized crime figures? Photographs of other gunmen? Perhaps a series of tweets with darkly cryptic warnings? From a rational perspective, one would think that such details would go a long way to prove or disprove a conspiracy.
One thing is for sure: there would be lots of digital information to capture, examine and preserve. The question, however, remains open as to the research impact of this kind of evidence. Data from an Oswald laptop could disprove theories or throw open the door to a flood of conspiratorial prospects. Or some jumbled mix of both–in spite of William S. Burrough’s proclamation that “the purpose of technology is not to confuse the brain, but to serve the body, to make life easier.”
Ultimately, as with any subject, it would come down to what researchers make of the preserved body of evidence.
At this point, most of the experience with digital forensics is with the law enforcement world, although there is growing interest on the part of memory organizations to obtain this capability; see, for example, Digital Forensics and Preservation (PDF) and the BitCurator project. This is a good thing. Even though there is no Oswald laptop, there can be no doubt that digital forensic evidence will grow increasingly important for historical research.
The following is a guest post from Jane Mandelbaum, co-chair of the National Digital Stewardship Alliance Innovation Working group and IT Project Manager at the Library of Congress.
As part of our ongoing series of insights discussions with individuals doing innovative work related to digital preservation and stewardship I am excited to talk with Brian Schmidt. Brian works as an astronomer at the Research School of Astronomy and Astrophysics at the Australian National University and his research is based on a lot of the “big data” that many individuals in the digital preservation and stewardship community have been keenly interested in. Schmidt shared both the 2006 Shaw Prize in Astronomy and the 2011 Nobel Prize in Physics for providing evidence that the expansion of the universe is accelerating.
Jane: I read that you’ve predicted that IT specialists will be at the core of building new telescopes. For example, your SkyMapper project, which is currently scanning the southern sky has a peak data rate of one terabyte per day. The Australian Square Kilometer Array Pathfinder, an array of 36 radio telescope dishes being built in Australia, will generate two terabytes per second. Can you talk about how you think astronomers and IT specialists will work together on these kinds of projects?
Brian: New telescopes like, Skymapper, are creating massive amounts of data, a terabyte of data each night. Processing a terabyte of data a night and making that data useful is as much an interesting computer science problem as it is an astronomy problem. In the past, astronomers did a lot of this kind of computer science work themselves. But the reality is, this has moved beyond what I can do sensibly myself. We need interdisciplinary groups of researchers to work together to meet these challenges. So astronomers need to be able to specify the scientific outcomes and algorithms. But implementation, and design of systems and databases and how that data is served, is computer science problem. So we work with them, to meet our needs. If you have a lot of data, and you’re not a computer scientist, you really want to use expertise that is out there.
Jane: Do you think that astronomers deal with data differently than other scientists?
Brian: Astronomers are very open with their data. This is one of the reasons that projects like the Sloan digital sky survey work in our field. Alongside that, our data is representations of the night sky. Everyone knows what stars look like, which means that people understand what we do in a way that they might not with other sciences. Aside from that, much of our data, for example images of galaxies, is beautiful in a way that something like DNA sequences isn’t. These features are all important for our ability to create complex citizen science projects.
Jane: It is sometimes said that astronomers are the scientists who are closest to practitioners of digital preservation because they are interested in using and comparing historical data observations over time. Do astronomers think they are in the digital preservation business?
Brian: Historical data is of the utmost importance in astronomy. Astronomers are often looking for subtle changes that occur over hundreds of years. For example, if we discover a new asteroid that might come close to Earth you need to go back to the archives and see what data you have on it to figure out if it is a threat. The more years you have, the more accurate you can predict the orbit. Other sciences benefit from this kind of long view of historical data, however, we’re the discipline that has had our act together for the longest period of time.
Jane: What do you think the role of traditional libraries, museums and archives should be when dealing with astronomical data and artifacts?
Brian: I think we are still figuring out the role that libraries, archives and museums have to play in the contemporary work of astronomers. In 2003 a fire-storm largely destroyed the library at the largely destroying Mount Stromlo Observatory. As a result of the work of IT and Library staff all of the digital information of the observatory was backed up and restored from off site. However, all the paper was just gone. Losing a Library of resources is a major loss, however, at this point, astronomy is basically a completely digital field. We keep a small numbers of books around for reference, but when we want to read the literature we have the Harvard/Smithsonian Astrophysics Data System. Just about every interaction I and my colleagues have with papers and articles is through that portal. Just search and download the full text.
While we have digital access to research and reference material through services like the Astrophysics Data System, there are substantial information challenges we are facing that I think libraries, archives and museums could help with. We’re even more information driven than in the past. Our work could be substantially aided with libraries providing systems for working with and curating data. Libraries need to figure out how to help curate and make available data and data products. Ideally, we would have librarians taking on increasingly specialist niches, across many institutions. In our library, we are bringing in more staff who have expertise in data management – trained astronomers who decide they want to be exporting data to the masses. I think training people in library science curation is important too, and I imagine we will increasingly see individuals with these skill sets and background embedded in the teams that produce, maintain, and provide access to various data products.
Jane: “Big data” analysis is often cited as valuable for finding patterns and/or exceptions. How does this relate to the work of astronomers?
Brian: Astronomers are often interested in very rare objects. For example, Skymapper will is cataloging 10 billion stars. And we want to find earliest stars in Milky Way with specific color signature. We need that many stars to find enough of those stars to do our research, and as a result, we need to use data mining techniques to find those very few needles in that gigantic haystack. Techniques allow us to do this.
Jane: What do you think astronomers have to teach others about generating and using the increasing amounts of data you are seeing now in astronomy?
Brian: Astronomers have been very good at developing standards (database and serving standards). There is a persistent danger that every library uses its own standards. You don’t want to have to work across hundreds of standards to make sense of what each piece of data means. You want it to be universal and also flexible to add things. Astronomy has been doing this for a good while and it’s not easy. Getting standards for data in place that work requires a consensus dictatorship. It requires collaborations between librarians, and computer scientists to figure out how to create and maintain data hierarchies. Astronomers developed the FITS data standard in the 1980s and are still using it. In the last five to seven years it’s diverge a bit in the field, which suggests we likely need to revisit and revise. Every time an observatory observes something, there are stars in common between observations that can serve as a point of reference. Linking this data can be very complicated – cross-matching is a difficult problem for 10 billion objects. Obvious thing is to give every object index number, but have to allow uncertainty.
Jane: What do you think will be different about the type of data you will have available and use in 10 years or 20 years?
Brian: We are going to continue to have more and more data and information. Now have images of sky, but in future will have images at thousands of wavelengths (compared to 5 or 6 now). We are going to have data cubes that record coordinates and intensity at 16k frequencies from radio telescopes. We are talking about instruments that generate a petabyte of data a night. This quantity of data is a challenge for every part of a system. It’s difficult to store, retrieve, process, and analyze and exactly how we work with it is a work in progress. We very well may need to be processing this data in real-time, finding the signal we care about and disregarding the noise, because the initial raw data is just too much to deal with if we let it pile up.
Jane: Speaking of raw data, do astronomers share raw data, and if so, how? When they do share, what are their practices for assigning credit and value to that work? Do you think this will change in the future?
Brian: Astronomers tend to store data in multiple formats. There is the raw data, as it comes off the telescope and we tend to store a copy of that. However the average researcher doesn’t care about that. They want it transformed into final state – fully calibrated, and we know where every pixel points to in the sky. At this point, all the data we provide access to is processed data. You can make a query and we give back “here’s this star and it’s properties.” It’s just too hard to query into the actual images we’ve collected. That isn’t how our systems are set up.
Jane: You’ve talked about the value of citizen science projects such as Galaxy Zoo. How do you think these kinds of projects could make a case for preservation of data?
Brian: Citizen science, at its best, serves as outreach/education and the advancement of science simultaneously. We need to be careful that citizen science projects are doing scientifically useful work with the hours and efforts people are putting in. Ideally, we can leverage the work people put into these kinds of projects to calibrate algorithms to double the value of their efforts. The immense data challenges facing astronomy and other sciences and the potential for citizen science projects to bring the public in to help us make sense of this data I think we are entering into a brave new information world. At this point, we need library and information science to become a lot bolder to stay relevant. There are huge opportunities to do great things in this area. I think timidity is likely the biggest threat to the future potential role that libraries, archives and museums could play in the future of sciences like astronomy. There are huge opportunities here to do great things.
“Bamboo is porous,” said Grigar. “It can absorb the paint. So my mother compensated by using very thick paint and very thick brushes to get the paint to stay on the surface.” Grigar’ mother fiddled with various materials and techniques until she figured out what worked and what did not. Within the constraints of the bamboo surface she created a lovely work of art.
Grigar tells that story to illustrate how artists can still create even when using material that is unfamiliar to them. And she should know. Like her mother, Grigar is an artist. She is also director and associate professor of the Creative Media & Digital Culture Program at Washington State University Vancouver. The medium to which she devotes herself is electronic literature, or eLit, particularly works from the period between the mid-1980s to the late 1990s.
During that period, personal computers proliferated and experimental artists were drawn to the ones with graphic user interfaces (as opposed to text-based command line screens) and interactive multimedia. Artists were lured to computers despite of the unfamiliar material…or maybe because of it.
The new generation of personal computers in the 1980s, particularly Macintoshes, were a pleasure to use and play with, much like modern smart phones. Macs were not dry, business-only machines. There were no command lines to memorize, no “under the hood” technical details to fuss with. You simply turned Macs on and started playing. They invited play.
And artists did just that. They played. They explored. They tinkered. And from the palette of text, hyperlinks, audio and graphics arose – among other things electronic literature.
The term “electronic literature” applies to works that are created on a computer and meant to be read and experienced on a computer. Grigar, a scholar and devotee of eLit, helped build a lab in which to preserve and enjoy works of vintage electronic literature.She helped create the Electronic Literature Lab at Washington State University Vancouver, which houses a collection of over 300 works of eLit — one of the largest collections in the world — and twenty eight vintage Macintosh computers on which to run them. Each computer has its appropriate OS version and, for browser-based works, appropriate browser versions.
The ELL is never closed. Students with access rights can come and go at any time. Despite the age of the computers, they are all in good working condition. Grigar has someone who maintains the lab computers and keeps them tuned and running, and she uses a local computer-repair specialist for more serious technical issues.
In addition to preserving the software disks on which the works reside, the ELL backs up and preserves their software in a repository. In some cases, the ELL keeps a copy of the software on the computer on which the work is played rather than go through the whole re-installation process; on the older computers that could require loading several disks. For CD-based works, they make an ISO image backup copy.
The ELL has a searchable database to track all the works, the computers, operating systems and software requirements. If a user wants to view a work, he or she would search for it and, according to its requirements, locate which lab computer to use.
All of the electronic literature works at the ELL share one common element: they deviate from traditional literature. Unlike paper-bound literature with sequentially numbered pages and a beginning, middle and end, many works of eLit do not read linearly. There are underlying decision trees that enable users to decide where they want to go next; the experience is chunked into scene-like elements and it is up to the user which element to navigate to next. Navigation is often left to chance. In fact, the decision-making process that is standard for many games today have their roots in vintage eLit. (Think of first-person shooters and multi-player adventure games, the “where can I go and what are my options?” games.)
In vintage eLit, a work that was rich in content pushed the limits of the computers of the day: the richer the content, the slower the computer ran. One of the challenges the artist faced was to see how much she or he could pack into a piece.
“One of the coolest things about working with these early pieces from, say, StorySpace,” said Grigar, “is that when you put the 3 1/2 inch floppy in and as the work was loading, you got a little dialog box that said ‘This work has 2000 nodes and has 1600 links’ and you’re watching each link load, one at a time. Part of the excitement was seeing how many nodes and how many links there were and how big and intricate the work was.”
Grigar is dedicated to preserving the experience of each work as the author or artist originally intended it, under the same physical conditions as when you would have experienced it when it was first released. That includes experiencing the sluggishness and snags of the technology. Not only are the works historically and culturally significant, their limitations and affordances are too.
“All of the quirks, all of the glitches, all of the constraints are obvious to you,” said Grigar. “And it was kind of a badge of honor to artists that you did this much work. It’s like handing someone James Joyce’s Ullyses as opposed to handing them a forty page article. It’s like ‘This is my novel. See how big it is? See how many nodes there are? See how many hyperlinks I had to make?’
“When you put all this on an emulator, all of those differences collapse. The slowness and glitchiness was part of the beauty of the work…I’m not convinced that emulators can capture a lot of that experience and the wonder of how things actually moved.”
The computers in the ELL are arranged in chronological order to demonstrate the evolution of the art form. For example, beginning in 1983, you can see that artists created grayscale and ASCII characters. In time, computers acquired a palette of 256 colors, which spawned a different stage of creativity. Then came thousands of colors and another stage of creativity.
“The palette just kept getting bigger,” said Grigar. “And so they go crazy with that and have fun with that. CDs like the Voyager piece ‘Shining Flower‘ — it’s just exquisite. Its just amazing. You could tear up, it’s just that gorgeous.”
In the earliest works of eLit, artists coordinated words with audio and graphics. As the technology evolved and artists could include motion pictures, the storytelling blurred the lines between literature, animation and movies. Still, no matter how much artists stretched the genres, vintage eLit works were still limited by the computer keyboard and mouse.
Newer works of interactive media, or participatory media, reach for other methods of interactivity. For example, “The Breathing Wall,” by Kate Pullinger, responds to the user’s rate of breathing, not the clicking of a mouse. And new advances in augmented reality enable interactivity with software without directly touching — even if only with your breath — any hardware objects. In some game systems and art installations users can interact with software through gestures and eye movements. Artistic expressions of human/computer interaction will clearly continue to evolve along with technology.
For now, Grigar is focused on protecting vintage electronic literature. She does not assume that the machines and software of vintage eLit will always be available, so she and hypertext author Stuart Moulthrop created Pathfinders, which demonstrates the user experience through video recordings of the artist and users reading works of early eLit.
“We have the authors perform their work on the computers and we videotape it,” said Grigar. “And the video will be archived for posterity so that one day when there are no more Macintoshes from 1983, we will at least have the video. It is better than just an emulator, because you could see the work unfold and have the author talking.”In April, 2013, Grigar, along with colleague Kathi Inman Berens and eight of Grigar’ students presented, presented the Electronic Literature Showcase at the Library of Congress. She brought several Macintoshes with her (she has extra vintage Macs as well as extra copies of software) to demonstrate some notable works of eLit, including including a Mac Classic on which to show Shelley Jackson’s “Patchwork Girl” and Michael Joyce’s “Afternoon, A Story.” She also brought along a G3 iMac on which to run her original copy of “Myst.”
The ELL is one of several labs dedicated to the preservation of vintage multimedia. Others include the Media Archaeology Lab, The Trope Tank and especially the Maryland Institute for Technology in the Humanities.
Preservation and access are equally important in the curation of electronic literature. Grigar and her colleagues are committed to not only preserving vintage works of digital humanities — the sotware — but in maintaining access to them, keeping the machines running and encouraging people to experience each work in its native technological context.
Grigar said, “What drives my research is how artists use the medium and the platforms and all the things to their advantage and work through the constraints so that the constraints do not look like weaknesses but actually are part of the beautiful aspect of the work.”
1. Open Source: Previous research undertaken by the Digital Curator indicated that the implementation of an open source digital repository would not be feasible due to the investment and expertise required.
2. Out of the Box (recommended option): Preservica scored very highly and also proved to be the most cost effective solution based on initial calculations. Other out of the box solutions were considered such as Ex Libris Rosetta, but the cost of implementing this system in-house was prohibitive.
3. Hybrid: The combination of using the OAIS compliant Archivematica in conjunction with bit-level preservation provided by Arkivum was considered. However, the combination of these two solutions was not as comprehensive and cost effective in comparison to an out of the box solution.Once the recommended option was decided, it was a case of using the guidance of the Digital Preservation Business Case Toolkit to form the final business case. What resulted was a straight to the point and clear justification based on expert knowledge which was presented internally to key stakeholders within NE. Lessons Learnt There is no one size fits all solution!
- Much of what is concluded will be based on your own organisational context, all of which can influence the right approach towards digital preservation. However, it is hoped that this project can establish a methodology which other small to medium organisations can adopt.
- Aligning organisational goals from the onset will save you a great deal of work further down the line. By identifying these key drivers you can begin to build up support for your recommended solution before the big pitch to senior management.
- There are a number of fantastic resources out there which can save you reinventing the wheel. The first and most obvious point of contact is the new Digital Preservation Business Case Toolkit. A fantastic resource including everything you need to get started.
- Nail down upfront costs for at least the first three years. After all, you want a solution which can be sustained into the future. For any costs include benefits and any potential returns on investments which can be identified
Preservation Topics: SPRUCE
I increasingly deal with vintage hardware. Why? Because we have vintage media in our collections that we need to read to make preservation and access copies of the files stored on them.
I spend a lot of time thinking about hardware that I have interacted with and managed over the years. Some of it was innovative and exhibited remarkable adaptive uses, yet is sadly forgotten.
I cannot leave out the Telex, one of the earliest technologies to have a lasting effect on our practices today. Telex was networked telecommunications and teleprinting from 1933.
In the same vein, I think every archival professional knows something about the Memex, proposed by Vannevar Bush in his article “As We May Think” in The Atlantic in July, 1945. While this posited the use of early hypertext navigation, it was an interface to static microfilm.
The compact cassette – yes the cassette tape of our youthful mix tapes – was used for data storage on home computers in the 1970s and 80s. I most vividly remember friends in high school seeking out cassettes with clear leaders for the loading of software and data on TRS-80 home computers. Interestingly, cassettes may be coming back in a revived form as a storage medium.
I remember some of the early word processors, but one of the most interesting appears to be the DECmate from 1977, a PDP-8 compatible _desktop_ computer running word processing software, meant, according to its advertising, for “office workers.”
My colleague Jimi Jones quipped that any technology that ends in the word “-vision” should be on my list. Catrivision analog video cassettes for consumer film distribution and for recording from 1972. The Polavision instant movie camera from 1977. The Magnavox Magnavision laserdisc player from 1978. The Selectavision Capacitance Electronic Disk video disc player from 1981. The Fisher-Price PixelVision camera (with cassette storage) from 1987.
While writing this post I was introduced by my colleague Jerry McDonough to the short-lived Vectrex. A color, true 3D vector graphics display for home gaming in 1982. And gone from the market in 1984.
How about the GRiD Compass laptop from 1982? Rugged, with a graphical interface. It used bubble memory, very high capacity non-volatile memory for its day. And it was the first laptop to go into space. In 1991 GRiD introduced the GRiDPad SL, one of the first pen-based Tablets.
While writing this, a friend introduced me to the DECtalk, a text-to-speech synthesizer from 1984. It could work as an interface to an email system and had the capability to function as an alerting system by interacting with phone systems via touch tones.
While I never used one, I was fascinated by the description of the development of the Thunderscan, a hardware adapter with accompanying software to turn an Apple ImageWriter into a scanner, which hit the market in 1984.
I worked with a Sony Mavica digital camera in the mid-1980s. Yes, digital. While the first version was an analog signal, later versions, such as the one I worked with, were digital, and wrote onto floppy disks.
How many people remember the NeXT, introduced in 1988? Perhaps it’s not fair to list this under hardware, because its OS, OpenStep object-oriented development tool, and WebObjects web application development framework were just as influential as the hardware, if not more. It was one of the earliest high-end workstations aimed at the scientific and higher education computer simulation market with fast chips and a lot of memory for the time, and magneto-optical storage, and it was truly WYSIWYG for layout and printing. You might remember that the first web browser was written by Sir Tim Berners-Lee on a NeXT and used it as the first web server…
The QuickCam was one of the first widespread consumer webcam devices in 1994, although neither the web nor videoconferencing were ubiquitous yet. (Tangentially, my first real experience with videoconferencing was a job interview in late 1995).
In 1996 the Palm personal digital assistant appeared in the market. I had four different models over the years. It was one of the earliest devices to support syncing between email and calendars on both Windows and Mac systems, and had a touchscreen for gestural writing capture using its Graffiti writing system. Of course it owed a huge debt to the Apple Newton from 1987, with its Notes, Names, and Dates applications and other productivity tools, and its true handwriting recognition. I was also reminded by a colleague about the Sharp Wizard from 1989, with a memo pad, calendar and scheduling with alarms and repeating events, and a calculator. I had completely forgotten that I once had one of these in my household when they were new.
I will end with one of my sentimental favorites. In 2000 I received an odd little box in the mail as part of my Wired magazine subscription. That box contained a CueCat, a home barcode scanner. It was meant to plug into home computers to read barcodes in print magazines to take you to targeted web sites. It was described by PC World in 2006 as “One of the 25 Worst Tech Products of All Time.” Now of course we all have barcode readers in our phones to interact with barcodes and QR codes everywhere. I still have my CueCat and the Wired box. And there are home library cataloging tools to this day that can still work with them.
What are your favorite forgotten innovations in hardware?
Back in May, after an enjoyable trip to the University of Leeds, I worked for a month on improving the Harvard Library’s FITS tool for combining the results of several file format identification and validation tools. The results were well received and the Harvard Library incorporated some of my work in the main line of FITS. Still, there were a lot of loose ends left and more work to be done.
Things are picking up again with a “FITS Blitz” that’s starting this week. Paul Wheatley writes that “in partnership with Harvard and the Open Planets Foundation (with support from Creative Pragmatics), SPRUCE is supporting a two week project to get the technical infrastructure in place to make FITS genuinely maintainable by the community. ‘FITS Blitz’ will merge the existing code branches and establish a comprehensive testing setup so that further code developments only find their way in when there is confidence that other bits of functionality haven’t been damaged by the changes.”
I’ve moved on to other things, so I won’t be able to participate, but I wish them every success.
Tagged: FITS, Harvard, Open Planets Foundation, software
The November 2013 Library of Congress Digital Preservation Newsletter is now available!
- Digital Preservation Pioneer: Sam Brylawski
- Welcome NDSR Inaugural Class!
- New Report: Preserving.exe
- Digital Portals to State and Community History
- NDSA Report on Geospatial Data
- Lists of Upcoming events and educational courses
- Interviews with Edward McCain and Emily Gore
- Articles on personal digital archiving, meetings reports, new resources, and more
Subscribe directly here, and get the newsletter automatically every month!
The following is a guest post by Philip Ardery, the newest member of the Library’s Web Archiving team.
I can trace my interest in computers and technology back to a single factor of my childhood: my family’s perpetually faulty home internet connection. While my multitude of siblings continually cursed and physically writhed over the frequent network disconnects, my parents stood by powerless, unconscious even of how to turn our computer on—though they were swift in realizing that they could unplug the machine to turn it off. I quickly learned that, in order to end the madness, I had to figure out how to fix the thing myself.
Flashing forward a decade or so, it makes perfect sense that I found myself fresh out of college employed as a technical support analyst. But, if you back up a year or two, the logic begins to fail.
In 2010 I graduated from Kenyon College with a Bachelor of Arts degree in English. Despite having a continued interest in technology, the thought of pursuing a computer science degree had not even cross my mind. After graduating, however, and stepping out into the real world with my stylish yet less-than-accommodating liberal arts degree, I began kicking myself for not considering a more dynamic and practical degree four years earlier. Nonetheless, my natural inclination for technical problem solving eventually resurfaced as I began learning about and enjoying computers again through my employment as a support analyst at FICO, a job that afforded me a wonderful crash course in Unix-based operating systems.
My new role as an Information Technology Specialist with the Web Archiving team of the Library’s Office of Strategic Initiatives represents an exceptionally ideal opportunity for me. Not only does it appeal to both my literary background and my love of technology, but it also incorporates my third most notable life passion: the internet! Despite some of its more questionable quirks, I firmly believe that history will look back on the internet in an ultimately favorable light, as one of mankind’s greatest inventions. Consequently, I am ecstatic about this opportunity to work with some of the most influential leaders in the internet archiving community and to contribute my part to this outstanding Library of Congress initiative.
As the newest member of the Web Archiving team, I will focus on supporting large data transfers relating to the Library’s various collections of archived web content, contributing to the greater internet archiving community’s expanding standard of best practices, refining internal procedures to accomplish the team’s long-term goals more effectively and efficiently, while simultaneously providing a wide range of general troubleshooting reinforcement as needed. I greatly look forward to the challenges ahead of me and am eager to learn, contribute, and accomplish as much as I can in this outstanding work environment. I invite all of you to introduce yourself and let me know if I can help you with anything!
Those of us in the “cultural heritage” sector get used to being at the end of the line sometimes. With very few exceptions, the unique items that end up in our collections usually get here after all their primary value has been extracted.
While we’d love to have a more regularized path for the treasures to get here, it’s actually to our benefit that creators and intermediaries have such strong incentives to steward and properly preserve their digital materials.
This is especially true in the music industry, where artists and records labels are still struggling to turn their digital art into gold. Digital music files are valuable cultural artifacts in their own right, but before they become “artifacts” they’re valuable assets that need to be managed for the long-term in order to sustain their earning potential.
There are tremendous opportunities for the cultural heritage community to leverage existing digital music workflows and to engage with the music community to implement digital stewardship processes for the benefit of all.
The best way to do this is to tap into existing initiatives and processes for managing digital music data. Nothing is currently hotter in the technical side of the music biz than discussions on metadata. For example, a new Recording Academy initiative called “Give Fans the Credit” is an effort to brainstorm ways to deliver more robust crediting information on digital music platforms.
The preservation benefits of rich metadata have long been apparent to NDIIPP. Metadata projects made up a number of 2007’s Preserving Creative America projects, including the “Metadata Schema Development for Recorded Sound” project, which focused on creating a standardized approach for gathering and managing metadata for recorded music and developing software models to assist creators and owners in collecting the data. The project ultimately developed the Content Creator Data tool, an open-source application that captures metadata at the inception of the recording process.
The NDIIPP connections with the music industry don’t stop there. John Spencer of the MSDRS project, a current member of the National Digital Stewardship Alliance Coordinating Committee, is also a participant in the Music Business Association’s Digital Asset Management Workgroup. The workgroup is co-chaired by Paul Jessop, a former chief technology officer for the RIAA, and Maureen Droney of the Recording Academy, Producers and Engineers Wing, who joined us for a conference panel a couple of years ago.
“I think there are two important on-going efforts that the music community is beginning to embrace,” said Spencer in a recent exchange. “One is that artists and performers are beginning to understand the importance of unique identifiers to define their ‘digital presence’ related to musical works. With the need to further automate the collection of royalties because of new delivery technologies, getting artists and performers to understand the importance of these identifiers is a place where the digital stewardship folks could help by showing examples of how they have implemented identifiers in their given space.”
The DAM group is working to “coordinate and standardize all non-recorded music assets relevant to the digital music value chain, such as artist images, credits, liner notes, archival assets, and more.” To that effect, they spearheaded last year’s publication of “MetaMillions: Turning Bits Into Bucks for the Music Industry Through the Standardization of Archival and Contextual Metadata.” The paper looks at the current state of metadata collection and curation in the music industry and explores how the data is being shared at each stage of the lifecycle, with an emphasis on showcasing the sales and marketing rationale for a more standardized metadata framework.
The Producers and Engineers Wing will soon release an update of the “Recommendation for Delivery of Recorded Music Projects” (PDF). This report “specifies the physical deliverables that are the foundation of the creative process” and “recommends reliable backup, delivery and archiving methodologies for current audio technologies, which should ensure that music will be completely and reliably recoverable and protected from damage, obsolescence and loss.”
More recently, the Music Business Association has hosted a Music Industry Metadata Summit and is working to expand the uptake of work being done by the Digital Data Exchange, a not-for-profit organization creating standards for the transmission of metadata between systems along the music supply chain. DDEX has established a working group focused on studio metadata, chaired by the aforementioned Mr. Spencer, with the release of specifications still to be determined (though we should note that they have already published a wide variety of other standards and recommendations).
This intense focus on metadata by the creation and intermediary management ends of the music industry should provide immense benefit to stewarding institutions once they ultimately take possession of the materials. Still, there are aspects of stewardship that may not be addressed by the current metadata efforts on the creation side, and the input of stewardship professionals could add lots of value.
So what are the most effective ways for the cultural heritage community to engage with the music community?
“Currently, I believe DDEX is a key piece of the puzzle,” said Droney in a recent conversation, “as it is the only organization working on actual standards for music business metadata. Standardization of the collection and transmission of recording studio metadata is the goal. In the meantime, educating the music community about best practices, both for the collection of credits and other technical and descriptive information, and for the short- and long-term archiving of masters, are important first steps. Also of note, the Audio Engineering Society has taken a serious interest in the National Recording Preservation Plan (PDF), and at the recent AES convention in NYC there were a number of tracks related to audio archiving and preservation that were inspired by the Plan.”
The following is a guest post by Leah Weinryb-Grohsgal, program officer in the Division of Preservation and Access at the National Endowment for the Humanities.National Endowment for the Humanities is now accepting proposals for the National Digital Newspaper Program. The National Digital Newspaper Program is a partnership between NEH and the Library of Congress to develop a searchable database of historically significant newspapers published in the United States. The Library of Congress hosts the site for this project at Chronicling America, a collection of information and digitized newspapers published in the U.S. and territories between 1836 and 1922 available on the web for anyone to use. The collection can now accept not only English titles, but Spanish, French, Danish, German, Hungarian, Italian, Norwegian, Portuguese and Swedish publications as well.
Each year, NEH and the Library of Congress seek to add more historic newspapers to the site, which currently includes more than 6.6 million pages and 1,100 titles. Each award is made in the form of a cooperative agreement that establishes a partnership between NEH and the applicant institution, with technical support provided by the Library of Congress. Awards support 2-year projects to digitize 100,000 newspaper pages from a state, primarily from microfilm negatives. A list of the 36 institutions currently participating in the National Digital Newspaper Program may be found at http://www.loc.gov/ndnp/awards/.
NEH hopes eventually to support projects in all states and U.S. territories. One organization within each U.S. state or territory will receive an award to collaborate with state partners. Previously funded projects are eligible to receive supplementary awards for continued work, but the program will give priority to new projects, especially those from states and territories that have not received NDNP funding in the past. New applicants are welcome to propose projects involving collaboration with previous partners, which might involve an experienced institution managing the creation and delivery of digital files, consulting on the project or providing formal training to the project staff of a new institution.
NDNP projects focus on:
- Selecting newspaper titles to be digitized and analyzing available microfilm for optimal scanning
- Digitizing page images from microfilm, preparing optical character recognition files, and creating relevant metadata
- Delivering files and metadata to the Library of Congress in conformity with technical guidelines
- Updating bibliographic records of digitized titles in WorldCat
- Identifying free open access newspapers in the state or territory for inclusion in the CA newspaper directory
Proposals are now being accepted from institutions wishing to participate in the National Digital Newspaper Program. For more information, please visit the program’s funding page at http://www.neh.gov/divisions/preservation/national-digital-newspaper-program, and technical guidelines at http://www.loc.gov/ndnp. Guidelines may be found at http://www.neh.gov/grants/preservation/national-digital-newspaper-program.
Applications are due January 15, 2014.
The following is a guest post by Nicholas Taylor, Web Archiving Service Manager for Stanford University Libraries.
I’m inclined to blame the semantic flexibility of the word “archive” for the fact that someone with no previous exposure to web archives might variously suppose that they are: the result of saving web pages from the browser, institutions acting as repositories for web resources, a navigational feature of some websites allowing for browsing of past content, online storage platforms imagined to be more durable than the web itself, or, simply, “the Wayback Machine.” For as many policies and practices guide cultural heritage institutions’ approaches to web archiving, however, the “web archives” that they create and preserve are remarkably consistent. What are web archives, exactly?
At the most basic level, web archives are one of two closely-related container file formats for web content: the Web ARchive Container format or its precursor, the ARchive Container format. A quick perusal of the data formats used by the international web archiving community shows a strong predominance of WARC and/or ARC. The ratification of WARC as an ISO standard in 2009 made it an even more attractive preservation format, though both WARC and ARC had been de facto standards since well before then. First used in 1996, the ARC format is more specifically described by the Sustainability of Digital Formats website as the “Internet Archive ARC file format”, a testament both to the outsized contribution of the Internet Archive to the web archiving field as well as the recentness of the community’s broadening membership.
This extensive technical metadata is what distinguishes a web archive from, say, a copy of a web page. Aside from testifying to the provenance and facilitating temporal browsing of the archived data, the variety and ubiquity of record headers also creates intriguing opportunities for metadata extraction and analysis.
If you want to see for yourself, an appendix to the draft WARC specification contains examples of each of the WARC record types, including archived resources. Internet Archive also provides a set of test WARC files for download. Since even archived binary data is stored as (Base64-encoded) ASCII text, the files are surprisingly legible once unzipped and opened in a text editor. It’s not as seamless a way to navigate the past web as, say, Wayback Machine or Memento, but it will give a deeper understanding of the well-considered and widely-used data structure that makes those technologies work.
Connecting Communities: FADGI Still Image Working Group’s Impact on the Library of Congress and Beyond
The following is a guest post from Carla Miller of the Library of Congress. This is the second in a two-part update on the recent activities of the Federal Agencies Digitization Guidelines Initiative. This article describes the work of the Still Image Working Group. The first article describes the work of the Audio-Visual Working Group.
While attending a Federal Agencies Digitization Guidelines Initiative Still Image Working Group meeting earlier this summer, I suddenly saw everything come together. What I mean by that is I realized how the digital preservation work performed by my team at the Library of Congress intersects and relates to the work being performed by other divisions within the Library as well as other government agencies.
Participants at the meeting came from multiple agencies throughout the federal government and from various divisions within the Library of Congress. Participants included:
- National Archives and Records Administration
- National Anthropological Archives at the Smithsonian
- National Museum of Health and Medicine
- National Oceanic and Atmospheric Administration
- Government Printing Office
- National Agricultural Library
- National Gallery of Art
- Library of Congress
At the Library of Congress, Dr. Lei He is an imaging scientist who is currently researching the effects of compression on digital images. Dr. He also uses quantitative methods to analyze “edges” found in images. “Edges” are naturally occurring high contrast areas of photographs that can be used to determine what resolution is needed for digitization. Dr. He’s research is already improving the processes at the Library of Congress. Similar analyses done on the Farm Security Administration photo collection at the Library determined a higher scanning resolution was required for groups of negatives in the collection. This determination was especially significant because many historic negatives are deteriorating, which means this may be the last chance to digitize them for preservation and access.
Another type of research and testing is being done by Don Williams of Image Science Associates, an expert consultant for the Library of Congress. Don works with Steve Puglia and Dr. Lei He at the Library to develop software and image targets for assessing image performance. The software is known as DICE (Digital Image Conformance Evaluation), and using targets it analyzes the quality of the actual image capture to help determine both if the product quality expected is occurring and if that quality is consistent throughout the workflow. One important aspect of the DICE targets is that they are produced with spectrally neutral gray patches; many neutral patches on color/grayscale targets are not. A spectrally neutral target for transmissive materials (think photographic negatives rather than printed photos) is also in development.
The Library of Congress uses the DICE targets to test scanning equipment and to verify output quality. The DICE software is also used in quality assurance and quality control testing for digitization projects funded by the Library. This testing and analysis assures consistent quality across projects. It also ensures that the final product will be as true to the original as possible, an aspect that is often important for users of the Library’s digitized collections.
In a joint effort with the Government Printing Office and the National Archives and Records Administration, Library staff members have developed a matrix of file format comparisons. Five formats for still images were chosen for analysis: PNG, TIFF, JPEG, JPEG2000 and PDF. The group compared sustainability, and cost factors for implementation and storage. The final draft of this document will be available for public comment on the FADGI site within the next couple of weeks.
The research work being done at the Library benefits other Federal agencies as well. In fact, the entire purpose of FADGI is for Federal agencies to collaborate and share information and best practices on digitizing our various collections and records. Some examples of these collaborations were shared at our most recent meeting: Don Williams will be working with the National Anthropological Archives at the Smithsonian on the digitization of endangered manuscript materials. The Smithsonian will work with the Library on standardized language we use in contracts requiring the use of DICE targets as an objective measurement of scanning devices. And in a general sense, the research we do often informs the development of policies, protocols and workflows throughout the Library and various other agencies.
Information Today recently published Personal Archiving: Preserving Our Digital Heritage, a collection of essays written by some of the leading practitioners, thinkers and researchers in the emerging field of personal digital archiving. We are honored that Information Today — and especially the book’s editor, Donald Hawkins — asked us to share our resources and experiences by contributing an essay to the book.
The term “personal digital archiving” can be interpreted in different ways, but I think it generally applies to digital preservation at the individual level as opposed to the institutional level. I say that the term “generally applies” because the concept of personal can be slippery to define.
Personal digital archiving could equally apply to individuals interested in securely saving their digital photos, families sharing and archiving all manner of born-digital and digitized memorabilia, local history and genealogy groups trying to deal with the increasing influx of digital material, public libraries acquiring non-commercial digital collections from the communities they serve and academics taking responsibility for the preservation of their digital professional works. So, for Personal Archiving: Preserving Our Digital Heritage, editor Donald Hawkins chose authors with a range of backgrounds and interests.
Summarizing the book might not do it justice, so here’s a quick look at the contents.
Brewster Kahl, visionary founder of the Internet Archives, wrote the introduction and he addresses personal digital archiving as an emerging societal phenomenon. “Excitement is growing as researchers learn from one another and welcome the type of sharing culture that comes before commercial players enter a field,” said Kahl.
Jeff Ubois, the founder of the annual Personal Digital Archiving conference, gives the informed, high-level view in his essay, “Personal Archives: What They Are, What They Could Be and Why They Matter.”
Danielle Conklin wrote, “Personal Archiving for Individuals and Families,” in which she examines the approaches that four different individuals take to their personal digital archiving projects.
I wrote “The Library of Congress and Personal Digital Archiving,” which summarizes the Library of Congress’s efforts to date: our print, video and audio resources; our outreach events and educational presentations to the general public and our collaboration with the Public Library Association to spread awareness of personal digital archiving resources into local communities. The essay also covers our general step-by-step advice for preserving personal digital valuables.
Editor Donald Hawkins wrote, “Software and Services for Personal Archiving,” in which he assesses media collection systems for photos and documents, notes, email archives and home movies and videos.
Evan Carroll, one of the leading experts in the complexity of digital-age estate planning, wrote “Digital Inheritance: Tackling the Legal Issues.”
Catherine Marshall, of Microsoft Research, wrote “Social Media, Personal Data and Reusing Our Digital Legacy.” Marshall specializes in objective research into what people actually do or don’t do with their digital stuff — human nature versus best practices.
Jason Zallinger, Nathan Freier and Ben Shneiderman co-wrote, “Reading Ben Shneiderman’s Email: Identifying Narrative Elements in Email Archives,” in which they analyzed 45,000 of Shneiderman’s emails for narrative elements.
Elisa Stern Cahoy wrote “Faculty Members as Archivists: Personal Archiving Practices in the Academic Environment.”
In “Landscape of Personal Digital Archiving Activities and Research,” author Sarah Kim goes into the kind of exhaustive comprehensive detail that only a PhD candidate can.
Aaron Ximm wrote “Active Personal Archiving and the Internet Archive” in which he details how the Internet Archive is already a public resource for personal digital archiving and he suggests some futuristic possibilities for the IA in actively capturing and preserving networked personal histories.
In “Our Technology Heritage,” Richard Banks of Microsoft Research details his philosophic and scientific observations about the intersection of the material and digital worlds, and their implications for next-generation technology.
Donald Hawkins, Christopher Prom and Peter Chan write about three interesting research projects in “New Horizons in Personal Archiving: 1 Second Everyday, myKive and MUSE.”
And appropriately, the book concludes with an essay from Clifford Lynch, “The Future of Personal Digital Archiving: Defining the Research Agendas.” One of Lynch’s gifts is his ability to make sense of concepts like personal digital context within broader contexts — in the entire informational and cultural ecosystem — and extrapolate where things might evolve next. Lynch is one of academia’s great explainers.
- Travis compiles the projects and executes unit tests whenever a new commit is pushed to Github, or when a pull request is submitted to the project.
- Jenkins builds are generally scheduled once per day. After a build the software has its code quality analysed by Sonar
Complete details of how to build each non-Java project are contained within the .travis.yml files that are found in the project directories. As a side effect of this work the .travis.yml files can be used as instructions for independently building the projects.
Matchbox, Xcorrsound and Jpylyzer have CI builds that are capable of generating an installable Debian package, which we are aiming to publish. Java projects have had their Maven GroupId and package names changed to the appropriate SCAPE names so we can publish binary snapshots.
The daily Maven snapshots of code built in Jenkins are now (or soon will be) published to https://oss.sonatype.org/content/repositories/snapshots/eu/scape-project/ and can be used by adding this repository to your pom.xml:<parent> <groupId>org.sonatype.oss</groupId> <artifactId>oss-parent</artifactId> <version>7</version> </parent>
What you can do for your project
- Maintain your .travis.yml file if project dependencies change
- Ensure code matches the SCAPE/OPF functional review criteria – correct Java package names and Maven GroupIds are essential to be able to publish snapshots
- Ensure your project has an up to date README that contains details of how to build and run your software (including dependencies)
- Very importantly ensure that your project has (at the very least) a top level LICENSE, ideally source files should each contain a license header
- Add unit tests for your project
- Ensure that unit tests for your project can easily be run using standard dependencies. Relying on your particular installation for unit tests to pass means that they cannot be successfully run by Travis/Jenkins and show as test failures. Whilst it might not always be possible to have unit tests that can be run independently, if there have to be test dependencies then please document how these should be set up!
- Check your project at http://projects.opf-labs.org/
The CI days are generally about once a month. If you are interested in joining us do let us know as we could always do with more help. It’s an opportunity for you to work on CI with Travis/Jenkins, and do other work that is interesting (and rewarding), such as Debian packaging, that you might not normally get to work on.Preservation Topics: ToolsPackagingOpen Planets FoundationSCAPESoftwarejpylyzer
One Format Does Not Fit All: FADGI Audio-Visual Working Group’s Diverse Approaches to Format Guidance
This is the first in a two-part update on the recent activities of the Federal Agencies Digitization Guidelines Initiative. This article describes the work of the Audio-Visual Working Group. The second article, to be published on November 4th 2013, describes the work of the Still Image Working Group.
I wish I had a quick and easy answer when colleagues ask what file format they should use to create and archive digital moving images. My response usually starts out with “well, it depends.” And indeed it does depend on a wide variety of factors. Factors like what they want to achieve with the file, what equipment and storage space is available, are they reformatting old videotapes or creating new born-digital material? The list of considerations that can impact the decision goes on. As a community, our general rule is to “make the best file that you can afford to create and maintain” but what makes one format better than another in a given situation? (BTW: in this context, the term file format is understood to mean both the file “wrapper,” e.g. mov, avi, and mxf, and the encoding in the wrapper, e.g., uncompressed, H.264, and JPEG 2000.)
The Federal Agencies Digitization Guidelines Initiative Audio-Visual Working Group, with active members from across the Library of Congress including the Packard Campus for Audio-Visual Conservation and American Folklife Center as well as the National Archives and Records Administration, Smithsonian Institution Archives, and National Oceanic and Atmospheric Administration among others, has four subgroups working on informative guidance products to help answer the age-old question, “what should I do with my moving image collections?”
The lead video effort, now in its third year, entails the development of a specification for the use of the MXF format, in effect a special profile of this wrapper tailored to serve preservation. The specification is dubbed AS-07 and it is not only of general interest to the community but directly supports the work of the Packard Campus, where a version of MXF with JPEG 2000 picture encoding has been in use for several years. Everyone expects that the publication of the AS-07 specification will increase the adoption of this format. Meanwhile, however, there are other formatting options to consider, especially by smaller archives or for classes of content that are less complex than, say, the broadcast collections that are an important part of the Packard Campus holdings.
The Working Group’s interest in exploring this wider range of options has led to the formation of the Digitized Video subgroup, spearheaded by staff from the NARA’s Video Preservation Lab. Taking a lead from the work of the FADGI Still Image Working Group, this subgroup is building a matrix to compare target wrappers and encodings against a set list of criteria that come into play when reformatting analog videotapes. The evaluation attributes include format sustainability, system implementation, cost, and settings and capabilities. The matrix and companion documents will be available for review on the FADGI website in the coming months.
The just-off-the-ground Born-Digital Video subgroup, led by staff from the American Folklife Center at the Library of Congress, is taking a lifecycle approach to born-digital video by focusing on guiding principles. Through visual examples and case histories, the subgroup’s product will illustrate the cause and effect of the range of decisions to be made during the creation and archiving lifecycle of a born-digital video file. This work will be geared for both file creators (such as videographers and others who create new digital video files) and file archivists (such as librarians and archivists and others who receive files from creators and have to archive and/or distribute them). For file creators, we want to emphasize the advantages of starting with high quality data capture. For file archivists, we want to explore options for identifying the composition of video files and evaluating their characteristics to better understand if action is warranted and if so, when the action needs to be taken.
Both the Digitized and Born-Digital video subgroup efforts build on the useful 2011 report by George Blood for the Library of Congress titled Determining Suitable Digital Video Formats for Medium-term Storage. In addition, our format comparisons will support the ongoing work of the International Association of Sound and Audiovisual Archives as they draft a general guideline for video preservation.
Motion Picture Film Efforts
The Film Scanning subgroup, led by staff from NARA’s Motion Picture Preservation Lab, is addressing the issues of digitizing motion picture film. The first product from this group will be an outline of technical components to address when outsourcing film scanning to commercial vendors with a goal towards improving access. Other efforts, including the Academy of Motion Picture Arts and Sciences Academy Color Encoding System, are focused on improving archival master formats but until these efforts are ready for prime time, the community is looking for guidance on interim solutions that take possible future uses into account.
Every year the Library of Congress hosts a meeting on Designing Storage Architectures for Digital Collections, aka the Preservation Storage Meeting. The 2013 meeting was held September 23-24, and featured an impressive array of presentations and discussions.
The theme this year was standards. The term applies not just to media or to hardware, but to interfaces as well. In preservation, it is the interfaces – the software and operating system mechanisms through which users and tools interact with stored files – that disappear the most quickly. Or change the least to keep up with changing needs. The quote of the meeting for me was from Henry Newman of Instrumental, Inc: “These are not new problems, only new engineers solving old problems.”
Library of Congress staff kicked off the meeting by discussing some of the Library’s infrastructure and needs. The Library has reached a point where it has 50% of the files in its storage systems inventoried, so we know what we have, where it is, who it belongs to, and have fixities for future auditing. We have a wide range of needs, though, which vary with the type of content. The scale of the data center where text and images are primarily stored is multiple millions of files in 10s of petabytes. The scale of the data center where video and audio are primarily stored is 700k files in 10s of PB. The different scales of file numbers and sizes mean different requirements for the hardware needed to stage and deliver this content. In terms of the Library’s storage purchases, 70% of the purchases are for the refresh of technology, 30% for capacity expansion.
Tape technologies are always a big topic at this meeting. T10K tape migration is ever ongoing. Interfaces to tape environments reach end-of-life and are unsupported within 5-10 years of their introduction, according to Dave Anderson of Seagate. According to Gary Decad of IBM, rates of areal density increases are slowing down, and the annual rate of petabytes of storage manufacturing is no longer increasing.
Tape is by far and away the highest MSI (millions of square inches) of storage in production use. Tape, hard disk drive, and solid state storage are surface-area intensive technologies. Many meeting participants believe that solid state improves Hard Disc Drive technology. Less obvious for preservation concerns is the impact of NAND flash storage on the use of hard drive storage. To replace enterprise hard disk drives will be exorbitantly expensive, and is not happening any time soon.
Across the board there must be technologies licensed to multiple manufacturers and suppliers for stability in the marketplace. But it is extraordinarily expensive to build fabrication facilities for newer technologies such as NAND flash storage. The same is true for LTO tape facilities, not as much for the expense of building the facilities but for the lack of profitability in manufacturing. After the presentations at this meeting I more familiar with the licensing of storage standards to manufacturing companies than I was before, and the monopolies that exist.
The panel on “The Cloud” engendered some of the liveliest discussion. Three quotes stood out. The first, from Andy Maltz of the Academy of Motion Picture Arts and Sciences: “Clouds are nice but sometimes it rains.” And from Fenella France at the Library of Congress: “I have conversations with people who say ‘It’s in the cloud.’ And where is that, I ask. The cloud is still on physical servers somewhere.” And Mike Thuman from Tessella, referencing his slides, said “Those bidirectional arrows between the cloud and local? They’re not based on Kryder’s Law or Moore’s Law, it’s based on Murphy’s Law. You will need to bring data back. ”
David Rosenthal of Stanford University pointed out some key topics:
- When is the cloud better than doing it yourself? When you have spiky demand and not steady use;
- The use of the cloud is the “Drug Dealer’s algorithm”: The first one is free, and it becomes hard to leave because of the download/exit/migration charges;
- The cloud is not a technology, it’s a business model. The technology is something you can use yourself.
Jeff Barr of Amazon commented, “I guess I am the official market-dominating drug dealer.” But Amazon very much wants to know from the community what it is looking for in a preservation action reporting system for files stored in the AWS environment.
The session on standards ranged from an introduction to NISO and the standards development process (with a wonderful slide deck based on clip art), to identifiers and file systems, and the specifics of an emerging standard: AXF.
A relatively new topic for this year’s meeting was the use of open source solutions, such as the range of component tools in OpenStack. HTTP-based REST is the up-and-coming interface for files – the technology is moving from file system-based interfaces to object-based interfaces. Everything now has a custom storage management layer from the vendor.
Other forms of media were also discussed. Two of the most innovative are a stainless steel tape in a hermetically-sealed cartridge engraved with a laser, and another that is visually-recorded metal alloy media. Optical media is also not dead. Ken Wood from Hitachi pointed out that 30-year-old commercial audio CDs are still supported in the hardware marketplace, and that CDs still play. Technically that has just as much to do with the software interface with error correction still being in play as the hardware still being supported. But mechanical compact disc players and storage are disappearing with the rise of mobile devices and thin laptops which have no optical players or hard discs.
Presentations by representatives of the digital curation and preservation community always make up a large percentage of this meeting. Projects such as the Data Conservancy and efforts at the Los Alamos National Labs, the National Endowment for the Humanities, the Library of Congress were featured. It was noted more than once that content and data creators still do not often feel that preservation is part of their responsibility. The key quote was “You can spend more time figuring out what to save than actually saving it. The cost of curation to assess for retention can be huge.”
You should really check out the agenda and presentations, which are available online.
If so, you are in luck – we have a publication on that very subject. “Perspectives on Personal Digital Archiving” was published and announced earlier this year, but I think it’s worth a reminder at this point, especially for those that may not have seen it yet.
Because we are generating more and more content about this topic on our blog, we compiled the relevant posts to make it easier to access all in one place. Access being, of course, a crucial element in any digital preservation plan.
Many of our readers are already aware that personal digital archiving is, for better or worse, becoming a necessity in our time – that is, a time when more and more of our personal documents are in digital form. Anyone who owns a digital camera, for example, has probably figured this out. Remember those days of printed photos placed in photo albums or even stuffed into shoeboxes? Now, since these items are in digital form, so too must be the storage, and eventually preservation, of those items. You’d hate to see a treasured physical item become fragile and break apart, but many people are surprised to learn that items in digital form can be even more fragile.
There are many variables that can affect the preservation of a digital item – outdated equipment or software, inaccessible files, lack of backup, etc. But luckily, there are steps you can take to make sure your documents, photos, and even email, all survive for even the next five years, or longer.
The process isn’t particularly complicated – see our specific advice here for digital photos. But it does require some amount of focused effort to make sure that your treasured personal items are available for your own long-term enjoyment, as well as that of future generations.
“Perspectives on Personal Digital Archiving” contains much general information as well as interviews and step by step instructions, all of which can serve the novice as well as those who may already have some experience. You can access and download this publication (free of charge, of course) on our general personal digital archiving page as well as our publications page.
Here are the chapter headings along with a sample of what’s included:
Under “Personal Digital Archiving Guidance”
- Four Easy Tips to Preserving Your Digital Photographs
- Archiving Cell Phone Text Messages
- What Image Resolution Should I Use?
Under “Personal Reflections on Personal Digital Archiving”
- One Family’s Personal Digital Archiving Project
- Personal Archiving: Year End Boot Camp
- Forestalling Personal Digital Doom
Under “Personal Digital Archiving Outreach”
- Librarians Helping Their Community with Personal Digital Archiving
- What Do Teenagers Know About Digital Preservation? Actually, More Than You Think…
- The Challenge of Teaching Personal Archiving
….and many more.
In the meantime, we are continuing to publish more blog posts all the time on personal digital archiving and related events. There is also a section on our website devoted to information and resources on the subject.
As always, we welcome any feedback on this resource as well as your own stories and experiences with personal digital archiving.
Digital Stewardship and the Digital Public Library of America’s Approach: An Interview with Emily Gore
The following is a guest post by Anne Wootton, CEO of Pop Up Archive, National Digital Stewardship Alliance Innovation Working Group member and Knight News Challenge winner.
In this installment of the Insights Interviews series, a project of the Innovation Working Group of the National Digital Stewardship Alliance, I caught up with Emily Gore, Director for Content at the Digital Public Library of America.
Anne: The DPLA launched publicly in April 2013 — an impressive turnaround from the first planning meeting in 2010. Tell us how it came to be, and how you ended up in your role as content director?
Emily: I started building digital projects fairly early in my career, in the early 2000s, when I was an entry-level librarian at East Carolina University. In the past, I’ve worked on a lot of collaborative projects at the state level. In North Carolina and South Carolina, I worked on a number of either small scale or large scale statewide collaborations. I led a project in North Carolina for a little over a year called NC-ECHO (Exploring Cultural History Online) and so have always been interested in what we can do together as opposed to what we can do on individually or on an institutional level. Standards are important. When we create data at our local institutions, we need to be thinking about that data on a global level. We need to think about the power of our data getting reused instead of just building a project for every institution — which is where all of us started, frankly. We all started in that way. We thought about our own box first, and then we started thinking about the other boxes, right? I think now we’re beginning to think broader and more globally. It’s always been where my passion has been, in these collaborations, especially across libraries, archives, and museums.
I was involved in the DPLA work streams early on and saw the power of promise of what DPLA could be, and I jumped at the offer to lead the content development. At the time, I had taken an associate dean of libraries position and been at Florida State for about a year, and it was a real struggle for me to think about leaving, after only being somewhere for a year… but I think, I guess we have to take leaps in our life. So I took the leap, and you know, I think we’re doing some pretty cool things. We’ve come really far from when I started last September, really fast. I haven’t even been working on the project for quite year and we’ve already aggregated millions of objects and we’re adding millions more.
I love all the energy around the project and that a lot of people are excited about it and want to contribute. One of the first projects I coordinated was with a local farm museum, dealing with the actual museum objects, and marrying those with the rich text materials we had in the library’s special collections. And telling a whole story — people being able to actually see those museum objects described in that text. I just saw the power of that kind of collaboration from early on and what it could be more than just kind of a static, each-one-of-us-building-our-own-little-online-presence. The concept of the DPLA has really been a dream for me, to take these collaborations that have been built on the statewide, regional and organizational levels and expand them.
Anne: There are ongoing efforts in lots of countries outside the United States to create national libraries, many of which have been underway since before the DPLA. Are there any particular examples you’ve looked to for inspiration?
Emily: Europeana, a multi-country aggregation in Europe, has been around for about five years now. We’ve learned quite a bit from them, and talked to them a lot during the planning phase. They have shared advice with us regarding things they might have done differently if given the opportunity to start again. One particularly valuable piece of advice has been not to be so focused on adding content to DPLA that we forget to nurture our partnerships and to work with our users. Of course, my job is largely focused on content and partnerships, but we really want to make sure that the data we are bringing in to DPLA is getting used, that there are avenues for reuse, that people are developing apps, that we continue to make sure the Github code is updated, and that everything is open and we promote that openness and take advantage of showing off apps that have been built, encouraging other people (through hackathons, for example) to build on what we’ve got.
Europeana has also done a lot of work building their data model, and testing that data model, and making it work with their partners. That’s been a huge help for us starting off, to take their data model and adapt it for our use. They’ve also held rights workshops — Europeana formed 12 standardized rights statements starting with CC-0 and various Creative Commons level licensing, down to rights restricted or rights unknown. We all need to work with our partners to help them understand their rights and their collections better, and to place appropriate rights on them. Most of the collections we see coming in are “contact so-and-so,” “rights reserved,” that kind of thing. This is largely because people are afraid or there is a lack of history regarding rights. We want to work with Europeana and our partners to clarify rights regarding reuse for our end users. Europeana has started to work with their partners on that, and we want to do that together, so that the rights statements are the same between organizations, and we promote interoperability in that way.
Anne: So much of the DPLA is based on state hubs and the relationships that existing institutions have with those state hubs. How much collaboration do you see among the states?
[For uninitiated readers: the DPLA Digital Hubs Program is building a national network of state/regional digital libraries and myriad large digital libraries in the US, with a goal of uniting digitized content from across the country into a single access point for end users and developers. The DPLA Service Hubs are state or regional digital libraries that aggregate information about digital objects from libraries, archives, museums, and other cultural heritage institutions within its given state or region. Each Service Hub offers its state or regional partners a full menu of standardized digital services, including digitization, metadata, data aggregation and storage services, as well as locally hosted community outreach programs to bring users in contact with digital content of local relevance.]
Emily: When the DPLA working groups started to examine how we should go about getting content into the DPLA, I remember saying “We should build off of existing infrastructure, because these collaborative projects exist in many states.” They’ve been working with the local institutions for a number of years. So if we can start working with those institutions, then we can build a network and get content. Trust is so important. I think that the small institutions often trust that institution that’s been aggregating their content for a number of years and they might not trust someone from the DPLA coming in and and saying, “I want your content.”
The states work extremely well together. We have project leads and other relevant staff from each state or region, and five states and one region covering multiple states right now that we’re working with. We come together to talk about issues that are relevant to all of the states. The models are very different. Some of them have centralized repositories where the metadata work, the digitization work, everything is done in one central place. They work with partners to help provide initial data, and to get the actual objects, but then all the work is done centrally to enhance that metadata and do the digitization work. In other places it’s totally distributed. I’ll take South Carolina as an example. The three major universities in the state have regional scan centers, and they work with the people in their respective regions to get materials digitized, described and online. They’ll accept contributions from institutions who have already digitized their content and provided metadata for that, and then they’ll take it in to their regional repository, and then the three regional repositories are linked together to form one feed. It’s wonderful to hear the exchanges among the hubs, “this is what works in our state, and here are the reasons why.” And they figure out, “Maybe we’ll try this, maybe this will work better to attract folks.”
Anne: Have the state hubs helped build relationships with small institutions? Or how has the DPLA mission and reputation preceded it in these communities?
In several of the regions, because of the participation of the DPLA, people who refused to partner before are actually saying, “I want my content exposed through the DPLA so can we partner with you?” Partnerships are expanding in the hub states/region as a result of this. I think being at the national level is really helping. I think a lot of [the state hubs] are trying to do outreach and education — they’re doing webinars, they’re talking to people in their state, they’re trying to educate people about what the DPLA is and what the possibilities are. And trying to alleviate fears, where possible. There’s a lot of fear. Even opening metadata, it’s been interesting to see what people’s reactions to that are sometimes. I guess in my mind, I never thought about metadata having any rights. These states have had a challenge explaining what a CC0 license really means for metadata. I think that that has been a hurdle, but most of them are overcoming it, and partners in general are ok with it once they understand the importance of open data. They’re explaining why it’s important, and they’re talking about linked data and the power of possibility in a LOD world, and that that’s only going to happen if data is open.
Anne: How do you effectively provide context for these 4,000,000+ digital records? How do you root a museum artifact in the daily life of that place, and how do you do it within a given state versus across states?
Emily: We’ve done exhibitions of some of the content in the DPLA so far. We have worked with our service hubs to build some initial exhibitions around topics of national interest. Our goal initially was for different states to work together to help provide data from multiple collections. That happened on a very small scale. Mostly the exhibitions were built with collections from their own institutions, largely because of time constraints we were under to get the exhibitions launched. But also, it’s easier. You know the curator down the hall, you can get permission to get the large-scale image that are needed to actually go in the exhibitions. We did have some exceptions to that; we had a couple of institutions work together and share images with the others. We hope to do more of that — we pulled out 40 or 50 themes of national significance that we could potentially build exhibitions around and there are a number of institutions who want to build more. Right now we’re working on a proposal to actually work with public librarians in several states, to reach some of the small rural public libraries that may have some collections that haven’t been exposed through the hubs, that would in turn help build some of these exhibitions at a national level. And those would be cross-state: local content into national-level topics of interest. We’re also doing a pilot with a couple of library schools on exhibition building. And we’ve given them the same themes, and they’re going to use content that already exists in the DPLA.
Anne: You mentioned hackathons and encouraging people to build things using the DPLA API. What are people building so far?
Emily: To date, I think there are approximately nine apps on the site. There is a cross-search between Europeana and the DPLA — a little widget app where you can search both at the same time and get results, which is awesome. That was built early on. Ed Summers built the DPLA map tool that automatically recognizes where you are so you can look at what DPLA content is available around you. The Open Pics app is iOS-based — you can search and find images around all the topics in the DPLA and use them on your phone. It’s pretty cool. Culture Collage is the same kind of app – it visualizes search results for images from the DPLA. StackLife is a way to visualize book material in a browsing way, like you would actually in the stacks in a library.
We also hope to continue to have hackathons, we’ve talked a little bit to Code for America and hoped to get more plugged in to their community, and we were involved in the National Day for Civic Hacking, and we’re hoping to continue to promote the fact that we do have this open data API that people can interface and build these cool apps with. We really want to encourage more of that.
Anne: Explain your vision for the Scannebago mobile scanning units.
Emily: When I was working in North Carolina years ago, we did a really extensive collections care survey of all the cultural heritage institutions in the state of North Carolina — about 1,000 institutions.
That survey took five years and two or three different cars! We surveyed these cultural heritage institutions looking specifically at their collections care and the conditions that their collection were in, but also with an eye toward what might need to be preserved for the long term, what needs to be digitized and made available, what are their gem collections that we could essentially help them expose? We saw so many amazing collections that, without physically going to these institutions, you would never ever see. Take the Museum of the Cherokee Indian as an example.
There we discovered wonderful textiles and pottery and other collections that, unless you physically go there, you will likely never see. And of course, like most museums, they only display a small portion of their collection at any time. Otherwise the collections are in storage, and on shelves, and until they rotate those collections in you never see them. It’s not only in North Carolina where we find those examples — it’s everywhere. The ability to see those objects online, I think, is so powerful. And even to potentially tell that rich contextual story, build exhibitions around that, talk about the important history there — I think can be very powerful. But we know that it took a trust relationship for us to even go there and survey their collections. There had to be a trust relationship built, instead of, “Hi, we’re from the state government and we’re coming here to survey your collections.” Obviously that is not really what a lot of people want to hear. So [during the North Carolina survey] we worked with cultural heritage professionals who had existing trust relationships with institutions and they helped us forge our own relationships. In the end, most institutions were confident that we were indeed only there to survey the collections, and that we had good intentions to help get funding, to help preserve these collections for the long term.
We use that network a lot. We’re not going to get local content without the local people, without the connections, without the trust relationships that have already been built. These people aren’t going to let materials out of their building to be digitized. They’re not going to send them to a regional scan center, or a statewide scan center — they’re just not going to do that. They care about those objects so much — they represent their history, and in many cases they’re not going to let them out of their sight. We have to come to them — how do we do that? Some of these places are up these long winding mountain roads — how in the world do we get up here, and how in the world do we get equipment to them to get this done? That’s where I came up with the concept of a mobile digitization vehicle that I called a Scannebago, a Winnebago shell that we can build out with scanning/camera equipment to get to these rural and culturally rich institutions. That’s the concept.
People ask me about taking content directly into DPLA, and I think the importance is the sustaining of that content. Somebody has to be responsible for the long term maintenance of that content — and at this point, that’s not us. We’re aggregating that content, exposing that content for reuse, but we are not the long-term preserver of that content. And these small institutions are not the long-term preservers of that content either — that’s why the hubs model continues to be important. When we go out with the Scannebago, I still want that digital material to go to the hubs to be preserved for the long term. The Scannebago is another way to make content available with its appropriate metadata through the DPLA, but we really want to see the digital objects preserved and maintained for the long term at some level, and right now that’s through the hubs. It doesn’t have to be geography-based — hubs could be organized around media type or organization type. But right now, a lot of these relationships exist already based on geography, so it seems logical to continue to build out hubs by geography as we build out other potential collaboratives as well.
The Scannebago has always been a dream — I had really hoped when I was working at the state of North Carolina that we’d be able to do it on some level, and it just didn’t become a reality — but John Palfrey (Head of School at Phillips Academy, Andover and chair of the DPLA board of directors) heard about what I wanted to do and picked it up and was really excited about the potential of doing this. We’re drawing out a schematic of what it would look like. We might potentially launch a Kickstarter campaign to try to build one out in the future. We really want to at least pilot the concept. I would also love to do a documentary on it — I think the stories we’ll find when we actually get to these places are just as important to preserve as the content — the curators, the people who are looking over this stuff and how important it is. I get chills just thinking about it, but one step at a time. One step at a time.