The BlogForever platform is one of the major results of the BlogForever project. It is a simple weblog digital archiving platform to preserve weblogs and ensure their authenticity, integrity, completeness, usability, and long term accessibility as a valuable cultural, social, and intellectual resource.
This release consists of the BlogForever repository and two blog spiders, a free version based on .NET and an OSS version based on python.
BlogForever Repository Component source code
BlogForever Free Spider binaries
BlogForever OSS Spider source code
My previous blog Assessing file format risks: searching for Bigfoot? resulted in some interesting feedback from a number of people. There was a particularly elaborate response from Ross Spencer, and I originally wanted to reply to that directly using the comment fields. However, my reply turned out to be a bit more lengthy than I meant to, so I decided to turn it into a separate blog entry.Numbers first?
Ross overall point is that we need the numbers first; he makes a plea for collecting more format-related data, and adding numbers to these. Although these data do not directly translate into risks, Ross argues that it might be able to use these data to address format risks at a later stage. This may look like a sensible approach at first glance, but on closer inspection there's a pretty fundamental problem, which I'll try to explain below. To avoid any confusion here, I will be speaking of "format risk" here in the sense used by Graf & Gordea, which follows from the idea of "institutional obsolescence" (which is probably worth a blog post by itself, but I won't go into this here).The risk model
Graf & Gordea define institutional obsolescence in terms of "the additional effort required to render a file beyond the capability of a regular PC setup in particular institution". Let's call this effort E. Now the aim is to arrive at an index that has some predictive power of E. Let's call this index RE. For the sake of the argument it doesn't matter how RE is defined precisely, but it's reasonable to assume it will be proportional to E (i.e. as the effort to render a file increases, so does the risk):
RE ∝ E
The next step is to find a way to estimate RE (the dependent variable) as a function of a set of potential predictor variables:
RE = f(S, P, C, ... )
where S = software count, P = popularity, C = complexity, and so on. To establish the predictor function we have two possibilities:
- use a statistical approach (e.g. multiple regression or something more sophisticated);
- use a conceptual model that is based on prior knowledge of how the predictor variables affect RE.
The first case (statistical approach) is only feasible if we have actual data on E. For the second case we also need observations on E, if only to be able to say anything about the model's ability to predict RE (verification).No observed data on E!
Either way, the problem here is that there's an almost complete lack of any data on E. Although we may have a handful of isolated 'war stories', these don't even come close to the amount of data that would be needed to support any risk model, no matter whether it is purely statistical or based on an underlying conceptual model1. So how are we going to model a quantity for which we do not have any observed data in the first place? Or am I overlooking something here?
Looking at Ross's suggestions for collecting more data, all of the examples he provides fall into the potential (!) predictor variables category. For instance, prompted by my observation on compression in PDF, Ross suggests to start analysing large collections of PDFs to establish patterns on the occurrence of various types compression (and other features), and attach numbers to them. Ross acknowledges that such numbers by themselves don't tell you if PDF is "riskier" than another format, but he argues that:
once we've got them [the numbers], subject matter experts and maybe some of those mathematical types with far greater statistics capability than my own might be able to work with us to do something just a little bit clever with them.
Aside from the fact that it's debatable whether, in practical terms, the use of compression is really a risk (is there any evidence to back up this claim?), there's a more fundamental issue here. Bearing in mind that, ultimately, the thing we're really interested in here is E, how could collecting more data on potential predictor variables of E ever help here in the near absence of any actual data on E? No amount of clever maths or statistics can compensate for that! Meanwhile, ongoing work on the prediction of E mainly seems to be focused on the collection, aggregation and analysis of potential predictor variables (which is also illustrated by Ross's suggestions), even though the purpose of these efforts remains largely unclear.
Within this context I was quite intrigued by the grant proposal mentioned by Andrea Goethals which, from the description, looks like an actual (and quite possibly the first) attempt at the systematic collection of data on E (although like Andy Jackson said here I'm also wondering whether this may be too ambitious).Obsolescence-related risks versus format instance risks
On a final note, Ross makes the following remark about the role of tools:
[W]ith tools such as Jpylyzer we have such powerful ways of measuring formats - and more and more should appear over time.
This is true to some extent, but a tool like jpylyzer only provides information on format instances (i.e. features of individual files); it doesn't say anything about preservation risks of the JP2 format in general. The same applies to tools that are are able to detect features in individual PDF files that are risky from a long-term preservation point of view. Such risks affect file instances of current formats, and this is an area that is covered by the OPF File Format Risk Registry that is being developed within SCAPE (it only covers a limited number of formats). They are largely unrelated to (institutional) format obsolescence, which is the domain that is being addressed by FFMA. This distinction is important, because both types of risks need to be tackled in fundamentally different ways, using different tools, methods and data. Also, by not being clear about which risks are being addressed, we may end up not using our data in the best possible way. For example, Ross's suggestion on compression in PDF entails (if I'm understanding him correctly) the analysis of large volumes of PDFs in order to gather statistics on the use of different compression types. Since such statistics say little about individual file instances, a more practically useful approach might be to profile individual files instances for 'risky' features.
On a side note even conceptual models often need to be fine-tuned against observed data, which can make them pretty similar to statistically-derived models. ↩
One of the activities in the European project SCAPE is to create a catalogue of policy elements. At the last iPRES conference we explained our work and you can read about it . During our activities we started collecting existing, published policies and we have now put the current set on a wiki http://wiki.opf-labs.org/display/SP/Published+Preservation+Policies Looking at the results of your colleagues might help to create or finalize your own preservation policies. As I said during my presentation at iPRES 2013, there are far more organizations dealing with digital preservation than published preservation policies on the internet – at least based on what we found!
If your organization has a digital preservation policy and you want to see yours in this list as well, please send an email to Barbara.Sierman@kb.nl and it will be added.
In a comment on a JHOVE bug, I said offhandedly that it’s approaching the end of its life. This caused a certain amount of concern in Twitter discussions. Andy said that software tools are one of the best ways to “preserve specific, reproducible knowledge about processes.” I don’t think dropping support of a rather dated tool is a big concern, though, as long as the code doesn’t vanish.
A software application is good for a certain number of years before it needs to be either left as legacy code or completely rewritten. Throwing out code and starting over takes a lot of effort, but it can result in much better code. I started on JHOVE in 2003 as a contractor to the Harvard University Libraries. After a few years it became clear that some of the design decisions weren’t ideal. Its all-or-nothing approach and its tendency to give up after the first error have long been obvious problems. The PDF module is a kludge built on a crock, and that’s without even talking about its profiles. The TIFF module, on the other hand, has a fair amount of elegance.
JHOVE2 was supposed to be the successor to JHOVE. Its creators learned from JHOVE and produced a better design. What they didn’t have was enough time and money to cover all the formats that JHOVE covered. I’ve continued to work on JHOVE because I know it inside and out. Someone else could pick up the work, but it might make more sense for a newcomer to the code to join the JHOVE2 effort instead. However, Maurice noted on Twitter that there hasn’t been much activity lately on JHOVE2 issues.
Both JHOVE and JHOVE2 were funded under grants. When the grant money ended, progress slowed down. The one-time grant model is the wrong way to fund preservation software. It’s an ongoing effort; new formats arise and old ones change, and there are always bugs to fix. What I’d like to see happen is for major libraries in the US to create an ongoing consortium for preservation work, similar to the Planets project in Europe. Or better yet, a consortium bringing together libraries all over the world. It wouldn’t take a lot from any individual institution. Its job would be to maintain information, preservation tools, test suites, and so on, on an ongoing basis. Instead of rushing to create a tool and then leaving it to freelancers like (formerly) me to maintain, it would support maintenance of tools for as long as it made sense and creation of new ones when it’s appropriate.
My voice isn’t enough to call anything like this into existence, but I can hope.
Tagged: JHOVE, preservation, software
The following is a guest post by Jeanette Altman, a Digital Projects Professional at the University of Alaska Fairbanks.
For many Alaskans, it’s not uncommon to be just slightly out of step with the rest of America. Things that might be easily obtainable Outside (that’s the Lower 48 to you) come at a premium here. Free shipping? Not to Alaska!
So when some of us here at the University of Alaska Fairbanks’ Elmer E. Rasmuson Library first got wind of the Library of Congress’ Digital Preservation Outreach and Education program’s Train the Trainer workshops, we asked, “When are you expanding to include Alaska?” After receiving the disappointing but altogether unsurprising news that there were no such plans, George Coulbourne, Executive Program Officer at the Library of Congress, offered us an opportunity for a collaborative partnership. One year later, Train the Trainer, Alaska Edition, was born.
Now in its third year, DPOE seeks to foster national outreach and education about digital preservation, using a Train the Trainer model to reach as many people as possible. Participants are trained in DPOE’s baseline curriculum, and then given the tools they need to build their own teaching network after they return to their communities.
The August 27-29 workshop in Fairbanks, Alaska, was hosted by the University of Alaska Fairbanks’ Elmer E. Rasmuson Library, and made possible by the generosity of the Alaska State Library and the Institute of Museum and Library Services. Participants throughout the state of Alaska were flown in to Fairbanks for the three-day training. Twenty-four participants now join the growing network of 87 “topical trainers” across the United States, and are the first in the state of Alaska.
Rasmuson Library opened the application process to Alaska residents in May of 2013. Participants were flown in from various regions of Alaska such as Kotzebue, Igiugig, and Skagway, and represented a myriad of organizations including the National Park Service, Alaskan tribal libraries, cultural foundations, and various museums and libraries.
“I so enjoyed participating in the workshop, and feel invigorated by all that I learned over the three day event,” said Angie Schmidt, a workshop participant and film archivist with the Alaska Film Archives. “Being able to interact and form contacts with leaders from the Library of Congress and other institutions, as well as colleagues from around the state was especially valuable. The framework provided for initiating and carrying through on digital preservation projects will be so beneficial to us all in coming months and years.”
On the first day of the training, six groups were formed to focus on each of the DPOE modules: Identify, Select, Store, Protect, Manage and Provide. The diversity of the participant population was a valuable addition overall, as each group brought aspects of their cultural heritage and experience to their presentations. The workshop provided time for networking and sharing of resources and experience, which has already led to further collaboration between Rasmuson Library and other state organizations. We hope that we can use this event as a starting point to find the right partners and funders to build out a digital preservation community in Alaska, including more Train the Trainer sessions, technical skills training, and investments in infrastructure.
“Alaska now has their first group of trained digital preservation practitioners,” Coulbourne noted in the event’s closing. “You all have the unique potential to collaborate across the state and use your newly acquired skills to enhance your communities’ efforts to preserve and make available the rich cultural heritage and treasures held by the Native Alaskan people.”
Robin Dale of LYRASIS, Mary Molinaro of the University of Kentucky and Jacob Nadal of the Brooklyn Historical Society continued their tradition of serving as lead or “anchor” instructors. Their generosity, their organizations’ commitment, and the Library of Congress focus on this national effort allowed the DPOE Train the Trainer Program to be offered to attendees from remote areas of Alaska who otherwise may not have been able to attend this critical skill building program in digital stewardship.
It was obvious to me that one of DPOE’s most valuable attributes is cost-effectiveness. The cultural heritage community needs quality training at a low cost. Digital preservation is a critical skill set, but training current staff is often too expensive for smaller institutions or states such as ours where accessibility to in-person training is very challenging if not impossible during certain times of the year. This program has helped the Rasmuson Library staff to work with the state’s professional and Native Alaskan organizations to preserve our rich history, folklore, and traditions in digital form. I hope the community formed at this training event will raise the level of digital preservation practice, forge new partnerships, and bring more Alaskans and their valuable collections, up to speed with digital stewardship.
Last week someone pointed my attention to a recent iPres paper by Roman Graf and Sergiu Gordea titled "A Risk Analysis of File Formats for Preservation Planning". The authors propose a methodology for assessing preservation risks for file formats using information in publicly available information sources. In short, their approach involves two stages:
- Collect and aggregate information on file formats from data sources such as PRONOM, Freebase and DBPedia
- Use this information to compute scores for a number of pre-defined risk factors (e.g. the number of software applications that support the format, the format's complexity, its popularity, and so on). A weighted average of these individual scores then gives an overall risk score.
This has resulted in the "File Format Metadata Aggregator" (FFMA), which is an expert system aimed at establishing a "well structured knowledge base with defined rules and scored metrics that is intended to provide decision making support for preservation experts".
The paper caught my attention for two reasons: first, a number of years ago some colleagues at the KB developed a method for evaluating file formats that is based on a similar way of looking at preservation risks. Second, just a few weeks ago I found out that the University of North Carolina is also working on a method for assessing "File Format Endangerment" which seems to be following a similar approach. Now let me start by saying that I'm extremely uneasy about assessing preservation risks in this way. To a large extent this is based on experiences with the KB-developed method, which is similar to the assessment method behind FFMA. I will use the remainder of this blog post to explain my reservations.Criteria are largely theoretical
FFMA implicitly assumes that it is possible to assess format-specific preservation risks by evaluating formats against a list of pre-defined criteria. In this regard it is similar to (and builds on) the logic behind, to name but two examples, Library of Congress' Sustainability Factors and UK National Archives' format selection criteria. However, these criteria are largely based on theoretical considerations, without being backed up by any empirical data. As a result, their predictive value is largely unknown.Appropriateness of measures
Even if we agree that criteria such as software support and the existence of migration paths to some alternative format are important, how exactly do we measure this? It is pretty straightforward to simply count the number of supporting software products or migration paths, but this says nothing about their quality or suitability for a specific task. For example, PDF is supported by a plethora of software tools, yet it is well known that few of them support every feature of the format (possibly even none, with the exception of Adobe's implementation). Here's another example: quite a few (open-source) software tools support the JP2 format, but for this many of them (including ImageMagick and GraphicsMagick) rely on JasPer, a JPEG 2000 library that is notorious for its poor performance and stability. So even if a format is supported by lots of tools, this will be of little use if the quality of those tool are poor.Risk model and weighting of scores
Just as the employed criteria are largely theoretical, so is the computation of the risk scores, the weights that are assigned to each risk factor, and they way the individual scores are aggregated into an overall score. The latter is computed as the weighted sum of all individual scores, which means that a poor score on, for example, Software Count can be compensated by a high score on other factors. This doesn't strike me as very realistic, and it is also at odds with e.g. David Rosenthal's view of formats with open source renderers being immune from format obsolescence.Accuracy of underlying data
A cursory look at the web service implementation of FFMA revealed some results that make me wonder about the data that are used for the risk assessment. According to FFMA:
- PNG, JPG and GIF are uncompressed formats (they're not!);
- PDF is not a compressed format (in reality text in PDF nearly always uses Flate compression, whereas a whole array of compression methods may be used for images);
- JP2 is not supported by any software (Software Count=0!), it doesn't have a MIME type, it is frequently used, and it is supported by web browsers (all wrong, although arguably some browser support exists if you account for external plugins);
- JPX is not a compressed format and it is less complex than JP2 (in reality it is an extension of JP2 with added complexity).
To some extent this may also explain the peculiar ranking of formats in Figure 6 of the paper, which marks down PDF and MS Word (!) as formats with a lower risk than TIFF (GIF has the overall lowest score).What risks?
It is important that the concept of 'preservation risk' as addressed by FFMA is closely related to (and has its origins in) the idea of formats becoming obsolete over time. This idea is controversial, and the authors do acknowledge this by defining preservation risks in terms of the "additional effort required to render a file beyond the capability of a regular PC setup in [a] particular institution". However, in its current form FFMA only provides generalized information about formats, without addressing specific risks within formats. A good example of this is PDF, which may contain various features that are problematic for long-term preservation. Also note how PDF is marked as a low-risk format, despite the fact that it can be a container for JP2 which is considered high-risk. So doesn't that imply that a PDF that contains JPEG 2000 compressed images is at a higher risk?Encyclopedia replacing expertise?
A possible response to the objections above would be to refine FFMA: adjust the criteria, modify the way the individual risk scores are computed, tweak the weights, change the way the overall score is computed from the individual scores, and improve the underlying data. Even though I'm sure this could lead to some improvement, I'm eerily reminded here of this recent rant blog post by Andy Jackson, in which he shares his concerns about the archival community's preoccupation with format, software, and hardware registries. Apart from the question whether the existing registries are actually helpful in solving real-world problems, Jackson suggests that "maybe we don't know what information we need", and that "maybe we don't even know who or what we are building registries for". He also wonders if we are "trying to replace imagination and expertise with an encyclopedia". I think these comments apply equally well to the recurring attempts at reducing format-specific preservation risks to numerical risk factors, scores and indices. This approach simply doesn't do justice to the subtleties of practical digital preservation. Worse still, I see a potential danger of non-experts taking the results from such expert systems at face value, which can easily lead to ill-judged decisions. Here's an example.KB example
About five years some colleagues at the KB developed a "quantifiable file format risk assessment method", which is described in this report. This method was applied to decide which still image format was the best candidate to replace the then-current format for digitisation masters. The outcome of this was used to justify a change from uncompressed TIFF to JP2. It was only much later that we found out about a host of practical and standard-related problems with the format, some of which are discussed here and here. None of these problems were accounted for by the earlier risk assessment method (and I have a hard time seeing how they ever could be)! The risk factor approach of GGMA is covering similar ground, and this adds to my scepticism about addressing preservation risks in this manner.Final thoughts
Taking into account the problems mentioned in this blog post, I have a hard time seeing how scoring models such as the one used by FFMA would help in solving practical digital preservation issues. It also makes me wonder why this idea keeps on being revisited over and over again. Similar to the format registry situation, is this perhaps another manifestation of the "trying to replace imagination and expertise with an encyclopedia phenomenon? What exactly is the point of classifying or ranking formats according to perceived preservation "risks" if these "risks" are largely based on theoretical considerations, and are so general that they say next to nothing about individual file (format) instances? Isn't this all a bit like searching for Bigfoot? Wouldn't the time and effort involved in these activities be better spent on trying to solve, document and publish concrete format-related problems and their solutions? Some examples can be found here (accessing old Powerpoint 4 files), here (recovering the contents of an old Commodore Amiga hard disk), here (BBC Micro Data Recovery), or even here (problems with contemporary formats)?
I think there could also be a valuable role here for some of the FFMA-related work in all this: the aggregation component of FFMA looks really useful for the automatic discovery of, for example, software applications that are able to read a specific format, and this could be could be hugely helpful in solving real-world preservation problems.Preservation Topics: Preservation RisksFormat Registry
For our evaluations within SCAPE it would be useful to have the ability to quantitatively measure the abilities of the Hadoop clusters available to us, to allow results from each cluster to be compared.
Fortunately as part of the standard Hadoop distribution there are some examples included that can be run as tests. Intel has produced a benchmarking suite - HiBench - that uses those included Hadoop examples to produce a set of results.
There are various aspects of performance that can be assessed. The main ones being:
- CPU loaded workflows (e.g. file format migration) where the workflow speed is limited by the CPU processing available
- I/O loaded workflows (e.g. identification/characterisation) where the workflow speed is limited by the I/O bandwidth available
For the testing of our cluster I used HiBench 2.2.1. I made some notes about getting it to run that should be useful (see below). Apart from the one change described below in the notes, there was no need to edit or change the code.
In SCAPE testbeds we are running various workflows on various clusters. However, individual workflows tend to be run on only one cluster. Running a standard benchmark on each Hadoop installation may allow us to better compare and extrapolate results from the different testbed workflows.
Notes - These are only required to be done on the node where HiBench is run from.
- JAVA_HOME is needed by some tests - I set this using “export JAVA_HOME=/usr/lib/jvm/j2sdk1.6-oracle/”.
- For the kmeans test I changed the HADOOP_CLASSPATH line in “kmeans/bin/prepare.sh” to “export HADOOP_CLASSPATH=`mahout classpath | tail -1`” as it was unable to run without that change; mahout already being in the path.
- The nutchindexing and bayes tests required a dictionary to be installed on the node that HiBench was started from – I installed the “wbritish-insane” package.
Some tests use less map/reduce slots than are available and therefore are not that useful for comparison as we want to max out the cluster. For example, the kmeans tests only used 5 map slots.
I have created a page on the SCAPE wiki where I have put the results from our cluster: “Benchmarking Hadoop installations”. I invite and encourage you to run the same tests above and add them to the wiki page. Running the tests was much quicker than I thought it might be – it took less than a morning to setup and execute.
To get a better understanding of which benchmarks are more/less appropriate I propose we first get some metrics from all the HiBench tests across different clusters. In future we may choose to refine or change the tests to be run but this is just a start of a process to better understand how our Hadoop clusters perform. It’s only through you participating that we will get useful results, so please join in!Preservation Topics: SCAPE
JHOVE 1.11 is now available at
Thanks to Maurice de Rooij for helping to debug the Windows batch files.
Tagged: JHOVE, preservation, software
One of the Open Planets Foundation’s main roles in the SCAPE project is to provide stewardship for, and ensure longevity of the SCAPE software outputs.
The SCAPE project is committed to producing open source software that is available to the wider community on GitHub, with clear licence terms and appropriate documentation, at an early stage in development.
While the above steps are important and helpful in encouraging other developers to download a project's source code, compile it, and try the software, this isn’t an everyday activity for the less geeky members of the digital preservation community. Software in this state is also unlikely to meet with the approval of an institution's IT Operations / Support section.
What’s really required for software longevity is an active community of users who:
Use the software for real world activities in their day to day work.
Report bugs and request enhancements on the project's issue tracker.
Contribute to community software documentation.
So how do we bridge the gap between our current developer-ready software, and software that non-geeks find easy to install and use?
Over October there will be a sustained effort to package, document and publish SCAPE software for download by anybody who wants to try it. If that sounds like you then read on.
Where can I find the SCAPE software?
We have compiled a list of tools that have been developed or extended as part of the SCAPE Project: http://www.scape-project.eu/tools. Currently our software is on the OPF’s GitHub page, though if you’re not comfortable with source code this might not prove very helpful. To help you make sense of what’s on the GitHub page the OPF have created a project health check page, which distills the information a little and provides helpful links to the projects' README and LICENSE files. This page is still a work in progress, so if there’s some information you’d like to see on it you can raise an issue on GitHub.
How do I know that the software builds?
All SCAPE software should have a Continuous Integration build that runs on the Travis-CI site, this means that the software is built every time somebody checks in a change to the source code in GitHub. If the build fails the developer is informed, and corrects the problem as soon as possible. Every project listed on the project health check site has one of these graphics:
indicating the result of the most recent attempt to build the project on Travis, or informs you that a Travis build couldn’t be found. Click on the image and you’ll be taken to the project’s Travis page if you’re interested in the gory details.
So how do I download and use SCAPE software?
Which brings us round to October, where we’ll be fitting the final piece of the puzzle. The real aim of the nightly builds is to build installable packages to be downloaded by you. These packages will be debian apt packages, installable on debian based linux distributions including ubuntu, mint, and of course debian itself.
We’ll be creating stable release packages for download from the OPF's Bintray page, and overnight “snapshot” builds of the current project at a to be decided location. Keep an eye @openplanets and @scapeproject for news and download links over the coming month.
But I use Windows, Mac OS, or another linux packaging system.
Fear not, all is not lost. We’ve chosen debian based linux distros first because:
it simplifies licensing issues for build machines and virtual test and demonstration environments.
debian based distros are among the most widely used linux distributions.
Hadoop, the engine that runs SCAPE’s scalable platform, has historically not played well with Windows, although this is no longer such a problem.
Some of the software will run on other platforms easily, Jpylyzer is available for Windows. Others may require a little more work, but if there’s interest and it’s practical we’ll do our best. We’re trying to establish a community of users, not exclude people.
So that’s why SCAPE software needs you, hopefully as much as you need SCAPE software.Preservation Topics: PackagingSCAPE
For this installment of the Content Matters interview series of the National Digital Stewardship Alliance Content Working Group I interviewed Edward McCain, digital curator of journalism at the Donald W. Reynolds Journalism Institute and University of Missouri Libraries. Missouri University Libraries joined the NDSA this past summer.
Ashenfelder: What is RJI’s relationship to the Missouri University School of Journalism?
McCain: RJI is a sort of sister organization of the University of Missouri School of Journalism. We work closely with the faculty and staff there. The J-School produces the journalists of the future and RJI is a think tank that works to insure and help direct the future of journalism.
Ashenfelder: You said that one of the motivations for RJI joining the NDSA was the Columbia Missourian’s loss of 15 years of digital newspaper archives in a server crash. Can you tell us about that event and why this content is so important to preserve?
McCain: The Columbia Missourian is a daily newspaper operated by the University of Missouri School of Journalism that has served this mid-Missouri community since 1908.
According to 2006 and 2008 reports by Victoria McCargar, a 2002 Missourian server crash wiped out fifteen years of text and seven years of photos. The archive was contained in an obsolete software package that effectively prevented cost-effective retrieval. The content that was lost represents a kind of “memory hole,” albeit not the intentional variety described in Orwell’s “1984.”
The disappearance of 15 years of news, birth announcements, obituaries and feature stories about the happenings in any community represents a loss of cultural heritage and identity. It also has an effect on the news ecosystem, since reporters often depend on the “morgue”– newspaper parlance for their library–to add background and context to their stories.
In other parts of the information food chain, radio and television newscasts often rely on newspapers as the basis for their efforts. This, in turn, can have an effect on the democratic process, since the election process benefits from an accurate record of the candidates’ words and actions. All this lends credence to Washington Post Editor Phil Graham’s statement that journalism is “a first rough draft of history.”
Ashenfelder: You began your career as a photojournalist. How did you get into library science?
McCain: I earned my Bachelor of Journalism degree here at Mizzou and worked in the field for over 30 years, operating my own business for the past twenty. One of McCain Photography’s profit centers has been and continues to be the sale of stock photography, which is based on my image archive.
I eventually found myself reading about controlled vocabularies, databases, metadata and other library science concepts in my spare time. I enjoyed the challenge of structuring information in a way that adds value to content. One day I called the University of Arizona’s School of Information Resources and Library Science program, and was connected to Dr. Peter Botticelli. I asked him a lot of questions. That phone conversation, plus the fact that the SIRLS Masters degree could be combined with the Digital Information Management (DigIn) certificate program, helped me decide to take the leap back into academia.
Ashenfelder: And then you came back to Missouri and joined RJI. What do you bring to RJI as its new digital curator?
McCain: From my perspective, the most important qualities I bring are imagination, the spirit of entrepreneurship and an ability to get things done. All human endeavors begin with a dream, the ability to visualize new possibilities. I’ve been a successful businessman, but more important is what I’ve learned over the years: the only failure is not owning your mistakes and learning from them so you can do better next time. To me, accomplishing things is often about having clear priorities and not caring who gets the credit; keeping egos (including my own) out of the way.
Those qualities, combined with my knowledge and experience as a journalist, photographer, software developer, businessman and library scientist all come into play in my new position. I’m still a bit amazed that MU Libraries and the Reynolds Journalism Institute created what I consider the perfect position for my skill set and interests at just the right time. And that as a result, I found my dream job.
Ashenfelder: The system you want to create will be able to archive the work of journalists from the newspaper, radio and TV. Can you broadly describe some of the requirements for such a system? What will it need to do in order to serve all of its stakeholders?
McCain: To be clear, we’re still in the embryonic phase of the software development process and we have a lot of research to do in terms of functional and technical requirements. It does seem likely that the framework will have to be modular, extensible and generally able to play well with others.
Obviously, the system will need to accommodate a wide range of file formats and packages during and across the processes needed during the life cycle of digital objects. I believe that we should be able to combine and build on existing open-source platforms to achieve this and more.
From early conversations with the three local media stakeholders, I imagine that that they are going to be focused on search functionality and speed. That means that they want to find relevant content quickly and access and integrate it into their workflow seamlessly.
We are going to spend quite a bit of time optimizing search and workflow issues but once we have a handle on those issues, there will be opportunities for collaboration within and between all three media outlets that will improve their efficiency and enhance the experience for their respective audiences.
Ashenfelder: One of your first tasks is to create a plan for such a system. What research are you doing as you develop that plan?
McCain: The problems surrounding preservation of and access to digital news archives stem from a combination of frequently changing factors. I’m employing an approach adapted from the Build Initiative, which has successfully produced change in the area of education.
The Build Initiative framework is based on change theory and focuses on five broad interconnected elements: context, components, connections, infrastructure and scale. Having this kind of framework allows me to keep the big picture in mind when making decisions.
For example, one of our components provides a new business model for digital news archives. In order to successfully support this service, we need to work in the infrastructure area to create the open-source software required to implement the new model. As in most real-life systems, there are many interconnections between these components. The key is to identify segments where positive outcomes in one realm can spread synergistically into others and continue to build on those successes.
Ashenfelder: You said you would like to share RJI’s system with other people, especially smaller towns and smaller institutions, so their history won’t be lost. Can you please tell us more about that?
McCain: Journalism is struggling to find sustainable and profitable business models. Print advertising revenue is less than half of what it was in 2006 and the number of newspaper journalists has declined by 27 percent since peaking in 1989. This is particularly true in smaller towns and rural areas. Once those businesses close their doors, there is an increased likelihood that its archives, especially those in digital formats, will be lost forever. That’s why I feel it imperative to address issues relating to current and future business models involving news archives.
By creating open-source software, we hope to offer these struggling enterprises new possibilities for generating revenues from their archives. For example, we can assist these organizations in setting up cooperative efforts that allow multiple archives to reside on a single server. That would keep costs low and participants would benefit from a larger pool of content, which is generally more attractive to potential customers, ranging from research services to individual users.
In addition, for those enterprises that don’t want to deal with setting up their own server or establish a co-op, we would like to leverage the efficiencies of the Missouri University IT system to provide our system as a service at an affordable cost.
Since humans tend to save what they value, we will prioritize our programs to support private enterprise’s ability to profit from their archives. Once those archives are seen as valuable assets, they will be preserved and accessed. But in cases where that outcome isn’t realized, part of our initiative involves working as an intermediary between news archive owners and cultural heritage institutions to facilitate the safe transfer of resources to an appropriate location.
Ashenfelder: There are potential opportunities for RJI to collaborate with other institutions, such as the Missouri Press Association and the State Historical Society of Missouri.
McCain: Interestingly, the State Historical Society of Missouri was established by the Missouri Press Association in 1898 and subsequently assumed by the state. They are both significant players in newspaper preservation and access.
I spoke to the MPA board a few weeks ago and found definite interest in working with RJI and the J-School to advance the cause of news archive preservation and access. I spoke with several publishers who expressed a willingness to experiment with our software and other services at an appropriate time in the process. SHS has been participating in the National Digital Newspaper Program since 2008 and has valuable experience in working with those and other analog and digital news collections.
Ashenfelder: Much of news content comes from businesses and the private sector. How do you intend to interest profit-oriented companies in RJI’s archive and repository?
McCain: My position is charged with preservation of and access to news archives, whether public or private. While the NDNP continues to do amazing things, there is a gargantuan amount of archival content in the private sector that we probably can’t address with public funding alone. This is one reason why, in its landmark 2010 report “Sustainable Economics for a Digital Planet: Ensuring Long-Term Access to Digital Information (PDF),” the Blue Ribbon Task Force on Sustainable Digital Preservation and Access stated the need to “provide financial incentives for private owners to preserve on behalf of the public.”
In light of current funding models for archives in the U.S., it makes perfect sense to work with people in the private sector to demonstrate the potential value of their archive and to assist them in realizing it. If news executives see archives as a profit center instead of a burden, my hope is that those resources will stay viable until they enter the public domain and can be accessed and preserved by other means.
News organizations are businesses and if decision-makers don’t see value in keeping their archives, they have little incentive to preserve them–or even donate them–given current laws that don’t incentivize such transfers to cultural heritage institutions. We plan to address those and other issues in the future by launching efforts in the Context component of our initiative.
Ashenfelder: Can you tell us more about the digital news summit that you are planning at RJI next spring?
McCain: In the spring of 2011, RJI, MU Libraries and Mizzou Advantage hosted the first Newspaper Archive Summit. My colleague Dorothy Carner, Head of Journalism Libraries, was instrumental in bringing together publishers, digital archivists, journalists, librarians, news vendors and entrepreneurs to begin a conversation about how best to approach the challenges with which we are currently presented.
Dorothy and I see the next part of that ongoing conversation as a kind of “break out” group focused on dialoging with decision-makers and their influencers in order to better understand their perspectives on access and preservation of archives. Undoubtedly, a large part of the next conversation will involve finding better ways to generate profits from archival resources.
In light of his recent purchase of The Washington Post, we’ve extended an invitation to Jeff Bezos, CEO of Amazon, to speak at the summit next April. I’m not sure he will attend but I think he’s a logical choice as a speaker for the following reasons.
1) It’s no accident that Mr. Bezos started Amazon by selling books, which is another word for content. By establishing relationships with book buyers, Amazon was able to access uniquely useful information about individual tastes and interests that could then be used to customize its marketing of all kinds of other merchandise.
2) Bezos used the Internet to develop a long-tail merchandising platform that could exploit low overhead in order to profit from even rarely ordered items. Most brick and mortar stores can only carry an inventory of high-volume merchandise because their overhead makes selling unpopular items prohibitively expensive. Combine these two effects and – voilà! Amazon becomes the world’s largest online retailer.
I invite you to take a moment to imagine you were Jeff Bezos and had just purchased a business with a lot of potentially valuable content cleverly disguised as a news archive. What would you do with it?
What kind of content matters to you? If you or your institution would like to share your story of long-term access to a particular digital resource, please email email@example.com and in the subject line put “Attention: Content Working Group.”
Society of American Archivists Awards ANADP conference paper with the 2013 Preservation Publication Award
The following is a guest post from Michael Mastrangelo, a Program Support Assistant in the Office of Strategic Initiatives at the Library of Congress.
During the Society of American Archivists Annual Conference in New Orleans in August, the NDIIPP-supported initiative Aligning National Approaches to Digital Preservation (ANADP), received the prestigious Preservation Publication Award for 2013. ANADP is a 327-page collection of peer-reviewed essays that establishes 47 goals and strategies to merge the efforts of national digital preservation efforts of nations throughout the European Union and the United States.
The Preservation Publication Award goes to outstanding preservation works, nominated by peers and reviewed by an SAA committee. SAA awarded this paper because it, “…broadens and deepens its impact by reflecting on the ANADP presentations,” and “…highlights the need for strategic international collaborations.” ANADP is written for information professionals from librarians to administrators, so it will have a broad impact on the whole information field, sparking cross-industry collaboration in addition to cross-border collaboration.
The honor goes to ANADP’s volume editor Nancy McGovern, the Head of Curation and Preservation Services at the MIT Libraries, series editor Katherine Skinner, the Executive Director of the Educopia Institute, and the section co-authors including representatives of the publications main sponsor, The Library of Congress, as well as experts from the Joint Information Systems Committee, Open Planets Foundation and other national and international organizations.
The ANADP conference was conceived from brainstorming sessions between the Educopia Institute, the Library of Congress, the University of North Texas, Auburn University, the MetaArchive Cooperative and the National Library of Estonia. In 2011, 125 delegates from 20 countries met in Tallinn, Estonia where they shared their national digital preservation practices. Delegates divided the work to create an overarching plan for furthering international collaboration by authoring a number of separate “alignments” across organizations, legal regimes, technical issues, economic approaches, standards and education.
The technical alignment panel discussed infrastructure like LOCKSS (Lots of Copies Keep Stuff Safe), while the organizational panel covered cost-efficiencies and vendor relations. The standards panel noted that many standards are just impractical or overly detailed making them inaccessible to smaller institutions. The copyright/legal panel mentioned the complicated laws on orphan works across jurisdictions, noting that conflicting copyright laws complicate preservation even across Europe’s fluid borders.
On the final day, the education panel stressed internships for bridging theory and practice, and George Coulbourne of the Digital Preservation Outreach and Education initiative suggested corporate partnerships to fund hands-on post-graduate development. Finally, the economics panel tackled the difficult question of shrinking budgets and identified successful funding models in projects like congressionally-funded NDIIPP, and JISC, a public charity with non-profit arms.
ANADP II is planned for November 18-20, 2013 in Barcelona. International digital stewardship leaders will reconvene to track progress toward collaboration and develop specific preservation actions for each collaborator to implement.
“I hope that we’ll delegate specific tasks to all the representatives to get the ball rolling on the action items in ANADP I,” said Mary Molinaro, the Associate Dean for Library Technologies at the University of Kentucky and a member of the DPOE Steering Committee. “We created an exciting plan for international collaboration with that first publication, now we just need to execute it.”
This past weekend I got to do one of my favorite things of the year: work at the NDIIPP Digital Preservation booth at the 2013 National Book Festival.
Why is it one of my favorite things to do each year? Because I get to hear from real people about what their personal digital preservation issues are, and what they hope the Library can do to help them.
People have asked what we are doing at a BOOK festival. The Library has a pavilion where it demonstrates its own programs, and we have been privileged to be included the past several years. We set up a table full of vintage media, from floppy discs to CDs, paper tape to punch cards, and even vintage computers. People inevitably stop by out of curiosity: “I remember those!” We listen to parents telling their incredulous children that they used to store data on those weird looking floppies. We display all the media and hardware not just to draw people in, but to make a point: all media will eventually become obsolete, as will the hardware needed to read it. We all need to actively manage our personal digital collections and migrate them over time to new media environments.
We also provide handouts and bookmarks with links to the personal digital archiving guidance online at the NDIIPP web site.
And we answer a LOT of questions. Some have general questions about the Library and its services. Some hope that the Library provides digitization services to help them migrate files off older analog or digital media (sorry, we cannot do that). But most want to tell us about their pain: 10s or 100s of thousands of slides to digitize. 8mm home movies they want to migrate to digital. Email services that shut down access to vital personal communications and records.
Some times they share their successes: a book containing digitized images from a relative’s trip to China decades ago. A project to digitize materials at a school library. Online searches that came up with digitized books and records at cultural heritage organizations that helped them document their family history. We share in the joy of their successes, commiserate on their challenges, and provide guidance wherever we can. The questions we get help us decide what new guidance documents to develop.
Some years there are definite themes. One year we were asked dozens of questions about slide digitization. Another year it was video tape-to-digital conversion. Last year there were quite a few questions about email export and migration. This year I would not say I heard any one theme, but a lot of general concern. And we received a lot of appreciation that we were there to answer questions. And that appreciation makes it all worthwhile.
I’m Rui Castro. I work at KEEP SOLUTIONS since 2010 where I have the roles of Director of Infrastructures, project manager and researcher. Before joining KEEP SOLUTIONS, I was part of the team who developed RODA, the digital preservation repository used by the Portuguese National Archives.Tell us a bit about your role in SCAPE and what SCAPE work you are involved in right now?
My role in SCAPE is primarily focused on Preservation Action Components and Repository Integration.
In Action Components, I’ve worked in the identification, evaluation and selection of large-scale action tools & services to be adapted to the SCAPE platform. I’ve contributed to the definition of a preservation tool specification with the purpose of creating a standard interface for all preservation tools and a simplified mechanism for packaging and redistributing those tools to the wider community of preservation practitioners. I have also contributed to the definition of a preservation component specification with the purpose of creating standard preservation components that can be automatically searched for, composed into executable preservation plans and deployed on SCAPE-like execution platforms.
Currently my work is focused on repository integrations where I have the task of implementing the SCAPE repository interfaces into RODA, an open-source digital repository supported and maintained by KEEP SOLUTIONS. These interfaces when implemented will enable the repository to use the SCAPE preservation environment to perform preservation planning, watch and large-scale preservation actions.Why is your organisation involved in SCAPE?
KEEP SOLUTIONS is a company that provides advanced services for managing and preserving digital information. One of the vectors that drive us is continuous innovation in the area of digital preservation. In the SCAPE project, KEEP SOLUTIONS is contributing with expertise in digital preservation, especially migration technologies, and with practical knowledge on the development of large-scale digital repository systems. KEEP SOLUTIONS is also acquiring new skills in digital preservation, especially in preservation planning, watch and service parallelisation, we are enhancing digital preservation products and services we currently support, such as RODA, and enhancing relationships with world leader digital preservation researchers and institutions. KEEP SOLUTIONS’ participation in the project will enhance our expertise in digital preservation and that will result in better products and services for our current and future clients.What are the biggest challenges in SCAPE as you see it?
SCAPE is a big project, from the number of people and institutions involved to the number of digital preservation aspects covered. I think the biggest challenge will be the integration of all parts into a single coherent system. From a technical point of view the integration between content repositories, automated planning & watch and the executable platform is a huge challenge.What do you think will be the most valuable outcome of SCAPE?
I see two very interesting aspects emerging from SCAPE.
One is the integration of automated planning & watch into digital preservation repositories. Planning is an essential part of digital preservation and it involves human level activities (like policy and decision making) and machine activities (like evaluation of alternative strategies, characterisation and migration of contents). Being able to make the bridge between these two realms and provide content holders the tools to take informed decisions about what to do with their data is a great achievement.
The other is the definition of a system architecture for large-scale processing, applied to the specific domain of digital preservation, that is able of executing preservation actions like characterisation, migration and quality-assurance over huge amounts of data in a “short” time.
Preservation Topics: SCAPE AttachmentSize rcastro.jpg15.08 KB
The following is a guest post by Lyssette Vazquez-Rodriguez, Program Support Assistant & Valeria Pina, Communications Assistant
This is the second part of a three part series of posts about the 2013-2014 NDSR class, read the first part here.
As part of the nine-month National Digital Stewardship Residency program, the residents recently completed their two – week digital content immersion workshop. Topics discussed included an overview of the digital landscape, how to identify and select potential digital content, and the levels of protection required for digital content, among others.
Mary Molinaro, Associate Dean for Library Technologies at the University of Kentucky Libraries, offered the workshop covering the overview of the digital landscape and the selection and review of digital content. In her workshop, she reviewed the process of identifying and selecting content that needs to be preserved to create an inventory. As part of the workshop, the residents were able to research and compare tools used for a variety of purposes in the digital stewardship and preservation lifecycle.
The workshop “Assess and Describe Digital Collections” was well received by the residents. Carlos Martinez, Information Technology Data Specialist at the Library of Congress, was a topical instructor in this workshop. He explained how it is crucial for digital stewards to be aware of the main characteristics of files and formats that need to be addressed when preservation initiatives are discussed and implemented. During the workshop, the residents were asked to explain how they would approach tackling metadata management issues and primary issues in building and maintaining a digital preservation infrastructure.
When asked to evaluate the workshops, the residents agreed that they were impressed with the engaging approach to digital preservation’s more abstract disciplines such as computer science. Having just finished their graduate degrees, they had the opportunity to refresh theory learned in library school. They agreed on the importance of learning how to use digital preservation tools on test data to complement the theory learned in graduate school.
Now that the immersion workshop is over, the residents will go to their host institutions and start working on their projects. In the next few weeks, we will bring you an update on the progress of their projects. Good luck to the NDSR inaugural class!
I've started to publish some of my notes on digital preservation. It's mostly a collection of 'war stories' and summaries of some of the little experiments I've carried out over the years, but never had time to write up properly. The idea of publishing these stories is inspired in part by XFNSTN, but also by my experience co-coordinating the AQuA workshops and from observing the success of the SPRUCEdp project.
In short, I think we need to share more war stories, not just the occasional full research paper, but also the small stuff, and the failures. Maybe I can start the ball rolling by sharing mine. I'd really like to know if anyone else out there is interested in sharing theirs.
There's a couple of bigger items on there that I think might be of particular interest:
- A long-winded data migration story about accessing data from BBC Master floppy disks.
- A description of how bitwise analysis can be used to better understand formats and the tools that act upon them, somewhat related to an OPF blog post by Jay Gattuso earlier this year.
Feedback welcome, as ever.
A recent NDIIPP intern, Ingrid Jernudd, did some research into current web resources that provide digital access to a broad array of primary source materials at the state level. She prepared a list of sites that billed themselves as general-interest portals to historical resources. Although the list is likely incomplete, I was surprised she found so many.
It is worth bearing in mind also that the list, with one exception, does not include local or municipal websites (the one exception is the Denver Public Library; its Western History and Genealogy site is included because it has resources that extend beyond Denver proper).
Most of the materials available through these websites are digitized versions of analog items. Many of the sites could, however, accommodate born digital content, as well as serve as useful models for the ongoing development of access to cultural heritage resources.
Her findings are presented in the two tables below. Table 1 lists 67 websites that relate to individual states. Table 2 lists four sites that cover multiple states. If you know of resources that are not listed here, please let us know via a comment.
Table 1, State Digital PortalsState Resource Website AL Alabama Dept. of Archives and History Digital Collections http://digital.archives.alabama.gov/ Alabama Virtual Library http://www.avl.lib.al.us/ AK Alaska’s Digital Archives http://vilda.alaska.edu/ AR Arkansa History Commission http://www.ark-ives.com/ Arkansas State Library Digital Collections http://cdm16039.contentdm.oclc.org/cdm/ AZ Arizona Memory Project http://azmemory.azlibrary.gov/ Arizona Cultural Inventory Project http://cip.azlibrary.gov/ CA Calisphere http://www.calisphere.universityofcalifornia.edu/ California State Library Online Resources http://www.library.ca.gov/services/online-resources.html CO Denver Public Library, Western History and Genealogy http://digital.denverlibrary.org/ CT Connecticut Digital Collections http://www.ctstatelibrary.org/dld/pages/connecticut-digital-colle DE Delaware Public Archives http://archives.delaware.gov/ FL Florida Memory http://www.floridamemory.com/ GA Digital Library of Georgia http://dlg.galileo.usg.edu/ Georgia’s Virtual Vault http://cdm.georgiaarchives.org:2011/cdm/ HA Hawaii State Archives Digital Collections http://archives1.dags.hawaii.gov/gsdl/cgi-bin/library IA Iowa State Historical Society Digital Archives http://www.iowahistory.org/libraries/index.html Iowa Heritage Digital Collections http://www.iowaheritage.org/ ID Idaho State Historical Society Digital Collections http://idahohistory.cdmhost.com/ IL Illinois Digital Archives http://www.idaillinois.org/ Explorie Illinois: Illinois Digital Archives http://www.finditillinois.org/ida/ IN Indiana Memory http://www.in.gov/memories/index.html KS Kansas Historical Society State Archives http://www.kshs.org/p/state-archives-library/11933 KY Kentucky Digital Library http://kdl.kyvl.org/ LA LOUISiana Digital Library http://louisdl.louislibraries.org/ LDMA (Louisiana Digital Media Archive) http://ldma.lpb.org/about-ldma MA Digital Commonwealth: Massachusetts Collections Online http://www.digitalcommonwealth.org/ Massachusetts Boards of Library Commissioners Digital Collections http://mblc.state.ma.us/books/digital/ MD Maryland Historical Society Online Collections http://www.mdhs.org/museum/collections-online Archives of Maryland Online http://ow.ly/p8kO4 ME Maine Memory Network http://www.mainememory.net/ MI Seeking Michigan http://seekingmichigan.org/ MN Minnesota Digital Library http://www.mndigital.org/ MO Missouri Digital Heritage http://www.sos.mo.gov/mdh/ MS Mississippi Department of Archives and History: Digital Archives http://mdah.state.ms.us/arrec/digital_archives/ Upper Mississippi Valley Digital Image Gallery http://www.umvphotoarchive.org/ MT Montana Memory Project http://mtmemory.org/ NC North Carolina Digital Collections http://digital.ncdcr.gov/ North Carolina Exploring Cultural Heritage Online http://www.ncecho.org/ North Carolina’s Digital Collections http://digitalnc.org/collections ND State Historical Society of North Dakota Digital Resources http://www.history.nd.gov/archives/digitalresources.html NE Virtual Exhibits of the Nebraska State Historical Society http://nebraskahistory.org/exhibits/index.shtml NJ New Jersey Digital Highway http://www.njdigitalhighway.org/ NM New Mexico State Library Digital Archive http://ow.ly/p8hrp NV Nevada Statewide Digital Initiative http://nsla.nevadaculture.org/ NY New York Heritage Digital Collections http://www.newyorkheritage.org/ New York Department of Records Photo Gallery http://www.nyc.gov/html/records/html/gallery/home.shtml New York Public Library Digital Gallery http://digitalgallery.nypl.org/nypldigital/explore/ OH Ohio Memory http://www.ohiomemory.org/ OK Oklahoma Digital Prairie http://digitalprairie.ok.gov/ OR Oregon Digital Library Project http://odl.library.oregonstate.edu/record/search PA Digital Collections at the State Library of Pennsylvania http://ow.ly/p8hv3 Historical Society of Pennsylvania Digital Library http://digitallibrary.hsp.org/ RI State of Rhode Island Virtual Archives http://sos.ri.gov/virtualarchives/ SC South Carolina Digital Library http://www.scmemory.org/ SD South Dakota State Historical Society Digital Archives http://sddigitalarchives.contentdm.oclc.org/ Digital Library of South Dakota http://dlsd.sdln.net/index.php TN Tennessee State Library and Archives: Digital Collections http://www.tennessee.gov/tsla/resources/index.htm Volunteer Voices http://www.volunteervoices.org/ TX Northeast Texas Digital Collections http://dmc.tamu-commerce.edu/ UT Digital Utah http://pioneer.utah.gov/digital/utah.html VA Virginia Memory http://www.virginiamemory.com/collections/ VT Vermont Folklife Center http://www.vermontfolklifecenter.org/digital-archive/collections/ WI State of Wisconsin Collection http://uwdc.library.wisc.edu/collections/WI WV West Virginia Division of Culture and History Online Exhibits http://www.wvculture.org/museum/exhibitsonline.html WY Wyoming Memory http://www.wyomingmemory.org/
Table 2, Multi-State Digital PortalsStates Resource Website CO, NM, WY Rocky Mountain Online Archive http://rmoa.unm.edu/ MN, ND Digital Horizons: A Plains Media Resource http://digitalhorizonsonline.org/ UT, NV, ID, AZ, HI Mountain West Digital Library http://mwdl.org/ Various Digital Public Library of America http://dp.la
Update: Corrected link to the Vermont Folklife Center.
If you’re in DC this weekend make sure to stop by the 2013 Library of Congress National Book Festival on the National Mall. Authors, poets, illustrators and several Library of Congress programs will be featured over two days, Saturday and Sunday, September 21 – 22, 2013. NDIIPP staff will be in the Library of Congress Pavilion (on Sunday only) with information and handouts about what we call Personal Digital Archiving: tips and guidelines on how people can keep safe their own digital photographs, documents, music, email and other digital information.
NDIIPP has been sharing these ideas at the NBF since 2006, one of the most popular parts of our exhibit are the myriad of old storage discs and out-dated computers we use to represent the constantly changing digital environment. Often, the computer punch cards on display bring back memories of loading reams of the cards into computers that ran fairly simple operations. It’s amazing to think about how far technology has come and how much it has changed our everyday lives. With these dramatic changes we’ve all had to learn new things: how to use a computer, a digital camera, a mobile phone, email, the Internet. We also have to learn how to keep the output of these new technologies and devices so that the generations after us can know what we experienced–our story.
So, if you’re interested in learning more about saving your digital stuff or just want to walk down memory lane with storage discs from your past stop by on Sunday between 10am and 5pm! We’ll be in the Library of Congress Pavilion. At 4:20 p.m. on Sunday Bill LeFurgy will be presenting Preserving America’s Digital Heritage: The National Digital Information Infrastructure and Preservation Program.
The following is a guest post by Carlos Martinez III, a program support assistant in the Library of Congress Office of Strategic Initiatives and a recent graduate of the Catholic University of America’s Library and Information Science Masters Program.
Digital technologies have become an integral part of everyday life, influencing and changing the way information is searched, retrieved, accessed and preserved. Over the past decade, there has been a major shift in the types and formats of information resources people seek, leading to changes in the way new library and information science professionals prepare for the current marketplace. This shift has manifested itself by creating opportunities in information technology roles and positions, such as digital archivists, repository librarians and metadata specialists versus more traditional roles of reference or cataloging librarians. Given this shift, I have often contemplated about the critical skills library and information science professionals need to obtain today – in graduate school and the professional arena.
As a recent graduate and based on my recent work experience at the Library of Congress, I hope to shed some light on what having a “modern” library job entails, offer some thoughts on the types of skills I have concluded are necessary for librarians in today’s information environment, and offer some advice to emerging professionals.
Even with recent discussion on the gaps that exist among professionals in the workplace, “library science” coursework is critical to understanding the profession as a whole. The traditional skills of librarianship, like cataloging and reference services, are still vital to the profession. For example, the information that is accessed via search engines was built on taxonomies and controlled vocabularies; mainstays of librarianship. During my master’s coursework, I took introductory and advanced courses in cataloging and classification. The theoretical concepts and practices taught in these courses are highly applicable to describe information at different levels of granularity for access.
The digital age has also affected acquisition and collection development policies in libraries. Librarians must now consider how to effectively manage digital information resources, while maintaining physical collections. Taking collection management or development courses will prepare one to think critically about how to apply the theoretical principles within your library or information center. In one of my classes we discussed the importance of updating collection development policies to provide access to digital information resources. Acquiring and providing access to electronic materials for patrons both on and offsite is a critical component of maintaining a useful collection.
All that said about theoretical coursework, it’s critical to take courses related to information system design and analysis. In light of previous discussions on this blog, it would be useful for new librarians to understand data as collections and for example, have the ability to manipulate and manage large data sets. A lot of the work that I do at the Library of Congress involves metadata remediation, data management and data migration. The courses that I feel prepared me most for this work were database management and programming for web applications. These courses offered a solid introduction into managing data in a relational database, and writing code for applications to create and access data.
Without a doubt, the work that librarians perform has been changing as access to free digital information becomes more prevalent. It is important to pursue a curriculum that allows you to be comfortable providing user support, reference services, reader advisory, and the myriad of skills associated with librarianship in a variety of formats. The best way to earn this experience is through an internship or a practicum.
Last summer, I completed an internship at the National Archives and Records Administration in the Center for Legislative Archives through Hispanic Association of Colleges and University National Internship Program. During this internship I migrated legacy data from House and Senate records into structured metadata to facilitate access for users. I also shadowed a reference archivist, and assisted him in providing reference support services to patrons visiting the Archives. Before the internship was over, I began answering both virtual and in-person reference requests independently
I started a second HACU internship at the Library of Congress Repository Development Center of the Office of Strategic Initiatives. Being a member of the RDC challenged me to learn about some of the problems librarians are facing with preserving digital information resources, such as digital content transfer and media degradation. I had the opportunity to meet with several stakeholders within the Library, and help develop a set of data elements necessary for the acquisition of electronic journals through a system called e-Deposit .
After my internship with RDC I began working as an employee in OSI’s Integration Management office. My primary responsibilities are in the area of metadata remediation and management. I help develop metadata according to the new web framework guidelines for a variety of online content ranging from digitized materials to static web pages.
Over the past couple of years, Integration Management partnered with the Interpretive Programs Office to migrate online exhibitions into a new web framework. An example of this work can be seen by looking at the Internet Archive’s version of Thomas Jefferson exhibit and the current online exhibit. The new framework allows users to access catalog records associated with digital objects, and creates page-level metadata that will refine the online catalog’s search capabilities.
Combining Theory with Practice
In my personal experience, learning how to provide traditional library services (like reference services and cataloging) is important, but capitalizing on the opportunity to develop a technical background while in library school is equally as critical. The most valuable aspect of completing my graduate coursework was learning the principles of the profession, and becoming instilled with its values because the core mission of librarianship has not changed.
As an emerging information professional, the most important theoretical principles are centered on becoming familiar with authority control in the digital age, the ability to manage and manipulate large sets of data, and understanding the challenges of preserving physical and digital formats. New librarians need to possess the ability to assess and describe collections like traditional librarians, but they also need to know enough about technology to successfully curate digital collections in the information age. While it is important to have this knowledge when entering the profession, it is equally as important to have had practical experience applying it. The experience you earn will not only prepare you for the workplace, but will give you an edge for applying the theory to practice.
The John W. Kluge Center at the Library of Congress has announced a new set of Kluge Fellowship in Digital Studies to examine the impact of the digital revolution on society, culture and international relations using the Library’s collections and resources. I am thrilled to have the chance to talk with Jason Steinhauer, Program Specialist with the Kluge Center about how this unique opportunity could fit with ongoing scholarship and research in digital stewardship.
Trevor: Could you give us a quick overview of the fellowship? What are the key points for anyone interested in it?
Jason: Sure. This is a call to scholars and thinkers worldwide to examine the digital revolution’s impact on how we think, how we live and how we relate to one another. Digital technology has made its way into every facet of our lives. Although it may be too early to fully know what the impact of the digital revolution is, it’s not too soon to ask the question. We hope to catalyze thinkers and scholars to take a step back, take a broad look at the evidence of the digital revolution’s effects on our lives and look deeply to see if something has fundamentally shifted. If so, what is it? What does it mean for us? What are the implications, positively or negatively? We hope to bring great minds to the world’s greatest repository of knowledge to investigate these questions.
Trevor: A reoccurring theme on The Signal as been bringing data science and computational analysis to bear on cultural heritage collections. For example, work funded through the interagency digging into data grants program often falls into this area. Would this call be an opportunity for data scientists and computer engineering researchers to develop that kind of corpus analysis research on things like the more than 30 million online documents in the National Digital Library mentioned in the call? I would be curious to hear you explain a few of the kinds of things you might imagine scholars could propose in this vein. Further, could you give us a sense of what would make this kind of proposal strong and compelling to reviewers.
Jason: Well, it’s a wide topic and applicants can approach the subject any number of ways. Most important, though, is to ensure that proposals address questions of deep concern to the humanities and the social sciences. We’d encourage scholars to go beyond data science and computational analysis and think about the digital revolution’s impact on language, education, communication, our thought patterns, on our values. Is the digital revolution ushering in a fundamental change in how we communicate, for example?
Some scholars speculate that the language of computer programmers may become the lingua franca of the future. Is that one of the implications of the digital revolution? Are there other implications for language, as more and more exchanges between people and nations are conducted through digital means? Is the digital revolution fundamentally shifting our values? If so, how? We want scholars who are willing to think deeply and critically about the implications of this massive transformation using the Library of Congress collections, as well as additional resources in Washington.
Trevor: Reading the call I thought of two very different streams of digital scholarship that might fit into it. On the one hand, there is work in new media studies that focuses on close readings and analysis of digital materials and their histories. On the other there is work in the digital humanities that focuses on computational analysis of digitized collections of existing primary sources from earlier eras. Matthew Kirshenbaum talked about these different streams of research in a recent interview. Are both of these kinds of research projects in play for the fellowships? If so, could you provide a sense of how these very different kinds of proposals would be evaluated against each other?
Jason: It’s best to think of this as a humanities fellowship that critically explores the digital revolution’s impact on our lives. Not to say that digital scholarship is not interrelated to this, but a deeply-rooted humanities framework may be most helpful in crafting a proposal.
In terms of evaluation, all applications to the Kluge Center are evaluated against five criteria: the significance of the contribution that the project will make to knowledge in the specific field and to the humanities or social sciences generally; the quality of the applicant’s work; the quality of the conception, definition, organization and description of the project; the likelihood that the applicant will complete the project; and the appropriateness of the research for the Library of Congress collections.
We hope to offer up to three fellowships in the first year of this competition and the three selectees may take very different approaches. We’re hoping to see a lot of differing, creative approaches to the topic.
Trevor: The call specifically mentions the Twitter archive. Do you have a sense of the kinds of modes of access proposing scholars would have with the twitter corpus?
Jason: The Twitter archive is a new kind of collection for the Library of Congress. Archiving and preserving outlets such as Twitter will enable future researchers access to a fuller picture of today’s cultural norms, dialogue, trends and events in order to inform scholarship, the legislative process, new works of authorship, education, and other purposes.
The Library has received billions of tweets and corresponding metadata to date, and is now working to develop a stable and sustainable way to preserve and organize the collection. In the near term, the Library is working to develop basic levels of access for on-site researchers and scholars-in-residence. We anticipate this Kluge Fellowship in Digital Studies to be an ongoing program, so we felt it appropriate to mention the Twitter archive as a potential resource, even though the full functionality may not be in place by the time the first fellows arrive. Scholars should not base their proposals around the Twitter archive, but rather consider it as one of the resources to mine while here at the Library of Congress.
Trevor: Aside from its born-digital and digitized collections, the Library of Congress has extensive holdings of personal papers and other primary sources that would seem to offer considerable value to answering questions about the impact of the digital revolution. Off the top of my head, something like John Von Neumann’s papers comes to mind. To help spark potential researchers’ imaginations, do you have any thoughts on particular Library of Congress collections that might be ripe for this call?
Jason: This is a great point. The Library of Congress has 35 million books, millions of manuscripts, moving images, sound recordings, digital collections, journals, newspapers, oral histories, the general humanities collections, the Law Library collections, the records of the U.S. Copyright Office, the holdings of the Science, Business and Technology Division, the writings of 20th and 21st century writers and thinkers… depending on the research question proposed, any number of these collections could be appropriate. The sky is really the limit.
Trevor: The Library of Congress has a sizable collection of video games at the Packard Campus A/V Conservation facility. Would proposals that focused on studying this collection of video games and related materials be relevant to this fellowship? Assuming they are, what would make a proposal to study these materials need to be able to establish to be compelling?
Jason: That’s a great idea. The effect of video games and online simulation on the cultural and societal norms that shape our lives has been seismic. The video game collection could certainly support a proposal; the proposal should indicate how these collections would inform the larger research question.
Trevor: Are there any key final words or thoughts that you want to stress about the program?
Jason: This is a unique moment for the Kluge Center and the Library of Congress. We have an opportunity to step back and ask important questions about ourselves and how we relate to one another in this new digital world. The insights from the scholars and practitioners we bring to Washington will open numerous possibilities for programs, symposia, seminars and more, to explore with policymakers and the public what the digital revolution means to us and future generations.
This fellowship is just the start. We hope people across the world will join us—including The Signal—and those interested should subscribe to our RSS feed on our home page, as well as check out the Digital Studies Fellowship page on our website. Thanks for letting us share this announcement with your readers, and we hope that some of them will apply!