Comment added by Anonymous
XQYKgA , [url=http://uamtegvhyyqp.com/]uamtegvhyyqp[/url], [link=http://atfxesqidjbn.com/]atfxesqidjbn[/link], http://nmkuomdtapkw.com/![]()
Of late it seems that almost every project I have been called to work on involves some aspect of “Big Data.” I have been challenged in the past that libraries actually have big data, because we don’t as a general rule collect social science or scientific datasets. But I feel strongly in asserting that our digital collections — texts, images, GIS, legislative documents, web archives, etc. — can be considered to be data in addition to being cultural artifacts. I first talked about this in a post on this blog in October 2011.

"Big Data Can Generate Big Brainstorms" from Flickr user Kevin Krejci, http://www.flickr.com/photos/kevinkrejci/6259499293/
I have had conversations recently with colleagues from many organizations about their collections, and some have told me that they do not have big data because they do not have datasets. Some have said that they do not have big data because they do not have large-scale collections or massive observational files. So it seems that we do not only need to define the “data” in big data, we need to define “big.”
Big can most definitely mean small files, but a lot of them.
And they do not have have to be seemingly exotic formats, like FITS files for astronomical images or HDF for earth science data. They can be Excel files, which California Digital Library and Microsoft Research are collaborating to preserve. PDFs are perhaps the most common file format in journal publishing; PDF/A became a formal standard meant for preservation in 2005. There are also HTML files in web archives. And TIFF or JP2 page image files from digitized books and newspapers. Our institutions have hundreds of thousands, millions, or even billions of those types of files.
I can give some examples. When working on the first phase of a publication archiving project, we received a relatively small content delivery — 100 GB. That delivery contained 1.3 million files. Or the Library of Congress web archives, which currently comprise over 6 billion files. All of the files mentioned are quite small and quite common formats. And yet, in the aggregate, they have research value and are Big Data.
As I was writing this post, a tweet came across my twitterstream pointing to this article:
danah boyd & Kate Crawford (2012): CRITICAL QUESTIONS FOR BIG DATA, Information, Communication & Society, DOI:10.1080/1369118X.2012.678878
As much as I was tempted to delete all my text and have my entire post instead just say “Read This Article,” I showed some restraint. I still feel that I need to say that there are many definitions of what constitutes big data, that cultural organizations have big data in every possible definition of the phrase, and that we need to decide how we are going to steward and provide access to our collections as data. But definitely read their article for even more on what big data is and isn’t.
Page edited by Paul Wheatley
View Online Paul Wheatley 2012-05-17T13:01:03ZPage edited by Paul Wheatley
View Online Paul Wheatley 2012-05-17T12:51:08ZThe Open Planets Foundation has been steadily collecting the practical experiences of those working to solve concrete digital preservation and digital curation challenges. This began with a structure developed during the AQuA Project, where we captured information on the OPF wiki consisting of Datasets, preservation Issues with those Datasets and Solutions to the Issues. Subsequent events and projects built on the approach and captured more information. Its also been adopted by the SCAPE Project, who are developing a variety of new preservation solutions.
As well as sharing information about prototypes and working solutions for others to pick up and use, it also captures information about what not to do. Perhaps where a particular tool was applied to a problem, but the result didn't work out. Perhaps to capture lessons learnt in approaching an arduous preservation task.
The other useful focus of this information is in the capture and sharing of the requirements of practitioners. Many digital preservation developments have been driven by those in an excellent position to come up with solutions, but without as much knowledge of the real digital preservation needs on the ground. As we've collated requirements from practitioners over the last 14 months, its been evident that many of the real needs are associated with quite simple questions. What is this digital content I've got? What's is the content about? Which bits shall I keep? What are the preservation risks here? The capability to apply basic assessment, appraisal and characterisation is not yet there. For many of the practitioners we've spoken to, preservation planning, migration and emulation is not the immediate need
One of the aims of the SPRUCE Project is to provide better community support for our digital preservation practitioners and developers. The first significant output can be seen here:
Digital Preservation and Data Curation Requirements and Solutions
We've collated all the information about preservation requirements and solutions in one place, and made it easy to browse, search and navigate through.
Please tell us if this is a useful resource? If you'd like to contribute: sign up for the OPF wiki and start adding new comments, requirements and solutions. If you have a particular preservation problem you need solving, maybe others in the community have the same need. Maybe the community can find a solution for you...
Preservation Topics: SPRUCEPage edited by Becky McGuinness
View Online Becky McGuinness 2012-05-17T07:47:18ZComment added by Anonymous
T86OHc , [url=http://zvyxtmpdaiig.com/]zvyxtmpdaiig[/url], [link=http://catawvjimlbr.com/]catawvjimlbr[/link], http://wfrjrqzezfjl.com/![]()
Comment added by Anonymous
uT2Hu6 <a href="http://ijwkbtegtmvl.com/">ijwkbtegtmvl</a>
View Online | 2012-05-17T03:48:11ZComment added by Anonymous
QpSV7i <a href="http://rzwytqgvrzxa.com/">rzwytqgvrzxa</a>
View Online | 2012-05-17T02:34:30ZSelection–what to keep, how to keep it, and how long to keep it–quickly comes up in connection with stewardship of digital content.
Consider two prevalent concepts at opposite extremes. One holds that we are failing to save enough digital content, a position taken in a recent article in the Economist, History flushed: The digital age promised vast libraries, but they remain incomplete. The other concept, perhaps in reaction to the first, is that organizations need to save every scrap of data because it’s impossible to predict what will have value down the road. David Rosenthal explores this idea in Lets Just Keep Everything Forever In The Cloud.

eternal impermanence, by Squant, on Flickr
If we attempt to look past whether we are saving too little or too much content, there is yet another selection issue that comes into play: the degree to which preserved content changes through migration, or even is lost as a result of system failure. Henry Newman notes that librarians and archivists discuss preservation in terms of data loss or no data loss in spite of the fact that “100% data reliability is impossible given the cost for large archives” (link here, PDF).
These are knotty issues that will take some time settle. Yet I found myself thinking about them while reading something completely removed from the subject of digital stewardship. The Unbearable Impermanence of Things: Reflections on Buddhism, Cultural Memory and Heritage Conservation, a chapter in the Routledge Handbook of Heritage in Asia, has some fascinating observations on conservation and the impermanence of cultural heritage. Impermanence in this case is framed as both how physical objects transform over time and how cultures modify their interpretation of those objects.
The basic point is that heritage materials inevitably change and that heritage conservation involves dealing with that change. Objects change in all kinds of ways, from acquiring a fine patina to outright loss or destruction. The author notes that iconoclasm–the smashing of of cultural objects–is a “selective process through which memory achieves social and cultural definition.” In the case of the two giant Buddha statues dynamited in 2001 by the Taliban in Afghanistan’s Bamiyan Valley, the act of erasure is clearly evident–it’s even “indefinitely replicated as a memorial image” via YouTube.
The author declares that all heritage remnants are fragments that can at best refer to an absent totality. Alterations, breakages and mistakes associated with a heritage object demonstrate it’s historicity and “existence in time within the society that created it.” Historical objects also have a tendency to accumulate layers of additional meaning, some of which can be radically different than what an original steward had in mind.
I know the comparison of physical objects to digital collections can only be taken so far. There are fundamental differences, including the fact that former is rooted in material manifestation and the later is literally disembodied. Nevertheless, I take some comfort in imagining that all the many challenges and complexities associated with digital preservation are subsumed in the same impermanence as the rest of the world.
Page edited by Paul Wheatley
View Online Paul Wheatley 2012-05-16T15:10:39ZPage edited by Paul Wheatley
View Online Paul Wheatley 2012-05-16T15:10:06ZPage edited by Paul Wheatley
View Online Paul Wheatley 2012-05-16T15:07:32ZPage edited by Paul Wheatley
View Online Paul Wheatley 2012-05-16T15:05:52ZPage edited by Paul Wheatley
View Online Paul Wheatley 2012-05-16T15:04:38ZPage edited by Paul Wheatley
View Online Paul Wheatley 2012-05-16T15:03:58ZPage edited by Paul Wheatley
View Online Paul Wheatley 2012-05-16T15:00:21ZPage edited by Paul Wheatley
View Online Paul Wheatley 2012-05-16T14:59:06ZPage edited by Paul Wheatley
View Online Paul Wheatley 2012-05-16T13:29:34ZPage edited by Becky McGuinness
View Online Becky McGuinness 2012-05-16T08:29:25Z