It seems I have been working on this for a long time. It also seems the direction has moved a number of times. And I may still end up with a number of versions...
Now, the new goal is an Audio Migration+QA Taverna workflow using a number of Hadoop jobs. The workflow will migrate a large number of mp3 files using ffmpeg and it will perform a content comparison of the original and migrated files using xcorrSound waveform-compare and it will complete the audio part of check point CP082 "Implementation of QA workflows 03" (PC.WP.3) (M36).
Thus version 1 (simple audio QA) will be : Taverna workflow including 3 hadoop jobs: ffmpeg migration, mpg321 conversion, waveform-compare on file-lists/directories. Version 2 would include ffprobe property extraction and comparison. I have changed my mind about the input / output fitting the tools / Taverna / Hadoop best a number of times. For now, the input to the Taverna workflow is a file containing a list of paths to mp3 files to migrate + an output path (+ number of files pr. task). This is also the input to the ffmpeg migration hadoop job and the mpg321 conversion job. The output from these will be the path to the wav file (the output directory will also contain logs). These lists of paths to ffmpeg migrated wavs and mpg321 converted wavs will then be combined in Taverna to a list of pairs of paths to wav files, which will be used as input to the xcorrSound waveform-compare hadoop job.
The good thing about Taverna is the nice illustrations ;-) And the Taverna part is hopefully fairly straight forward, so I'll leave it as the last part. First I want to ge the three Hadoop jobs running. My trouble seems to be reading and writing files... Some of the trouble may be caused by testing on a local one-machine-cluster, which probably has some quirks! Right now I am wrapping the FFmpeg Migration as a Hadoop job. The tool can read local nfs files. The tool seems also to be able to write local files - it however does so with a "Permission denied" message in the log file?!? The Hadoop job is able to output the job result to an hdfs directory specified as part of input. The Hadoop mapper is however not able to output to the same directory, but it can output the logs to a different hdfs directory. Thus I now have the output distributed to three different locations to make things work... And I do not have these settings in a nice configuration!!!
So this raises a number of questions:
- Why am I not using https://github.com/openplanets/tomar? They have probably encountered and solved many of the same issues. The answer for now is that this will be version 3, as it would be nice to compare.
- Where do I want to read data from and write data to? If I have my data in a repository, it is probably not on hdfs. Do I really want to copy data to hdfs for processing and results back from hdfs. It seems that the command line tools I am using do not understand hdfs, which makes the simple answer no. So I want an nfs-mounted input and output data storage on my cluster. I can then read from this mount and output to this mount. I think I will probably put the event logs here as well, instead of on hdfs (here the input to the ffmpeg migration is the original mp3 file; the output is the migrated wav file and the "preservation event log" from the tool).
This is work in progress... Soon to come: Part II with Taverna diagram :)
Preservation Topics: MigrationSCAPE
The following is a guest post by Heidi Dowding, Resident at the Dumbarton Oaks Research Library in Washington, DC
As part of the National Digital Stewardship Residency program’s biweekly takeover of The Signal, I’m here to talk about my project at Dumbarton Oaks Research Library and Collection. And by the way, if you haven’t already checked out Emily Reynolds’ post on the residency four months in as a primer, go back and read that first. I’ll wait.
OK then, on we go.
My brief history in residence at this unique institution technically started in September, but really the project dates back a little over a year to a digital asset management information gathering survey that was undertaken by staff at Dumbarton Oaks. Concerned with DO’s shrinking digital storage capacity, they were hoping to find out how various departments were handling their digital assets. What they discovered was that, with no central policy guiding digital asset management within the institution, ad hoc practices were overlapping and causing manifold problems.
This is about where my project entered the scene. As part of the first cohort of NDSR residents, I’ve been tasked with identifying an institution-wide solution to digital asset management. This has first involved developing a deep (at times, file-level) understanding of Dumbarton Oaks’ digital holdings. These include the standard fare – image collections, digital books, etc. – but also more specialized content like the multimedia Oral History Project and the GIS Tree Care Inventory. I started my research with an initial survey sent to everyone around the institution, and then undertook interviews and focus groups with key staff in every department.
While I uncovered a lot of nuanced information about user behaviors, institutional needs, and the challenges we currently face, the top-level findings are threefold.
First, relationships within an institution make or break its digital asset management.
This is largely because each department has a different workflow for managing assets, but no department is an island. In interdepartmental collaborations, digital assets are being duplicated and inconsistently named. This is especially apparent in the editorial process at DO, wherein an Area of Study department acts as intermediary between the Publications department and various original authors. Duplicative copies are being saved in various drives around the institution, with very little incentive to clean and organize files once the project has been completed.
In this case, defined policies will aid in the development of interdepartmental collaborations in digital projects. My recommendation of a Digital Asset Management System (DAMS) will also hopefully aid in the deduplication of DO’s digital holdings.
Second, file formats are causing big challenges. Sometimes I even ran into them with my own research.
Other times, these were more treacherous around the institution, being caused by a lack of timely software updates for some of our more specialized systems or by a general proliferation of file formats. A lot of these issues could be addressed by central policy based on the file format action plans discussed by NDSR resident Lee Nilsson. Effective plans should address migration schedules and file format best practices.
Finally, staff need to be more proactive in differentiating between archival digital assets and everyday files.
By archival digital assets, I mean images from the ICFA or photographs of the gardens or word processing documents. This behavior becomes particularly problematic depending on where items are saved: many of the departmental drives are only backed up monthly, while a bigger institutional drive collectively referred to as ‘the Shared Drive’ is backed up daily. So if everyday items are being stored on a departmental drive, there is a much higher likelihood of data loss as there is no backup copy. Likewise, if archival assets are being put here with no local iteration being stored until the scheduled backup, really important digital assets could be lost. Finally, this also becomes problematic when digital assets are being stored long-term on the Shared Drive – they take up precious space and are not being properly organized and cared for.
My job over the next few months will be to look at potential Digital Asset Management Systems to determine whether a specific tool would assist Dumbarton Oaks’ staff in better managing digital files. I will also be convening a Digital Preservation Working Group to carry on my work after my residency ends in May.
Please check out NDSR at the upcoming ALA Midwinter Digital Preservation Interest Group meeting at 8:30am on Sunday, January 24 in the Pennsylvania Room.
With 60M downloads, BitTorrent's Bundle experiment is paying off
It released a new file format called BitTorrent Bundles in September, which gives movie makers, recording artists, authors, and any other content creator the ability to embed a mini-store inside their work. This was revolutionary because it turned ...
and more »
In my work at the Library, one of my larger projects has to do with the acquisition and preservation of eserials. But this I don’t mean access to licensed and hosted eserials, but the acquisition and preservation of eserial article files that come to the Library.
In many ways, this is just like other acquisition streams and workflows: some specifications for the content are identified; electronic transfer mechanisms are put in place; processing includes automated and human actions including inspection, metadata extraction and enrichment, and organization; and files are moved to the appropriate storage locations.
They are serials and have a complex organization of files/articles/issues/volumes/titles. There are multiple formats, content, and metadata standards in play. Publisher often now have a very frequent article-based publishing model that includes versions and updates. And the packages of files to be transferred between and within organizations can have many formats.
My Library of Congress colleague Erik Delfino reached out to our colleagues at the National Institutes of Health/National Library of Medicine who operate PubMed Central, who deal with similar issues. Out of our shared interest has come a NISO working group to develop a protocol for the transfer and exchange of files called PESC – Protocol for Exchanging Serial Content. This group is co-chaired by the Library of Congress and NIH, and has representatives from publishers small and large, data normalizers and aggregators, preservation organizations, and organizations with an interest in copyright issues.
This group is making great progress identifying the scope of the problem, looking at how a variety of organizations solve the problem for their own operations, and drafting its ideas for solutions for exchange that support the effective management and preservation of serials.
If you are interested in the work, please read the Work Item description at the PESC web site, and check out who’s involved. There will also be a brief update presented as part of the NISO standards session at ALA Midwinter on Sunday, January 26 from 1-2:30 PM in Pennsylvania Convention Center room 118 C.
We hear a constant stream of news about how crunching massive data collections will change everything from soup to nuts. Here on The Signal, it’s fair to say that scientific research data is close to the heart of our hopes, dreams and fears when it comes to big data: we’ve written over two-dozen posts touching on the subject.
In the context of all this, it’s exciting to see some major projects getting underway that have big data stewardship closely entwined with their efforts. Let me provide two examples.
The Registry of Data Repositories seeks to become a global registry of “repositories for the permanent storage and access of data sets” for use by “researchers, funding bodies, publishers and scholarly institutions.” The activity is funded by the German Research Foundation through 2014 and currently has 400 repositories listed. With the express goal to cover the complete data repository landscape, re3data.org has developed a typology of repositories that compliments existing information offered by individual instutions. The aim is to offer a “systematic and easy to use” service that will strongly enhance data sharing. Key to this intent is a controlled vocabulary that describes repository characteristics, including policies, legal aspects and technical standards.
In a bow to the current trend for visual informatics, the site also offers a set of icons with variable values that represent repository characteristics. The project sees the icons as helpful to users as well as to assist repositories “identify strengths and weaknesses of their own infrastructures” and keep the information up to date.
I really like this model. It hits the trifecta in appealing to creators who seek to deposit data, to users who seek to find data and to individual repositories who seek to evaluate their characteristics against their peers. It remains to be seen if it will scale and if it can attract ongoing funding, but the approach is elegant and attractive.
The second example is ELIXIR, an initiative of the EMBL European Bioinformatics Institute. ELIXIR aims to “orchestrate the collection, quality control and archiving of large amounts of biological data produced by life science experiments,” and “is creating an infrastructure – a kind of highway system – that integrates research data from all corners of Europe and ensures a seamless service provision that it is easily accessible to all.”
This is huge undertaking and has the support of many nations who are contributing millions of dollars to build a “hub and nodes” network. It will connect public and private bioscience facilities throughout Europe and promote shared responsibility for biological data delivery and management. The intention is to provide a single interface to hundreds of distributed databases and a rich array of bioinformatics analysis tools.
ELIXIR is a clear demonstration of how a well-articulated need can drive massive investment in data management. The project has a well-honed business case that presents an irresistible message. ”Biological information is of vital significance to life sciences and biomedical research, which in turn are critical for tackling the Grand Challenges of healthcare for an ageing population, food security, energy diversification and environmental protection,” reads the executive summary. “The collection, curation, storage, archiving, integration and deployment of biomolecular data is an immense challenge that cannot be handled by a single organisation.” This is what the Blue Ribbon Task Force on Sustainable Digital Preservation and Access termed “the compelling value proposition” needed to drive the enduring availability of digital information.
As a curious aside, it’s worth nothing that projects such as ELIXIR may have an unexpected collateral impact on data preservation. Ewan Birney, a scientist and administrator working on ELIXIR, was so taken with the challenge of what he termed “a 10,000 year archive” holding a massive data store that he and some colleagues (over a couple of beers, no less) came up with a conjecture for how to store digital data using DNA. The idea was sound enough to merit a letter in Nature, Towards practical, high-capacity, low-maintenance information storage in synthesized DNA. So, drawing the attention of bioinformaticians and other scientists to the digital preservation challenge may well lead to stunning leaps in practices and methods.
Perhaps one day the biggest of big data can even be reduced to the size of a bowl of alphabet soup or a bowl of mixed nuts!