Feed aggregator | Open Planets Foundation

FileMaker 13: More Polished but Pricey – TidBITS

Google News Search: "new file format" – 24 January 2014 – 7:54pm

FileMaker 13: More Polished but Pricey
TidBITS
Version 12 saw another new file format, but one that simply enabled new design features and left the underlying data structures unbroken. The good news for all except developers specialising in upgrade conversion is that file structures again remain …

Categories: Technology Watch

Repository APIs

OPF Wiki Activity Feed – 24 January 2014 – 3:23pm

Page edited by Asger Askov Blekinge

View Online Asger Askov Blekinge 2014-01-24T15:23:13Z

Categories: All OPF Activity Feeds

Repository APIs

SCAPE Wiki Activity Feed – 24 January 2014 – 3:23pm

Page edited by Asger Askov Blekinge

View Online Asger Askov Blekinge 2014-01-24T15:23:13Z

Categories: SCAPE

Repository APIs

OPF Wiki Activity Feed – 24 January 2014 – 12:14pm

Page edited by Asger Askov Blekinge

View Online Asger Askov Blekinge 2014-01-24T12:14:49Z

Categories: All OPF Activity Feeds

Repository APIs

SCAPE Wiki Activity Feed – 24 January 2014 – 12:14pm

Page edited by Asger Askov Blekinge

View Online Asger Askov Blekinge 2014-01-24T12:14:49Z

Categories: SCAPE

Scout Virtual Hackathon

OPF Wiki Activity Feed – 24 January 2014 – 10:41am

Page edited by Peter May

View Online Peter May 2014-01-24T10:41:10Z

Categories: All OPF Activity Feeds

Scout Virtual Hackathon

SCAPE Wiki Activity Feed – 24 January 2014 – 10:41am

Page edited by Peter May

View Online Peter May 2014-01-24T10:41:10Z

Categories: SCAPE

SCAPE Stories

OPF Wiki Activity Feed – 23 January 2014 – 3:38pm

Page edited by Sven Schlarb

View Online Sven Schlarb 2014-01-23T15:38:08Z

Categories: All OPF Activity Feeds

SCAPE Stories

SCAPE Wiki Activity Feed – 23 January 2014 – 3:38pm

Page edited by Sven Schlarb

View Online Sven Schlarb 2014-01-23T15:38:08Z

Categories: SCAPE

Wikipedia: The Go-to Source for Information About Digital Preservation?

The Signal: Digital Preservation – 23 January 2014 – 2:56pm

The following is a guest post from Andrea Goethals, Digital Preservation and Repository Services Manager at the Harvard University Library, with contributions from Stephen Paul Davis, Director of Columbia University Libraries Digital Program Division and Kate Zwaard, Supervisory IT Specialist, Repository Development, Library of Congress. Andrea and Kate co-chair the NDSA Standards and Practices Working Group.

When you hear about something that is new to you – where is the first place you usually go to learn more about it? If you’re like most of us, you usually find yourself reading a Wikipedia article. In fact, Wikipedia is the sixth most popular website. That was the inspiration behind the NDSA Standards and Practices Working Group’s project, started in 2012, to use Wikipedia as a platform to expose information about digital preservation standards and best practices. Since people are already going to Wikipedia for information, why not leverage it to build upon the information that is already there?

A Challenging Undertaking!

This idea proved more challenging than it first appeared. Wikipedia’s main article about digital preservation wasn’t in a state where the group could easily attach related articles on particular standards and best practices. Information about digital preservation in Wikipedia was spread out over multiple articles and important areas were completely left out while other areas were fairly detailed but out-of-date, some came from a non-library perspective, and some were poorly written or biased. In fact, the poor quality of the article hadn’t gone without notice by Wikipedia editors and there were banners at the top of the page warning readers:

Disclaimer banners at the top of the page Digital Preservation Wikipedia page.

Digital Preservation WikiProject

The group decided that the first step was to improve Wikipedia’s core article about digital preservation to provide a more complete scaffolding from which subsidiary articles on standards and best practices could be hung. A small group took on the task of writing an outline for reorganizing and adding to the existing Digital Preservation article and then started writing new sections including:

The logo for WikiProject Digital Presevation

Despite the state of the Digital Preservation article, the group recognized that Wikipedia could still be a good platform to expose a wider audience to digital preservation standards and best practices, so a “WikiProject” was set up to organize the work.

Definition of digital preservation
Challenges of digital preservation
Intellectual foundations of digital preservation in libraries
Specific tools and methodologies
CRL certification and assessment of digital repositories
Digital preservation best practices for audio, moving images and email

This was such an improvement to the quality of the Digital Preservation article that the disclaimers at the top of the article were removed.

This project couldn’t have been done without the dedication of Stephen Paul Davis and Dina Sokolova from Columbia University Libraries who provided the needed editorial oversight and wrote most of the new content. In addition, key contributions were made by Priscilla Caplan, formerly of the FCLA, Linda Tadic of the Audiovisual Archive Network and Chris Dietrich and Jason Lautenbacher, both from the U.S. Park Service.

What’s Next? How You Can Help

Wikipedia’s digital preservation articles need ongoing oversight, but this is a responsibility that should be distributed broadly. Please take a look at the article and outline and consider contributing in your areas of expertise. If you’re looking for a leadership opportunity in digital preservation, the NDSA is looking for someone who can help encourage participation in the WikiProject and act as a liaison to the coordinating committee. If you’re interested, please contact Stephen Paul Davis at [email protected].

Categories: Planet DigiPres

Standing on the Shoulders of your peers

SCAPE Blog Posts – 23 January 2014 – 9:01am

In December last year I attended a Hadoop Hackathon in Vienna. A hackathon that has been written about before by other participants: Sven Schlarb‘s Impressions of the ‘Hadoop-driven digital preservation Hackathon’ in Vienna and Clemens and René‘s The Elephant Returns to the Library…with a Pig!. Like these other participants I really came home from this event with a lot of enthusiasm and fingers itching to continue the work I started there.

As Clements and René writes in their blog post on this event, collaboration had, without really being stated explicit, taken centre place and that in it self was a good experience.

For the hackathon Jimmy Lin from the University of Maryland had been invited to present Hadoop and some of the adjoining technologies to the participants. We all came with the hope of seeing cool and practical usages of Hadoop for use in digital preservation. He started his first session by surprising us all with a talk titled: Never Write Another MapReduce Job. It later became clear Jimmy enjoys this kind of gentle provocation like in his 2012 article titled If all you have is a hammer, throw away everything that is not a nail. Jimmy, of course, did not want us to throw away Hadoop. Instead he gave a talk on how to get rid of all the tediousness and boiler plating necessary when writing MapReduce jobs in Java. He showed us how to use Pig Latin, a language which can be described as an imperative-like, SQL-like DSL language for manipulating data structured as lists. It is very concise and expressive and soon became a new shiny tool for us developers.

During the past year or so Jimmy had been developing a Hadoop based tool for harvesting web sites into HBase. This tool also had its own piggy bank which is what you call a library of user defined functions (UDFs) for Pig Latin. So to cut the corner those of us who wanted to hack in Pig Latin cloned that tool from Github: warcbase. As a bonus this tool also had an UDF for reading ARC files which was nice as we had a lot of test data in that format, some provided by ONB and some brought from home.

As an interesting side-note, the warcbase tool actually leverages another recently developed digital preservation tool, namely JWAT, developed at the Danish Royal Library.

As Clemens and René writes in their blog post they created two UDFs using Apache Tika. One UDF for detecting which language a given ARC text-based record was written in and another for identifying which MIME type a given record had. Meanwhile another participant Alan Akbik from Technischen Universität Berlin showed Lin how to easily add Pig Latin unit tests to a project. This resulted in an actual commit to warcbase during the hackathon adding unit tests to the previously implemented UDFs.

Given those unit tests I could then implement such tests for the two Tika UDFs that Clemens and René had written. These days unit test are almost ubiquitous when collaborating on writing software. Apart from their primary role of ensuring the continued correctness of refactored code, they do have another advantage. For years I’ve preferred an exploratory development style using REPL-like environments. This is hard to do using Java, but the combination of unit tests and a good IDE gives you a little of that dynamic feeling.

With all the above in place I decided to write a new UDF. This UDF should use the UNIX file tool to identify records in an ARC file. This task would aggregate the ARC reader UDF by Jimmy, the Pig unit tests by Alan and lastly a Java/JNA library written by Carl Wilson who adapted it from another digital preservation tool called JHOVE2. This library is available as libmagic-jna-wrapper. I, off course, would also rely heavily on the two Tika UDFs by Clemens and René and the unit tests I wrote for those.

Old Magic

The “file” tool and its accompanying library “libmagic” is used in every Linux and BSD distribution on the planet, it was born in 1987, and is still the most used file format identification tool. It would be sensible to employ such a robust and widespread tool in any file identification environment especially as it is still under development. As of this writing, the latest commit to “file” was five days ago!

The “file” tool is available on Github as glenc/file.

“file” and the “ligmagic” library are developed in C. To employ this we therefore need to have a JNA interface and this is exactly what Carl finished during the hackathon.

Maven makes it easy to use that library:

<dependency> <groupId>org.opf-labs</groupId> <artifactId>lib-magic-wrapper</artifactId> <version>0.0.1-SNAPSHOT</version> </dependency>

which gives access to “libmagic” from a Java program:

import org.opf_labs.LibmagicJnaWrapper; … LibmagicJnaWrapper jnaWrapper = new LibmagicJnaWrapper(); magicFile = “/usr/share/file/magic.mgc”; jnaWrapper.load(magicFile); mimeType = jnaWrapper.getMimeType(is); …

There is one caveat in using a C library like this on Java. It often requires platform specific configuration as in this case the full path to the “magic.mgc” file. This file contains the signatures (byte sequences) used when identifying the formats of the unknown files. In this implementation the UDF will take this path as a parameter to the constructor of the UDF class.

Magic UDF

With the above in place is it very easy to implement the UDF which in its completeness is as simple as

package org.warcbase.pig.piggybank; import org.apache.pig.EvalFunc; import org.apache.pig.data.Tuple; import org.opf_labs.LibmagicJnaWrapper; import java.io.ByteArrayInputStream; import java.io.IOException; import java.io.InputStream; public class DetectMimeTypeMagic extends EvalFunc<String> { private static String MAGIC_FILE_PATH; public DetectMimeTypeMagic(String magicFilePath) { MAGIC_FILE_PATH = magicFilePath; } @Override public String exec(Tuple input) throws IOException { String mimeType; if (input == null || input.size() == 0 || input.get(0) == null) { return “N/A”; } //String magicFile = (String) input.get(0); String content = (String) input.get(0); InputStream is = new ByteArrayInputStream(content.getBytes()); if (content.isEmpty()) return “EMPTY”; LibmagicJnaWrapper jnaWrapper = new LibmagicJnaWrapper(); jnaWrapper.load(MAGIC_FILE_PATH); mimeType = jnaWrapper.getMimeType(is); return mimeType; } }

Github: DetectMimeTypeMagic.java

Magic Pig Latin

A Pig Latin script utilising the new magic UDF on an example ARC file. The script measures the distribution of MIME types in the input files.

register ‘target/warcbase-0.1.0-SNAPSHOT-fatjar.jar’; — The ’50’ argument is explained in the last section define ArcLoader50k org.warcbase.pig.ArcLoader(’50’); — Detect the mime type of the content using magic lib — On MacOS X using Homebrew the magic file is located at — /usr/local/Cellar/libmagic/5.15/share/misc/magic.mgc define DetectMimeTypeMagic org.warcbase.pig.piggybank.DetectMimeTypeMagic(‘/usr/local/Cellar/libmagic/5.16/share/misc/magic.mgc’); — Load arc file properties: url, date, mime, and 50kB of the content raw = load ‘example.arc.gz’ using ArcLoader50k() as (url: chararray, date:chararray, mime:chararray, content:chararray); a = foreach raw generate url,mime, DetectMimeTypeMagic(content) as magicMime; — magic lib includes “; <char set>” which we are not interested in b = foreach a { magicMimeSplit = STRSPLIT(magicMime, ‘;’); GENERATE url, mime, magicMimeSplit.$0 as magicMime; } — bin the results magicMimes = foreach b generate magicMime; magicMimeGroups = group magicMimes by magicMime; magicMimeBinned = foreach magicMimeGroups generate group, COUNT(magicMimes); store magicMimesBinned into ‘magicMimeBinned’;

This script can be modified a bit for usage with this unit test

@Test public void testDetectMimeTypeMagic() throws Exception { String arcTestDataFile; arcTestDataFile = Resources.getResource(“arc/example.arc.gz”).getPath(); String pigFile = Resources.getResource(“scripts/TestDetectMimeTypeMagic.pig”).getPath(); String location = tempDir.getPath().replaceAll(“\\\\”, “/”); // make it work on windows ? PigTest test = new PigTest(pigFile, new String[] { “testArcFolder=” + arcTestDataFile, “experimentfolder=” + location}); Iterator <Tuple> ts = test.getAlias(“magicMimeBinned”); while (ts.hasNext()) { Tuple t = ts.next(); // t = (mime type, count) String mime = (String) t.get(0); System.out.println(mime + “: ” + t.get(1)); if (mime != null) { switch (mime) { case “EMPTY”: assertEquals( 7L, (long) t.get(1)); break; case “text/html”: assertEquals(139L, (long) t.get(1)); break; case “text/plain”: assertEquals( 80L, (long) t.get(1)); break; case “image/gif”: assertEquals( 29L, (long) t.get(1)); break; case “application/xml”: assertEquals( 11L, (long) t.get(1)); break; case “application/rss+xml”: assertEquals( 2L, (long) t.get(1)); break; case “application/xhtml+xml”: assertEquals( 1L, (long) t.get(1)); break; case “application/octet-stream”: assertEquals( 26L, (long) t.get(1)); break; case “application/x-shockwave-flash”: assertEquals( 8L, (long) t.get(1)); break; } } } }

Github: TestArcLoaderPig.java

The modified Pig Latin script is at TestDetectMimeTypeMagic.pig

¡Hasta la Vista!

During this event we had a lot of synergy through collaboration; shouting over the tables, showing code to each other, running each other’s code on non-public test data, presenting results on projectors, and so on. Even late night discussions added significant energy to this synergy. All this is not possible without people actually meeting each other face to face for a couple of days, showing up with great intentions for sharing, learning and teaching.

So, I do hope to see you all soon somewhere in Europe for some great hacking.

Epilogue: Out of heap space

A couple of weeks ago I was more or less done with all of the above, including this blog post. Then something happened that required us to upgrade our version of Cloudera to 4.5. This again resulted in us changing the basic cluster architecture and then the UDFs stopped working due to heap space out of memory errors. I traced those out of memory errors to the ArcLoader class, which is why I implemented the “READ_SIZE” class field. This field is set when instantiating the class to some reasonable number of kB. In forces the ArcLoader to only read a certain amount of payload data, just enough for Tika and libmagic to complete their format identifications while ensuring we don’t get hundreds-of-megabyte sized strings being passed around.

This doesn’t address the problem of why it worked before and why it doesn’t now. It also doesn’t address the loss of generality. The ArcLoader can no longer provide an ARC container format abstraction in every case. It only works when the job can make do with only a part of the payload of the ARC records. I.e. the given solution would not work for a Pig script that needs to extract the audio parts of movie files provided as ARC records.

As this work has primarily been a learning experience I will stop here — for now. Still, I’m certain that I’ll revisit these issues somewhere down the road as they are both interesting and the solutions will be relevant for our work.

Preservation Topics: Identification Web Archiving Tools Open Planets Foundation SCAPE

Categories: SCAPE

Standing on the Shoulders of your peers

Open Planets Foundation Blogs – 23 January 2014 – 9:01am