I had the distinct pleasure of moderating the opening plenary session of the Joint Annual Meeting of COSA, NAGARA and SAA in Washington D.C. in early August. The panel was on the “state of access,” and I shared the dais with David Cuillier, an Associate Professor and Director of the University of Arizona School of Journalism, as well as the president of the Society of Professional Journalists; and Miriam Nisbet, the Director of the Office of Government Information Services at the National Archives and Records Administration.
The panel was a great opportunity to tease out the spaces between the politics of “open government” and the technologies of “open data” but our time was much too short and we had to end just when the panelists were beginning to get to the juicy stuff.
There were so many more places we could have taken the conversation:
- Is our government “transparent enough”? Do we get the “open government” we deserve as (sometimes ill-informed) citizens?
- What is the role of outside organizations in providing enhanced access to government data?
- What are the potential benefits of reducing the federal government role in making data available?
- Is there the right balance between voluntary information openness and the need for the Freedom of Information Act?
- What are the job opportunities for archivists and records managers in the new “open information” environment?
- Have you seen positive moves towards addressing digital preservation and stewardship issues regarding government information?
I must admit that when I think of “access” and “open information” I’m thinking almost exclusively about digital data because that’s the sandbox I play in. At past SAA conferences I’ve had the feeling that the discussion of digital preservation and stewardship issues was something that happened in the margins. At this year’s meeting those issues definitely moved to the center of the conversation.
Just look at this list of sessions running concurrently during a single hour on Thursday August 14, merely the tip of the iceberg:
- Getting Things Done with Born-Digital Collections
- Spreading the Word: Access to Oral History Collections in the Digital Age
- Editathon: You Have One Hour to Increase Access to Archival Science Info on Wikipedia…Go!
- Ethics, Provenance, Metadata: Trust and Recordkeeping in the Cloud?
There were also a large number of web archiving-related presentations and panels including the SAA Web Archiving Roundtable meeting (with highlights of the upcoming NDSA Web Archiving Survey report), the Archive-IT meetup and very full panels Friday and Saturday.
I was also pleased to see that the work of NDIIPP and the National Digital Stewardship Alliance was getting recognized and used by many of the presenters. There were numerous references to the 2014 National Agenda for Digital Stewardship and the Levels of Preservation work and many NDSA members presenting and in the audience. You’ll find lots more on the digital happenings at SAA on the #SAA14 twitter stream.
The increased focus on digital is great news for the archival profession. Digital stewardship is an issue where our expertise can really be put to good use and where we can have a profound impact. Younger practitioners have recognized this for years and it’s great that the profession itself is finally getting around to it.
It is well-known that PDF documents can contain features that are preservation risks (e.g. see here and here). Migration of existing PDFs to PDF/A is sometimes advocated as a strategy for mitigating these risks. However, the benefits of this approach are often questionable, and the migration process can also be quite risky in itself. As I often get questions on this subject, I thought it might be worthwhile to do a short write-up on this.PDF/A is a profile
First, it's important to stress that each of the PDF/A standards (A-1, A-2 and A-3) are really just profiles within the PDF format. More specifically, PDF/A-1 offers a subset of PDF 1.4, whereas PDF/A-2 and PDF/A-3 are based on the ISO 32000 version of PDF 1.7. What these profiles have in common, is that they prohibit some features (e.g. multimedia, encryption, interactive content) that are allowed in 'regular' PDF. Also, they narrow down the way other features are implemented, for example by requiring that all fonts are embedded in the document. Keeping this in mind, it's easy to see that migrating an arbitrary PDF to PDF/A can easily result in problems.Loss, alteration during migration
Suppose, as an example, that we have a PDF that contains a movie. This is prohibited in PDF/A, so migrating to PDF/A will simply result in the loss of the multimedia content. Another example are fonts: all fonts in a PDF/A document must be embedded. But what happens if the source PDF uses non-embedded fonts that are not available on the machine on which the migration is run? Will the migration tool exit with a warning, or will it silently use some alternative, perhaps similar font? And how do you check for this?Complexity and effect of errors
Also, migrations like these typically involve a complete re-processing of the PDF's internal structure. The format's complexity implies that there's a lot of potential for things to go wrong in this process. This is particularly true if the source PDF contains subtle errors, in which case the risk of losing information is very real (even though the original document may be perfectly readable in a viewer). Since we don't really have any tools for detecting such errors (i.e. a sufficiently reliable PDF validator), these cases can be difficult to deal with. Some further considerations can be found here (the context there is slightly different, but the risks are similar).Digitised vs born-digital
The origin of the source PDFs may be another thing to take into account. If PDFs were originally created as part of a digitisation project (e.g. scanned books), the PDF is usually little more than a wrapper around a bunch of images, perhaps augmented by an OCR layer. Migrating such PDFs to PDF/A is pretty straightforward, since the source files are unlikely to contain any features that are not allowed in PDF/A. At the same time, this also means that the benefits of migrating such files to PDF/A are pretty limited, since the source PDFs weren't problematic to begin with!
The potential benefits PDF/A may be more obvious for a lot of born-digital content; however, for the reasons listed in the previous section, the migration is more complex, and there's just a lot more that can go wrong (see also here for some additional considerations).Conclusions
Although migrating PDF documents to PDF/A may look superficially attractive, it is actually quite risky in practice, and it may easily result in unintentional data loss. Moreover, the risks increase with the number of preservation-unfriendly features, meaning that the migration is most likely to be successful for source PDFs that weren't problematic to begin with, which belies the very purpose of migrating to PDF/A. For specific cases, migration to PDF/A may still be a sensible approach, but the expected benefits should be weighed carefully against the risks. In the absence of stable, generally accepted tools for assessing the quality of PDFs (both source and destination!), it would also seem prudent to always keep the originals.Taxonomy upgrade extras: PDFPreservation Topics: Preservation ActionsMigrationTools
At the 2014 Society of American Archivists meeting, the CAD/BIM Taskforce held a session titled “Frameworks for the Discussion of Architectural Digital Data” to consider the daunting matter of archiving computer-aided design and Building Information Modelling files. This was the latest evidence that — despite some progress in standards and file exchange — archivists and the international digital preservation community at large are trying to get a firm grasp on the slippery topic of preserving CAD files.
CAD is a suite of design tools, software for 3-D modelling, simulation and testing. It is used in architecture, geographic information systems, archaeology, survey data, geophysics, 3-D printing, engineering, gaming, animation and just about any situation that requires a 3-D virtual model. It comprises geometry, intricate calculations, vector graphics and text.
The data in CAD files resides in structurally complex inter-related layers that are capable of much more than displaying models. For example, engineers can calculate stress and load, volume and weight for specific materials, the center of gravity and visualize cause-and-effect. Individual CAD files often relate and link to other CAD files to form a greater whole, such as parts of a machine or components in a building. Revisions are quick in CAD’s virtual environment, compared to paper-based designs, so CAD has eclipsed paper as the tool of choice for 3-D modelling.
CAD files — particularly as used by scientists, engineers and architects — can contain vital information. Still, CAD files are subject to the same risk that threatens all digital files, major and minor: failure of accessibility — being stuck on obsolete storage media or dependent on a specific program, in a specific version, on a specific operating system. In particular, the complexity and range of specifications and formats for CAD files make them even more challenging than many other kinds of born-digital materials.
As for CAD software, commerce thrives on rapid technological change, new versions of software and newer and more innovative software companies. This is the natural evolution of commercial technology. But each new version and type of CAD software increases the risk of software incompatibility and inaccessibility for CAD files created in older versions of software. Vendors, of course, do not have to care about that; the business of business is business — though, in fairness, businesses may continually surpass customer needs and expectations by creating newer and better features. That said, many CAD customers have long realized that it is important — and may someday be crucial — to be able to archive and access older CAD files.
Building Information Modelling files and Project Lifecycle Management files also require a digital-preservation solution. BIM and PLM integrate all the information related to a major project, not only the CAD files but also the financial, legal, email and other ancillary files.
Part of a digital preservation workflow is compatibility and portability between systems. So one of the most significant standards for the exchange of product manufacturing information of CAD files is ISO 10303, known as the “Standard for the Exchange of Product model data” or STEP. Michael J. Pratt, of the National Institute of Standards and Technology, wrote in 2001 (pdf), “the development of STEP has been one of the largest efforts ever undertaken by ISO.”
- Here are some other CAD preservation resources, many of which refer to STEP:
- The United States National CAD Standard encompasses The American Institute of Architect’s CAD Layer Guidelines, the Construction Specification Institute’s Uniform Drawing System and the National Institute of Building Sciences Plotting Guidelines.
- MIT conducted a two-year project (which included digital preservation pioneer Stephen Abrams on their advisory board) called “Future-proofing Architectural Computer-Aided DEsign,” where they analyzed CAD data from three renown architects and their projects. The FACADE project’s final report (pdf) details recommendations and best practices.
- The National Archives’ “Revised Format Guidance for the Transfer of Permanent Electronic Records” lists “Preferred” and “Acceptable” formats for CAD.
- The Art Institute of Chicago Department of Architecture published “Collecting, Archiving and Exhibiting Digital Design Data” (pdf).
- In July, 2013, the Digital Preservation Coalition held a conference titled “Preserving Computer Aided Design.” In 2013, the DPC also released their Technology Watch report, authored by Alex Ball, “Preserving Computer-Aided Design.”
- ISO 13567 is a CAD layer standard.
- CAD standards on Wikipedia.
- ISO 16739 for BIM data.
- The “CAD: A Guide to Good Practice” is a collaborative effort from the UK’s Archaeology Data Service and the US’s Digital Antiquity.
- The list of CAD file formats is stunning.
Some simple preservation information that comes up repeatedly is to save the original CAD file in its original format. Save the hardware, software and system that runs it too, if you can. Save any metadata or documentation and document a one-to-one relationship with each CAD file’s plotted sheet.
The usual digital-preservation practice applies, which is to organize the files, backup the files to a few different storage devices and put one in a geographically remote location in case of disaster, and every seven years or so migrate to a current storage medium to keep the files accessible. Given the complexity of these files, and recognizing that at its heart digital preservation is an attempt to hedge our bets about mitigating a range of potential risks, it is also advisable to try to generate a range of derivative files which are likely to be more viable in the future. That is, keep the originals, and try to also export to other formats that may lose some functionality and properties but which are far more likely to be able to be opened in the future. The final report from the FACADE project makes this recommendation: ”For 3-D CAD models we identified the need for four versions with distinct formats to insure long-term preservation. These are:
1. Original (the originally submitted version of the CAD model)
2. Display (an easily viewable format to present to users, normally 3D PDF)
3. Standard (full representation in preservable standard format, normally IFC or STEP)
4. Dessicated (simple geometry in a preservable standard format, normally IGES)”
CAD files now join paper files — such as drawings, plans, elevations, blueprints, images, correspondence and project records — in institutional archives and firms’ libraries. In addition to the ongoing international work on standards and preservation, there needs to be a dialog with the design-software industry to work toward creating archival CAD files in an open-preservation format. Finally, trained professionals need to make sense of the CAD files to better archive them and possibly get them up and running again for production, academic, legal or other professional purposes. That requires knowledge of CAD software, file construction and digital preservation methods.
Either CAD users need better digital curatorial skills to manage their CAD archives or digital archivists need better CAD skills to curate the archives of CAD users. Or both.
The first part of the workshop will be a panel session at which David Giaretta (APARSEN), Ross King (SCAPE), and Ed Fay (OPF) will be discussing digital preservation.
After this a range of digital preservation projects will be presented at different stalls. This part will begin with an elevator pitch session at which each project will have exactly one minute to present their project.
Everybody is invited to visit all stalls and learn more about the different projects, their results and thoughts on sustainability. At the same time there will be a special ‘clinic’ stall at which different experts will be ready to answer any questions you have on their specific topic – for instance PREMIS metadata or audit processes.
The workshop takes place at City University London, 8 September 2014, 1pm to 5pm.
Looking forward to meeting you!
Register for the workshop (please notice! Registration for this workshop should not be done via the DL registration page)
Oh, did I forget? We also have a small competition going on… Read more.
Preservation Topics: SCAPE
I had occasion today to look up the “Rendering Matters" report I wrote while at Archives New Zealand (I was looking for this list of questions/object attributes that were tested for and included as an appendix in the report) and got distracted re-reading the findings in the report.
Summary findings from “Rendering Matters”:
- The choice of rendering environment (software) used to open or “render” an office file invariably has an impact on the information presented through that rendering. When files are rendered in environments that differ from the original then they will often present altered information to the user. In some cases the information presented can differ from the original in ways that may be considered significant.
- The emulated environments, with minimal testing or quality assurance, provided significantly better rendering functionality than the modern office suites. 60-100% of the files rendered using the modern office suites displayed at least one change compared to 22-35% of the files rendered using the emulated hardware and original software.
- In general, the Microsoft Office 2007 suite functioned significantly better as a rendering tool for older office files than either the open source LibreOffice suite or Corel’s Word Perfect Office X5 suite.
- Given the effectiveness of modern office applications to open the office files, many files may not need to have content migrated from them at this stage as current applications can render much of the content effectively (and the content’s accessibility will not be improved by performing this migration as the same proportion of the content can currently be accessed).
- Users do not often include a lot of problematic attributes in their files but often include at least one. This in turn indicates a level of unpredictability and inconsistency in the occurrence of rendering issues which may make it difficult to test the results of migration actions on files like these.
There were more detailed findings towards the end of the report:
"The [findings] show quantitatively that the choice of rendering environment (software) used to open or “render” an office file invariably has an impact on the information presented through that rendering. When files are rendered in environments that differ from the original they will often present altered information to the user. In some cases the information presented can differ from the original in ways that may be considered significant. This result is useful as it gives a set of ground-truth data to refer to when discussing the impact of rendering on issues of authenticity, completeness and the evidential value of digital office files.
The results give an indication of the efficacy of modern office suites as rendering tools for older office files. Risk analysis of digital objects in current digital repositories could be informed by this research. Digital preservation risk analysts could use this research to evaluate whether having access to these modern office suites means that files that can be “opened “by them are not at risk.
The results highlight the difficulty and expense in testing migration approaches by showing how long it took to test only ~100 files comprehensively (at least 13.5 hours). Scaling this to 0.5% of 1,000,000 files would give 675 hours or nearly 17 weeks at 40 hours per week. This level of testing may be considered excessive depending on the context, but similarly comprehensive testing of only 100 files per 1,000,000 of each format (.01%) would take at least 13.5 hours per format, per tool. More information on how long testing would take for a variety of different sample sizes and percentages of objects (e.g. 1% of 100,000 objects would take 150 hours) is available in Appendix 3.
The results also show the promise of running original software on emulated hardware to authenticate the rendering of files to ensure that all the content has been preserved. Although emulated environment renderings were not shown to be 100% accurate in this research, they were shown to have a far greater degree of accuracy in their renderings than current office suites (which are the tools currently used for migrating office files). Additionally, some of the changes introduced in the emulated environments may have been due to poor environment configuration.
The results give an indication of how prevalent certain attributes are in office files. With a greater sample size this research this could help to show whether or not it is true that “most users only use the same 10% of functionality in office applications” (the data from this small sample indicates that in fact they only use about 10% of the functionality/attributes each, but often it is a different 10%).”
Findings specific to the prevalence of rendering “errors”
Personally, found the findings related to the prevalence of problematic attributes in the files tested to be most enlightening. The relevant findings from the report are included below:
- "The likelihood that any single file has a particular attribute that does not render properly in a particular rendering environment is low,
- The likelihood that the same file will have at least one attribute that doesn’t render properly in a particular environment is quite high (~60% and above).
In other words, the results indicate that users do not often include a lot of attributes in their files that caused rendering issues when rendered in modern environments but often include at least one. This in turn indicates a level of unpredictability and inconsistency in the occurrence of rendering issues.
A significant challenge for digital preservation practitioners is evaluating the effectiveness of digital preservation approaches. When faced with a large and ever increasing volume of digital files to be preserved, practitioners are forced to consider approaches that can be automated. The results in this report indicate that the occurrence of problematic attributes is inconsistent and they therefore may be difficult to automatically identify. Without identifying such attributes pre-migration it will not be possible to test whether the attributes exist post-migration and so the effectiveness of the migration will not be able to be evaluated. Without automatically identifying such attributes pre-migration then it is unlikely that any effective evaluation will be able to be made cost-effectively. The cost to manually identify these attributes for every object would likely be prohibitively large for most organisations given reasonably sized collections.”
Time to manually validate object rendering
Also included in the appendices was a table estimating the time it would take to manually validate a set % of objects for a given collection size. This was based on the average of 9 minutes it took to undertake the tests as part of the rendering matters research. I’ve included this table below, it is sobering.
Also included as an appendix in the report, and included in a separate web page, are some examples of the types of rendering issues that were identified, including screenshots e.g.:
Replicating the results
It has now been three and a half years since the publication of this report and as far as I am aware nobody has attempted to replicate the approach or the findings. Personally I found the process as enlightening as the results, and would welcome (and where possible, help) the replication of this research by others.
Every day, people from around the world upload photos to share on a range of social media sites and web applications. The results are astounding; collections of billions of digital photographs are now stored and managed by several companies and organizations. In this context, Yahoo Labs recently announced that they were making a data set of 100 million Creative Commons photos from Flickr available to researchers. As part of our ongoing series of Insights Interviews, I am excited to discuss potential uses and implications for collecting and providing access to digital materials with David Ayman Shamma, a scientist and senior research manager with Yahoo Labs and Flickr.
Trevor: Could you give us a sense of the scope and range of this corpus of photos? What date ranges do they span? The kinds of devices they were taken on? Where they were taken? What kinds of information and metadata they come with? Really, anything you can offer for us to better get our heads around what exactly the dataset entails.
Ayman: There’s a lot to answer in that question. Starting at the beginning, Flickr was an early supporter of the Creative Commons and since 2004 devices have come and gone, photographic volume has increased, and interests have changed. When creating the large-scale dataset, we wanted to cast as wide a representative net as possible. So the dataset is a fair random sample across the entire corpus of public CC images. The photos were uploaded from 2004 to early 2014 and were taken by over 27,000 devices, including everything from camera phones to DSLRs. The dataset is a list of photo IDs with a URL to download a JPEG or video plus some corresponding metadata like tags and camera type and location coordinates. All of this data is public and can generally be accessed from an unauthenticated API call; what we’re providing is a consistent list of photos in a large, rolled-up format. We’ve rolled up some but not all of the data that is there. For example, about 48% of the dataset has longitude and latitude data which is included in the rollup, but comments on the photos have not been included, though they can be queried through the API if someone wants to supplement their research with it.
Trevor: In the announcement about the dataset you mention that there is a 12 GB data set, which seems to have some basic metadata about the images and a 50 TB data set containing the entirety of the collection of images. Could you tell us a bit about the value of each of these separately, the kinds of research both enable and a bit about the kinds of infrastructure required to provide access to and process these data sets?
Ayman: Broadly speaking, research on Flickr can be categorized into two non-exclusive topic areas: social computing and computer vision. In the latter, one has to compute what are called ‘features’ or pixel details about luminosity, texture, cluster and relations to other pixels. The same is true for audio in the videos. In effect, it’s a mathematical fingerprint of the media. Computing these fingerprints can take quite a bit of computational power and time, especially at the scale of 100 million items. While the core dataset of metadata is only 12 GB, a large collection of features reach into the terabytes. Since these are all CC media files, we thought to also share these computed features. Our friends at the International Computer Science Institute and Lawrence Livermore National Labs were more than happy to compute and host a standard set of open features for the world to use. What’s nice is this expands the dataset’s utility. If you’re from an institution (academic or otherwise), computing the features could be a costly set of compute time.
Trevor: The dataset page notes that the dataset has been reviewed to meet “data protection standards, including strict controls on privacy.” Could you tell us a bit about what that means for a dataset like this?
Ayman: The images are all under one of six Creative Commons licenses implemented by Flickr. However, there were additional protections that we put into place. For example, you could upload an image with the license CC Attribution-NoDerivatives and mark it as private. Technically, the image is in the public CC; however, Flickr’s agreement with its users supersedes the CC distribution rights. With that, we only sampled from Flickr’s public collection. There are also some edge cases. Some photos are public and in the CC but the owner set the geo-metadata to private. Again, while the geo-data might be embedded in the original JPEG and is technically under CC license, we didn’t include it in the rollup.
Trevor: Looking at the Creative Commons page for Flickr, it would seem that this isn’t the full set of Creative Commons images. By my count, there are more than 300 million creative commons licensed photos there. How were the 100 million selected, and what factors went into deciding to release a subset rather than the full corpus?
Ayman: We wanted to create a solid dataset given the potential public dataset size; 100 million seemed like a fair sample size that could bring in close to 50% geo-tagged data and about 800 thousand videos. We envision researchers from all over the world accessing this data, so we did want to account for the overall footprint and feature sizes. We’ve chatted about the possibility of ‘expansion packs’ down the road, both to increase the size of the dataset and to include things like comments or group memberships on the photos.
Trevor: These images are all already licensed for these kinds of uses, but I imagine that it would have simply been impractical for someone to collect this kind of data via the API. How does this data set extend what researchers could already do with these images based on their licenses? Researchers have already been using Flickr photos as data, what does bundling these up as a dataset do for enabling further or better research?
Ayman: Well, what’s been happening in the past is people have been harvesting the API or crawling the site. However, there are a few problems with these one-off research collections; the foremost is replication. By having a large and flexible corpus, we aim to set a baseline reference dataset for others to see if they can replicate or improve upon new methods and techniques. A few academic and industry players have created targeted datasets for research, such as ImageNet from Stanford or Yelp’s release of its Phoenix-area reviews. Yahoo Labs itself has released a few small targeted Flickr datasets in the past as well. But in today’s research world, the new paradigm and new research methods require large and diverse datasets, and this is a new dataset to meet the research demands.
Trevor: What kinds of research are you and your colleagues imagining folks will do with these photographs? I imagine a lot of computer science and social network research could make use of them. Are there other areas you imagine these being used in? It would be great if you could mention some examples of existing work that folks have done with Flickr photos to illustrate their potential use.
Ayman: Well, part of the exciting bit is finding new research questions. In one recent example, we began to examine the shape and structure of events through photos. Here, we needed to temporally align geo-referenced photos to see when and where a photo was taken. As it turns out, the time the photo was taken and the time reported by the GPS are off by as much as 10 minutes in 40% of the photos. So, in work that will be published later this year, we designed a method for correcting timestamps that are in disagreement with the GPS time. It’s not something we would have thought we’d encounter, but it’s an example of what makes a good research question. With a large corpus available to the research world at-large, we look forward to others also finding new challenges, both immediate and far-reaching.
Trevor: Based on this, and similar webscope data sets, I would be curious for any thoughts and reflections you might offer for libraries, archives and museums looking at making large scale data sets like this available to researchers. Are there any lessons learned you can share with our community?
Ayman: There’s a fair bit of care and precaution that goes into making collections like this - rarely is it ever just a scrape of public data; ownership and copyright does play a role. These datasets are large collections that reflect people’s practices, behavior and engagement with media like photos, tweets or reviews. So, coming to understand what these datasets mean with regard to culture is something to set our sights on. This applies to the libraries and archives that set to preserve collections and to researchers and scientists, social and computational alike, who aim to understand them.
In this post I'll be taking a look at format identification of PDF files and highlighting a difference in opinion between format identification tools. Some of the details are a little dry but I'll restrict myself to a single issue and be as light on technical details as possible. I hope I'll show that once the technical details are clear it really boils down to policy and requirements for PDF processing.Assumptions
I'm considering format identification in its simplest role as first contact with a file that little, if anything, is known about. In these circumstances the aim is to identify the format as quickly and accurately as possible then pass the file to format specific tools for deeper analysis.
I'll also restrict the approach to magic number identification rather than trust the file extension, more on this a little later.Software and data
- the fine free file utility (also known simply as file),
- FIDO, and
- Apache Tika.
I used as up to date versions as possible but will spare the details until I publish the results in full.So is this a PDF?
So there was plenty of disagreement between the results from the different tools, I'll be showing these in more detail at our upcoming PDF Event. For now I'll focus on a single issue, there are a set of files that FIDO and DROID don't identify as PDFs that file and Tika do. I've attached one example to this post, Google chrome won't open it but my ubuntu based document viewer does. It's a three page PDF about Rumen Microbiology and this was obviously the intention of the creator. I've not systematically tested multiple readers yet but Libre Office won't open it while ubuntu's print preview will. Feel free to try the reader of your choice and comment.What's happening here?
It appears we have a malformed PDF and this is the case . The issue is caused by a difference in the way that the tools go about identifying PDFs in the first place. This is where it gets a little dull but bear with me. All of these tools use "magic" or "signature" based identification. This means that they look for unique (hopefully) strings of characters in specific positions in the file to work out the format. Here's the Tika 1.5 signature for PDF:
<match value="%PDF-" type="string" offset="0"/>
What this says is look for the string %PDF- (the value) at the start of the file (offset="0") and if it's there identify this as a PDF. The attached file indeed starts:
meaning it's a PDF version 1.2. Now we can have a look at the DROID signature (version 77) for the PDF 1.2 sig:<InternalSignature ID="125" Specificity="Specific"> <ByteSequence Reference="BOFoffset"> <SubSequence MinFragLength="0" Position="1" SubSeqMaxOffset="0" SubSeqMinOffset="0"> <Sequence>255044462D312E32</Sequence> <DefaultShift>9</DefaultShift> <Shift Byte="25">8</Shift> <Shift Byte="2D">4</Shift> <Shift Byte="2E">2</Shift> <Shift Byte="31">3</Shift> <Shift Byte="32">1</Shift> <Shift Byte="44">6</Shift> <Shift Byte="46">5</Shift> <Shift Byte="50">7</Shift> </SubSequence> </ByteSequence> <ByteSequence Reference="EOFoffset"> <SubSequence MinFragLength="0" Position="1" SubSeqMaxOffset="1024" SubSeqMinOffset="0"> <Sequence>2525454F46</Sequence> <DefaultShift>-6</DefaultShift> <Shift Byte="25">-1</Shift> <Shift Byte="45">-3</Shift> <Shift Byte="46">-5</Shift> <Shift Byte="4F">-4</Shift> </SubSequence> </ByteSequence></InternalSignature>Which is a little more complex than Tika's signature but what it says is a matching file should start with the string %PDF-1.2, which our sample does. This is in the first <ByteSequence Reference="BOFoffset"> section, a begining of file offset. Crucially this signature adds another condition, that the file contains the string %EOF within 1024 bytes of the end of the tile. There are two things that are different here. The start condition change, i.e. Tika's "%PDF-" vs. DROID's "%PDF-1.2%" is to support DROID's capability to identify versions of formats. Tika simply detects that a file looks like a PDF and returns the application/pdf mime type and has a single signature for the job. DROID can distinguish between versions and so has 29 different signatures for PDF. It's also NOT the cause of the problem. The disagreement between the results is caused by DROID's requirement for a valid end of file marker %EOF. A hex search of our PDF confirms that it doesn't contain an %EOF marker.So who's right?
An interesting question. The PDF 1.3 Reference states:The last line of the file contains only the end-of-file marker,%%EOF. (See implementation note 15 in Appendix H.)The referenced implementation note reads:3.4.4, “File Trailer”15. Acrobat viewers require only that the %%EOF marker appear somewherewithin the last 1024 bytes of the file.
So DROID's signature is indeed to the letter of the law plus amendments. It's really a matter of context when using the tools. Does DROID's signature introduce an element of format validation to the identification process? In a way yes, but understanding what's happening and making an informed decision is what really matters.What's next?
I'll be putting some more detailed results onto GitHub along with a VM demonstrator. I'll tweet and add a short post when this is finished, it may have to wait until next week.Preservation Topics: Identification AttachmentSize It looks like a PDF to me....44.06 KB
On September 8 the SCAPE/ APARSEN workshop Digital Preservation Sustainability on the EU Level is held at London City University in connection with the DL2014 conference.
The main objective of the workshop is to provide an overview of solutions to challenges within Digital Preservation Sustainability developed by current and past Digital Preservation research projects. The event brings together various EU projects/initiatives to present their solutions and approaches, and to find synergies between them.
Attached to the workshop Digital Preservation Sustainability on the EU Level SCAPE and APARSEN launch a competition:
Which message do YOU want to send to the EU for the future of Digital Preservation projects?
You can join the competition on Twitter. Only tweets including the hashtag #DP2EU are contending in the competition. You are allowed to include a link to a text OR one picture with your message. Messages which contain more than 300 characters in total are excluded from the competition, though.
The competition will close September 8th at 16:30 UK time. The workshop panel will then choose one of the tweets as a winner. The winner will receive an e-book reader as a prize.
There are only a few places left for the workshop. Registration for the workshop is FREE and must be completed by filling out the form here - http://bit.ly/DPSustainability. Please don’t register for this workshop on the DL2014 registration page, since this workshop is free of charge!
The following is a guest post from Euan Cochrane, Digital Preservation Manager at Yale University Library. This piece continues and extends exploration of the potential of emulation as a service and virtualization platforms.
Increasingly, the intellectual productivity of scholars involves the creation and development of software and software-dependent content. For universities to act as responsible stewards of these materials we need to have a well-formulated approach to how we can make these legacy works of scholarship accessible.
While there have been significant concerns with the practicality of emulation as a mode of access to legacy software, my personal experience (demonstrated via one of my first websites about Amiga emulation) has always been contrary to that view. It is with great pleasure that I can now illustrate the practical utility of Emulation as a Service via three recent case studies from my work at Yale University Library. Consideration of interactive artwork from 1997, interactive Hebrew texts from a 2004 CD-ROM and finance data from 1998 illustrate that it’s no longer really a question of if emulation is a viable option for access and preservation, but of how we can go about scaling up these efforts and removing any remaining obstacles to their successful implementation.
At Yale University Library we are conducting a research pilot of the bwFLA Emulation as a Service software framework. This framework greatly simplifies the use of emulators and virtualization tools in a wide range of contexts by abstracting all of the emulator configuration (and its associated issues) away from the end-user. As well as simplifying use of emulators it also simplifies access to emulated environments by providing the ability to access and interact with emulated environments from right within your web browser, something that we could only dream of just a few years ago.
At Yale University Library we are evaluating the software against a number of criteria including:
- In what use-cases might it be used?
- How might it fit in with digital content workflows?
- What challenges does it present?
The EaaS software framework shows great promise as a tool for use in many digital content management workflows such as appraisal/selection, preservation and access, but also presents a few unique and particularly challenging issues that we are working to overcome. The issues are mostly related to copyright and software licensing. At the bottom of this post I will discuss what these issues are and what we are doing to resolve them, but before I do that let me put this in context by discussing some real-life use-cases for EaaS that have occurred here recently.
It has taken a few months (I started in my position at the Library in September 2013) but recently people throughout the Library system have begun to forward queries to me if they involve anything digital preservation-related. Over the past month or so we have had three requests for access to digital content from the general collections that couldn’t be interacted with using contemporary software. These requests are all great candidates for resolving using EaaS but, unfortunately (as you will see) we couldn’t do that.
Interactive Artwork, Circa 1997: Use Case One
An Arts PhD student wanted to access an interactive CD-ROM-based artwork (Laurie Anderson’s “Puppet Motel”) from the general collections. The artwork can only be interacted with on old versions of the Apple Mac “classic” operating system.
Fortunately the Digital Humanities Librarian (Peter Leonard) has a collection of old technology and was willing to bring a laptop into the library from his personal collection for the PhD student to use to access it on. This was not an ideal or sustainable solution (what would have happened if Peter’s collection wasn’t available? What happens when that hardware degrades past usability?).
Since responding to this request we have managed to get the Puppet Motel running in the emulation service using the Basilisk II emulator (for research purposes).
This would be a great candidate for accessing via the emulation service. The sound and interaction aspects all work well and it is otherwise very challenging for researchers to access the content.
Hebrew Texts, Circa 2004: Use Case Two
One of the Judaica librarians needed to access data for a patron and the data was in a Windows XP CD-ROM (Trope Trainer) from the general collections. The software on the CD would not run on the current Windows 7 operating system that is installed on the desktop PCs here in the library.
The solution we came up with was to create a Windows XP virtual machine for the librarian to have on her desktop. This is a good solution for her as it enables her to print the sections she wants to print and export pdfs for printing elsewhere as needed.
We have since ingested this content into the emulation service for testing purposes. In the EaaS it can run on either the virtualization software from Oracle: VirtualBox (which doesn’t provide full-emulation) or QEMU an emulation and virtualization tool.
It is another great candidate for the service as this version of the content can no longer be accessed on contemporary operating systems and the emulated version enables users to play through the texts and hear them read just as though they were using the CD on their local machine. The ability to easily export content from the emulation service will be added in a future update and will enable this content to become even more useful.
Finance Data, Circa 1998/2003: Use Case Three
A Finance PhD student needed access to data (inter-corporate ownership data) trapped within software within a CD-ROM from the general collection. Unfortunately the software was designed for Windows 98: “As part of my current project I need to use StatCan data saved using some sort of proprietary software on a CD. Unfortunately this software seemed not to be compatible with my version of Windows.” He had been able to get the data out of the disc but couldn’t make any real sense of it without the software: “it was all just random numbers.”
We have recently been developing a collection of old hardware at the Library to support long-term preservation of digital content. Coincidentally, and fortunately, the previous day someone had donated a Windows 98 laptop. Using that laptop we were able to ascertain that the CD hadn’t degraded and the software still worked. A Windows 98 virtual machine was then created for the student to use to extract the data. Exporting the data to the host system was a challenge. The simplest solution turned out to be having the researcher email the data to himself from within the virtual machine via Gmail using an old web browser (Firefox 2.x).
We were also able to ingest the virtual machine into the emulation service where it can run on either VirtualBox or QEMU.
This is another great candidate for the emulation service. The data is clearly of value but cannot be properly accessed without using the original custom software which only runs on older versions of the Microsoft Windows operating system.
Other uses of the service
In exploring these predictable use-cases for the service, we have also discovered some less-expected scenarios in which the service offers some interesting potential applications. For example, the EaaS framework makes it trivially easy to set up custom environments for patrons. These custom environments take up little space as they are stored as a difference from a base-environment, and they have a unique identifier that can persist over time (or not, as needed). Such custom environments may be a great way for providing access to sets of restricted data that we are unable to allow patrons to download to their own computers. Being able to quickly configure a Windows 7 virtual machine with some restricted content included in it (and appropriate software for interacting with that content, e.g., an MS Outlook PST archive file with MS Outlook), and provide access to it in this restricted online context, opens entirely new workflows for our archival and special collections staff.
Why we couldn’t use bwFLA’s EaaS
In all three of the use-cases outlined above EaaS was not used as the solution for the end-user. There were two main reasons for this:
- We are only in possession of a limited number of physical operating system and application licenses for these older systems. While there is some capacity to use downgrade rights within the University’s volume licensing agreement with Microsoft, with Apple operating systems the situation is much less clear. As a result we are being conservative in our use of the service until we can resolve these issues.
- It is not always clear in the license of old software whether this use-case is allowed. Virtualization is rarely (if ever) mentioned in the license agreements. This is likely because it wasn’t very common during the period when much of the software we are dealing with was created. We are working to clarify this point with the General Counsel at Yale and will be discussing it with the software vendors.
Addressing the software licensing challenges
As things stand we are limited in our ability to provide access to EaaS due to licensing agreements (and other legal restrictions) that still apply to the content-supporting operating system and productivity software dependencies. A lot of these dependencies that are necessary for providing access to valuable historic digital content do not have a high economic value themselves. While this will likely change over time as the value of these dependencies becomes more recognized and the software more rare, it does make for a frustrating situation. To address this we are beginning to explore options with the software vendors and will be continuing to do this over the following months and years.
We are very interested in the opportunities EaaS offers for opening access to otherwise inaccessible digital assets. There are many use-cases in which emulation is the only viable approach for preserving access to this content over the long term. Because of this, anything that prevents the use of such services will ultimately lead to the loss of access to valuable and historic digital content, which will effectively mean the loss of that content. Without engagement from software vendors and licensing bodies it may require law change to ensure that this content is not lost forever.
It is our hope that the software vendors will be willing to work with us to save our valuable historic digital assets from becoming permanently inaccessible and lost to future generations. There are definitely good reasons to believe that they will, and so far, those we have contacted have been more than willing to work with us.
While a fair amount of digital preservation focuses on objects that have clear corollaries to objects from our analog world (still and moving images and documents for example), there are a range of forms that are basically natively digital. Completely native digital forms, like database-driven web applications, introduce a variety of challenges for long-term preservation and access. I’m thrilled to discuss just such a form with Karl Nilsen and Robin Dasler from the University of Maryland, College Park. Karl is the Research Data Librarian, and Robin is the Engineering/Research Data Librarian. Karl and Robin spoke on their work to ensure long-term access to the Extragalactic Distance Database at the Digital Preservation 2014 conference.
Trevor: Could you tell us a bit about the Extragalactic Distance Database? What is it? How does it work? Who does it matter to today and who might make use of it in the long term?
Karl and Robin: The Extragalactic Distance Database contains information that can be used to determine distances between galaxies. For a limited number of nearby galaxies, the distances can be measured directly with a few measurements, but for galaxies beyond these, astronomers have to correlate and calibrate data points obtained from multiple measurements. The procedure is called a distance ladder. From a data curation perspective, the basic task is to collect and organize measurements in such a way that researchers can rapidly collate data points that are relevant to the galaxy or galaxies of interest.
The EDD was constructed by a group of astronomers at various institutions over a period of about a decade and is currently deployed on a server at the Institute for Astronomy at the University of Hawaii. It’s a continuously (though irregularly) updated, actively used database. The technology stack is Linux, Apache, MySQL and PHP. It also has an associated file system that contains FITS files and miscellaneous data and image files. The total system is approximately 500GB.
The literature mentioning extragalactic or cosmic distance runs to thousands of papers in Google Scholar, and over one hundred papers have appeared with 2014 publication dates. Explicit references to the EDD appear in twelve papers with 2014 publication dates and a little more than seventy papers published before 2014. We understand that some astronomers use the EDD for research that is not directly related to distances simply because of the variety of data compiled into the database. Future use is difficult to predict, but we view the EDD as a useful reference resource in an active field. That being said, some of the data in the EDD will likely become obsolete as new instruments and techniques facilitate more accurate distances, so a curation strategy could include a reappraisal and retirement plan.
Our agreement with the astronomers has two parts. In the first part, we’ll create a replica of the EDD at our institution that can serve as a geographically distinct backup for the system in Hawaii. We’re using rsync for transfer. Our copy will also serve as a test case for digital curation and preservation research. In this period, the copy in Hawaii will continue to be the database-of-record. In the second part, our copy may become the database-of-record, with responsibility for long-term stewardship passing more fully to the University of Maryland Libraries. In general, this project gives us an opportunity to develop and fine-tune curation processes, procedures, policies and skills with the goal of expanding the Libraries’ capacity to support complex digital curation and preservation projects.
Trevor: How did you get involved with the database? Did the astronomers come to you or did you all go to them?
Karl and Robin: One of the leaders of the EDD project is a faculty member at the University of Maryland and he contacted us. We’re librarians on the Research Data Services team and we assist faculty and graduate students with all aspects of data management, curation, publishing and preservation. As a new program in the University Libraries, we actively seek and cultivate opportunities to carry out research and development projects that will let us explore different data curation strategies and practices. In early 2013 we included a brief overview of our interests and capabilities in a newsletter for faculty, and that outreach effort lead to an inquiry from the faculty member.
We occasionally hear from other faculty members who have developed or would like to develop databases and web applications as a part of their research, so we expect to encounter similar projects in the future. For that reason, we felt that it was important to initiate a project that involves a database. The opportunities and challenges that arise in the course of this project will inform the development of our services and infrastructure, and ultimately, shape how we support faculty and students on our campus.
Trevor: When you started in on this, were there any other particularly important database preservation projects, reports or papers that you looked at to inform your approach? If so, I’d appreciate hearing what you think the takeaways are from related work in the field and how you see your approach fitting into the existing body of work.
Karl and Robin: Yes, we have been looking at work on database preservation as well as work on curating and preserving complex objects. We’re fortunate that there has been a considerable amount of research and development on database preservation and there is a body of literature available. As a starting point, readers may wish to review:
- Ribiero, Cristina and David Gabriel. “Database Preservation” (PDF). Digital Preservation Europe, 2009.
- Roberts, Bill et al. “Case Study: Database Preservation at the National Archives of the Netherlands” (PDF) Open Planets Foundation, 2010.
Some of the database preservation efforts have produced software for digital preservation. For example, readers may wish to look at SIARD (Software Independent Archiving of Relational Databases) or the Database Preservation Toolkit. In general, these tools transform the database content into a non-proprietary format such as XML. However, there are quite a few complexities and trade-offs involved. For example, database management systems provide a wide range of functionality and a high level of performance that may be lost or not easily reconstructed after such transformations. Moreover, these preservation tools may involve dependencies that seem trivial now but could introduce significant challenges in the future. We’re interested in these kinds of tools and we hope to experiment with them, but we recognize that heavily transforming a system for the sake of preservation may not be optimal. So we’re open to experimenting with other strategies for longevity, such as emulation or simply migrating the system to state-of-the-art databases and applications.
Trevor: Having a fixed thing to preserve makes things a lot easier to manage, but the database you are working with is being continuously updated. How are you approaching that challenge? Are you taking snapshots of it? Managing some kind of version control system? Or something else entirely? I would also be interested in hearing a bit about what options you considered in this area and how you made your decision on your approach.
Karl and Robin: We haven’t made a decision about versioning or version control, but it’s obviously an important policy matter. At this stage, the file system is not a major concern because we expect incremental additions that don’t modify existing files. The MySQL database is another story. If we preserve copies of the database as binary objects, we face the challenge of proliferating versions. That being said, it may not be necessary to preserve a complete history of versions. Readers may be interested to know that we investigated Git for transfer and version control, but discovered that it’s not recommended for large binary files.
Trevor: How has your idea of database preservation changed and evolved by working through this project? Are there any assumptions you had upfront that have been challenged?
Karl and Robin: Working with the EDD has forced us to think more about the relationship between preservation and use. The intellectual value of a data collection such as the EDD is as much in the application–joins, conditions, grouping–as in the discrete tables. Our curation and preservation strategy will have to take this fact into account. We expect that data curators, librarians and archivists will increasingly face the difficult task of preservation planning, policy development and workflow design in cases where sustaining the value of data and the viability of knowledge production depends on sustaining access to data, code and other materials as a system. We’re interested to hear from other librarians, archivists and information scientists who are thinking about this problem.
Trevor: Based on this experience, is there a checklist or key questions for librarians or archivists to think through in devising approaches to ensuring long term access to databases?
Karl and Robin: At the outset, the questions that have to be addressed in database preservation are identical to the questions that have to be addressed in any digital preservation project. These have to do with data value, future uses, project goals, sustainability, ownership and intellectual property, ethical issues, documentation and metadata, data quality, technology issues and so on. A couple of helpful resources to consult are:
- Maron, Nancy L., Jason Yun, and Sarah Pickle. “Sustaining Our Digital Future: Institutional Strategies for Digital Content.” London, UK: JISC, 2013.
- Whyte, Angus, and Andrew Wilson. “How to Appraise & Select Research Data for Curation.” London, UK, and Melbourne, AU: Digital Curation Centre and Australian National Data Service, 2010.
Databases may complicate these questions or introduce unexpected issues. For example, if the database was constructed from multiple data sources by multiple researchers, which is not unusual, the relevant documentation and metadata may be difficult to compile and the intellectual property issues may be somewhat complicated.
Trevor: Why are the libraries at UMD the place to do this kind of curation and preservation? In many cases scientists have their own data managers, and I imagine there are contributions to this project from researchers at other universities. So what is it that makes UMD the place to do it and how does doing this kind of activity fit into the mission of the university and the libraries in particular?
Karl and Robin: While there are well-funded research projects that employ data managers or dedicated IT specialists, there are far more scientists and scholars who have little or no data management support. The cost of employing a data manager, even part-time, is too great for most researchers and often too great for most collaborations. In addition, while the IT departments at universities provide data storage services and web servers, they are not usually in the business of providing curatorial expertise, publishing infrastructure and long-term preservation and access. Further, while individual researchers recognize the importance of data management to their productivity and impact, surveys show that they have relatively little time available for data curation and preservation. There is also a deficit of expertise in general, though some researchers possess sophisticated data management skills.
Like many academic libraries, the UMD Libraries recognize the importance of data management and curation to the progress of knowledge production, the growth of open science and the success of our faculty and students. We also believe that library and archival science provide foundational principles and sets of practices that can be applied to support these activities. The Research Data Services program is a strategic priority for the University of Maryland Libraries and is highly aligned with the Libraries’ mission to accelerate and support research, scholarship and creativity. We have a cross-functional, interdisciplinary team in the Libraries–made up of subject specialists and digital curation specialists as needed–and partners across the campus, so we can bring a range of perspectives and skills to bear on a particular data curation project. This diversity is, in our view, essential to solving complex data curation and preservation problems.
We have to acknowledge that our work on the EDD involves a number of people in the Libraries. In particular, Jennie Levine Knies, Trevor Muñoz and Ben Wallberg, as well as University of Maryland iSchool students Marlin Olivier and, formerly, Sarah Hovde, have made important contributions to this project.
The following is a guest post from Julia Fernandez, this year’s NDIIPP Junior Fellow. Julia has a background in American studies and working with folklife institutions and worked on a range of projects leading up to CurateCamp Digital Culture in July. This is part of a series of interviews Julia conducted to better understand the kinds of born-digital primary sources folklorists, and others interested in studying digital culture, are making use of for their scholarship.
When Hasbro decided to reboot their 1980s “My Little Pony” franchise, who would have guessed that they would give rise to one of the most surprising and interesting fan subcultures on the web? The 2010 animated television series “My Little Pony: Friendship is Magic” has garnered an extremely loyal–and as a 2012 documentary put it, “extremely unexpected”–viewership among adult fans. Known colloquially as “bronies” (a portmanteau of “bro” and “ponies”), these fans are largely treated with fascination and confusion by the mainstream media. All of this interest has resulted in a range of scholars in different fields working to understand this cultural phenomena.
In this installment of the NDSA Insights Interview series, I talk with Jason Nguyen and Kurt Baer. Both PhD students at Indiana University in the Department of Folklore and Ethnomusicology, Jason and Kurt decided to study this unique subculture. Their website is where they both conduct their field research, blog about their findings and invite feedback from the community.
Julia: Can you tell me a little bit more about bronies (and pegasisters)? How do they define themselves? How long have these movements been occurring and where are they communicating online? Do you have any sense of how large these communities are?
Jason: An important starting premise for us is that bronies attach a wide variety of different values and identity markers to the label of brony, imagining and experiencing their relationships to one another in multiple ways–sometimes even conflicting ones. Nonetheless, there are some shared histories that nearly all bronies will describe as specific to this community. Specifically, bronies as a concept unique from My Little Pony fandom arose out of the relaunch/reboot of the Hasbro franchise as My Little Pony: Friendship is Magic in fall 2010. Lauren Faust, particularly known to this group for her work with her husband Craig McCracken on Powerpuff Girls and Foster’s Home for Imaginary Friends, developed the idea and wrote for the show through its first two seasons, and her gender politics has a lot to do with the complex and often non-normative characterization of the ponies. Because of that, bronies will generally start with the content of the show as reason enough for being a fandom: it is smartly written and portrays a positive, socially-oriented world view. Some bronies will portray this oppositionally to other, more negative media, but at the same time, many are involved in multiple fandoms and are often fans of “darker” work as well.
In any case, the label of “brony” has a pretty specific starting point, arising out of the show’s popularity in 2010 on 4chan, which was to some extent ironic, i.e. “Haha, we’re grown men watching a little girls’ show,” though I think the irony of that moment is always overstated (since irony is a useful footing to allow a grown man to watch a little girls’ show if he so desires). Over the following year, the bronies started to overtake 4chan and were kicked out; 4chan eventually opened /mlp/ for them, but the conflict lasted for a few months and was an impetus to organize elsewhere on the web.
At this point, things get more complicated, because people who like FiM search for other fans online, but the cross-demographic appeal means that reasons for being a fan and even ways of being a fan are not necessarily shared in the way you might expect of a more homogenous group. For example, fans coming from other “geek” fandoms are used to the convention scene and fandom as a sort of genre (keeping in touch with friends online, then getting together a few times a year at a convention), but for many bronies, this is the first time they have participated in this kind of mass-mediated imagined community.
Kurt: As far as numbers go, it is really hard to tell how large the brony community is. This is partly due to the varying definitions of what makes a “brony.” However, the brony community (or communities) is quite large and very active both online and off. For instance, Bronycon, the largest brony convention, brought in over 8,000 people last year, Coder Brony’s 2014 herd census received over 18,000 responses from all around the world, and Equestria Daily is, as of now, rapidly approaching 500 million hits on their website. There are brony communities all over Facebook and Reddit (which even has multiple subreddits devoted to sorting out all of the MLP subreddits). There are very active 4chan, Twitter, SoundCloud and DeviantArt communities; brony groups on other online games ranging from Team Fortress to Minecraft to Clash of Clans; over a dozen 24-hour streaming radio stations for Brony music; and major news sites such as Equestria Daily and Everfree that link bronies to relevant information from all over the web. What’s more is that these “communities” are not discrete from one another. People bounce between platforms all of the time, sometimes between different online personas, making coming up with specific numbers very difficult.
Julia: How is your approach to studying bronies similar or different from approaches to studying other fan cultures, and for that matter, any number of other modes of participatory culture?
Jason: In a lot of ways, I don’t think the work we are doing is all that different than many ethnographic studies insofar as the basic process of participant observation is concerned. As for the field of fan/fandom studies, we have thus far not cast our work in that light, though not because of any strong feelings either way. Fandom studies has a strong thread of reception and media studies coming from a more literary and cultural studies perspective that we enjoy but it’s not our theoretical foundation (I’m thinking of Henry Jenkins’ early work, for example).
That emphasis on broad cultural production that I think is heavily influenced by the legacy of the Frankfurt School is perhaps one difference, since we are strongly ethnographic and thus more granular in our approach. That said, many scholars we might read in a fandom studies class have used ethnographic and anthropological methods as well, such as Bonnie Nardi in her great “My Life as a Night Elf Priest” about the “World of Warcraft” fandom.
Kurt: Ultimately, while we might be one of a few people researching about people and brightly colored ponies on the internet at the moment (that number is always growing), the questions that we are looking to understand and the ways that we are trying to understand them are quite similar to research coming from a long line of ethnographers dating (in the anthropological imagination, at least) all the way back to Bronislaw Malinowski. Perhaps one relatively substantial difference that we have at least been trying for, however, lies in the fact that we are trying to use the blog format to allow for more back-and-forth interaction between us and the people who we are studying/studying with than the traditional ethnographic monograph allows. While many ethnographers (such as Steven Feld in his ethnography “Sound and Sentiment”) are able to get feedback from the people they study with and incorporate that into the writing process (or at least their second editions), we have been trying to find ways to speed up that process of garnering feedback, learning from it, and using that knowledge as a means for further theorization.
Julia: You’ve stated that your blog “represents an attempt at participant-observation that collapses the boundaries between academic and interlocutor.” Can you expand on this? What are some of your goals with this blog? Why start your own blog as opposed to gathering data and engaging with bronies on their own virtual “turf,” like websites like Equestria Daily?
Kurt: One important bit of background information that I feel is important to bring up here is that Jason and I both come from fields that focus primarily upon ethnographic research, and in fact, the blog itself was started as part of a course in creative ethnography taught by Dr. Susan Lepselter that Jason and I took at Indiana University. In approaching this research ethnographically, we wanted to be able to ask questions and elicit observations from bronies themselves in addition to analyzing the various other types of “texts” such as the show itself, other websites, and pre-existing conversations. We also wanted to be clear and open about the fact that we are researchers conducting research. We figured that starting our own blog would give us the space that we needed to be able to ask questions and make observations while still being clear about our research and research objectives. Through our interactions with people on social media sites and on places such as Equestria Daily, it has been our hope that the blog becomes a space that is part of different bronies’ “turfs,” where they can go to interact with us and each other and discuss different aspects of being a brony.
As far as our attempts to collapse the boundaries between academic and interlocutor goes, one of the things that drew us to the brony community in the first place is that they are already very involved in theorization about themselves and about the show. They talk about what it means to be a brony, provide deep textual analyses of the show and its themes, and grapple with the social implications of liking a show that some people think that they shouldn’t. Rather than us going into the “field,” collecting data about bronies, and then returning to write that information up in an article to be published in an academic journal, we hoped to create a space where we can theorize together and and where all of the observations and ideas would be available in the same space to serve as material for more conversation and theorization.
Jason: Another way to think about this is that there is nothing more brony-like than to start a space of your own online. As Kurt has recounted above, bronies have been quite prolific in their production of cyberspaces for communal interaction, and not all of them are big like Equestria Daily. Of course there are always the YouTube stars and Twitter celebrities of any mass-media fandom, but the more mundane spaces are equally important, and the process of making a website, maintaining a Twitter profile, etc.–in short, creating a presentation of self as brony researchers amongst other people similarly engaged in a presentation of self as bronies–has been invaluable in our experience of the “participant” part of participant-observation. We both have web presences, as most bronies do before they join the fandom, but many choose to create fandom-specific identities, and that means anchoring those identities somewhere; we’ve in part chosen to anchor our brony-related identities on the website.
With all that said, we do spend a lot of time investigating bronies in other spaces and in less explicitly theoretical ways. We live-tweet (tweeting comments about something as it occurs) new episodes from time to time, which is a really fun experience that lets us interact with both fans and show staff alike. I have drawn fan art and Kurt has made fan music that we have shared via Twitter, Reddit and our site.
So we like to think that we are doing both things at the same time. Of course it is important for anyone doing anthropologically informed ethnography to meet people where they are and explore their lives as they lead them, but at the same time, many fans have shown an interest in a space where they can read about and join in conversations that marry explicit theorization with personal observations of their fandom, and the “Research Is Magic” blog produces a hybrid narrative framing that we found was not previously existing in either academic or brony fandom spaces.
Julia: One of the reasons bronies as a group are so interesting is because they appear to subvert both gender and age norms. But you argue that “an analytical orientation that positions bronies as resisters trivializes their rich social interactions and effaces complicated power dynamics within and peripheral to the fandom.” That’s some dense language! Can you unpack this a bit for us?
Kurt: Essentially, our argument here is one against the tendency to find resistance and subversion and then get carried away insisting on interpreting everything about the group in that light. There is certainly some very interesting subversion of age and gender norms going on in the fandom, but bronies are not only, or even (I would argue) primarily, resisting. Most bronies that we have talked to don’t think of themselves as being oppositional, but instead as simply liking a show that they like. While it is both productive and interesting to look at the ways that bronies are resisting gender norms, it is also very easy for academics to fall into the trap of casting everything in that light, limiting the rich and complex social interactions of bronies to a romanticized narrative about bronies rising up together and resisting the gender stereotypes of larger society.
Jason: Resistance as a concept works because of a binary opposition: X resists Y. However, multiple competing discourses may be at work and are probably not all aligned to one another. For example, earlier this year, a North Carolina school kept a nine year old boy from bringing his Rainbow Dash backpack to school because it was getting him bullied by other students. On one level, the reasoning on all sides is obvious. To the other boys, a boy wearing “girly” paraphernalia is ripe to be bullied. The school counselor wanted to ensure the boy’s safety, so removed what was believed to be the problem. Some parents were concerned that the boy was being punished for simply expressing himself, and that the bullies should have been punished instead. …
So, while each person appears to act in resistance according to a particular discourse of meaning, and each person may have a particular narrative, the entire scenario is complicated by these competing ideas of masculinity that intersect with ideologies of personal freedom and liberty. Rainbow Dash (the character on the backpack), for example, is clearly written as a “tomboy” character–good at sports, adventurous, daring and 20 percent cooler than you. If a boy was going to pick a character to identify with that does not break existing standards of masculinity, she would be the one; thus, insofar as male fans identify with her, they’re also identifying with characteristics that don’t challenge their heteronormativity. But she is also the one covered in rainbows, and that has a particular valence as a form of non-heteronormative imagery (e.g. LGBT rights symbolism). In short, there is a density of meaning attached to Rainbow Dash that complicates people’s responses, though I would argue that it’s that complexity and density of meaning that allows different groups to be drawn to MLP in the first place.
Kurt: The ways in which people are using the show in relation to gender norms further complicate things. While in many ways bronies are challenging gender norms through their liking the show and re-defining ideas about masculinity, in other ways many bronies are super heteronormative. While they like a show that some people think is for girls, their argument is less about the fact that gender norms need dismantling than it is about the fact that the show is written in a way that is appealing to heteronormative men and that men can still be manly while liking MLP. The World’s Manliest Brony, for instance, while going against gender norms in some ways by embracing MLP and re-enforcing the manliness of giving charitably, also reinforces them in others–leaving many ideas of masculinity intact but drawing MLP into the list of things that can be manly.
Julia: Psychologist Marsha Redden, one of the conductors of The Brony Study, stated in an interview that the fandom is a normal response to the anxiety of life in a conflict-driven time, saying “they’re tired of being afraid, tired of angst and animosity. They want to go somewhere a lot more pleasant.” Likewise, a lot of what you talk about on your blog has to do with the positivity of the actual show, how each episode has a positive message and emphasizes the importance of friendship and other values. It feels very rare that we hear something positive about bronies from the mainstream media. Can you talk a bit about this? What draws adults to the show, and to the community? What do you make of the moral panic surrounding Bronies in the mainstream media?
Jason: At the risk of sounding a little persnickety, I’d like to suggest that we invert the way we think about such causal explanations. Explanations similar to Dr. Redden’s–basically, some version of the idea that the world is a rough and cynical place and that MLP presents an alternative space, no matter how delimited or constrained, that is more trusting and open–are pretty common within the fandom as part of people’s personal narratives for why and how they became bronies (obviously, this is not true for everyone, but it’s clearly a fandom trope). In anthropology itself, scholars like Victor Turner and Max Gluckman have suggested that certain carnivalesque (to borrow Bakhtin’s term) rituals act as a kind of “safety valve” for a society to release its pent up frustrations and conflicts without destroying the order of things, and some version of that idea is laden in Redden’s theory and that of many bronies. There are many bronies who see involvement in fandom and watching the show as that safety valve.
But there are many others who narrate their experience as simply watching a show that they like–just like any other show–and, to their surprise finding outside resistance. Indeed, we don’t expect people to explain their affinity for most elements of popular culture. You need not justify why you watch “Breaking Bad” or “Game of Thrones.”
The fact that causal explanations that answer why you are a brony are central to the narratives of many bronies does not really indicate too much about their truth value, but they are a useful indicator of where society draws its lines and how people who find themselves on the wrong sides of social lines create meaning based on their situations. Here, I’m drawing heavily on Lila Abu-Lughod‘s ideas about resistance as a “diagnostic of power” that points us to the methods and configurations of power (“The Romance of Resistance: Tracing Transformations of Power Through Bedouin Women,” 1990). In this case, bronies (and researchers) find themselves having to produce narratives that can explain why they have crossed norms of gender and age appropriateness, even if they don’t live by those norms themselves. Jacob Clifton in “Geek Love: On the Matter of Bronies” does a great job arguing that, being the first generation raised by feminists, of course these young men don’t see any difference between Twilight Sparkle or Han Solo being their idols.
Kurt: Ultimately the fact that bronies have to justify why they like the show is in many ways coming from the fact that they get such negative press and draw such negative stereotypes. We haven’t done too much to tease out what actually draws people to the show, although we’ve seen many people give many different reasons as we’ve gone about our research–the good writing and production, the positive themes, the large and thriving fan community, having friends and relatives that like the show, that they just somehow liked it, etc. I’m not sure that there is necessarily one, or even a few, things inherent in the show or the fandom that draw people to it any more than there being something inherent in basketball that makes people want to watch it. There are a lot of really complex personal, psychological and socio-cultural things at work in personal preference and the reasons people give usually seem to explain less about why they like something (I couldn’t tell you why I like Carly Rae Jepson or George Clinton) than they give culturally-determined reasons why it might be okay for them to like it.
Julia: Right now you have the benefit of both directly looking for source material on the open web, and having it come to you (through participation on your blog). Given your perspective, what kinds of online content do you think are the most critical for cultural heritage organizations to preserve for anthropologists of the future to study this moment in history?
Kurt: That’s a tough one, as even with our research on bronies I feel like everywhere I look, I see someone joining the Brony research herd with a new and different focus. Although we try to do a lot of our work by talking and collaborating directly with bronies, we’ve dealt with Twitter exchanges, media reports about MLP, message board archives, brony music collections, the show itself and just about anything that we can find where people are exchanging their ideas about the fandom. Others have dealt with collection of fanfics, sites dedicated to discussing MLP and religion, fan art, material culture and cosplay, and just about anything else you can think of. I’m always finding people who focus upon and draw insight from archives (both in the sense of actual archives and in the super-general sense of “stuff people use as the basis of their research”) that I would never have thought to use.
This being said, as someone that primarily studies expressive culture (my degree is from the department of Folklore and Ethnomusicology), I tend to place a lot of importance on it. The amount and quality of the music, art, videos, memes, stories, etc. floating around within the fandom has never ceased to astound me and was one of the primary reasons that I became attracted to the fandom in the first place. I feel like these bodies of creative works–from “My Little Dashie,” “Ponies: The Anthology,” and “Love me Cheerilee” to the Twilicane memes and crude saxophone covers of show tunes –are very important to the fandom and to those that want to understand it as scholars.
Jason: Broadly speaking, anthropologists have taken two approaches to describing the lives of others to their audience. The first is like a wide-angle lens, allowing someone to get a sense of the full scope of a social phenomenon, but it has trouble with the details and the charming little moments of creativity and agency–like fan-created fluffy ponies dancing on rainbows or background ponies portrayed as anthropologists studying humankind. Archival work needs that little-bit-of-everything for context, but it also needs a macro lens that can capture more of those particular and special moments. In anthropology, it might be akin to the difference between Malinowski’s epic “Argonauts of the Western Pacific”–a sprawling work that tried to introduce the entirety of a culture to us–and something like Anthony Seeger’s “Why Suyá Sing,” which performed the humbler, but no less impressive, task of letting us experience the nuances of a single ritual.
Since we can’t archive every little thing to that level of detail … we have to make choices, and that’s where bronies themselves are the best guides. What moments mattered to them, and “where” in cyberspace did they experience those moments? For a concrete example, the moment Twilight Sparkle gained her wings and became an alicorn princess (she was previously just a unicorn…thanks M.A. Larson) was particularly salient in the community, suggesting for some fans Hasbro’s stern hand manipulating the franchise. While there are some other similar instances, the unique expressions through Twitter, Reddit, YouTube, Tumblr, etc. during and immediately following the Season 3 episode “Magical Mystery Cure” (when that transformation occurs) provide a really important look into what holds meaning for this fandom.
On a technical level, I think that means being able to follow links surrounding particular events to multiple levels of depth across multiple media modalities.
Julia: If librarians, archivists and curators wanted to learn more about approaches like yours what examples of other scholars’ work would you suggest? It would be great if you could mention a few other scholars’ work and explain what you think is particularly interesting about their approaches.
Jason: One place to start is to consider what the cultural artifact is and what it is we are analyzing, interpreting, preserving, archiving, etc., because it is not, ethnographically speaking, simply media that we are studying. As Mary Gray has insisted, we should “de-center media as the object of analysis,” instead looking at what that media means and how it is contextualized. For the archivist or curator, I think that means figuring out how people come to understand media and how they attach particular ideologies to it. Ilana Gershon’s “The Breakup 2.0″ and her work on “media ideology” broadly are great examples of shifting our attention so that we can hold both the “text” and “context” in view simultaneously.
Another example is danah boyd’s recent study of young people and their social media use, “It’s Complicated,” in which she inverts older people’s assumptions that teenagers’ social media use is crippling their ability to socialize, instead arguing that the constant texting and messaging indicates a desire to connect with one another that is born out of frustration with the previous generation’s (over-)protectiveness: truancy and loitering law, curfews, school busing, constant organized activity, etc. She arrives at that conclusion not only by studying teens’ messages, but by analyzing the historical conditions that produce the very different concerns of teens and their parents.
Kurt: As far as our approach goes, we’ve also been influenced by scholars working creatively with ethnography as a form or working just outside of its purview. We’ve brought up Kathleen Stewart’s “Ordinary Affects” in our blog and academic papers several times because it has been extremely influential upon both of us through its attempt to understand and express the ordinary moments in people’s lives that, while not unusual, per se, seem to have a weight to them that moves them somewhere in some direction–the little moments that are both ordinary and extraordinary, nondescript and meaningful. Susan M. Schultz’ “Dementia Blog” also comes to mind. While it isn’t necessarily an ethnography, per se, Schultz utilized blogging and its unique structural features (namely, that newer posts come first so that reading the blog in order is actually going backwards in time) as a means of looking into the poetics and tragic beauty of dementia while also expressing and understanding her own feelings as her mother’s mental illness progressed.
Jason: We are not too familiar with scholars who are interacting with fans in precisely the way that we are (or whether there are any), though it is important to be aware of the term “aca-fan” (academic fan) in fandom studies and some of the works being produced under that rubric. Henry Jenkins titles his website “Confessions of an Aca-Fan,” for example, and writes for an audience that includes both scholars and people interested in fandoms in general. The online journal Flow is another example that is somewhat more closely related to our blog, expressly attempting to link scholars with members of the public interested in talking about television. I’m also personally influenced by the work of Michael Wesch and Kembrew McLeod, both scholars who attempt to engage their students and the public in novel ways using media and technology.
The ability of websites to bypass privacy settings with “canvas fingerprinting” has caused quite a bit of concern, and it’s become a hot topic on the Code4lib mailing list. Let’s take a quick look at it from a technical standpoint. It is genuinely disturbing, but it’s not the unstoppable form of scrutiny some people are hyping it as.
The best article to learn about it from is “Pixel Perfect: Fingerprinting Canvas in HTML5,” by Keaton Mowery and Hovav Shacham at UCSD. It describes the basic technique and some implementation details.
Either API lets you draw objects, not just pixels, to a browser. These include geometric shapes, color gradients, and text. The details of drawing are left to the client, so they will be drawn slightly differently depending on the browser, operating system, and hardware. This wouldn’t be too exciting, except that the API can read the pixels back. The getImageData method of the 2D context returns an ImageData object, which is a pixel map. This can be serialized (e.g., as a PNG image) and sent back to the server from which the page originated. For a given set of drawing commands and hardware and software configuration, the pixels are consistent.
Drawing text is one way to use a canvas fingerprint. Modern browsers use a programmatic description of a font rather than a bitmap, so that characters will scale nicely. The fine details of how edges are smoothed and pixels interpolated will vary, perhaps not enough for any user to notice, but enough so that reading back the pixels will show a difference.
Was including getImageData in the spec a mistake? This can be argued both ways. Its obvious use is to draw a complex canvas once and then rubber-stamp it if you want it to appear multiple times; this can be faster than repeatedly drawing from scratch. It’s unlikely, though, that the designers of the spec thought about its privacy implications.
The following is a guest post from Julia Fernandez, this year’s NDIIPP Junior Fellow. Julia has a background in American studies and working with folklife institutions and worked on a range of projects leading up to CurateCamp Digital Culture in July. This is part of a series of interviews Julia conducted to better understand the kinds of born-digital primary sources folklorists, and others interested in studying digital culture, are making use of for their scholarship.
Online communities, and their digital records, can be rich source of information, invaluable to academic researchers and to market researchers. In this installment of the Insights Interviews series, I’m delighted to talk with Robert V. Kozinets, professor of marketing at York University in Toronto and the originator of “netnography.”
Julia: In your book “Netnography: Doing Ethnographic Research Online,” you define “netnography” as a “qualitative method devised specifically to investigate the consumer behavior of cultures and communities present on the Internet.” Can you expand a bit on that definition for us? What is it about online communities that warrants minting a new word for doing ethnographic work online? Further, how would you compare and contrast your approach to other terms like “virtual ethnography”?
Robert: It’s a great question, and one that is difficult to do justice to in a short interview. For readers who are aware of the anthropological technique of ethnography, or participant-observation, it may be fairly easy to grasp that ethnographic work can also be performed in online or social media environments. However, doing ethnographic work on the combination of digital archives and semi-real-time conversations, and much more, that is the Internet is a bit different from, say, traveling to Outer Mongolia to learn about how people live there. The online environment is technologically mediated, it is instantly archived, it is widely accessible, and it is corporately controlled and monitored in ways that face-to-face behavior is not. Netnography is simply a way to approach ethnography online, and it could just as easily be called “virtual,” “digital,” “web,” “mobile” or other kinds of ethnography. The difference, I suppose, is that netnography has been associated with particular research practices in a way that these other terms are not.
Julia: You began implementing netnography as a research method in 1995. The web has changed a good bit since you started doing this work nearly twenty years ago. How has the continued development of web applications and software changed or altered the nature of doing netnographic research? In particular, has the increased popularity of social media (Facebook, Twitter) changed work in studying online communities?
Robert: This is a little like asking an experimental researcher if the experiments they run are different if they are running them on children or old people, or if they are experimenting on prisoners in a prison, or students at a party. It is a tactical and operational issue. The guiding principles of netnography are exactly the same whether it is a bulletin board, a blog or Facebook. Fundamental questions of focus, data collection, immersion and participation, analysis, and research presentation are identical.
Julia: How do you suggest finding communities online outside of the relatively basic search operations offered by Google and Yahoo? What are some signs that a particular online community will be a good source for netnographic research?
Robert: There are many search tools that are available, but there is no particular need to go beyond Google or Yahoo. The two keys to netnography are finding particularly interesting and relevant data amongst the load of existing data, and paying particular attention to one’s own role and consciousness as participant in the research process. Whatever tools one chooses to work with, this is time-consuming, painstaking and rewarding work. One thing I would love search engines to be able to do is to include and tag visual, audio and audiovisual material. It would be wonderful to have a search engine that spat out results to a search and gave me, along with website, blog and forum links, a full list of links to Instagram photos, YouTube videos and iTunes podcasts.
Julia: Throughout the book, you reinforce the point that the key to generating insight in netnography is building trust. Can you unpack that a bit? What are some ethical concerns researchers should keep in mind when conducting ethnographic research?
Robert: A range of ethical concerns have been raised about the use of Internet data, many of which have proven over the years to be non-starters. Notions of informed consent can be difficult online, and ethical imperatives can be difficult in environments where the line between public and private is so unclear. However, disclosure of the researcher or the research is not always necessary–it depends always upon the context. As with any research ethics question, it is generally a question of weighing potential benefits against potential risks.
Julia: From your perspective as an ethnographer and market researcher, what kinds of online content do you think is the most critical for cultural heritage organizations to preserve for researchers of the future to study this moment in history? Collecting and preserving content isn’t your area, but I’d be interested to hear whether you think there are particular subcultures, movements or content that aren’t getting enough attention.
Robert: I have used the Wayback Machine from time to time to look at snapshots of the Internet of the past. I also recall a recent research project in which we studied bloggers, and in which some interesting blog material was removed shortly after it was posted. It survived only in our fieldnotes, but we had not archived it. Of course, it would be nice to be able to instantly retrieve “the data that got away.” However, in my research, it is the immediate experience of the Internet which matters most.
Given the rapid spread of social media, I believe that the present holds far more information and insight that any other time in the past. There are so many archives of so many particular groups already, and those archives are, in themselves, rather revealing cultural artifacts. The ones I find the most fascinating to study are the archives that groups make of their own activities. So, to answer your last question, I suppose that, to answer a library sciences question, I would be more interested to see the archives that library science people construct about library science and how they represent themselves to themselves and to wider audiences of assumed “others” that I would about how library science people represent any other group.
Julia: Aside from what to collect, I would be curious to learn a bit more about what kinds of access you think researchers studying digital culture are going to want to have to these collections. How much of this do you think will be focused on close reading of what individual pages and sites looked like and how much on bulk analysis of materials as data?
Rob: I think researchers are hungry for everything. If you ask typical researchers what data they want, they will say everything. That is because, without a specific focus or research question, you want to keep all of your options open. Then the problem becomes what they do with all this data, and they end up with all sorts of big data methods that try to fit as much data as possible into models. My approach is a bit different, in that I am searching for individual experiences online that generate insight. This could come from masses of data, or from one page, one site, even one photograph or one video clip. I think the question of access is tied up with questions of categorizing, interpretation and ownership, and these are all interesting and complex matters that lend themselves to a lot more thought and debate. In the short- to medium-term, what is currently available on the Internet is certainly more than enough for me to work with.
The OPF is holding a PDF event in Hamburg on 1st-2nd September 2014 where we'll be taking an in-depth look at the PDF format, its sub-flavours like PDF/A and open source tools that can help. This is a quick post of list of things you can do to prepare for the event if you're attending and looking to get the most out of it.Pre-reading
Johan van der Knijff's OPF blog has a few interesting posts on PDF preservation risks:
- PDF - Inventory of long-term preservation risks links to a report on the same subject. This is written from a preservation point of view and despite Johan admitting it's incomplete (see blog post) it's still a good overview of the format and associated preservation issues.
- Identification of PDF preservation risks with Apache Preflight: a first impression examines the use of Apache PDF Box's Preflight module to detect preservation risks. Again it links to a report written as part of the SCAPE project. PDF Box is one of the tools we'll be looking at in Hamburg, see the Tools section later in this post.
Below are brief details of the main open source tools we'll be working with. It's not essential that you dowload and install these tools. The all require Java and none of them have user friendly install procedures. We'll be looking at ways to improve that at the event. We'll also be providing a pre-configured virtual environement to allow you to experiment in a friendly, throw away environment. See the Software section a little further down.JHOVE
JHOVE is an open source tool that performs format specific identification, characterisation and validation of digital objects. JHOVE can identify and validate PDF files against the PDF specification while extracting technical and descriptive metadata. JHOVE recognises PDFs that state that they conform to the PDF/A profile, but it can't then validate that a PDF conforms to the PDF/A specification.
- Official Website: JHOVE website on SourceForge
- Licensing: LGPL v2.1
- Version: v1.11 released 09/2013
- Download: SourceForge
The Apache Foundation's Tika project is an application / toolkit that can be used to identify, parse, extract metadata, and extract content from many file formats.
- Official Website: Apache Tika Home
- Licensing: Apache License v2.0
- Version: v1.5 released 02/2014
- Download: Apache Tika download page
Written in Java, Apache PDFBox is an open source library for working with PDF documents. It's primarily aimed at developers but has some basic command line apps. PDFBox also contains a module that verifies PDF/A-1 documents that has a command line utility.
These libraries are of particular interest to Java developers who can incorporate the libraries into their own programs, Apache Tika uses the PDFBox libraries for PDF parsing.
- Official Website: Apache PDFBox Home
- Licensing: Apache License v2.0
- Version: v1.8.6 released 06/2014
- Download: Apache PDFBox download page
These test data sets were chosen because they're freely available. Again it's not necessary to download them before attending but they're good starting points for testing some of the tools or your code:PDFs from GovDocs selected dataset
The original GovDocs corpora is a test set of nearly 1 million files and is nearly half a terabyte in size. The corpus was reduced in size by removing similar items by David Tarrant, as described in this post. The remaing data set is still large at around 17GB and can be downloaded here.Isator PDF/A test suite
The Isator test suite is published by the PDF Association's PDF/A competency centre, in their own words:
This test suite comprises a set of files which can be used to check the conformance of software regarding the PDF/A-1 standard. More precisely, the Isartor test suite can be used to “validate the validators”: It deliberately violates the requirements of PDF/A-1 in a systematic way in order to check whether PDF/A-1 validation software actually finds the violations.
The OPF has a GitHub repository where members can upload files that represent preservation risks / problems. This has a couple of sub-collections of PDFs, these show problem PDFs from the GovDocs corpus and this is a collection of PDFs with features that are "undesirable" in an archive setting.Software
If you'd like the chance to get hands-on with the software tools at the event and try some interactive demonstrations / exercises we'll be providing light virtualised demonstration environments using VirtualBox and Vagrant. It's not essential that you install the software to take part but it does offer the best way to try things for yourself, particularly if you're not a techie. These are available for Windows, Mac, and linux and should run on most people's laptops, download links are shown below.
Be sure to install the VirtualBox extensions also, it's the same download for all platforms.What next?
I'll be writing another post for Monday 18th August that will take a look at using some of the tools and test data together with a brief analysis of the results. This will be accompanied by a demonstration virtual environment that you can use to repeat the tests and experiment yourself.
The following is a guest post from Julia Fernandez, this year’s NDIIPP Junior Fellow. Julia has a background in American studies and working with folklife institutions and worked on a range of projects that lead up to CurateCamp Digital Culture in July. This is part of an ongoing series of interviews Julia conducted to better understand the kinds of born-digital primary sources folklorists, and others interested in studying digital culture, are making use of for their scholarship.
How do teens use the internet? For researchers, reporters and concerned parents alike, that question has never been more relevant. Many adults can only guess, or extrapolate based on news reports or their own social media habits. But researcher danah boyd took an old-fashioned but effective approach: she asked them.
I’m delighted to continue our ongoing Insights Interview series today with danah, a principal researcher at Microsoft Research, a research assistant professor in media, culture and communication at New York University, and a fellow at Harvard’s Berkman Center for Internet & Society. For her new book It’s Complicated: The Social Lives of Networked Teens, she spent about eight years studying how teens interact both on- and off-line.
Julia: The preface to your latest book ends by assuring readers that “by and large, the kids are all right.” What do you mean by that?
danah: To be honest, I really struggle with prescriptives and generalizations, but I had to figure out how to navigate those while writing this book. But this sentence is a classic example of me trying to add nuance to a calming message. What I really mean by this – and what becomes much clearer throughout the book – is that the majority of youth are as fine as they ever were. They struggle with stress and relationships. They get into trouble for teenage things and aren’t always the best equipped for handling certain situations. But youth aren’t more at-risk than they ever were. At the same time, there are some youth who are seriously not OK. Keep in mind that I spend time with youth who are sexually abused and trafficked for a different project. I don’t want us to forget that there are youth out there that desperately need our attention. Much to my frustration, we tend to focus our attention on privileged youth, rather than the at-risk youth who are far more visible today because of the internet than ever before.
Julia: In a recent article you stated that “social media mirror, magnify, and complicate countless aspects of everyday life, bringing into question practices that are presumed stable and shedding light on contested social phenomena.” Can you expand a bit on this?
danah: When people see things happening online that feel culturally unfamiliar to them, they often think it’s the internet that causes it. Or when they see things that they don’t like – like bullying or racism – they think that the internet has made it worse. What I found in my research is that the internet offers a mirror to society, good, bad and ugly. But because that mirror is so publicly visible and because the dynamics cross geographic and cultural boundaries, things start to get contorted in funny ways. And so it’s important to look at what’s happening underneath the aspect that is made visible through the internet.
Julia: In a recent interview you expressed frustration with how, in the moral panic surrounding social media, “we get so obsessed with focusing on relatively healthy, relatively fine middle- and upper-class youth, we distract ourselves in ways that don’t allow us to address the problems when people actually are in trouble.” What’s at stake when adults and the media misunderstand or misrepresent teen social media use?
danah: We live in a society and as much as we Americans might not like it, we depend on others. If we want a functioning democracy, we need to make sure that the fabric of our society is strong and healthy. All too often, in a country obsessed with individualism, we lose track of this. But it becomes really clear when we look at youth. Those youth who are most at-risk online are most at-risk offline. They often come from poverty or experience abuse at home. They struggle with mental health issues or have family members who do. These youth are falling apart at the seams and we can see it online. But we get so obsessed with protecting our own children that we have stopped looking out for those in our communities that are really struggling, those who don’t have parents to support them. The urban theorist Jane Jacobs used to argue that neighborhoods aren’t safe because you have law enforcement policing them; they are safe because everyone in the community is respectfully looking out for one another. She talked about “eyes on the street,” not as a mechanism of surveillance but as an act of caring. We need a lot more of that.
Julia: You conduct research on teen behaviors both on and offline. How are physical environments important to understanding mediated practices? What are the limitations to studying online communities solely by engaging with them online?
danah: We’ve spent the last decade telling teenagers that strangers are dangerous, that anyone who approaches them online is a potential predator. I can’t just reach out to teens online and expect them to respond to me; they think I’m creepy. Thus, I long ago learned that I need to start within networks of trust. I meet youth through people in their lives, working networks to get to them so that they will trust me and talk about their lives with me. In the process, I learned that I get a better sense of their digital activities by seeing their physical worlds first. At the same time, I do a lot of online observation and a huge part of my research has been about piecing together what I see online with what I see offline.
Julia: Researchers interested in young people’s social media use today can directly engage with research participants and a wealth of documentation over the web. When researchers look back on this period, what do you think are going to be the most critical source material for understanding the role of social media in youth culture? In that vein, what are some websites/data sets and other kinds of digital material that you think would be invaluable for future researchers to have access to for studying teen culture of today 50 years from now?
danah: Actually, to be honest, I think that anyone who looks purely at the traces left behind will be missing the majority of the story. A lot has changed in the decade in which I’ve been studying youth, but one of the most significant changes has to do with privacy. When I started this project, American youth were pretty forward about their lives online. By the end, even though I could read what they tweeted or posted on Instagram, I couldn’t understand it. Teens started encoding content. In a world where they can’t restrict access to content, they restrict access to meaning. Certain questions can certainly be asked of online traces, but meaning requires going beyond traces.
Julia: Alongside your work studying networked youth culture, you have also played a role in ongoing discussions of the implications of “big data.” Recognizing that researchers now and in the future are likely going to want to approach documentation and records as data sets, what do you think are some of the most relevant issues from your writing on big data for cultural heritage institutions to consider about collecting, preserving and providing access to social media, and other kinds of cultural data?
danah: One of the biggest challenges that archivists always have is interpretation. Just because they can access something doesn’t mean they have the full context. They work hard to piece things together to the best that they can, but they’re always missing huge chunks of the puzzle. I’m always amazed when I sit behind the Twitter firehose to see the stream of tweets that make absolutely no sense. I think that anyone who is analyzing this data knows just how dirty and confusing it can be. My hope is that it will force us to think about who is doing the interpreting and how. And needless to say, there are huge ethical components to that. This is at the crux of what archivists and cultural heritage folks do.
Julia: You’ve stated that “for all of the attention paid to ‘digital natives’ it’s important to realize that most teens are engaging with social media without any deep understanding of the underlying dynamics or structure.” What role can cultural heritage organizations play in facilitating digital literacy learning?
danah: What I love about cultural heritage organizations is that they are good at asking hard questions, challenging assumptions, questioning interpretations. That honed skill is at the very center of what youth need to develop. My hope is that cultural heritage organizations can go beyond giving youth the fruits of their labor and inviting them to develop these skills. These lessons don’t need to be internet-specific. In many ways, they’re a part of what it means to be critically literate period.
The August Library of Congress Digital Preservation Newsletter is now available:
- Digital Preservation 2014: It’s a Thing
- Preserving Born Digital News
- LOLCats and Libraries with Amanda Brennan
- Digital Preservation Questions and Answers
- End-of-Life Care for Aging, Fragile CDs
- Education Program updates
- Interviews with Henry Jenkins and Trevor Blank
- More on Digital Preservation 2014
- NDSA News, and more
A feeling came over me, the same horrified realization the translator of To Serve Man had: “It’s a cookbook!” It wasn’t designed to let you learn how the software works, but to get you turning out code as quickly as possible. There are too many of these books, designed for developers who think that understanding the concepts is a waste of time. Or maybe the fault belongs less to the developers than to managers who want results immediately.
A book that introduces a programming language or API needs to start with the lay of the land. What are its basic concepts? How is it different from other approaches? It has to get the terminology straight. If it has functions, objects, classes, properties, and attributes, make it clear what each one is. There should be examples from the start, so you aren’t teaching arid theory, but you need to follow up with an explanation.
If you’re writing an introduction to Java, your “Hello world” example probably has a class, a main() function, and some code to write to System.out. You should at least introduce the concepts of classes, functions, and importing. That’s not the place to give all the details; the best way to teach a new idea is to give a simple version at first, then come back in more depth later. But if all you say is “Compile and run this code, and look, you’ve got output!” then you aren’t doing your job. You need to present the basic ideas simply and clearly, promise more information later, and keep the promise.
Don’t jump into complicated boilerplate before you’ve covered the elements it’s made of. The point of the examples should be to teach the reader how to use the technology, not to provide recipes for specific problems. The problem the developer has to solve is rarely going to be the one in the book. They can tinker with the examples until they fit their own problem, not really understanding them, but that usually results in complicated, inefficient, unmaintainable code.
Expert developers “steal” code too, but we know how it works, so we can take it apart and put it back together in a way that really suits the problem. The books we can learn from are the ones that put the “how it works” first. Cookbooks are useful too, but we need them after we’ve learned the tech, not when we’re trying to figure it out.
Tagged: books, writing
The following is a guest post from David Gibson, a moving image technician in the Library of Congress. He was previously interviewed about the Library of Congress video games collection.
The discovery of that which has been lost or previously unattainable is one of the driving forces behind the archival profession and one of the passions the profession shares with the gaming community. Video game enthusiasts have long been fascinated by unreleased games and “lost levels,” gameplay levels which are partially developed but left out of the final release of the game. Discovery is, of course, a key component to gameplay. Players revel in the thrill of unlocking the secret door or uncovering Easter eggs hidden in the game by developers. In many ways, the fascination with obtaining access to unreleased games or levels brings this thrill of discovery into the real world. In a recent article written for The Atlantic, Heidi Kemps discusses the joy in obtaining online access to playable lost levels from the 1992 Sega Genesis game, Sonic The Hedgehog 2, reveling in the fact that access to these levels gave her a glimpse into how this beloved game was made.
Since 2006, the Moving Image section of the Library of Congress has served as the custodial unit for video games. In this capacity, we receive roughly 400 video games per year through the Copyright registration process, about 99% of which are physically published console games. In addition to the games themselves we sometimes receive ancillary materials, such as printed descriptions of the game, DVDs or VHS cassettes featuring excerpts of gameplay, or the occasional printed source code excerpt. These materials are useful, primarily for their contextual value, in helping to tell the story of video game development in this country and are retained along with the games in the collection.
Several months ago, while performing an inventory of recently acquired video games, I happened upon a DVD-R labeled Duke Nukem: Critical Mass (PSP). My first assumption was that the disc, like so many others we have received, was a DVD-R of gameplay. However, a line of text on the Copyright database record for the item intrigued me. It reads: Authorship: Entire video game; computer code; artwork; and music. I placed the disc into my computer’s DVD drive to discover that the DVD-R did not contain video, but instead a file directory, including every asset used to make up the game in a wide variety of proprietary formats. Upon further research, I discovered that the Playstation Portable version of Duke Nukem: Critical Mass was never actually released commercially and was in fact a very different beast than the Nintendo DS version of the game which did see release. I realized then that in my computer was the source disc used to author the UMD for an unreleased PlayStation Portable game. I could feel the lump in my throat. I felt as though I had solved the wizard’s riddle and unlocked the secret door.
The first challenge involved finding a way to access the proprietary Sony file formats contained within the disc, including, but not limited to, graphics files in .gim format and audio files in .AT3 format. I enlisted the aid of Packard Campus Software Developer Matt Derby and we were able to pull the files off of the disc and get a clearer sense of the file structure contained within. Through some research on various PSP homebrew sites we discovered Noesis, a program that would allow us to access the .gim and .gmo files which contain the 3D models and textures used to create the game’s characters and 3D environments. With this program we were able to view a complete 3D view of Duke Nukem himself, soaring through the air on his jetpack and a pre-composite 3D model of one of the game’s nemeses, the Pig Cops. Additionally, we employed Mediacoder and VLC in order to convert the Sony .AT3 (ATRAC3) audio files to MP3 in order to have access to the game’s many music cues.
Perhaps the most exciting discovery came when we used a hex editor to access the ASCII text held in the boot.bin folder in the disc’s system directory. Here we located the full text and credit information for the game along with a large chunk of un-obfuscated software code. However, much of what is contained in this folder was presented as compiled binaries. It is my hope that access to both the compiled binaries and ASCII code will allow us to explore future preservation options for video games. Such information becomes even more vital in the case of games such as this Duke Nukem title which were never released for public consumption. In many ways, this source disc can serve as an exemplary case as we work to define preferred format requirements for software received by the Library of Congress. Ultimately, I feel that access to the game assets and source code will prove to be invaluable both to researchers who are interested in game design and mechanics and to any preservation efforts the Library may undertake.
Providing access to the disc’s content to researchers will, unfortunately, remain a challenge. As mentioned above, it was difficult enough for Library of Congress staff to view the proprietary formats found on the disc before seeking help from the homebrew community. The legal and logistical hurdles related to providing access to licensed software will continue to present themselves as we move forward but I hope that increased focus on the tremendous research value of such digital assets will allow for these items to be more accessible in the future. For now the assets and code will be stored in our digital archive at the Packard Campus in Culpeper and the physical disc will be stored in temperature-controlled vaults.
The source disc for the PSP version of Duke Nukem: Critical Mass stands out in the video game collection of the Library of Congress as a true digital rarity. In Doug Reside’s recent article “File Not Found: Rarity in the Age of Digital Plenty” (pdf), he explores the notion of source code as manuscript and the concept of digital palimpsests that are created through the various layers that make up a Photoshop document or which are present in the various saved “layers” of a Microsoft Word document. The ability to view the pre-compiled assets for this unreleased game provides a similar opportunity to view the game as a work-in-progress, or at the very least to see the inner workings and multiple layers of a work of software beyond what is presented to us in the final, published version. In my mind, receiving the source disc for an unreleased game directly from the developer is analogous to receiving the original camera negative for an unreleased film, along with all of the separate production elements used to make the film. The disc is a valuable evidentiary artifact and I hope we will see more of its kind as we continue to define and develop our software preservation efforts.
The staff of the Moving Image section would love the opportunity to work with more source materials for games and I hope that game developers who are interested in preserving their legacy will be willing to submit these kinds of materials to us in the future. Though source discs are not currently a requirement for copyright, they are absolutely invaluable in contributing to our efforts towards stewardship and long term access to the documentation of these creative works.
Special thanks to Matt Derby for his assistance with this project and input for this post.
Back in late June I attended the National Geospatial Advisory Committee (NGAC) meeting here in DC. NGAC is a Federal Advisory Committee sponsored by the Department of the Interior under the Federal Advisory Committee Act. The committee is composed of (mostly) non-federal representatives from all sectors of the geospatial community and features very high profile participants. For example, ESRI founder Jack Dangermond, the 222nd richest American, has been a member since the committee was first chartered in 2008 (his term has since expired). Current committee members include the creator of Google Earth (Michael Jones) and the founder of OpenStreetMap (Steve Coast).
So what is the committee interested in, and how does it coincide with what the digital stewardship community is interested in? There are number of noteworthy points of intersection:
- In late March of this year the FGDC released the “National Geospatial Data Asset Management Plan – a Portfolio Management Implementation Plan for the OMB Circular A–16” (pdf). The plan “lays out a framework and processes for managing Federal NGDAs [National Geospatial Data Assets] as a single Federal Geospatial Portfolio in accordance with OMB policy and Administration direction. In addition, the Plan describes the actions to be taken to enable and fulfill the supporting management, reporting, and priority-setting requirements in order to maximize the investments in, and reliability and use of, Federal geospatial assets.”
- Driven by the release of the NGDA Management Plan, a baseline assessment of the “maturity” of various federal geospatial data assets is currently under way. This includes identifying dataset managers, identifying the sources of data (fed only/fed-state partnerships/consortium/etc.) and determining the maturity level of the datasets across a variety of criteria. With that in mind, several “maturity models” and reports were identified that might prove useful for future work in this area. For example, the state of Utah AGRC has developed a one-page GIS Data Maturity Assessment; the American Geophysical Union has a maturity model for assessing the completeness of climate data records (behind a paywall, unfortunately); the National States Geographic Information Council has a Geospatial Maturity Assessment; and the FGDC has “NGDA Dataset Maturity Annual Assessment Survey and Tool” that is being developed as part of their baseline assessment These maturity models have a lot in common with the NDSA Levels of Preservation work.
- Lots of discussion on a pair of reports on big data and geolocation privacy. The first, Big Data – Seizing Opportunities, Preserving Values Report from the Executive Office of the President, acknowledges the benefits of data but also notes that “big data technologies also raise challenging questions about how best to protect privacy and other values in a world where data collection will be increasingly ubiquitous, multidimensional, and permanent.” The second, the PCast report on Big Data and Privacy (PCAST is the “President’s Council of Advisors on Science and Technology” and the report is officially called “Big Data: A Technology Perspective”) “begins by exploring the changing nature of privacy as computing technology has advanced and big data has come to the forefront. It proceeds by identifying the sources of these data, the utility of these data — including new data analytics enabled by data mining and data fusion — and the privacy challenges big data poses in a world where technologies for re-identification often outpace privacy-preserving de-identification capabilities, and where it is increasingly hard to identify privacy-sensitive information at the time of its collection.” The importance of both of these reports to future library and archive collection and access policies regarding data can not be understated.
- The Spatial Data Transfer Standard is being voted on for withdrawal as an FGDC-endorsed standard. FGDC maintenance authority agencies were asked to review the relevance of the SDTS, and they responded that the SDTS is no longer used by their agencies. There’s a Federal Register link to the proposal. The Geography Markup Language (GML), which the FGDC has endorsed, now satisfies the encoding requirements that SDTS once provided. NARA revised their transfer guidance for geospatial information in April 2014 to make SDTS files “acceptable for imminent transfer formats” but it’s clear that they’ve already moved away from them. As a side note, GeoRSS is coming up for a vote soon to become an FGDC-endorsed standard.
- The Office of Management and Budget is reevaluating the geospatial professional classification. The geospatial community has an issue similar to that being faced by the library and archives community, in that the jobs are increasingly information technology jobs but are not necessarily classified as such. This coincides with efforts to reevaluate the federal government library position description.
- The Federal Geographic Data Committee is working with federal partners to make previously-classified datasets available to the public. These datasets have been prepared as part of the “HSIP Gold” program. HSIP Gold is a compilation of over 450 geospatial datasets of U.S. domestic infrastructure features that have been assembled from a variety of Federal agencies and commercial sources. The work of assembling HSIP Gold has been tasked to the Homeland Infrastructure Foundation-Level Data (HIFLD) Working Group (say it as “high field”). Not all of the data in HSIP Gold is classified, so they are working to make some of the unclassified portions available to the public.
The next meeting of the NGAC is scheduled for September 23 and 24 in Shepherdstown, WV. The meetings are public.
This following is a guest post by Chris Adams from the Repository Development Center at the Library of Congress, the technical lead for the World Digital Library.
We live in an age of cheap bits: scanning objects en masse has never been easier, storage has never been cheaper and large-scale digitization has become routine for many organizations. This poses an interesting challenge: our capacity to generate scanned images has greatly outstripped our ability to generate the metadata needed to make those items discoverable. Most people use search engines to find the information they need but our terabytes of carefully produced and diligently preserved TIFF files are effectively invisible for text-based search.
The traditional approach to this problem has been to invest in cataloging and transcription but those services are expensive, particularly as flat budgets are devoted to the race to digitize faster than physical media degrades. This is obviously the right call from a preservation perspective but it still leaves us looking for less expensive alternatives.
OCR is the obvious solution for extracting machine-searchable text from an image but the quality rates usually aren’t high enough to offer the text as an alternative to the original item. Fortunately, we can hide OCR errors by using the text to search but displaying the original image to the human reader. This means our search hit rate will be lower than it would with perfect text but since the content in question is otherwise completely unsearchable anything better than no results will be a significant improvement.
Since November 2013, the World Digital Library has offered combined search results similar to what you can see in the screenshot below:
This system is entirely automated, uses only open-source software and existing server capacity, and provides an easy process to improve results for items as resources allow.How it Works: From Scan to Web Page Generating OCR Text
As we receive new items, any item which matches our criteria (currently books, journals and newspapers created after 1800) will automatically be placed in a task queue for processing. Each of our existing servers has a worker process which uses idle capacity to perform OCR and other background tasks. We use the Tesseract OCR engine with the generic training data for each of our supported languages to generate an HTML document using hOCR markup.
The hOCR document has HTML markup identifying each detected word and paragraph and its pixel coordinates within the image. We archive this file for future usage but our system also generates two alternative formats for the rest of our system to use:
- A plain text version for the search engine, which does not understand HTML markup
- A JSON file with word coordinates which will be used by a browser to display or highlight parts of an image on our search results page and item viewer
Search has become a commodity service with a number of stable, feature-packed open-source offerings such as such Apache Solr, ElasticSearch or Xapian. Conceptually, these work with documents — i.e. complete records — which are used to build an inverted index — essentially a list of words and the documents which contain them. When you search for “whaling” the search engine performs stemming to reduce your term to a base form (e.g. “whale”) so it will match closely-related words, finds the term in the index, and retrieves the list of matching documents. The results are typically sorted by calculating a score for each document based on how frequently the terms are used in that document relative to the entire corpus (see the Lucene scoring guide for the exact details about how term frequency-inverse document frequency (TD-IDF) works).
This approach makes traditional metadata-driven search easy: each item has a single document containing all of the available metadata and each search result links to an item-level display. Unfortunately, we need to handle both very large items and page-level results so we can send users directly to the page containing the text they searched for rather than page 1 of a large book. Storing each page as a separate document provides the necessary granularity and avoids document size limits but it breaks the ability to calculate relevancy for the entire item: the score for each page would be calculated separately and it would be impossible to search for multiple words which fall on different pages.
The solution for this final problem is a technique which Solr calls Field Collapsing (the ElasticSearch team has recently completed a similar feature referred to as “aggregation”). This allows us to make a query and specify a field which will be used to group documents before determining relevancy. If we tell Solr to group our results by the item ID the search ranking will be calculated across all of the available pages and the results will contain both the item’s metadata record and any matching OCR pages.
At this point, we can perform a search and display a nice list of results with a single entry for each item and direct links to specific pages. Unfortunately, the raw OCR text is a simple unstructured stream of text and any OCR glitches will be displayed, as can be seen in this example where the first occurrence of “VILLAGE FOULA” was recognized incorrectly:
The next step is replacing that messy OCR text with a section of the original image. Our search results list includes all of the information we need except for the locations for each word on the page. We can use our list of word coordinates but this is complicated because the search engine’s language analysis and synonym handling mean that we cannot assume that the word on the page is the same word that was typed into the search box (e.g. a search for “runners” might return a page which mentions “running”).
Here’s what the entire process looks like:
1. The server returns an HTML results page containing all of the text returned by Solr with embedded microdata indicating the item, volume and page numbers for results and the highlighted OCR text:
Now we can find each word highlighted by Solr and locate it in the word coordinates list. Since Solr returned the original word and our word coordinates were generated from the same OCR text which was indexed in Solr, the highlighting code doesn’t need to handle word tenses, capitalization, etc.
3. Since we often find words in multiple places on the same page and we want to display a large, easily readable section of the page rather than just the word, our image slice will always be the full width of the page starting at the top-most result and extending down to include subsequent matches until there is either a sizable gap or the total height is greater than the first third of the page.
Once the image has been loaded, the original text is replaced with the image:
4. Finally, we add a partially transparent overlay over each highlighted word:
- The WDL management software records the OCR source and review status for each item. This makes it safe to automatically reprocess items when new versions of our software are released without the chance of inadvertently overwriting OCR text which was provided by a partner or which has been hand-corrected.
- You might be wondering why the highlighting work is performed on the client side rather than having the server return highlighted images. In addition to reducing server load this design improves performance because a given image segment can be reused for multiple results on the same page(rounding the coordinates improves the cache hit ratio significantly) and both the image and word coordinates can be cached independently by CDN edge servers rather than requiring a full round-trip back to the server each time.
- This benefit is most obvious when you open an item and start reading it: the same word coordinates used on the search results page can be reused by the viewer and since the page images don’t have to be customized with search highlighting, they’re likely to be cached on the CDN. If you change your search text while viewing the book highlighting for the current page will be immediately updated without having to wait for the server to respond.
This approach works relatively well but there are a number of areas for improvement:
- The process described above allows the OCR process to be improved considerably. This provides plenty of room to improve results with technical improvements such as more sophisticated image processing, OCR engine training, and workflow systems incorporating human review and correction.
- For collections such as WDL’s which include older items OCR accuracy is reduced by the condition of the materials and typographic conventions like the long s (ſ) or ligatures which are no longer in common usage. The Early Modern OCR Project is working on this problem and will hopefully provide a solution for many needs.
- Finally, there’s considerable appeal to crowd-sourcing corrections as demonstrated by the National Library of Australia’s wonderful Trove project and various experimental projects such as the UMD MITH ActiveOCR project.
- This research area is of benefit to any organization with large digitized collections, particularly projects with an eye towards generic reuse. Ed Summers and I have casually discussed the idea for a simple web application which would display images with the corresponding hOCR with full version control, allowing the review and correction process to be a generic workflow step for many different projects.