"Format" Identification/Characterization for Hard Drive Disk Images

Emulation requires disk images to be provided and used for the main hard drive of emulated computers. These disk images can be captured from old hard drives as discussed here. In order to automate aspects of running these disk images in an emulator a tool is needed that tells us which emulators would be compatible with the image and how to configure them. In this post I identify some of the requirements of such a tool and seek feedback from the community about them and the concept in general.

A tool for identifying and configuring emulators that are compatible with captured hard drive images would need to do the following:

  1. identify hard drive image formats (in order to know how to read their contents to identify other important information)
  2. identify if an operating system was installed on them (in order to know whether they can be directly executed in an emulator or would need to be attached to an emulator running from another disk image)
  3. identify what operating system was installed on them (including versions) (in order to know what emulators and emulated hardware should be compatible with the images)
  4. identify the hardware the system has been running on previously (in order to identify the hardware requirements of the emulated hardware)

It would also be good to be able to identify other software installed on the operating system environment stored on the disk image in order to identify whether any hardware it relied on could be supplied by emulated hardware but couldn't be identified in step 4.

This information could then be mapped to emulators and the hardware available in them. This would enable the automatic configuration and execution of the appropriate emulator to load the disk image on emulated hardware.

Disk images could theoretically be assigned identifiers to be mapped to emulators and emulated hardware configurations/profiles based on either:

  1. The combination of software they have on them
  2. The Operating System (OS) they have on them
  3. The hardware configuration they were previously running on that the operating system has drivers configured for
  4. Generic emulated hardware profiles that the disk images are compatible with

There is huge potential complexity in this approach. Every variant of installed hardware could require a new "format", every variant of installed software profile could require a new "format" and every combination of installed software profile and hardware/driver configuration could require a new format.

In order to simply the process it may be possible in many cases to just identify which OS is installed on the disk image. This may often be enough to configure an emulated environment to successfully execute the software installed on the disk image. This could be achieved by using a generic configuration that is known to work with that OS.

In other cases the environment may require a specialised hardware configuration that would require more extensive information to identify. Identifying generic emulated hardware profiles that are compatible with the hardware configured on the disk image will require identifying that previously configured hardware (i.e. 3. above). The concept of this approach would be to first identify the hardware components previously used to run the software installed on the disk image. These hardware components could then be compared against the set of available emulated hardware and the previously used components could be matched with compatible emulated components.  

Developing a tool to identify 2. the installed operating system, seems like a quick-win piece of work. Developing a tool to identifying 4. generic emulated hardware profiles that the disk images are compatible with, seems like a much harder piece of work that may require a lot of documentation, but would potentially be much more useful. The work to develop a schema for documenting hardware and software environments (and to document them) being carried out by the Totem Project  may be able to be utilized to help realise such a tool. 

There are analogies between this approach and the approaches used by file format identification tools. It is interesting to consider hard drive images as files that require applications (i.e. emulators) to execute them. As such there is potential to repurpose format identification or characterisation tools for the purposes outlined above.

The purpose for this post was threefold:

  1. To raise the idea of such a tool within the OPF community
  2. To seek feedback on the value of such a tool, is such a tool/approach work pursuing?
  3. To seek advice on whether there are any tools out there currently that perform this role or could be repurposed to perform such a role (both JOHVE and DROID have potential here).

 Any comments or feedback would be greatly appreciated.




An interesting idea, and one I suspect will become more pressing as visualization usage increases. It would (probably) not be hard to develop Droid signatures that would recognise the primary type of virtual hard drive image. Recognising the contents may be harder to achieve reliably, as there will be file system structures and fragmentation obscuring the contents. Droid 6 has architectural support for the notion of container formats, where a separate piece of code is invoked to read the container contents as integral files, rather than a single mixed up byte stream. You can, of course ignore fragmentation to some extent if your signatures are not actively obscured by the containing format. Prior to Droid six, this was how all binary Microsoft office formats were recognised, peering through the slightly dirty lens of ole2. However, doing this would be a lot of work. You would also have to decode the partition structure, contained file system, and only then decode the files themselves. And in Droid's case, write all that code in java, probably from scratch. So, while I think there is some potential to use tools like Droid to help with broad identification of images and possibly the types of file system and operating system within them, for a full o/s profile you would want something else. To be honest, I suspect you would want the ability to extract files from virtual images as a separate tool in any case.

Euan Cochrane's picture

Thanks for the reply Matt,

Your evaluation of the option of using DROID to achieve the aims outlined in the post is really useful.

To clarify, it sounds like any method for undertaking this kind of identification/charaterisation will likely require some sort of multi-step identification process. For example, it might be possible to use the same identifer to identify disk images with the same operating system installed on them, but with different file systems. This might be able to be done using multiple signatures and the binary signature identification technique you described being used for MS-Office OLE files in DROID pre v6 (e.g. one signature for each file-system and operating system combination but only one unique identifier for all that share the same OS). However it would be an odd approach and potentially difficult.

As you point out it makes sense to instead identify the disk image type (.vhd, .vmdk, .img, etc) and then file system (e.g. adfs, affs, autofs, cifs, coda, coherent,cramfs, debugfs, devpts, efs, ext, ext2, ext3,  ext4, hfs, hfsplus, hpfs, iso9660, jfs, minix, msdos, ncpfs, nfs, nfs4, ntfs, proc, qnx4, ramfs,  reiserfs, romfs, squashfs,smbfs, sysv, tmpfs, ubifs, udf, ufs, umsdos, usbfs, vfat, xenix, xfs, xiafs etc!) and then unpack the images to identify the installed OS by, for example,  looking for particular system files.

It also sounds like you are saying it might be possible to use the multiple step identification and container format unpacking functionality in DROID 6. I have to admit that I don't really understand how the container unpacking works in DROID 6 though. I've also been struggling to find documentation about it. Also I think you are pointing out that we actually want to know something about the whole disk image but might have to identify it by individually looking at the files contained within it. This seems to be what we try to do/need to do with Open Document Format (ODF) files and Office Open XML (OOXML) files (i.e. unpacking the zip containers and looking in them to acertain whether they are OOXML or ODF files etc). Unfortuantely I don't understand whether that is what DROID 6's container unpacking enables or whether it just enables container files (e.g. zip) to be unpacked and the files they contain to be identified using regular DROID signatures. 

Euan Cochrane's picture

I found this within the DROID zip file once I downloaded it from here. After some searching on the filename I found that it is also available here

"A “container” identification means that a format was identified by finding embedded files (possibly with signatures of their own) inside the main file. For example, Microsoft Office 2007 word processing files are actually zip files containing xml files, images or other resources used in the document. A container identification would identify the main file as a Microsoft Office 2007 file, not a zip file. This method is very reliable, as not only does the broad type of container have to be identified (e.g. zip), but the zip file must then be opened, and files inside scanned for further identifications to be made. The original zip identification is removed, and replaced by the Office 2007 identification, on the basis of the files discovered within it."

 It looks like this might be able to be used to at least identify the OS type that the disk image has (by looking for system files), provided code could be included in DROID to unpack the container formats of the disk images. 

Hi Euan

in the interest of full disclosure, I should mention that I was the architect for DROID 5 and 6, but that I no longer work at the National Archives. So my views may be somewhat biased - but they are my own.

The thing about DROID signatures is that they're surprisingly powerful, especially combined together.  They aren't always elegant, but it's actually quite rare that it's not possible to recognise a binary file format. 

It may be entirely possible to separately identify the virtual hard disk format, the file system and the underlying operating system, if the fragmentation imposed by those systems doesn't horribly mix up the few bits we can use to identify them.  DROID will happily report multiple identifications for the same underlying file, if they all match.  Doing it this way avoids the cominatorial explosion you describe, and also avoids the need to write an amazing amount of code to decode them as container formats.

Container formats are a really powerful addition to DROID, although I confess I often mix up the terminology with archive formats.  Archive formats are just file containers, like zip or tar.  Container formats have the definition you (eventually) discovered.  And on the subject of documentation, there appears to be a real need for more.  Maybe that can be raised at the upcoming DROID workshop at the National Archives (which I unfortunately won't have time to attend).  In the meantime, I'm happy to answer any questions I can about it.



Andy Jackson's picture

Coincidentally, I just came across this disk image identification tool, called disktype: http://disktype.sourceforge.net/

Looks like is supports quite a range of disk formats, and even installed straight away on OSX with a quick

sudo port install disktype

There's also a Debian package, and the online documentation looks like a useful reference for this kind of thing: http://disktype.sourceforge.net/doc/

There are a number of inhouse / commercial-off-the-shelf products that you might look towards that are being used by the cyber forensics world to do exactly what you are looking for.

There is an exceptional community run by the FBI that you could contact (no promises as to the assistance they are able to give): http://www.swgde.org/

The two software’s I am most aware of are FDK and EnCase, although I suspect that my knowledge is somewhat out of date by now.

I am aware of a large project to map simply versions of windows installs, to allow forensic examiners to remove standard/commonly found system/OEM installation files. This is essentially very close to the approach you propose. The main issue at that time was the diversity of discovered installation type files that meant they had to manage a large list of installations based on their specific file patterns. This has implications for forensic level audit, however, the archiving requirement will have a somewhat higher tolerance for generic descriptions / file structures.

The biggest pitfall is starting off with enough diversity of HDDs that allow you to start to map the elements of the file structure that can be used to provide OS signature data, hardware/software indentifiers (that tell you how to set up the emulators) and any bespoke 'tweaks' or deviations from the standard that have been made (more important for older systems).

A similar approach is also used very successfully in the digital CCTV world, where there is often scant detail about the OS/platform the CCTV system is built on (often bespoke / hybrid)

All food for thought. Interesting post, thanks Euan.


Euan Cochrane's picture

Your post has lead me to conclude that this could form the basis of a student project. The project could involve :

1. producing a linux distro to use for automated disk imaging for digital preservation.

2. Producing a workflow management tool/gui that would enable:

a) Auto Identifying the disk image format using available tools.

b) Auto convert the image to emulator/virtual machine compatible formats (e.g. using qemu-convert)

c) Automatically add drivers for emulated/virtualised hardware.

It does seem that most/many of the functional parts of such a project are already available they just need to be combined and/or tweaked for these purposes and given a nice user-friendly interface.