Automating Disk Imaging

Imaging of complete systems can become the appropriate appraisal method for a couple of complex objects like data bases or digital art, see e.g. the work of the Bitcurator project. This could apply to cases too, where a specific original environment is of interest, as e.g. the machine of a famous author, scientist or politician. The joint Archives New Zealand and University of Freiburg paper at iPRES 2011 discussed a couple of aspects on system imaging.

With the successful reviving of just another operating system to rerun a database in its original environment it makes sense to look deeper into the single steps, general workflow requirements and possible options for at least partial automation of system imaging.
Even with an increasing number of successfully dumped and emulated systems and an increased experience it is to be admitted that the procedure is still a rather specialist adventure. Thus, in a discussion with J. Rothenberg former of Rand Corporation and I. Welch of Victoria University it was agreed upon to look deeper into a couple of automation aspects. A research agenda on this topic should discuss the following issues:

Non-intrusive Disk Dumping

A crucial point for challenging the original machines in a least possible way would be to run a custom made small non-intrusive Linux distribution running on the machine. For laptops it should be bundled e.g. with a PCMCIA or cardbus network adaptor which is directly supported. This mini Linux should start not only the system, but can automatically configure a special network connection to the dump target and automate the required steps to achieve this. Linux variants basing on the same core for different hardware architectures could be considered. Additionally, this mini Linux could be used to gather metadata, like original hardware configuration, on the actual machine for the emulator configuration and additional information like different user profiles configured on the machine or external storage linked in.

This system even could make this as a remote appraisal method, so that offices or other organizational entities like research labs could hand over their systems in an organized way without risking the machine integrity by physically sending it over. Appropriate authentication and encryption could be built into the system to secure the transport path. Another method could link to existing backup systems deployed within the institution. If those allow full machine recovery the preservation workflows could start from this source.

The resetting of passwords and removal of unnecessary components and users might be incorporated into the system already, especially when the donor is using it. Then the donor could be asked to run or skip several clean-up steps on the system dumped. Otherwise these steps could be run within the memory institution receiving the image. The same would apply to the automatic exchange of all relevant hardware drivers if possible.

Inside and Outside Operations on the Object

After the object got dumped from the original source or retrieved from a backup system (if it allows full system restore) there exist two options to deal with the artefact further on. Either it gets attached to a virtual machine or emulator after certain preparations and the original environment is run and all further actions are done from within it. Or the image is treated within the archivist's working environment externally, which means that the imaged original environment is not executed at this moment.

Disk Image Forensics

Methods and tools of digital forensics could be applied upon disk images to gather a wide range of information like

  • Block device structure and filesystem
  • Installed operating systems and (kernel) versions
  • Installed software components identified e.g. with the help of the NIST database
  • Hardware drivers which should get exchanged to run the image within an emulator

Additional steps for disk image forensics should run sanitary procedures over the image file

  • Remove unnecessary users and set the passwords to ones which get documented in the image metadata
  • Strip the disk image of any privacy relevant stuff, if possible

The latter one is research question to be answered, as knowledge on the block structure, operating and file system knowledge is required. Here, tool suits like the Bitcurator can help to run a couple of standard tasks like detecting and wiping privacy related or other sensitive data. This procedure might pose a risk on the authenticity of the original image if it is impossible to prove that the sanitizing did not alter relevant components in an unwanted way.

A student project is planned to look here into the information which could be gathered across a wider range of different system images (wider range of X86, a couple of Apple Macintosh, Sun Sparc Classic). Nevertheless, a test set of system images like a test set of files would be useful to have for the future, even with the more complicated legal challenges in mind. A first start here might be the Digital Corpora site.

Image Size Reduction and Authenticity of the Artefact

The digital artefact of interest does not necessarily is the whole original system but e.g. some database installed to onto some operating system. This installation might even contain additional software, like an office package or a couple of games, the person using the machine wanted to use, but is of no further interest in the preserved object.

The disks of the systems to be imaged are not necessarily of optimal size meaning to be used to 99% of the available storage space. As the disk dumping procedure is agnostic of the content of the original disk and cannot distinguish between blocks containing meaningful or deleted data, it has to consider the whole disk in every case. In a worst case scenario the system resides on a 80 Gigabyte disk consuming only a few gigabytes of it. Thus, it makes sense to try to minimize the resulting image. Plus, there are economic reasons for this too, as the secure long-term bit preservation of large images might pose a significant factor in costs. A size reduction could be achieved in a number of ways:

  • Ask the donor to remove all unnecessary components before actually starting the system dump
  • Let the donor run the available filesystem clean-up and defragmentation procedures
  • Use digital forensics to get rid of blocks marked as unused but containing old data which is officially deleted by the means of the underlying filesystem

In further steps the tools of virtual machines or emulators could be used to identify empty blocks and shrink the image to its minimal size. Another approach would be to use compressed image types offered e.g. by VMware or QEMU. These procedures are methods which use identity transformations and thus do not alter the filesystem layer of the preserved this. This would be prove-able by comparing the number of files and their fingerprints before and after the transformation. It would be desirable to have more abstract methods to prove that the significant properties of the artefact to be preserved are not harmed. If the artefact is clearly identifiable within a system image different strategies might help to gain significant size reductions. Especially, if there are a couple of very similar but pretty standard installations of e.g. databases then it would be attractive to have a common base system. This could be achieved in a later step by running the original system in the emulator and use approaches like of monitoring the relevant objects to find out all components and files it uses. This information could be used to finally reduce the image to a minimum set of files and applications or find out which components have to be present in a custom made base system.

Cost Calculations and Object Metrics

For memory institutions it would be helpful to get an estimate on expected costs for their artefacts. These costs add up from the one-time costs of ingest procedure, the long-term storage of the object itself, the management efforts for the associated emulator and the access costs induced by user requirements. Besides the cost the artefact has a perceived mid-term and long-term value associated to it. Those considerations could finally led to a preservation plan.

While the efforts to dump the system image are usually low, in the different experiments the actual work time an archivist or experienced donor has to spend, is rather minor. Nevertheless, depending on the disk size, the procedure itself might take several hours to complete. The system image itself has already a certain value as it at least preserves the bitstream and allows forensic analysis. The further steps are more tedious and expensive as they might require individual steps and manual interaction. If research can show that automation of many steps is possible, the costs would go down with the number of objects processed.

Experiments and Outlook

The successful replication of various original environments in virtual machines or emulators raises a couple of new, exciting questions: What exactly is the object which is to be preserved authentically? Are the numerous operations to reduce the original disk image size identity transformations? Could this be proven in an automated way? How much change of the original image on the block level is acceptable?

The different problem sets laid out are to be translated into concrete actions. Especially, if there are quite different systems taken into consideration the actions should be formulated in an abstract way before translated into workflows. There are at least two different strands which could be looked into: The horizontal approach trying to automate as much as possible steps for a single operating system or hardware architecture. The vertical approach would look into methods to generalize and unify the process over a wider range of different images.

A perceived disadvantage of system imaging is the ratio of object to system image size. In cases where the outlined approach may be used it's value will have to be assessed against the cost of preserving such large objects (the disk images). From experimental experience it could be expected that in many cases the cost to understand, document, and migrate the database, along with the costs in providing meaningful access to the migrated databases without the custom GUI, may make this emulation/virtualisation approach quite attractive in comparison. As a mitigation to the object size challenge it could be tried to isolate the object to be rendered in a standard original environment. Such a standard environment might be used for many different objects of the same type and reduce the per-unit cost of storing any "rendered object" to a minimal value. That is a big if for complex databases but should be straight forward for the likes of "stand-alone" MS-Access style databases.

After having established the access via either forensics or through a running virtual machine or emulator further issues include are the access to parts of the artefact or results of e.g. database queries or other results for object reuse. Still, there is a couple of challenges to be solved to allow non-experts - even the IT professional of today is not necessarily familiar any more with the intricacies of driver installation in e.g. Windows operating systems or the boot loaders of OS/2 - to run these processes at manageable costs and efforts. Here, a cooperation between the various memory institutions could help to maintain the necessary knowledge and provide the resources to employ the necessary digital archivists.

The different challenges in the various experiments showed that a certain "cooperation" of the original operating system is required to be properly preservable. Another issue might be the handover of the license for the OS and the database engine. Both should be taken into consideration for objects which should be preservable in the future and could be made a requirement when implementing a project or putting out a tender.

Comments

Dirk von Suchodoletz's picture

The block device to be preserved by dumping a system image is usually (significantly) larger than necessary. As there are costs involved with the secure longterm storing of data it makes sense to try to reduce the image size as much as possible. Quite a couple of available procedures like filesystem defragmentation, image shrinking and the final removing of deleted data are identity transformations of a given filesystem. To prove this, if the dumping system has access to the given filesystem, it would make sense to generate metadata on every contained file and compare this to the metadata generated from the final virtual machine or emulator system file. The metadata could contain typical file information like creation, last access time, file size and a checksum. This data should not change under the aforementioned transformations.