Exploring the impact of Flipped Bits. | Open Planets Foundation

Following a few interesting conversations recently, I got interested in the idea of ‘bit flip’ – an occasion where a single binary bit changes state from a 0 to a 1 or from a 1 to a 0 inside a file.

I wrote a very inefficient script that sequentially flipped every bit in jpeg file, saved the new bitstream as a jpeg, attempted to render it in the [im] python library, and if successful, to calculate an RMSe error value for the new file.

I’ve not really had much time to take this further at the moment, but its an academic notion I’d be interested in exploring some more.

I’m not sure if a bit flip is a theoretical or ‘real’ threat on modern storage devices – in the millions of digital objects that have passed through my hands in the past 10 years, I’ve never knowingly handled a random bit flip errored file. I’d be interested in any thoughts / experiences / observations on the topic.

Please see the attached file for some pretty pictures.

Feel free to get in touch if you want any more data – images, RMSe data or scripts.

Preservation Topics:

Bit rot

Attachment	Size
Bit Rot_OPF.pdf	1.76 MB

Submitted by Jay Gattuso on 14 February 2013 – 1:59am

Comments

A few examples

Hi Jay,

Nice work, but I’m not aware of any specific examples of disk/storage failure type bitrot with single bit flips. From what I’ve heard from people with experience of disk failures, this doesn’t tend to happen very much. Although I’d love to see some better evidence on this.

What we do know is that processes to manage files (move, replicate, migrate, etc) do sometimes go wrong. S**t does indeed happen. Software tools are buggy. Networks drop out. Humans press the wrong button. That’s life. And this means that quality assurance across the lifecycle is pretty important.

There are a few examples of damage to files (scroll down to the bit rot sections) that we’ve collected in our mashups, and these tend to be caused by an array of different issues. This rather interesting example with TIFFs is single bit damage, but seems likely to have been caused by the creating software as its consistent across a lot of files (although this has not been confirmed).

I think there was a paper published by some of the Planets partners who conducted a similar experiment to yours, but I can’t locate it. I’m sure another reader will know it…?

Cheers

Paul

Submitted by Paul Wheatley on 14 February 2013 – 1:55pm Permalink

Bit damage analysis

Some of the previous work is covered in the Heydegger paper referenced from here.

Note that the reason we don’t see bit-level damage is precisely because all of our systems are very carefully engineered in order to address them. There are error detection and correction protocols working for us at every moment, at the lowest levels of our systems.

Which is why things mostly go wrong at the higher levels, where we haven’t fully understood the classes of threat to the data and so engineered management protocols to compensate.

Submitted by Andy Jackson on 14 February 2013 – 3:22pm Permalink

Agreed agreed.

Yupe, totally agree – I was following a couple of strands when I did this, one is the comments in the reply to Paul, and the other was to see what the resulting images look like!

I’ve really only seen file construction errors (where a filestream is created incorrectly) or truncation errors (where files haven’t been written fully post tx or write).

Submitted by Jay Gattuso on 14 February 2013 – 7:13pm Permalink

Processing error

I’ll share this one as well, as (to me as a DP geek…) it’s a rather interesting one. Judging from the way the corruption has been cropped, this seems to have occured during the post digitisation, processing stage (de-skew, cropping etc).

Submitted by Paul Wheatley on 15 February 2013 – 10:50am Permalink

Private image…

I can’t see the file in flickr. 🙁

Submitted by Jay Gattuso on 18 February 2013 – 9:55pm Permalink

Flickr fail

Apologies, I’ll try and get this fixed…

Submitted by Paul Wheatley on 19 February 2013 – 2:08pm Permalink

All good points

Hey Paul,

As ever, I completely agree.

Underneath this work, there is a question in my mind that looks at knowing where critical parts of files are. Some file types will naturally localise errors into chucks, others will spread them evenly throughout the file object, and some will not tolerate any errors at all (depending on how the bitstream is organised in the file object, and now the file object is organised on the storage medium).

If a single bit in a txt file is errored, the error is always only localised to the affected byte – any burst errors would be dispersed at byte level throughout the file, and any cluster errors the same.

If a single bit in a mp3 file is errored, as long as the bit is not in the critical setup/declaration parts of the header, then errors are confined to the frame that contains the errored bit. Any burst errors would be dispersed at a frame level throughout the file – (without doing an impact study in mp3, I’m not sure what size error mp3 frames are tolerant of).

It’s not hard to imagine there are some files in which any damage to any bit results in a complete render failure.

What we can see from these jpgs is that some errors will cause critical failure (it follows that if a one bit error is capable of preventing the object from being rendered, at some offsets an error of any size can do the same)

Quick count of the failed render files: 1930 failed to render in script. That means we can assume that a jpg of a comparable size encountering any number of distributed bit errors is 1.37% per bit. Without accounting for a location bias, a full byte flip error has a ~10% chance of disabling a file.

This asks questions (1) what is the location bias – I can see from my data that the first 6 bytes of jpg are critical – any error of any bit results in a 91.6% chance the whole file fail to render. (2) If I throw big enough errors at the file, when does it behave differently? (3) What if I spread those errors around the file (transmission or storage cluster errors) or in a single block (transmission or write errors)

So yupe, totally take your point! it is a hypothetical attack, but it has helped to start some interesting discussions already, so I’m very happy with that!

Submitted by Jay Gattuso on 14 February 2013 – 7:04pm Permalink

Great stuff

Great stuff, but I had trouble understanding this part:

This source image was reduced to 180 x 120 pixels in size, which results in a 117,514 byte image. This is equal to 140,112 bits of data per image, which results in 140,112 new images being created.

Surely, at 8 bits to the byte, this should be 940,112 bits of data? Where did 140,112 come from?

Submitted by Andy Jackson on 14 February 2013 – 3:16pm Permalink

ham fingers

Good spot,

Fixed now, thanks.

Submitted by Jay Gattuso on 14 February 2013 – 7:16pm Permalink

Never mind preservation…

…this is a pretty nice art project.

Submitted by Peter Cliff on 14 February 2013 – 4:37pm Permalink

Arty

Indeed. When I get round to it, I’m going to loop them all into a movie. Even at full frame rate its going to be a very long and dull movie!

Submitted by Jay Gattuso on 14 February 2013 – 7:27pm Permalink

Volker’s work

Jay,

Volker has done some nice work on this:

Heydegger, V.: Just One Bit in a Million: On the Effects of Data Corruption in Files. In: Agosti, M. et al. (Eds.): Research and Advanced Technology for Digital Libraries, ECDL 2009, LNCS 5714 (2009). (=Proceedings of the 13th European Conference ECDL, published by Springer ) .

Also, the British Library has seen bit rot in its Digital Library System, detected when recalculating hashsums. They are real.

Best wishes,

Angela

Submitted by Angela Dappert on 15 February 2013 – 1:26pm Permalink

Real, but rare…

Just asked a collegue, and they said over the last six years of operation of the main store, which has a current total of 50 million files containing about half a petabyte of data (replicated totals), the BL has seen spontanous bitstream damage once (i.e. only one file has ever been repaired for this reason). There have been other errors, but they have been down to systematic sources like faulty hardware or workflow problems, rather than true spontanous ‘bit rot’.

So yes, it happens, but it is certainly rare.

Submitted by Andy Jackson on 15 February 2013 – 1:44pm Permalink

Interesting.

It would be very interesting to know how they caught it…and repaired it (I can only guess a brute force bit flip and MD5 comparison until the original MD5 is re-created….)

Agree with your comment on workflow and tools – ditto over here. The main cause of errors is naff tools writing ill formed objects, or hardware having an issue (e.g. we discovered through fixity checks at some point we had a ‘chattery’ network cable in a critical switch, resulting in intermittent writing of file objects bitstreams….went back and fixed the problem, and badly written files, so no loss, but worrying for a while as we tracked the fault down.)

Submitted by Jay Gattuso on 18 February 2013 – 9:54pm Permalink

Plain old fixity checking

The errors were caught using plain old fixity checking (signed, timestamped SHA-256 I think), and restored from mirrors.

In my experience, systematic faults like ‘chattery’ cables, dodgy disc controllers or flaky firmware are much more commonly problematic than random spontanous damage (cosmic rays etc.).

Submitted by Andy Jackson on 19 February 2013 – 10:19am Permalink

Flips and clusters

You know, I’ve been thinking (dangerous I know) and wondered if my thought was a good idea, a bad idea, been done before, etc. and this seemed a good place to find out! 🙂

Flipping a bit produces a broken image. If you flip the bits on lots of images you get lots of broken images and broken images – particularly JPEGs – seem to exhibit very similar artifacts – at least the broken images seem familar somehow.

We could use this technique then to create a large body of broken images quite quickly.

Now, my question is, will we see any similarity in that breakage?

I’m not expecting a direct correllation between the bit and the damage (though it’d be neat if flipping bit 17 always resulted in a cyan swathe across the image for example) but rather that images that are broken may all produce similar artifacts/shapes?

If (big if probably) we can extract features from each of the broken images (Matchbox?) we may then be able to cluster around these features and start to answer that question – is there any similarity in the breakages?

Why?

If we can spot similarity, we can use that cluster data as another measure of whether or not an image is broken in the absence of any “ground truth” – ie. we’ve not migrated the image and are checking against the original, we’re just handling an image in isolation – say from a CD-ROM we’re ingesting?

Could also do something similar with images identified as broken on the Atlas, but I’m not sure the corpus is big enough yet…

Having thought it all through, I think I’ll go get on with it! 🙂

Submitted by Peter Cliff on 18 February 2013 – 9:57am Permalink

I like your thinking…

Especially:

“I’m not expecting a direct correlation between the bit and the damage (though it’d be neat if flipping bit 17 always resulted in a cyan swathe across the image for example) but rather that images that are broken may all produce similar artefacts/shapes?”

In my head, there IS a few bytes that are critical to a file – from my brief foray into bit mashing I saw that the first few bytes of jpeg are critical. Given how inefficient my code was, I’ve not been able to really run as many tests as I wanted – however, I’d love to see the aggregated results of a decent number of different jpgs (a few hundred, covering frame size and compression aggression) and see there are any relative or absolute patterns to errors and error percentages.

For example, we know that jpeg is built of 8×8 blocks, each block having undergone a DCT, and the result is stored as one of the main parts of the jpeg file. This means that any changes to the MSB side of a byte will result in a larger ‘per DCT block’ error than a change on the LSB end – this means that we are more ‘tolerant’ (visually and arithmetically) of LSB skewed errors in most of the jpeg file than MSB errors.

This offers some questions: At a base level, do these sorts of patterns affect how we should be clustering different file types to reduce the likelihood of errors / damage in the long term (or do simply not care because modern error correction and concealment and distributed bit writing methods remove this issue?…)

What other patterns can we see? Are there critical portions of files that would benefit from a higher bit budget?

Submitted by Jay Gattuso on 18 February 2013 – 7:19pm Permalink