I would like to pose a scenario for your comment:
Description
A large set of files, ~5,000.
Created between ~1993 to ~1997
Creation software unknown
Given extension .ASC
PRONOM PUIDs:
x-fmt/22 (7-bit ASCII Text) and x-fmt/283 (8-bit ASCII Text) relate DROID matches by extension as above
JHove: ASCII-hul (Status: Well-Formed and valid)
Visual inspection confirms that these files are ASCII text documents, with no BOM or other header/footer data. The characters seem to be limited to 7-bit ASCII, but a full check of the whole collection has not been made. This has to be undertaken manually, there is no tool that will make the distinction in a ‘bulk’ mode.
The files have no discernable data in the filename other than an arbitrary text string and the extension.
Most of the inspected files have a contextual ‘header’ (a vaguely structured line of text) inside the document of the (approximately common) form:
example 1: ‘NEWZTEL NEWS: RNZ 1ZB “LARRY WILLIAMS” MONDAY 25 MARCH 1996’
example 2: ‘NEWZTEL NEWS: CAPITAL TV “NIGHTLY NEWS” TUESDAY 19 MARCH 1996’
example 3: ‘NEWZTEL LOG: RNZ 12:00 NOON NEWS FRIDAY 17 MARCH 1995’
example 4: ‘NEWZTEL NEWS: CAPITAL TV “NIGHTLY NEWS” WEDNESDAY 14 FEBRUARY 1996’
Collection description:
Files are transcriptions of news broadcasts of the period.
The set is rich with useful search terms inside the files: Names, dates, places, themes etc.
The library describes the collection by a calendar month grouping. (e.g. ‘Transcripts of news broadcasts from May 1995’) The library has not, and will not undertake a ‘file-by-file’ description.
Technical details (Files with an .ASC extension)
There are two PUIDs that refer (as above). The only constraints appear to be either 7-bit or 8-bit ASCII character encoding, They have a standard ASCII CR (carriage return, or new line) encoding of {0xda} and a LF (line feed) encoding of {0x0a}. There is what appears to be an EOF type character {0x1a} at the end of the encoded text, which is followed by what appears to be some zero bit padding (in the form of {0x00} assumed to repeat until the total file size reaches a specific multiplier).
These encodings are interpreted correctly by any text viewer that was tested (MS word, notepad, notepad++).
Ingest options
1) Ingest as is. Create a rule that will associate all files of the extension ASC to x-fmt/22, and assume that all files are 7-bit ASCII (confidence in this assertion as yet unknown)
2) Change all the ASC extended files to .txt. ingest as x-fmt/111.
Justifications
1) (a) These files came in as .ASC files, they should be ingested as such. Any modifications required in the future should be undertaken through the creation of a modified master, and an new representation ‘layer’ added to the IE.
(b) There is a matching and suitable PUID.
(c) File completes MDE completely.
2 (a) ASC is not a widely adopted format. It’s a legacy format identifier that simply indicates the file contains ASCII text.
(b) Long term, the value of these objects is making them accessible – to a human reader, and to a systematic parser/indexer
(c) External tools are unlikely to support ASC as a format type (where text format type is specified). By changing the extension to txt , this potential bottleneck is completely removed. Objects are delivered to viewers in a widely accepted format that (generally) will be natively rendered on most platforms. Objects are delivered to agents in a widely accepted format.
(d) it is expected that the use of free text indexers or other content crawlers will be used at some point to extract context and search terms from this collection. If this process is not undertake its true value is unlikely to be realised by the Library given the limited description that is available. This includes harvesting the title, the dates, locations, names and other such useful terms, and making an index of these granular expressions available to researchers.
(e) Accepting the above, it would be more efficient to change the files once at ingest, (recording the changes as per policy), negating the need to revisit the objects in the future.
Thoughts? questions? comments?
CR & LF
I assume when you wrote {0xda} you meant either {0x0d} (CR) or {0x0a} (LF)?
Generally, I would wish to preserve all items precisely as they were submitted, and trial any renaming or other modifications for access via a ‘migrate on-the-fly’ approach at first. Once the intended preservation action has been tested out via a live process with real users for some time (and/or any other deep QA is complete), then i’d consider creating a new migrated form and replacing the old version with this new version.
However, in this case, that seems like a lot of work for what would appear to be a fairly minor change. But I’m still nervous because you’ve not tested that all files are ASCII-only, and because I’m concerned that the EOF and zero-padding might confuse other clients, like web browsers.
The batch-testing problem is an interesting one. If you’ve got access to a Linux prompt, I found this bit of grep magic that might help, perhaps in the form of:
find . -name "*.ASC" -type f -exec grep -P "[\x80-\xFF]" {} \;Which recursively finds all files ending with “ASC” and shows any lines that contain characters in the upper-bit range (128-255). If there are none, this will return nothing.
Thanks Andy
Thanks for the reply. Very useful, I will investigate the grep line you suggest and see what that gives me.
I’m interested in your views around ‘migration-on-the-fly’, and the desire to keep things as precise as possible.
I’m not saying either things are the right or wrong way of doing business, but I am interested in the discussion / cognition that underpins any policy like this.
To me the precision issue is an interesting one. I want to know from both a technical and intellectual/philosophical perspective what the value is for either argument. Any activity has a cost, and if we considered the whole cost of the object, (cost to render, cost to migrate, cost to classify, cost to store etc) there has to be a cost/benefit analysis exercise at some point that justifies the technological and intellectual.
This example is a very useful use case. We should be asking ourselves ‘what are we interested in preserving here, and why’?
(Oh and good catch on the {0xda} I should have written {0x0d} (CR) and {0x0a} (LF), thanks)
Update on GREP
I managed to find some time to really go over this problem, and the grep idea worked really well, so thanks again for that Andy.
I can post a more complete update if there is any interest – I’m not sure its answered the primary questions, but it was a useful process to explore text encoding with real live data.
In short, I found that of the ~5000 files, about 10 have characters/bytes outside the expected range (and about 30 occurrences of single characters across the whole collection), of them one was é used in one report, which as a character is outside the 7bit ASCII range, there was a couple of odd inverted commas, where the encoding seems to have come from another scheme, but the lions share was a character that appeared after numbers that was decoded as « in a UTF-8 decoder, it has a hex value of {0xab} so I see ‘…1«% of …’. or ‘…2« Million…’. This is curious, and I am hunting around to see of there is an older/less used encoding scheme that would explain this. It doesn’t seem to be replacing/augmenting the [space] character. My only other thought at the moment is that it might some markup for render, and given that we are unsure of either the creating or target application that will be a tough idea to prove.
It does make me wonder if there is would be some value in encodings identifiers… if we can’t use this approach and add in known extensions to a fixed pool (e.g. as per above, ASCII + {0xab} is labelled encoding ‘ASCIILocalExtension’.) It would give me a format label to ringfence technically similar content at the minimum.
The other area is control words inside a document for markup etc. I did have a look at some WP stuff a while ago, and though how useful it would be to have an ‘autodetect’ method for finding control words etc (and then perhaps even using the same scanning engine to replace known encodings to give a level of conformity where applicable…)
Why can’t we all just get along and use UTF-8?!
I followed a hunch, Googled ‘0xab codepage’ and got lucky.
So, this minor mysterious character, which I (like you) initially thought may be some short/non-breaking space or render markup, turns out to be a character (½) that significantly modifies the meaning of the message. Half a million seems significant to me! It is probably worth looking up the other one as well, in case that Codepage has a different meaning or accent for that character code.
I don’t know if we have a PUID for this codepage, and even then I’m not sure how one should use the chr/XXX PUIDs. Alternatively, maybe we could we use “text/plain;charset=Cp437” or something?
All of this reminds me that I’d been meaning to look at some of the open source Charset Detection tools to see if they caught this kind of thing.
Assume nothing…
Bingo, great thought Andy. The original content (radio news report) is long dead, so I can’t go back to source and check the original words.
I will have a rummage around and see what I can find. Your comment answered a niggle that I revisit, which was to see if in these few documents, all numbers are affected, or if there are some without this extra character. THis is a snippet from one of the offendign files:
“WELL IF THE MORTGAGE RATE IS 10«% OR 11% AND HOUSEPRICES GO UP 20% IN A YEAR, I’M STILL GOING TO BE BETTER OFF.”
Also, thanks for the char detection tool. This looks very interesting and something I will explore. I have huge ‘back-of-my-mind’ concern that we are not doing enough in the way of charset preservation.