home *** CD-ROM | disk | FTP | other *** search
- ═════════════════════════════════
-
- 8. WORKED EXAMPLES...
- BINARY DATA
-
- ═════════════════════════════════
-
-
- ≡≡≡≡->> QUESTION:
- If you have binary data that you can't decipher, and if
- you are able to give permission for a sample to be
- included in this topic as a worked example, then here's
- an opportunity! The first few samples received that I
- feel would be useful teaching tools for other readers
- will be built in, and the sender will receive a copy of
- the expanded topic.
- <<-≡≡≡≡
-
-
- ════════════════════════════════════
- 8.1 The preprocessing option
- ════════════════════════════════════
-
- Recall that the eight bits in a byte can occur in 256
- combinations. Most of the low end control characters and the 128
- values with the high bit set are not printable. In a DOS file,
- allowance may be made for accented characters and, in some cases,
- for mathematical and Greek symbols. In topic 4 we looked at
- reasons why the non-printable characters might occur in a file...
- packed numbers, binary markup, compression substitutions, etc. To
- this list we may add data that is truly binary... numeric values
- occupying one, two or four bytes, or so-called "real" numbers with
- values before and after a decimal place. Alternatively, files may
- contain data that are not intended for display on ASCII screens at
- all... audio, graphics, animation, etc.
-
- It would be technically possible to index some types of
- binary data directly, provided the software were written to take
- into account every form of data which might be presented for
- indexing. The problem with specialized software is that disaster
- lurks one thoughtless command away. Sooner or later somebody is
- going to feed in data with peculiarities not foreseen by the
- programmer. In computer parlance, the "results are unpredictable."
- Specialized software requires specialized data.
-
- For most purposes, it is preferable to use a single
- general indexing program. Indexing at its core is a process of
- inverting a huge matrix. That's fairly complex and has to be made
- as efficient as possible. (We go into all the details in Tutorial
- THREE.) We don't want to load that software down with complex
- parsing. It's better to do any specialized work separately. Hence
- we add a step called "preprocessing" to bring data to a standard
- ASCII format. The following steps are applied, first to a sample,
- later to the full set of data:
-
- » Analyze data
-
- » Select or create preprocessing tools
-
- » Preprocess
-
- » Verify preprocessed data is to standard
-
- » Invert
-
- » Validate that indexes fulfil specifications
-
- » Set parameters, select display software
-
- » Finalize data
-
- » Test indexing and retrieval result
-
- We raise the subject of preprocessing now to show that
- all binary data has to be transformed prior to indexing. (This is
- independent of how the data is made available to the searcher, in
- its original form or preprocessed or otherwise modified.) Packed
- numbers must be unpacked, numeric fields changed to their ASCII
- equivalent, blocking data used to extract ASCII, markup used to
- identify records and their parts, compressed data uncompressed,
- etc.
-
-
- ═══════════════════════════
- 8.2 File signatures
- ═══════════════════════════
-
- Okay, you have some data to index and a byte survey
- shows that it contains non-printing characters. Where to from
- here?
-
- All data is created through the use of some form of
- software... Optical Character Recognition, word processing, report
- generators, transaction processing, and many other forms. Data
- that is not printable text often is identified internally so that
- its parent software will recognize the file if an attempt is made
- to process it. The signature is often within the first four bytes
- of a file. If not, the first 128 bytes usually contain important
- clues. Recall that to display the first 128 bytes, the command
-
- DUMP filename 0 128
-
- puts the data on the screen in Hexadecimal and ASCII format.
- Examples:
-
- » WordPerfect files are identified in the first four
- bytes by a hex value \FF followed by the letters "WPC".
-
- » Microsoft Word contains many null bytes among the first
- 128. Two items stand out, a style sheet name and a
- printer name, such as NORMAL.STY and NECP6.
-
- » DOS executable files have the signature "MZ" in the
- first two bytes. (But don't try to index their
- content!)
-
- ≡≡≡≡->> QUESTION:
- Another useful item you might contribute... the first
- 128 bytes of various file types, to extend the list of
- file "signatures". Use CPB to make these small samples
- and identify each sample by the software used to create
- it.
- <<-≡≡≡≡
-
-
- ════════════════════════════════════════════
- 8.3 Converting word processing files
- ════════════════════════════════════════════
-
- Many word processors have an option to output ASCII
- files. This option may be used to simplify preprocessing.
-
- The normal sequence to store a Microsoft Word file is
- Escape / Transfer / Save (three keystrokes \ESC T S). If a file
- name has already been specified, its name is presented for
- confirmation. To save the file in normal word processing mode,
- with its underlining, bold characters, etc., simply press return.
- To convert the file to ASCII, use the same \ESC T S sequence. You
- might modify the file name (for example, MYFILE.DOC to MYFILE.ASC).
- Then press TAB or the right arrow. The cursor moves to the right,
- past the word "formatted" and highlights Yes in a Yes / No choice.
- Press N for No. The program checks with you: "Enter Y to confirm
- loss of formatting." Press Y and the ASCII version is saved. If
- you do a byte survey on the result, you should find no binary data.
- There may be some work left to put the file into a standard format
- for indexing, but at least you are working with an ASCII file with
- no undefined codes.
-
- WordPerfect's sequence is not intuitively obvious. But
- it produces a clean result. It's CTL-F5 1 1. You are asked to
- supply a file name. Once that's in place, hit return and the ASCII
- version is saved. One of the nice features of this conversion is
- that all tabs are replaced by blanks, and you get a WYSIWYG file
- that preserves all indenting and centering. Note: My copy of
- WordPerfect 5.1 includes a CONVERT.EXE utility dated 02-08-91. The
- result is NOT the same. CONVERT lapses occasionally on indents and
- pads the end of the ASCII output with many unnecessary hex \1A
- bytes. (Incidentally, I use WordPerfect to write source code and
- to clean it up after debugging. I set margins to five characters
- left and right, and tabs to four. It's an easy habit to use the
- CTL-F5 1 1 sequence to save the file when ready to compile or tuck
- away for future use.)
-
- ≡≡≡≡->> QUESTION:
- Sequences for other word processors, please!
- <<-≡≡≡≡
-
-
- ═════════════════════════════════════
- 8.4 Binary deblocking lengths
- ═════════════════════════════════════
-
- Binary deblocking may be recognized by three features.
- First, there is a low portion of binary characters in a file that
- is mostly printable. Second, the binary characters occur in small
- bursts at slightly varying distances. This is in contrast to the
- pattern with packed numbers (considered in the preceding topic) in
- which the non-printing characters recurred at specific points
- within fixed length records. The third feature is that the binary
- portions translate into lengths corresponding more or less to the
- distance between the current and the next burst.
-
- This third feature is not readily apparent. Display
- the data in hexadecimal and ASCII on either side of a burst.
-
- DUMP filename 19200 19400
-
- Within the result, look at the binary bytes which in this case are
- \03\9e (decimal equivalent = 3 X 256 + 9 X 16 + 14 X 1 = 926,
- provided bytes are in sequence of high to low order).
-
- 19264: 65 6e 64 20 6f 66 20 70 72 65 76 69 6f 75 73 20
- end of previous
- 19280: 62 6c 6f 63 6b 2e 03 9e 42 65 67 69 6e 6e 69 6e
- block...Beginnin
- 19296: 67 20 6f 66 20 74 68 65 20 6e 65 78 74 20 62 6c
- g of the next bl
-
- To test whether these are binary blocking data, do a
- dump 926 bytes further on and see if there are binary bytes either
- there or immediately adjacent. Variations occur depending on
- whether the blocking data include or exclude their own length, in
- this case two bytes.
-
- You may run into cases of extended block lengths,
- possibly four bytes, followed by local record or sub-block lengths
- which are usually two bytes each. The signal for this is four or
- six bytes at or near the top of the file. It is found in some
- library MARC records. In the next topic we introduce a program to
- deblock data in this form.
-
- Incidentally, the above sample is cooked data. It's
- not real. Here's a little data cooker:
-
- ░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
- Usage hex_bin ascii_input binary_output
-
- Create a file with any combination of printable and binary
- characters. Used to create test files.
-
- input: An edited ASCII file in which printable characters are as
- desired in output, and binary characters are represented by
- a backslash and two hex values (example, \08 to represent
- a backspace or \5C to represent a backslash).
-
- output: The same file with binary characters replacing all \xx
- values.
-
- writeup: MIR TUTORIAL ONE, topic 8
- ░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
-
-
- ═══════════════════════════════════════════════
- 8.5 Binary data in fixed length records
- ═══════════════════════════════════════════════
-
- Binary data in fixed length records can be interpreted
- within a reasonable time period only if you have the record layout
- and/or examples of printouts that match data on hand and/or the
- software normally used to display the data. The more ASCII data
- sprinkled through the data, the easier it is to orient oneself
- within the record.
-
- Start by trying to match large integers... over 255 and
- if possible over 65535, that is more than one and preferably more
- than two bytes in length. Select a high integer value in the
- display or printout. Work out its hex equivalent. Example: The
- number 144,000 appears in the printout.
-
- 144,000 / 65536 = 2 (hex \02), remainder = 12,928
-
- 12,928 / 256 = 50 (hex \32), remainder = 128
-
- 128 / 1 = 128 (hex \80).
-
- Then look in the matching record for four bytes. Depending on the
- operating system in which created, the bytes \00\02\32\80 or the
- reverse set \80\32\02\00 should be somewhere within the record.
- For a fixed length layout, you now have identified exactly which
- bytes match the corresponding data. This work is painstaking, but
- it enables you to define the preprocessing task very precisely.
-
- ≡≡≡≡->> QUESTION:
- Write a little routine to convert decimal values to
- hexadecimal. (UNIX has a utility BC which does exactly
- this task.)
- <<-≡≡≡≡
-
- Dates are often expressed as binary values as well.
- There are a variety of techniques. Typically a base is selected.
- Two bytes can hold elapsed days since the base date. More
- frequently, seven bits are reserved for years since the base year,
- four bits for the month and five for the day of the month. Once
- date bytes have been identified and matched to their corresponding
- display, the coding technique can be identified by inspection.
-
-
- ═══════════════════════════
- 8.6 Compressed data
- ═══════════════════════════
-
- A variety of techniques are used for compression...
- pattern substitution, variable length encoding, Huffman code,
- suppression of repeated characters, differencing, etc. Unless
- information is available on how the data has been compressed, the
- indexer is faced with a daunting task. Until you have built up
- experience in other areas of preparation and indexing, you might
- choose to avoid working with compressed data.
-
- Pattern substitution is one method that can sometimes
- be deciphered within a reasonable amount of time. Repetitive
- strings are replaced in the text with one or two byte binary
- values. These binary bytes must be distinguishable from text. If
- there are no accented characters (or if accented characters are
- preceded by a reserved character to flag them), then the high bit
- set characters can be used either for 128 single byte replacements
- or as the lead byte for 32768 two byte replacements. In order to
- decompress rapidly, the display program relies on a body of text
- which it reads into RAM on startup. Some programs store the
- decompressed equivalents in a tree format. It's easier to work
- with those that employ a linear table, that is, the decompressed
- values listed one after another, usually with a supporting list of
- offsets indicating the beginning of each term.
-
- Pattern substitution compressed data is recognized by:
-
- » high counts of non-printable characters;
-
- » words and word fragments interspersed frequently within
- binary values;
-
- » availability of display of the uncompressed form that
- matches the data being analyzed;
-
- » existence of a decompression table and possibly a
- vector of offsets pointing to starting points within
- the table.
-
- Once the compression technique is understood and the decompression
- table is available, it is possible to create software to decompress
- the data as a first step in preprocessing. Because of the variety
- of techniques in use, it is nearly impossible to write software to
- cover all cases. At a minimum, expect to spend some time adapting
- software; in worst case, be prepared to write from scratch.
-
- ≡≡≡≡->> QUESTION:
- The source code I have on hand is too specific to have
- any teaching value. Do you have any C code that could
- serve the purpose? Alternatively, do you have a sample
- and a decompression table for which I could write the
- algorithm?
- <<-≡≡≡≡
-
-
- * * * * *
-
- Of all data formats, binary data presents the greatest
- challenge to the indexer. We have looked at reasons to preprocess
- binary data to a standard ASCII format. Where possible, we use the
- signature of a binary file to identify the parent software and use
- that software to reduce the data to an ASCII alternative. Binary
- blocking information and binary data within fixed length records
- can be fairly readily transformed to ASCII. Compressed data is
- more difficult, but some forms can be preprocessed within a
- reasonable time frame.