Power-Programmierung

home *** CD-ROM | disk | FTP | other *** search

/ Power-Programmierung / CD2.mdf / doc / mir / 18binary < prev next >

Wrap

Text File | 1992-06-29 | 15.6 KB | 345 lines

═════════════════════════════════ 8. WORKED EXAMPLES... BINARY DATA ═════════════════════════════════ ≡≡≡≡->> QUESTION: If you have binary data that you can't decipher, and if you are able to give permission for a sample to be included in this topic as a worked example, then here's an opportunity! The first few samples received that I feel would be useful teaching tools for other readers will be built in, and the sender will receive a copy of the expanded topic. <<-≡≡≡≡ ════════════════════════════════════ 8.1 The preprocessing option ════════════════════════════════════ Recall that the eight bits in a byte can occur in 256 combinations. Most of the low end control characters and the 128 values with the high bit set are not printable. In a DOS file, allowance may be made for accented characters and, in some cases, for mathematical and Greek symbols. In topic 4 we looked at reasons why the non-printable characters might occur in a file... packed numbers, binary markup, compression substitutions, etc. To this list we may add data that is truly binary... numeric values occupying one, two or four bytes, or so-called "real" numbers with values before and after a decimal place. Alternatively, files may contain data that are not intended for display on ASCII screens at all... audio, graphics, animation, etc. It would be technically possible to index some types of binary data directly, provided the software were written to take into account every form of data which might be presented for indexing. The problem with specialized software is that disaster lurks one thoughtless command away. Sooner or later somebody is going to feed in data with peculiarities not foreseen by the programmer. In computer parlance, the "results are unpredictable." Specialized software requires specialized data. For most purposes, it is preferable to use a single general indexing program. Indexing at its core is a process of inverting a huge matrix. That's fairly complex and has to be made as efficient as possible. (We go into all the details in Tutorial THREE.) We don't want to load that software down with complex parsing. It's better to do any specialized work separately. Hence we add a step called "preprocessing" to bring data to a standard ASCII format. The following steps are applied, first to a sample, later to the full set of data: » Analyze data » Select or create preprocessing tools » Preprocess » Verify preprocessed data is to standard » Invert » Validate that indexes fulfil specifications » Set parameters, select display software » Finalize data » Test indexing and retrieval result We raise the subject of preprocessing now to show that all binary data has to be transformed prior to indexing. (This is independent of how the data is made available to the searcher, in its original form or preprocessed or otherwise modified.) Packed numbers must be unpacked, numeric fields changed to their ASCII equivalent, blocking data used to extract ASCII, markup used to identify records and their parts, compressed data uncompressed, etc. ═══════════════════════════ 8.2 File signatures ═══════════════════════════ Okay, you have some data to index and a byte survey shows that it contains non-printing characters. Where to from here? All data is created through the use of some form of software... Optical Character Recognition, word processing, report generators, transaction processing, and many other forms. Data that is not printable text often is identified internally so that its parent software will recognize the file if an attempt is made to process it. The signature is often within the first four bytes of a file. If not, the first 128 bytes usually contain important clues. Recall that to display the first 128 bytes, the command DUMP filename 0 128 puts the data on the screen in Hexadecimal and ASCII format. Examples: » WordPerfect files are identified in the first four bytes by a hex value \FF followed by the letters "WPC". » Microsoft Word contains many null bytes among the first 128. Two items stand out, a style sheet name and a printer name, such as NORMAL.STY and NECP6. » DOS executable files have the signature "MZ" in the first two bytes. (But don't try to index their content!) ≡≡≡≡->> QUESTION: Another useful item you might contribute... the first 128 bytes of various file types, to extend the list of file "signatures". Use CPB to make these small samples and identify each sample by the software used to create it. <<-≡≡≡≡ ════════════════════════════════════════════ 8.3 Converting word processing files ════════════════════════════════════════════ Many word processors have an option to output ASCII files. This option may be used to simplify preprocessing. The normal sequence to store a Microsoft Word file is Escape / Transfer / Save (three keystrokes \ESC T S). If a file name has already been specified, its name is presented for confirmation. To save the file in normal word processing mode, with its underlining, bold characters, etc., simply press return. To convert the file to ASCII, use the same \ESC T S sequence. You might modify the file name (for example, MYFILE.DOC to MYFILE.ASC). Then press TAB or the right arrow. The cursor moves to the right, past the word "formatted" and highlights Yes in a Yes / No choice. Press N for No. The program checks with you: "Enter Y to confirm loss of formatting." Press Y and the ASCII version is saved. If you do a byte survey on the result, you should find no binary data. There may be some work left to put the file into a standard format for indexing, but at least you are working with an ASCII file with no undefined codes. WordPerfect's sequence is not intuitively obvious. But it produces a clean result. It's CTL-F5 1 1. You are asked to supply a file name. Once that's in place, hit return and the ASCII version is saved. One of the nice features of this conversion is that all tabs are replaced by blanks, and you get a WYSIWYG file that preserves all indenting and centering. Note: My copy of WordPerfect 5.1 includes a CONVERT.EXE utility dated 02-08-91. The result is NOT the same. CONVERT lapses occasionally on indents and pads the end of the ASCII output with many unnecessary hex \1A bytes. (Incidentally, I use WordPerfect to write source code and to clean it up after debugging. I set margins to five characters left and right, and tabs to four. It's an easy habit to use the CTL-F5 1 1 sequence to save the file when ready to compile or tuck away for future use.) ≡≡≡≡->> QUESTION: Sequences for other word processors, please! <<-≡≡≡≡ ═════════════════════════════════════ 8.4 Binary deblocking lengths ═════════════════════════════════════ Binary deblocking may be recognized by three features. First, there is a low portion of binary characters in a file that is mostly printable. Second, the binary characters occur in small bursts at slightly varying distances. This is in contrast to the pattern with packed numbers (considered in the preceding topic) in which the non-printing characters recurred at specific points within fixed length records. The third feature is that the binary portions translate into lengths corresponding more or less to the distance between the current and the next burst. This third feature is not readily apparent. Display the data in hexadecimal and ASCII on either side of a burst. DUMP filename 19200 19400 Within the result, look at the binary bytes which in this case are \03\9e (decimal equivalent = 3 X 256 + 9 X 16 + 14 X 1 = 926, provided bytes are in sequence of high to low order). 19264: 65 6e 64 20 6f 66 20 70 72 65 76 69 6f 75 73 20 end of previous 19280: 62 6c 6f 63 6b 2e 03 9e 42 65 67 69 6e 6e 69 6e block...Beginnin 19296: 67 20 6f 66 20 74 68 65 20 6e 65 78 74 20 62 6c g of the next bl To test whether these are binary blocking data, do a dump 926 bytes further on and see if there are binary bytes either there or immediately adjacent. Variations occur depending on whether the blocking data include or exclude their own length, in this case two bytes. You may run into cases of extended block lengths, possibly four bytes, followed by local record or sub-block lengths which are usually two bytes each. The signal for this is four or six bytes at or near the top of the file. It is found in some library MARC records. In the next topic we introduce a program to deblock data in this form. Incidentally, the above sample is cooked data. It's not real. Here's a little data cooker: ░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░ Usage hex_bin ascii_input binary_output Create a file with any combination of printable and binary characters. Used to create test files. input: An edited ASCII file in which printable characters are as desired in output, and binary characters are represented by a backslash and two hex values (example, \08 to represent a backspace or \5C to represent a backslash). output: The same file with binary characters replacing all \xx values. writeup: MIR TUTORIAL ONE, topic 8 ░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░ ═══════════════════════════════════════════════ 8.5 Binary data in fixed length records ═══════════════════════════════════════════════ Binary data in fixed length records can be interpreted within a reasonable time period only if you have the record layout and/or examples of printouts that match data on hand and/or the software normally used to display the data. The more ASCII data sprinkled through the data, the easier it is to orient oneself within the record. Start by trying to match large integers... over 255 and if possible over 65535, that is more than one and preferably more than two bytes in length. Select a high integer value in the display or printout. Work out its hex equivalent. Example: The number 144,000 appears in the printout. 144,000 / 65536 = 2 (hex \02), remainder = 12,928 12,928 / 256 = 50 (hex \32), remainder = 128 128 / 1 = 128 (hex \80). Then look in the matching record for four bytes. Depending on the operating system in which created, the bytes \00\02\32\80 or the reverse set \80\32\02\00 should be somewhere within the record. For a fixed length layout, you now have identified exactly which bytes match the corresponding data. This work is painstaking, but it enables you to define the preprocessing task very precisely. ≡≡≡≡->> QUESTION: Write a little routine to convert decimal values to hexadecimal. (UNIX has a utility BC which does exactly this task.) <<-≡≡≡≡ Dates are often expressed as binary values as well. There are a variety of techniques. Typically a base is selected. Two bytes can hold elapsed days since the base date. More frequently, seven bits are reserved for years since the base year, four bits for the month and five for the day of the month. Once date bytes have been identified and matched to their corresponding display, the coding technique can be identified by inspection. ═══════════════════════════ 8.6 Compressed data ═══════════════════════════ A variety of techniques are used for compression... pattern substitution, variable length encoding, Huffman code, suppression of repeated characters, differencing, etc. Unless information is available on how the data has been compressed, the indexer is faced with a daunting task. Until you have built up experience in other areas of preparation and indexing, you might choose to avoid working with compressed data. Pattern substitution is one method that can sometimes be deciphered within a reasonable amount of time. Repetitive strings are replaced in the text with one or two byte binary values. These binary bytes must be distinguishable from text. If there are no accented characters (or if accented characters are preceded by a reserved character to flag them), then the high bit set characters can be used either for 128 single byte replacements or as the lead byte for 32768 two byte replacements. In order to decompress rapidly, the display program relies on a body of text which it reads into RAM on startup. Some programs store the decompressed equivalents in a tree format. It's easier to work with those that employ a linear table, that is, the decompressed values listed one after another, usually with a supporting list of offsets indicating the beginning of each term. Pattern substitution compressed data is recognized by: » high counts of non-printable characters; » words and word fragments interspersed frequently within binary values; » availability of display of the uncompressed form that matches the data being analyzed; » existence of a decompression table and possibly a vector of offsets pointing to starting points within the table. Once the compression technique is understood and the decompression table is available, it is possible to create software to decompress the data as a first step in preprocessing. Because of the variety of techniques in use, it is nearly impossible to write software to cover all cases. At a minimum, expect to spend some time adapting software; in worst case, be prepared to write from scratch. ≡≡≡≡->> QUESTION: The source code I have on hand is too specific to have any teaching value. Do you have any C code that could serve the purpose? Alternatively, do you have a sample and a decompression table for which I could write the algorithm? <<-≡≡≡≡ * * * * * Of all data formats, binary data presents the greatest challenge to the indexer. We have looked at reasons to preprocess binary data to a standard ASCII format. Where possible, we use the signature of a binary file to identify the parent software and use that software to reduce the data to an ASCII alternative. Binary blocking information and binary data within fixed length records can be fairly readily transformed to ASCII. Compressed data is more difficult, but some forms can be preprocessed within a reasonable time frame.