Power-Programmierung

home *** CD-ROM | disk | FTP | other *** search

/ Power-Programmierung / CD2.mdf / doc / mir / 19debloc < prev next >

Wrap

Text File | 1992-06-29 | 11.2 KB | 285 lines

══════════════════════════════ 9. DATA DEBLOCKING ══════════════════════════════ ══════════════════════════════ 9.1 An aid to analysis ══════════════════════════════ It is common practice to group several data records together into a block, either of fixed or variable length. Before input-output buffering was built into operating system software, the use of blocks reduced the frequency of read/write instructions and speeded up programs. The size of a block depended on (and often matched) the physical record size of the storage medium. In this topic, we examine several techniques of separating blocks of data into records. The topic is introduced at this point because deblocking is often done within the analysis stage. Deblocking gets rid of byte counts or padding that have nothing to do with the data being analyzed. Byte surveys are cleaner when they are restricted to the data proper. The binary component of a file may disappear completely through deblocking. Blocks may be of fixed or variable length. The data within a fixed length block may itself be fixed. Variable length data can be found in blocks of any kind. ═════════════════════════════════ 9.2 Reducing line records ═════════════════════════════════ Line records date back to punch cards. Continuous text would be entered on a series of cards, with blank padding after the last complete word that could fit on a given card. Recall the NEWLINES program, introduced in topic 7.1: NEWLINES blocked_in unblocked_out bytes_per_line NEWLINES simply inserted line feeds and carriage returns at fixed intervals in the data. For continuous text on 80 column punched cards, this left blank padding at the end of almost every line. In order to get rid of blanks at the end of lines in any ASCII text, use the utility F_TRAIL: ░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░ Usage f_trail [/4] < ASCII text > revised Remove trailing blanks from lines of ASCII text. The /4 option is for backward compatibility only; it leaves a blank in the fourth column where a line consists of a three digit field number only. input: Any printable ASCII file. output: The same file with trailing blanks removed from each line. writeup: MIR TUTORIAL ONE, topic 9 ░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░ For example, these two commands might be used in sequence: NEWLINES blocked.txt stage2.txt 80 F_TRAIL < stage2.txt > stage3.txt The file STAGE2.TXT in this case would be fixed length lines of 80 bytes each, plus line feed and carriage return. STAGE3.TXT would have variable length lines of text (none greater than 80) and a line feed and carriage return at the end of each line. The /4 option in F_TRAIL may be safely ignored. It pads a three digit field number with a single blank; this single blank pad is not required in MIR production format records. More on this in MIR Tutorial TWO. ═════════════════════════════════════════ 9.3 Handling fixed length records ═════════════════════════════════════════ In topic 7.3 we showed how to extract a single field from a fixed length record. Here is a deblocking routine P_FIXED which places all fields in continuous ASCII text: ░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░ usage: p_fixed control_file fixed_length_input > ASCII_output Converts a fixed record length file to ASCII with field numbers. A control file governs field lengths and handling of empty data. input: [1] A control file as in P_FIXED.CTL (also appears at end of source code). [2] The fixed length records data output: ASCII output with one or more lines per field. New records are signalled by a line containing 000; all other lines begin with a three digit field number. Non-printable characters are shown in hex format with leading backslash. Additional processing may be needed to bring individual fields into production indexing format. writeup: MIR TUTORIAL ONE, topic 9 ░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░ Here is the template P_FIXED.CTL: # Edit a copy of this file to use with P_FIXED.EXE in order # to break out fixed length records. Each line consists of # three numbers and zero or more codes; each element is # separated by one or more blanks. The numbers are: # field number # start byte (followed by R if right half of byte only) # end byte (followed by L if left half of byte only) # A special line must be included with field number 0, begin # byte 0, and end byte = last byte of record (i.e., record # length - 1). # # Comment lines may be included. Each must start with # # # The codes that follow the three numbers are: # B retain field if blank # Z retain field if zeros # N retain field if nulls # LB retain leading blanks in field # LZ retain leading zeros in field # TB retain trailing blanks in field # 0 0 53 1 0 27 LB TB 2 28 29 3 30 32L N 4 32R 34L 5 35 38 6 39 42 7 43 49 8 50 50 9 51 52 The last ten lines above are samples only. Simply edit a copy of the template and give it a name of your choice. Then run the command P_FIXED with appropriate file names: P_FIXED my.ctl fixedlen.dta > ascii.dta The output takes this form: 000 001 Text of field one 002 Text of field two etc. 015 \9a\81 016 more data etc. The output contains only ASCII characters. Data that is in non- printable form is converted to hexadecimal format a character at a time. Note that \9a is a single byte; three characters are needed to represent each hexadecimal value. Where a byte within a series of hexadecimals happens to be printable, it is shown in its printable form. More processing may be required on some fields. Tutorial TWO includes software for that purpose. ══════════════════════════════════════════════ 9.4 Blocked records with ASCII lengths ══════════════════════════════════════════════ Variable length lines of ASCII text are sometimes blocked with a four byte ASCII count at the beginning of each new line. There is no line feed or carriage return at the end of a line. The program DEBLOC_A may be used to deblock this kind of data. ░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░ Usage debloc_a ASCII_blocked_file > unblocked_version Remove blocking, insert line feeds in ASCII blocked file. input: ASCII file with four byte inclusive line lengths at the beginning of every line, no line feeds at end. output: Same data with counts out, line feeds/carriage returns in. writeup: MIR TUTORIAL ONE, topic 9 ░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░ Data might look like this (usually with longer lengths): 0016First field.0008No. 2001401234567890013That's it Note the inclusive counting. The second field has only four bytes, but the count adds another four bytes... 0008No. 2. Deblocking that example would produce: First field. No. 2 0123456789 That's it A byte survey of the blocked file would have heavy concentrations of digits, especially of the digit zero. The data itself may contain digits, but in much smaller proportions. ═══════════════════════════════════════════════ 9.5 Blocked records with binary lengths ═══════════════════════════════════════════════ Newspaper and book publishers often use a blocking format which has two levels. The blocking values are in binary. The order of the binary bytes may vary. The source code for DEBLOC_B assumes high order byte, low order byte, then two NULLs to make up the four bytes in each case. Alter the source code in the "get_data" function if you come across data with a different sequence. There are typically two levels... a block of several thousand bytes, and sub-blocks within each block. The counts are inclusive. The program DEBLOC_B deblocks two level binary blocked data. It also addresses the problem that the data often originates on mainframe computers which use the EBCDIC character set. Using a program like EBC_ASC to convert from EBCDIC to ASCII of course replaces the bytes holding the binary block and sub-block counts. To ensure the correct count, DEBLOC_B provides for the situation by reconverting the counting bytes. ░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░ Usage debloc_b binary_blocked_file [/s][/e] > unblocked_output Remove blocking, and (if not suppressed by /s argument) insert line feeds. Argument /e must be used if file was originally EBCDIC, in which case the block lengths must be converted back to EBCDIC before they are interpreted. input: File with four byte binary inclusive block lengths and sub-lengths, two bytes in high to low order, then two NULLs. output: Same data with counts out, line feeds/carriage returns in (unless suppressed). writeup: MIR TUTORIAL ONE, topic 9 ░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░ The /s option is used if there is fixed length data included in the result. (I would have thought it unlikely, until I was handed a nine-track tape containing such data.) Notice the assumption that the data itself is printable ASCII text. If that is not the case and you are working in DOS, amend the source code to write to a named binary output file. An ancestor variation of DEBLOC_B is included with the source code. It has not been stylized for "copyleft", nor has it been tested recently. The program is P_MARC.C, intended for deblocking MARC records. MARC records were common for library citation databases. A companion ASCII document, MARC_REC.DOC, is also included with the software. It was reverse engineered from a customer's data several years ago. Its accuracy is not assured. ≡≡≡≡->> QUESTION: If you have access to data in MARC record format, could you either furnish a sample, or (better yet) take a run at upgrading both the MARC_REC.DOC document and the P_MARC.C source code? <<-≡≡≡≡ * * * * * Apart from the Glossary/Index, this completes MIR Tutorial ONE. You have tools and learning materials that should equip you to analyze most kinds of data that are likely to be indexed for search using normal ASCII search terms... words, phrases, numeric values, subject categories, etc.