home *** CD-ROM | disk | FTP | other *** search
- ══════════════════════════════
-
- 9. DATA DEBLOCKING
-
- ══════════════════════════════
-
-
- ══════════════════════════════
- 9.1 An aid to analysis
- ══════════════════════════════
-
- It is common practice to group several data records
- together into a block, either of fixed or variable length. Before
- input-output buffering was built into operating system software,
- the use of blocks reduced the frequency of read/write instructions
- and speeded up programs. The size of a block depended on (and
- often matched) the physical record size of the storage medium.
-
- In this topic, we examine several techniques of
- separating blocks of data into records. The topic is introduced at
- this point because deblocking is often done within the analysis
- stage. Deblocking gets rid of byte counts or padding that have
- nothing to do with the data being analyzed. Byte surveys are
- cleaner when they are restricted to the data proper. The binary
- component of a file may disappear completely through deblocking.
-
- Blocks may be of fixed or variable length. The data
- within a fixed length block may itself be fixed. Variable length
- data can be found in blocks of any kind.
-
-
- ═════════════════════════════════
- 9.2 Reducing line records
- ═════════════════════════════════
-
- Line records date back to punch cards. Continuous text
- would be entered on a series of cards, with blank padding after the
- last complete word that could fit on a given card. Recall the
- NEWLINES program, introduced in topic 7.1:
-
- NEWLINES blocked_in unblocked_out bytes_per_line
-
- NEWLINES simply inserted line feeds and carriage returns at fixed
- intervals in the data. For continuous text on 80 column punched
- cards, this left blank padding at the end of almost every line. In
- order to get rid of blanks at the end of lines in any ASCII text,
- use the utility F_TRAIL:
-
-
- ░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
- Usage f_trail [/4] < ASCII text > revised
-
- Remove trailing blanks from lines of ASCII text. The /4
- option is for backward compatibility only; it leaves a
- blank in the fourth column where a line consists of a
- three digit field number only.
-
- input: Any printable ASCII file.
-
- output: The same file with trailing blanks removed from each line.
-
- writeup: MIR TUTORIAL ONE, topic 9
- ░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
-
- For example, these two commands might be used in sequence:
-
- NEWLINES blocked.txt stage2.txt 80
- F_TRAIL < stage2.txt > stage3.txt
-
- The file STAGE2.TXT in this case would be fixed length lines of 80
- bytes each, plus line feed and carriage return. STAGE3.TXT would
- have variable length lines of text (none greater than 80) and a
- line feed and carriage return at the end of each line.
-
- The /4 option in F_TRAIL may be safely ignored. It
- pads a three digit field number with a single blank; this single
- blank pad is not required in MIR production format records. More
- on this in MIR Tutorial TWO.
-
-
- ═════════════════════════════════════════
- 9.3 Handling fixed length records
- ═════════════════════════════════════════
-
- In topic 7.3 we showed how to extract a single field
- from a fixed length record. Here is a deblocking routine P_FIXED
- which places all fields in continuous ASCII text:
-
- ░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
- usage: p_fixed control_file fixed_length_input > ASCII_output
-
- Converts a fixed record length file to ASCII with field
- numbers. A control file governs field lengths and
- handling of empty data.
-
- input: [1] A control file as in P_FIXED.CTL (also appears at
- end of source code).
- [2] The fixed length records data
-
- output: ASCII output with one or more lines per field. New records
- are signalled by a line containing 000; all other lines
- begin with a three digit field number. Non-printable
- characters are shown in hex format with leading backslash.
- Additional processing may be needed to bring individual
- fields into production indexing format.
-
- writeup: MIR TUTORIAL ONE, topic 9
- ░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
-
- Here is the template P_FIXED.CTL:
-
- # Edit a copy of this file to use with P_FIXED.EXE in order
- # to break out fixed length records. Each line consists of
- # three numbers and zero or more codes; each element is
- # separated by one or more blanks. The numbers are:
- # field number
- # start byte (followed by R if right half of byte only)
- # end byte (followed by L if left half of byte only)
- # A special line must be included with field number 0, begin
- # byte 0, and end byte = last byte of record (i.e., record
- # length - 1).
- #
- # Comment lines may be included. Each must start with #
- #
- # The codes that follow the three numbers are:
- # B retain field if blank
- # Z retain field if zeros
- # N retain field if nulls
- # LB retain leading blanks in field
- # LZ retain leading zeros in field
- # TB retain trailing blanks in field
- #
- 0 0 53
- 1 0 27 LB TB
- 2 28 29
- 3 30 32L N
- 4 32R 34L
- 5 35 38
- 6 39 42
- 7 43 49
- 8 50 50
- 9 51 52
-
- The last ten lines above are samples only. Simply edit
- a copy of the template and give it a name of your choice. Then run
- the command P_FIXED with appropriate file names:
-
- P_FIXED my.ctl fixedlen.dta > ascii.dta
-
- The output takes this form:
-
- 000
- 001 Text of field one
- 002 Text of field two
- etc.
- 015 \9a\81
- 016 more data
- etc.
-
- The output contains only ASCII characters. Data that is in non-
- printable form is converted to hexadecimal format a character at a
- time. Note that \9a is a single byte; three characters are needed
- to represent each hexadecimal value. Where a byte within a series
- of hexadecimals happens to be printable, it is shown in its
- printable form.
-
- More processing may be required on some fields.
- Tutorial TWO includes software for that purpose.
-
-
- ══════════════════════════════════════════════
- 9.4 Blocked records with ASCII lengths
- ══════════════════════════════════════════════
-
- Variable length lines of ASCII text are sometimes
- blocked with a four byte ASCII count at the beginning of each new
- line. There is no line feed or carriage return at the end of a
- line. The program DEBLOC_A may be used to deblock this kind of
- data.
-
- ░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
- Usage debloc_a ASCII_blocked_file > unblocked_version
-
- Remove blocking, insert line feeds in ASCII blocked file.
-
- input: ASCII file with four byte inclusive line lengths at the
- beginning of every line, no line feeds at end.
-
- output: Same data with counts out, line feeds/carriage returns in.
-
- writeup: MIR TUTORIAL ONE, topic 9
- ░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
-
- Data might look like this (usually with longer
- lengths):
-
- 0016First field.0008No. 2001401234567890013That's it
-
- Note the inclusive counting. The second field has only four bytes,
- but the count adds another four bytes... 0008No. 2. Deblocking
- that example would produce:
-
- First field.
- No. 2
- 0123456789
- That's it
-
- A byte survey of the blocked file would have heavy
- concentrations of digits, especially of the digit zero. The data
- itself may contain digits, but in much smaller proportions.
-
-
- ═══════════════════════════════════════════════
- 9.5 Blocked records with binary lengths
- ═══════════════════════════════════════════════
-
- Newspaper and book publishers often use a blocking
- format which has two levels. The blocking values are in binary.
- The order of the binary bytes may vary. The source code for
- DEBLOC_B assumes high order byte, low order byte, then two NULLs to
- make up the four bytes in each case. Alter the source code in the
- "get_data" function if you come across data with a different
- sequence. There are typically two levels... a block of several
- thousand bytes, and sub-blocks within each block. The counts are
- inclusive.
-
- The program DEBLOC_B deblocks two level binary blocked
- data. It also addresses the problem that the data often originates
- on mainframe computers which use the EBCDIC character set. Using
- a program like EBC_ASC to convert from EBCDIC to ASCII of course
- replaces the bytes holding the binary block and sub-block counts.
- To ensure the correct count, DEBLOC_B provides for the situation by
- reconverting the counting bytes.
-
- ░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
- Usage debloc_b binary_blocked_file [/s][/e] > unblocked_output
-
- Remove blocking, and (if not suppressed by /s argument)
- insert line feeds. Argument /e must be used if file was
- originally EBCDIC, in which case the block lengths must
- be converted back to EBCDIC before they are interpreted.
-
- input: File with four byte binary inclusive block lengths and
- sub-lengths, two bytes in high to low order, then two
- NULLs.
-
- output: Same data with counts out, line feeds/carriage returns
- in (unless suppressed).
-
- writeup: MIR TUTORIAL ONE, topic 9
- ░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
-
- The /s option is used if there is fixed length data
- included in the result. (I would have thought it unlikely, until
- I was handed a nine-track tape containing such data.)
-
- Notice the assumption that the data itself is printable
- ASCII text. If that is not the case and you are working in DOS,
- amend the source code to write to a named binary output file.
-
- An ancestor variation of DEBLOC_B is included with the
- source code. It has not been stylized for "copyleft", nor has it
- been tested recently. The program is P_MARC.C, intended for
- deblocking MARC records. MARC records were common for library
- citation databases. A companion ASCII document, MARC_REC.DOC, is
- also included with the software. It was reverse engineered from a
- customer's data several years ago. Its accuracy is not assured.
-
-
- ≡≡≡≡->> QUESTION:
- If you have access to data in MARC record format, could
- you either furnish a sample, or (better yet) take a run
- at upgrading both the MARC_REC.DOC document and the
- P_MARC.C source code?
- <<-≡≡≡≡
-
-
- * * * * *
-
-
- Apart from the Glossary/Index, this completes MIR
- Tutorial ONE. You have tools and learning materials that should
- equip you to analyze most kinds of data that are likely to be
- indexed for search using normal ASCII search terms... words,
- phrases, numeric values, subject categories, etc.