home *** CD-ROM | disk | FTP | other *** search
- ═══════════════════════════════════
-
- 7. WORKED EXAMPLES...
- FIXED LENGTH RECORDS
-
- ═══════════════════════════════════
-
-
- In the 1950s, virtually all computer records were fixed
- length. They are still in wide use today. Again, we look to you
- for data samples that can be used to explain the various forms in
- which data is held in real files.
-
-
- ═══════════════════════════════════════════════
- 7.1 Recognizing fixed length ASCII text
- ═══════════════════════════════════════════════
-
- Recall that the first step in analysis of a file is a
- byte survey. We have concentrated so far on ASCII text data in
- which all characters are printable. A_BYTES (analyze bytes) tells
- us immediately if we are within the printable subset, which is much
- more manageable than the full set of 256 characters. Many of the
- standard software utilities have unpredictable (read "nasty")
- effects when used on non-printing characters. The MORE command
- produces garbage on a personal computer screen, and often locks up
- a terminal so badly that it has to be turned off to reset.
-
- Assuming a file contains printable characters only, how
- do we recognize if it contains fixed length data? First, check
- whether the byte survey shows any line feeds or carriage returns.
- Fixed length ASCII text normally contains none. If there are none,
- then use the MORE command to pass part of the file over the screen.
- MORE normally shows 1920 characters at a time (1840 on a terminal).
-
-
- ≡≡≡≡->> QUESTION:
- Why these numbers? And why are the numbers incorrect
- if there are line feeds or carriage returns present?
- <<-≡≡≡≡
-
- In cases where the fixed length is less than a single
- screen display, you should be able to spot recurring patterns. If
- the data is unfielded, watch for blank padding at the end of each
- record. If there are fields, watch for distinctive features such
- as dates, or numeric groups, or fields that are blank repeatedly.
- If the fixed length is a multiple of 80, the patterns will recur
- directly under one another. If it is just over a multiple of 80,
- the patterns will shift progressively to the right.
-
-
- ≡≡≡≡->> QUESTION:
- Each fixed length ASCII record in a file is 343 bytes
- long. How many lines will there be between pattern
- repetitions? Will the pattern appear to shift to the
- right or left? By how many columns each time?
- <<-≡≡≡≡
-
- For historic reasons (length of punch cards, early
- printer widths, etc.), certain fixed lengths are especially
- common... 80, 120, 128 and 132 bytes.
-
- The record length turns out to be a multiple of 80 plus
- the number of bytes extra on the next line (or less the number of
- bytes in which a new record intrudes on the last line). To test
- the number that you have calculated, use the NEWLINES program on a
- sample.
-
- ░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
- Usage - newlines file_in file_out bytes_per_line
-
- Inserts carriage return and line feed every "file_in"
- bytes. Used to deblock data received in line blocks.
-
- input: A file in which data is broken into units in which every
- unit is precisely the same length, and there are no line
- feeds or carriage returns. In the past this was a common
- way of storing text and fixed length records.
-
- output: The same data, expanded by the addition of carriage returns
- and line feeds. In this form, the data can be further
- analyzed and processed using line-oriented programs.
-
- writeup: MIR TUTORIAL ONE, topic 7
- ░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
-
- For example,
-
- NEWLINES SAMPLE SAMPLE.FIX 120
-
- Using MORE on SAMPLE.FIX looks quite different... every second line
- is only half full, and the patterns recur directly under each
- other. If they don't... back to the drawing board! Test alternate
- values until the patterns are aligned vertically.
-
-
-
-
-
- ═════════════════════════
- 7.2 Field layouts
- ═════════════════════════
-
- Here is a fixed length version of the inventory record
- example in the previous topic. This version omits tags and end of
- data markers. Location within the record determines the field.
- The record is displayed 50 bytes per line. A byte counter is added
- below the record.
-
-
- CL4-097-BVALVE SPRING ASSEMBLY 00005700000
- 0016212008004004211003654031032023996
- 00 000100000001297500 000000 CL4-9
- 97-XDY6-000-P
- 0 1 2 3 4
- 01234567890123456789012345678901234567890123456789
-
- Successive records each have the same fields. The same
- number of bytes is reserved for any given field in every record,
- whether or not the field is empty. A fixed length data record may
- prove cryptic unless we have a precise definition of the allocation
- of bytes. Recall the list of headings in the first version: it
- does not account for all the data. Watch what happens when we try
- attempt to assign start and end bytes:
-
- Start End
- byte byte
-
- Part Number: CL4-097-B 0 8
- Description: VALVE SPRING ASSEMBLY 9 ???
- Quantity on hand, location 1: 57 39 ? 44
- Quantity on hand, location 2: 0 45 ? 50 ?
- Quantity on hand, location 3: 16,212 51 ? 56
- Quantity on hand, location 4: 8,004 57 ? 62
- Usage, this month: 4,211 63 ? 68
- Usage, same month last year: 3,654 69 ? 74
- Usage, current year to date: 31,032 75 ? 80
- Usage, last year to date: 23,996 81 ? 86
- Economic order quantity: 1,000 ??? ???
- Cost per unit: $ 12.975 ??? ???
- Permitted substitute part #: CL4-997-X 145 153
- Permitted substitute part #: DY6-000-P 154 162
-
- Where there is a single question mark, there is a bit
- of uncertainty. Where there are triple question marks, we are
- really floundering. Is the economic order quantity in single units
- or to one, two or three places after the decimal? What is the
- maximum economic order quantity value? Is the description field 30
- bytes wide, or is there an intervening field which in this record
- is blank?
-
- In the ideal world, whoever provides the data also
- gives us the fixed record definition with full headings and byte
- ranges. Where the definition is not available, we can infer some
- (and perhaps all) of the byte assignments and field titles if we
- have access to the software normally used to display the data. A
- third best is to have hard copy of samples of the same data that is
- in the file. The more samples we have, the more likely we are to
- find the extreme values that fill fields to their full width. For
- example, if "usage, last year to date" displays as 248,377 and we
- find in bytes 81 through 86 the digits 248377 then that field is
- defined with full certainty.
-
-
- ═════════════════════════════════════
- 7.3 Extracting a single field
- ═════════════════════════════════════
-
- In MIR TUTORIAL TWO, we will look at software that
- converts fixed length ASCII fielded text into tagged text.
-
- ≡≡≡≡->> QUESTION:
- Under what conditions would the tagged version conserve
- resources? When is it wasteful? Which has to be more
- carefully planned... tagged or fixed length? Why?
- What are the advantages and disadvantages of tags
- versus fixed length record definitions?
- <<-≡≡≡≡
-
- For now, here is a quick and dirty way to break out the
- content of a single field. This works for record lengths up to 510
- bytes. (With a carriage return and line feed added, we reach the
- limit for the COLRM program.) The steps:
-
- » Use A_BYTES to check that there are no unprintable
- characters, no line feeds, and no carriage returns.
-
- » Verify that patterns recur at regular intervals.
-
- » Determine as accurately as possible the fixed record
- length.
-
- » Use NEWLINES to break the data into lines.
-
- » Use COLRM to delete all characters after the desired
- field, and all characters before. Watch differences in
- counting conventions; COLRM is based on the UNIX
- utility which numbers the first column "1".
-
- » If the field data appears repetitious, sort the result
- and run it through the A_OCCUR program to get a full
- listing of field content.
-
- This depth of analysis would normally be overkill. But
- there are times when you want to be sure of what you have before
- you go further.
-
- As an alternative to the last step in the case where
- the full field is too long or variable, you can still make a
- listing of all the words or terms. Avoid this if there is heavy
- numeric content, because numbers vary far more widely than words;
- the sorting gets slow and the final list is not altogether useful.
- In order to make a list of terms, you convert all blanks in the
- field data to line feeds. In UNIX, the TR utility is ideal:
-
- TR '\040' '012' < field_data | SORT | A_OCCUR
- > output_list
-
- In Tutorial TWO, we will introduce a DOS program REPLACE1. It
- could be used with a replacement table consisting of one line:
- "\20 \0A". Again sort the result and run it through A_OCCUR to get
- an ordered list with frequencies.
-
- ≡≡≡≡->> QUESTION:
- The UNIX TR utility is exceptionally quick, yet it is
- flexible. I use it heavily in UNIX indexing. A DOS
- equivalent, especially one which is copyleft, would be
- widely welcomed by serious indexers. Would anyone
- care to take this on as a project? (You might consider
- using hexadecimal rather than octal notation for non-
- printing characters.)
- <<-≡≡≡≡
-
-
- ══════════════════════════════════════════════════
- 7.4 Packed numbers in fixed length records
- ══════════════════════════════════════════════════
-
- Suppose we have either display software or hard copy to
- match data in a file that we are examining. Suppose further that
- where numbers are shown, often the matching field bytes contain
- non-printing characters. The byte analysis shows a low to moderate
- frequency set of non-text characters. What's happening?
-
- Chances are, you are dealing with packed numbers.
- COBOL and BASIC each have packing schemes; doubtless there are
- others. COBOL packing works on the basis that four bits can
- express 16 different values. 0 through 9 take up 10 of the 16
- values. Others are used to indicate negative, positive, unsigned
- values, decimal sign, end of packed data. Compression is slightly
- under two to one compared with unpacked decimal data (one byte per
- character).
-
-
- ≡≡≡≡->> QUESTION:
- Can anyone find evidence that COBOL packed numbers have
- been rigorously standardized? I'm still looking!
- <<-≡≡≡≡
-
- The presence of packed values makes recognition of
- fixed length data more difficult. A byte within a packed value may
- happen to correspond to a line feed or carriage return or to some
- other value that will upset software intended for ASCII text. The
- next topic takes up the problem of analysis under such conditions.
-
- For analysis purposes, we simply need to know that
- specific ranges of bytes within a fixed length record have been
- assigned to packed numeric values. Do not push the analysis any
- further at this point. There is an extra headache that may confuse
- the analysis; typically the data started as EBCDIC and has been
- converted to ASCII. The packed characters have to be switched back
- to EBCDIC for unpacking, while the other fields must be left in
- ASCII. More on this in TUTORIAL TWO.
-
- * * * * *
-
- We have studied methods of recognizing fixed length
- ASCII text and of determining the fixed length. Building a list of
- field layouts from data is possible, particularly if we have
- display software or hard copy of records. It is possible to
- isolate the content of any field and create a summary listing of
- terms or field entries, complete with frequencies. The presence of
- packed information within fixed length data creates hazards for
- software intended for ASCII text only.
-
- In the next topic, we relax all assumptions about the
- character set, and assume that we are dealing with all 256 possible
- combinations of bits within a byte.