Power-Programmierung

home *** CD-ROM | disk | FTP | other *** search

/ Power-Programmierung / CD2.mdf / doc / mir / 17fixed < prev next >

Wrap

Text File | 1992-06-29 | 13.2 KB | 288 lines

═══════════════════════════════════ 7. WORKED EXAMPLES... FIXED LENGTH RECORDS ═══════════════════════════════════ In the 1950s, virtually all computer records were fixed length. They are still in wide use today. Again, we look to you for data samples that can be used to explain the various forms in which data is held in real files. ═══════════════════════════════════════════════ 7.1 Recognizing fixed length ASCII text ═══════════════════════════════════════════════ Recall that the first step in analysis of a file is a byte survey. We have concentrated so far on ASCII text data in which all characters are printable. A_BYTES (analyze bytes) tells us immediately if we are within the printable subset, which is much more manageable than the full set of 256 characters. Many of the standard software utilities have unpredictable (read "nasty") effects when used on non-printing characters. The MORE command produces garbage on a personal computer screen, and often locks up a terminal so badly that it has to be turned off to reset. Assuming a file contains printable characters only, how do we recognize if it contains fixed length data? First, check whether the byte survey shows any line feeds or carriage returns. Fixed length ASCII text normally contains none. If there are none, then use the MORE command to pass part of the file over the screen. MORE normally shows 1920 characters at a time (1840 on a terminal). ≡≡≡≡->> QUESTION: Why these numbers? And why are the numbers incorrect if there are line feeds or carriage returns present? <<-≡≡≡≡ In cases where the fixed length is less than a single screen display, you should be able to spot recurring patterns. If the data is unfielded, watch for blank padding at the end of each record. If there are fields, watch for distinctive features such as dates, or numeric groups, or fields that are blank repeatedly. If the fixed length is a multiple of 80, the patterns will recur directly under one another. If it is just over a multiple of 80, the patterns will shift progressively to the right. ≡≡≡≡->> QUESTION: Each fixed length ASCII record in a file is 343 bytes long. How many lines will there be between pattern repetitions? Will the pattern appear to shift to the right or left? By how many columns each time? <<-≡≡≡≡ For historic reasons (length of punch cards, early printer widths, etc.), certain fixed lengths are especially common... 80, 120, 128 and 132 bytes. The record length turns out to be a multiple of 80 plus the number of bytes extra on the next line (or less the number of bytes in which a new record intrudes on the last line). To test the number that you have calculated, use the NEWLINES program on a sample. ░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░ Usage - newlines file_in file_out bytes_per_line Inserts carriage return and line feed every "file_in" bytes. Used to deblock data received in line blocks. input: A file in which data is broken into units in which every unit is precisely the same length, and there are no line feeds or carriage returns. In the past this was a common way of storing text and fixed length records. output: The same data, expanded by the addition of carriage returns and line feeds. In this form, the data can be further analyzed and processed using line-oriented programs. writeup: MIR TUTORIAL ONE, topic 7 ░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░ For example, NEWLINES SAMPLE SAMPLE.FIX 120 Using MORE on SAMPLE.FIX looks quite different... every second line is only half full, and the patterns recur directly under each other. If they don't... back to the drawing board! Test alternate values until the patterns are aligned vertically. ═════════════════════════ 7.2 Field layouts ═════════════════════════ Here is a fixed length version of the inventory record example in the previous topic. This version omits tags and end of data markers. Location within the record determines the field. The record is displayed 50 bytes per line. A byte counter is added below the record. CL4-097-BVALVE SPRING ASSEMBLY 00005700000 0016212008004004211003654031032023996 00 000100000001297500 000000 CL4-9 97-XDY6-000-P 0 1 2 3 4 01234567890123456789012345678901234567890123456789 Successive records each have the same fields. The same number of bytes is reserved for any given field in every record, whether or not the field is empty. A fixed length data record may prove cryptic unless we have a precise definition of the allocation of bytes. Recall the list of headings in the first version: it does not account for all the data. Watch what happens when we try attempt to assign start and end bytes: Start End byte byte Part Number: CL4-097-B 0 8 Description: VALVE SPRING ASSEMBLY 9 ??? Quantity on hand, location 1: 57 39 ? 44 Quantity on hand, location 2: 0 45 ? 50 ? Quantity on hand, location 3: 16,212 51 ? 56 Quantity on hand, location 4: 8,004 57 ? 62 Usage, this month: 4,211 63 ? 68 Usage, same month last year: 3,654 69 ? 74 Usage, current year to date: 31,032 75 ? 80 Usage, last year to date: 23,996 81 ? 86 Economic order quantity: 1,000 ??? ??? Cost per unit: $ 12.975 ??? ??? Permitted substitute part #: CL4-997-X 145 153 Permitted substitute part #: DY6-000-P 154 162 Where there is a single question mark, there is a bit of uncertainty. Where there are triple question marks, we are really floundering. Is the economic order quantity in single units or to one, two or three places after the decimal? What is the maximum economic order quantity value? Is the description field 30 bytes wide, or is there an intervening field which in this record is blank? In the ideal world, whoever provides the data also gives us the fixed record definition with full headings and byte ranges. Where the definition is not available, we can infer some (and perhaps all) of the byte assignments and field titles if we have access to the software normally used to display the data. A third best is to have hard copy of samples of the same data that is in the file. The more samples we have, the more likely we are to find the extreme values that fill fields to their full width. For example, if "usage, last year to date" displays as 248,377 and we find in bytes 81 through 86 the digits 248377 then that field is defined with full certainty. ═════════════════════════════════════ 7.3 Extracting a single field ═════════════════════════════════════ In MIR TUTORIAL TWO, we will look at software that converts fixed length ASCII fielded text into tagged text. ≡≡≡≡->> QUESTION: Under what conditions would the tagged version conserve resources? When is it wasteful? Which has to be more carefully planned... tagged or fixed length? Why? What are the advantages and disadvantages of tags versus fixed length record definitions? <<-≡≡≡≡ For now, here is a quick and dirty way to break out the content of a single field. This works for record lengths up to 510 bytes. (With a carriage return and line feed added, we reach the limit for the COLRM program.) The steps: » Use A_BYTES to check that there are no unprintable characters, no line feeds, and no carriage returns. » Verify that patterns recur at regular intervals. » Determine as accurately as possible the fixed record length. » Use NEWLINES to break the data into lines. » Use COLRM to delete all characters after the desired field, and all characters before. Watch differences in counting conventions; COLRM is based on the UNIX utility which numbers the first column "1". » If the field data appears repetitious, sort the result and run it through the A_OCCUR program to get a full listing of field content. This depth of analysis would normally be overkill. But there are times when you want to be sure of what you have before you go further. As an alternative to the last step in the case where the full field is too long or variable, you can still make a listing of all the words or terms. Avoid this if there is heavy numeric content, because numbers vary far more widely than words; the sorting gets slow and the final list is not altogether useful. In order to make a list of terms, you convert all blanks in the field data to line feeds. In UNIX, the TR utility is ideal: TR '\040' '012' < field_data | SORT | A_OCCUR > output_list In Tutorial TWO, we will introduce a DOS program REPLACE1. It could be used with a replacement table consisting of one line: "\20 \0A". Again sort the result and run it through A_OCCUR to get an ordered list with frequencies. ≡≡≡≡->> QUESTION: The UNIX TR utility is exceptionally quick, yet it is flexible. I use it heavily in UNIX indexing. A DOS equivalent, especially one which is copyleft, would be widely welcomed by serious indexers. Would anyone care to take this on as a project? (You might consider using hexadecimal rather than octal notation for non- printing characters.) <<-≡≡≡≡ ══════════════════════════════════════════════════ 7.4 Packed numbers in fixed length records ══════════════════════════════════════════════════ Suppose we have either display software or hard copy to match data in a file that we are examining. Suppose further that where numbers are shown, often the matching field bytes contain non-printing characters. The byte analysis shows a low to moderate frequency set of non-text characters. What's happening? Chances are, you are dealing with packed numbers. COBOL and BASIC each have packing schemes; doubtless there are others. COBOL packing works on the basis that four bits can express 16 different values. 0 through 9 take up 10 of the 16 values. Others are used to indicate negative, positive, unsigned values, decimal sign, end of packed data. Compression is slightly under two to one compared with unpacked decimal data (one byte per character). ≡≡≡≡->> QUESTION: Can anyone find evidence that COBOL packed numbers have been rigorously standardized? I'm still looking! <<-≡≡≡≡ The presence of packed values makes recognition of fixed length data more difficult. A byte within a packed value may happen to correspond to a line feed or carriage return or to some other value that will upset software intended for ASCII text. The next topic takes up the problem of analysis under such conditions. For analysis purposes, we simply need to know that specific ranges of bytes within a fixed length record have been assigned to packed numeric values. Do not push the analysis any further at this point. There is an extra headache that may confuse the analysis; typically the data started as EBCDIC and has been converted to ASCII. The packed characters have to be switched back to EBCDIC for unpacking, while the other fields must be left in ASCII. More on this in TUTORIAL TWO. * * * * * We have studied methods of recognizing fixed length ASCII text and of determining the fixed length. Building a list of field layouts from data is possible, particularly if we have display software or hard copy of records. It is possible to isolate the content of any field and create a summary listing of terms or field entries, complete with frequencies. The presence of packed information within fixed length data creates hazards for software intended for ASCII text only. In the next topic, we relax all assumptions about the character set, and assume that we are dealing with all 256 possible combinations of bits within a byte.