Power-Programmierung

home *** CD-ROM | disk | FTP | other *** search

/ Power-Programmierung / CD2.mdf / doc / mir / 14analyz < prev next >

Wrap

Text File | 1992-07-01 | 33.5 KB | 868 lines

════════════════════════════════════════ 4. FIRST STEPS IN DATA ANALYSIS ════════════════════════════════════════ ══════════════════════ 4.1 Objectives ══════════════════════ Let's imagine that you have been assigned an indexing project. You are to prepare an indexed version of a large database, ready for other people to search in it for information. How shall you proceed? In this topic, we deal with first steps in hands-on analysis of machine-readable databases. You may have to start by fending off shortcuts. Some managers and clients want work completed yesterday. The first question you are asked is: "How long is it going to take?" "The data is the same as last time; we want only a few minor changes. This won't take very long, will it?" Refuse any time estimate until your analysis is complete. The ultimate objective is a satisfied user. Shortcuts at the analysis and specification stage rarely (make that "never") serve that objective. The best strategy is to write time estimates once the project is fully specified in writing and the data has been analyzed. Analysis logically comes before data preparation; we can't prepare something for indexing until we know what we have. If we are dealing with familiar data, we still have to watch for data quality problems... inconsistencies or unexpected patterns that are common within large accumulations of data. The analysis stage provides information that is needed to prepare the data. After analyzing the data, we should have enough information so that we will be able to: » extract searchable content, that is, separate out displayable text with punctuation and each search term (word, phrase, numeric value, etc.) intact, at the same time screening out all material that is not part of the intended display; » recognize record separations, that is, identify divisions between units in cases where data is logically divided into units which are meaningful to the searcher (a property in a real estate file, a person in a personnel file, a subsection in a manual of regulations, a heading in a book or magazine article, etc.); » recognize field separations, that is, identify divisions between elements of data that take on different meaning according to where they occur within a record. Examples of fields... purchase order number, street address, city, postal code, quantity, item description, cost per unit, etc. » recognize formatting aids, that is, identify bytes within the data that are intended to control display of data... indents from the margin, shifts to different fonts (e.g., italics), table spacing, etc. This may strike you as simply finding out what is needed to reduce data to formatted printable text. In a sense, it is. But there are some pitfalls along the way. The trick is to be able to recognize the pitfalls early in the indexing process while the costs are still low. It's quite discouraging to index a billion bytes and then find the index list contains large numbers of false or meaningless search terms. ══════════════════════════════════════════════ 4.2 Learn how the data was accumulated ══════════════════════════════════════════════ Ask questions. The data came from somewhere. How was the data put together? Specifically: » Was the same method used consistently to assemble all the data? Or were there changes along the way? (If methods or programs changed, then each subset of data must be separately analyzed.) If possible, find out why the data-gathering method changed. Was the change prompted by a quality control problem, or the introduction of new technology? In either case, check the first part of the new data carefully for errors. Starting up in unfamiliar surroundings can lead to extra errors for a while. If the change was made on account of quality control problems, re-check the portion of data created just before the change. » Was the data scanned from print media? Which scanning technology and software were used? What was the release date of that software? What was the quality of the print media? Was the layout regular? How thoroughly has the result been checked for reliability? » If keypunching was used as a low cost substitute for scanning printed material, were the keypunchers and their supervisors familiar with the subject matter? Were they working in their own language? What methods of verification were used? (For example, keying by a second operator is not always reliable.) How much pressure were the keyers under? Were they being monitored and paid by the kilo-keystroke as has often been the case? People under pressure tend to make more errors than those working "at their own pace". » Has the data been used and updated in ways that would sift out errors? ═══════════════════════════════════════════ 4.3 Learn how the data will be used ═══════════════════════════════════════════ Put yourself in the position of a person searching within this database. Don't settle for educated guesses; find out for whom the data is being indexed and ask questions: What are the needs of prospective searchers? How do they look for information now: manually, using a mainframe computer, not at all? In what ways can the indexing setup add to the value of the data for their needs? (For example, a computerized phone book can be organized one way to create mailing lists, in quite another to optimize the speed of looking up individual names.) What kinds of search term combinations are typically used? Perhaps some forms of search were deemed impractical in the past. If searchers were invited to imagine away all restrictions, what would they like to be able to do? Marketing questions are often outside the interests of technical people. That's unfortunate, since all sorts of technical side trips to nowhere might be avoided by interviewing even a few potential end users. ═══════════════════════════════════════════ 4.4 Access to samples and hard copy ═══════════════════════════════════════════ After acquiring technical background re the data and marketing information on potential users, the key issues are: » access to the data on media that can be handled by your computer. Non-standard media still exist. ("I just received the diskettes from the first volume on 8 inch disks"... letter in February 1992). And there are standard media that may not be useful to you... nine track 6250 BPI tape doesn't feed the average personal computer; 1.2 megabyte floppies aren't worth much on a 360K floppy drive. » ability to extract samples that are representative. This depends in part on the degree of consistency of the data. Pay particular attention to the first several thousand bytes, and the very last several thousand bytes. If the data set is large, extract randomly selected portions in between. If there are known changes in how the data was accumulated, put together samples from each variation. » access to hard copy. If at all possible, get paper printouts that match the samples. The paper version is helpful for identifying format and typesetting codes. The way the hard copy is laid out also tells volumes about the search philosophy used to date. Ask users how well the printed layout served their needs. ════════════════════════════════════ 4.5 Access to software tools ════════════════════════════════════ A variety of programs (software tools) are supplied with each tutorial, in source code and in executable form. Source code has the extension ".C" and the DOS executable normally has the extension ".EXE". There is also the occasional batch file, a series of commands in a simple text file that can be interpreted by DOS. If you have not already done so, include the directory that has the executable version in the series of PATH names in the AUTOEXEC.BAT file on your computer. Each program has a description, or help screen, which is displayed when you input the command (the program name without the extension) followed by /U (for Usage) or ? (for help). Normally we will include the help screen in the tutorial the first time a program is mentioned. ════════════════════════════════════════════════ 4.6 Extracting samples from larger files ════════════════════════════════════════════════ The program CPB (copy bytes) is useful for extracting samples from larger files. Here is its help screen description: ░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░ Usage cpb byte_count start_byte input_file output_file Copy bytes... Copy any portion of any file to a new file. Start at a specified byte, copy a specified byte count. Standard output may not be used because DOS drops carriage returns and CTL-Z. input: Any file whatsoever. output: Portion of the same file. writeup: MIR TUTORIAL ONE, topic 4 ░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░ Suppose you wish to separate out the first 10,000 bytes of a file named BIG_DATA into a file called SAMPLE.000. The command would be: CPB 10000 0 BIG_DATA SAMPLE.000 To isolate 7,500 bytes from the middle of a 200,000 byte file, the command takes the form: CPB 7500 100000 BIG_ONE MY_SAMPL CPB (copy bytes) is particularly useful for massive files. Here's how to call out a sample between 4,000 and 5,000 bytes in length from the end of a file that is 248,885,237 bytes long: CPB 5000 248881000 BIGGER.YET BIGGER.END The file BIGGER.END in this case would be 4,237 bytes long. CPB simply stops copying when it reaches the end of the input file. Sampling is one of many uses of CPB. We will use it in a variety of ways further along. If you wish to make a UNIX version of CPB, consider removing the input and output file names and using standard input and output instead. That is safe since UNIX does not quietly mess up on carriage returns and CTL-Z characters. The advantage of standard input and output is the convenience of piping from one program to another without the clutter of intervening files. ═══════════════════════════════════════════ 4.7 Byte surveys - a worked example ═══════════════════════════════════════════ We start the analysis with a byte survey of each sample, or of one file that combines all your data samples. If you choose to combine samples, use the DOS COPY command to join them together, being very sure to use the "/b" binary flag: COPY /B HEADER + BATCH1 + BATCH2 + TAIL BIG_ONE The destination file BIG_ONE has the four source files concatenated together in order: HEADER, BATCH1, BATCH2, TAIL. If you have a series of files named SAMPLE.001, SAMPLE.002, SAMPLE.003, etc., COPY /B SAMPLE.* BIG_ONE Included among the MIR support files supplied with TUTORIAL ONE is a 238,312 byte file SVP_TXT. This file contains English translation of some 17th century correspondence of the French priest and reformer, Vincent de Paul, who founded the "Congregation of the Mission". The Vincentians, generous as always, have given permission to use the correspondence for an example in demonstrating indexing and retrieval methods. The file SVP_TXT is used here to introduce tools and first steps in data analysis. The file's quarter megabyte size is no problem. For example, the program A_BYTES classifies and counts every one of the 238,312 bytes in only four seconds on a slow machine (AT 80286 at 12 Megahertz). The character survey is produced by the program A_BYTES ("analyze bytes")... ░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░ usage: a_bytes [ /L ] file_name[s] Analyze the bytes (characters) used within any file, report the frequency of each byte present. If the location flag /L is set, include offsets of the first 8 occurrences of each byte pattern present. input: Any file[s] whatsoever. output: file_name.BYT which contains up to 256 lines, one line for each different byte present. The byte is shown first in printable OR octal form, then the hexadecimal equivalent. The third column is frequency. The fourth column shows percentage of total occurrences within the file. If the /L locations option is selected, the output file is name file_name.LOC and the offsets of the first up to 8 occurrences follow at the end of each line. writeup: MIR TUTORIAL ONE, topic 4 ░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░ A_BYTES SVP_TXT creates a report file SVP_TXT.BYT which is a byte analysis of the entire file.. Here is that report: \012 [0A] 4117 1.7% \015 [0D] 4117 1.7% \032 [1A] 1 0.0% \040 [20] 38883 16.3% ! [21] 173 0.1% ' [27] 87 0.0% ( [28] 38 0.0% ) [29] 38 0.0% , [2C] 2754 1.2% - [2D] 266 0.1% . [2E] 1922 0.8% 0 [30] 258 0.1% 1 [31] 560 0.2% 2 [32] 168 0.1% 3 [33] 117 0.0% 4 [34] 485 0.2% 5 [35] 170 0.1% 6 [36] 771 0.3% 7 [37] 174 0.1% 8 [38] 198 0.1% 9 [39] 85 0.0% : [3A] 87 0.0% ; [3B] 158 0.1% < [3C] 1511 0.6% = [3D] 935 0.4% > [3E] 1511 0.6% ? [3F] 62 0.0% @ [40] 935 0.4% A [41] 658 0.3% B [42] 225 0.1% C [43] 403 0.2% D [44] 962 0.4% E [45] 1357 0.6% F [46] 124 0.1% G [47] 388 0.2% H [48] 501 0.2% I [49] 1715 0.7% J [4A] 90 0.0% K [4B] 5 0.0% L [4C] 519 0.2% M [4D] 1029 0.4% N [4E] 466 0.2% O [4F] 536 0.2% P [50] 476 0.2% Q [51] 9 0.0% R [52] 234 0.1% S [53] 534 0.2% T [54] 2364 1.0% U [55] 161 0.1% V [56] 166 0.1% W [57] 132 0.1% X [58] 841 0.4% Y [59] 106 0.0% Z [5A] 2 0.0% [ [5B] 67 0.0% ] [5D] 67 0.0% ^ [5E] 479 0.2% a [61] 11934 5.0% b [62] 2128 0.9% c [63] 3465 1.5% d [64] 6330 2.7% e [65] 21212 8.9% f [66] 3581 1.5% g [67] 2899 1.2% h [68] 9818 4.1% i [69] 11036 4.6% j [6A] 121 0.1% k [6B] 897 0.4% l [6C] 5929 2.5% m [6D] 4209 1.8% n [6E] 10777 4.5% o [6F] 15224 6.4% p [70] 2305 1.0% q [71] 96 0.0% r [72] 10071 4.2% s [73] 10261 4.3% t [74] 15670 6.6% u [75] 5655 2.4% v [76] 1945 0.8% w [77] 3214 1.3% x [78] 273 0.1% y [79] 3891 1.6% z [7A] 76 0.0% | [7C] 28 0.0% é [82] 43 0.0% â [83] 1 0.0% à [85] 1 0.0% ç [87] 7 0.0% è [8A] 4 0.0% î [8C] 4 0.0% ô [93] 10 0.0% The report (actually a single set of four columns) can be anywhere up to 256 lines long, one line for each possible arrangement among 8 off-and-on bits within one byte. The program deliberately omits a heading line and shows only one byte pattern per line. This enables us to get a clean result when we sort the report: SORT /+10 /R < SVP_TXT.BYT > SVP_TXT.BYS This is the standard DOS SORT routine. The /+10 causes the sort to start at the tenth column, that is, sort by the frequency. The /R makes it a reverse frequency sort. The top end of the output looks like this: \040 [20] 38883 16.3% e [65] 21212 8.9% t [74] 15670 6.6% o [6F] 15224 6.4% a [61] 11934 5.0% i [69] 11036 4.6% n [6E] 10777 4.5% If we include the "locations" flag when analyzing bytes, each line of the output contains more information. Processing is only two thirds as fast. The command would be: A_BYTES -L SVP_TXT The top end of the resulting SVP_TXT.LOC (note the changed report name) looks like this: \012 [0A] 4117 1.7% 31 56 120 154 178 244 309 374 \015 [0D] 4117 1.7% 30 55 119 153 177 243 308 373 \032 [1A] 1 0.0% 238311 \040 [20] 38883 16.3% 6 8 14 22 25 38 40 63 ! [21] 173 0.1% 2098 2477 2800 3527 3671 6419 8389 8395 The report including locations may be sorted in precisely the same manner as the BYT report: SORT /+10 /R < SVP_TXT.LOC > SVP_TXT.LOS If a file already exists with the target name, A_BYTES replaces the last character with a digit. BYT becomes BY0 or BY1, etc., up to BY9. The variations on LOC are LO0 through LO9. One final note: A_BYTES can be run on a series of files with a single command: A_BYTES -L MYFILE HERFILE FILE.WHO WHATEVER ══════════════════════ 4.8 Data types ══════════════════════ With a byte survey in hand, we can begin to answer two questions about a file: » What is in it? » How is it presented? The two questions are answered simultaneously during the analysis. In the next section, we will look at how data is presented. Let's focus on the first question. A file may contain one (or possibly more) of these data types: ASCII text / extended ASCII text text with ASCII markup codes text with binary markup codes text with BCD (binary coded decimal) packed numbers text with BASIC packed numbers text with compression substitutions EBCDIC binary (compression, encryption, GIF graphic interchange files, sound, etc.) ASCII (American Standard Code for Information Interchange) is an agreed-upon assignment of bit patterns to letters, digits, punctuation, control characters, etc. ASCII text consists of the letters of the printable alphabet, numeric digits, punctuation characters and space (hex 20 through 76) plus tab, vertical tab and new page (hex 09, 0B and 0c respectively). The newline character (hex 0A) is normally present (except in line records and some fixed length records... see below). If the file has been processed using a PC, for each newline character, there is normally exactly one carriage return (hex 0D), immediately following. PC ASCII text files often have exactly one hex 1A or CTL-Z as the last character (EOF or end of file marker). Extended ASCII text for our purposes is ASCII text in which accented characters appear within French, German, Spanish and other foreign language words. If we allow also for punctuation common in Spanish, the extra characters are in the ranges hex 80 to 9A and A0 to A8. These are all "high-bit-set" characters, specific to PC compatible ASCII data. ("High-bit-set" means that the leading bit is turned on; hence the value is hex 80 or higher.) The count of these characters when they are part of extended ASCII text is generally lower than the count of regular vowels in the same distribution. Greek and mathematical characters (hex AB, AC, E0 to FD) may also occur in extended ASCII text. We will look later at how to check context in order to verify whether high-bit-set characters are valid text, errors within text or indications of a different file type. Text with ASCII markup codes uses the same characters as in ASCII or extended ASCII text. It is common practice to insert special bytes or series of bytes within data to signal how the data should be displayed. This practice is called markup. Examples of items marked: justify text (left, center, right) select font (courier, etc.), font subsets (italics) select type size (pica, elite, with a count) change to/from bold underline protect against "widows" and "orphans" (ensuring paragraphs stay together) respond to new heading level The first indicator of ASCII markup is that some characters (often "<" and ">") are present in unusually high proportion. Text with binary markup codes may contain virtually all 256 characters. Null bytes (hex 00) do not occur in ASCII text, but are common in files with binary content. If you have Microsoft Word or WordPerfect files handy, try the command A_BYTES on one of them and look at the listing that results. Notice that lower case alphabetic letters still figure prominently in the distribution. Text with packed numbers is found most often in COBOL style fixed length fielded records. Distribution is like that of ASCII text except that a sprinkling of other values shows up. Binary Coded Decimal (BCD) shows up in COBOL records. BASIC language data files have their own variation of packed numbers. Text with compression substitutions has binary series in the midst of normal text. This is notoriously difficult to work with unless you have access to the decompression table. EBCDIC data warrants special treatment the moment it is identified. Work stations or mini computers that receive data on nine track tape often deal with non-ASCII data. (The problem is rare with personal computers.) If there are many '@' symbols and no recognizable text, it is possible that you are working with EBCDIC and not ASCII. (The reason for lots of '@' characters is that the EBCDIC space character is the same as the ASCII '@'.) Try the EBC_ASC program on a portion of the data: ░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░ usage: ebc_asc ebcdic_input ascii_output Converts an EBCDIC file to ASCII. EBCDIC (Extended Binary Coded Decimal Interchange Code) data is commonly produced by IBM mainframe computers. ASCII (American Standard Code for Information Interchange) is used on personal computers and computers produced by the majority of manufacturers. input: Any EBCDIC file output: ASCII equivalent writeup: MIR TUTORIAL ONE, topic 4 ░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░ If samples start to make sense when filtered through an EBCDIC to ASCII conversion, process all samples through EBC_ASC before continuing analysis. Later you will need to process all the data through the same conversion. Binary data: Suppose you find in every sample that the data looks completely meaningless. There are many reasons that data dumps appear to be a jumble. Among them are: binary compression encryption binary numeric data graphics content sound files non-ASCII data Attempts to decipher binary data are time consuming and expensive. The process is called reverse engineering. It is not always legal. Go back to asking people questions (the people providing the data, and maybe a lawyer) before getting too deep! ═════════════════════════════ 4.9 Data presentation ═════════════════════════════ A byte survey is also used as part of the analysis of how the data is presented. Main options are: byte stream line records fixed length records blocked records with ASCII lengths blocked records with binary lengths Here we simply describe the options. The topic on deblocking data will describe the programs used. Deblocking, like EBCDIC to ASCII conversion, may have to be carried out before analysis can be completed. One stage of the indexing process will be to create a byte stream equivalent of the file to be indexed. A byte stream is the crudest form of file... simply a series of bytes in the order in which they would be displayed if sent to a screen. The ASCII source code file for each MIR program is a byte stream. Line records are blocks of text or other data, padded with blanks out to some fixed length, very often 80. This form of data storage goes back to punch cards (remember them?) in which as many English (or other natural language) words as possible were fitted into one 80 column card. When the next word would cause overflow, it was placed at the beginning of the next card. A telltale sign of line records is a disproportionately high number of blanks and few if any linefeeds in the byte distribution. Fixed length records also date back to punch cards. One or more cards would be divided into groups of columns, with one group assigned to each field. Sizes of fields are fixed according to the amount of data that the file designer expects... perhaps eight columns/bytes for a purchase order number, 30 for a street address, 10 for a quantity, 55 for an item description, etc. Unused fields and unused spaces within fields are left blank. Text fields are normally left justified (extra spaces at the end), and numeric fields are right justified (leading bytes either blank or zero padded). Line feeds are rare, and would occur only within a long text field. The essence of fixed length records is that location determines meaning. Like line records, fixed length records show a very high frequency of blanks (hex 20). They are likely to have more numeric digits than line records. Alternately, if the records contain Binary Coded Decimal packed numbers, there will be a near random assortment of high-bit-set bytes. Blocked records with ASCII lengths are series of byte streams of variable length, with a measure of length at the front of each block. This length is typically 4 digits with zero padding (for example, 0032 or 3217 or 0539). The byte survey of this data shows high frequencies of digits, especially the digit zero. Line feeds show up less often than normal, or are non-existent. Otherwise, the content is fairly typical of ASCII text. Blocked records with binary lengths have been common in library data (so called "MARC" records) and in the publishing business. They are often more sophisticated than blocked records with ASCII lengths; there may be field sub-lengths within larger blocks. Byte surveys show a small percentage of randomly distributed binary characters. Those that have the high bit set are noticeable. High-bit-off binary bytes are hidden in the frequencies of normal text characters. ══════════════════════════════ 4.10 Byte distributions ══════════════════════════════ Here is the distribution of alphabetical characters in a sample of English text (actually drafts of seven MIR topics). The 79,657 characters were as follows: A 312 B 101 C 288 D 220 E 288 F 152 G 108 H 112 I 406 J 7 K 17 L 175 M 190 N 187 O 292 P 219 Q 17 R 281 S 331 T 447 U 109 V 38 W 117 X 15 Y 81 Z 5 a 6018 b 920 c 2764 d 2775 e 9658 f 1766 g 1482 h 2959 i 5478 j 49 k 378 l 2523 m 1996 n 5194 o 6008 p 1899 q 126 r 5433 s 4975 t 6966 u 2189 v 834 w 1010 x 337 y 1319 z 86 Here is the same data again, arranged from most frequent to least frequent: e 9658 t 6966 a 6018 o 6008 i 5478 r 5433 n 5194 s 4975 h 2959 d 2775 c 2764 l 2523 u 2189 m 1996 p 1899 f 1766 g 1482 y 1319 w 1010 b 920 v 834 T 447 I 406 k 378 x 337 S 331 A 312 O 292 E 288 C 288 R 281 D 220 P 219 M 190 N 187 L 175 F 152 q 126 W 117 H 112 U 109 G 108 B 101 z 86 Y 81 j 49 V 38 Q 17 K 17 X 15 J 7 Z 5 Frequency patterns in English text vary with the subject matter, but never to a large degree. Letters j and z are little used; letters e and t together with the space character almost assuredly account for one quarter to one third of all alphabetic bytes present. Percentages of course are lower when we include the space character and punctuation. Spaces (blanks) lead the frequency distribution in virtually all text files. European languages use the high-bit-set accented characters. In French, the frequency of unaccented letter e drops somewhat in favor of è with grave accent (infrequent) and é with acute accent (common). Spanish uses more of the letter l, the tilde form of ñ and Ñ, and of course the interrogative and exclamatory symbols ¿ and ¡ at the beginning of sentences. The PC high-bit-set accented characters apply as well in Scandinavian languages, German, and so forth. But no language based on the roman character set departs altogether from the basic patterns -- frequent use of vowels, certain consonants highly favored and others marginal across all languages. Treat each departure from typical distributions as a prompt to analyze each exceptional byte in its contexts. To illustrate, let's examine what significance we can attach to the example byte survey of the file SVP_TXT above. Start by making your own complete copy with the command: A_BYTES SVP_TXT What further analysis should we do? » The lower case letters have a typical English text distribution... j, z and q infrequent; e, a, t and o with very high frequency. » Upper case letters are also normally distributed for English text... Q and Z lowest, K very low (rarely the first letter in a sentence), M appearing together with the normally high vowels and T. M is frequent at the beginning of names, and in French greetings (Monsieur, Madame). » There are only 70 characters out of 238,312 that have the high bit set (hex 80 or higher) and all can be verified to be valid accented French characters that might appear in names of people and places in French correspondence. » The line feed (hex 0A) and carriage return (hex 0D) each appear 4117 times, normal in an ASCII text file with 4117 lines. The locations report shows that the first eight occurrences of 0A 0D are paired right together. » The only other non-printing characters are hex 20 (space) and hex 1A (CTL-Z or end of file in DOS). At 16.3% of occurrences, the space is the most frequent character. The CTL-Z occurs once only, in the very last byte of the file. Byte number 238311 is the last. (Offsets count from zero upward and SVP_TXT is a 238312 byte file.) CTL-Z is the standard end of file marker for ASCII text files. » We therefore have cumulative evidence that this is an extended ASCII file. We have not yet established whether it contains markup. » Round and square parentheses probably are matched sets ... 38 of ( and ), 67 of [ and ]. The latter are worth checking whether they occur in normal ways within text. » The sort by frequency reveals that < and > each appear 1511 times... a tip that they might be used as matching angle brackets. The location data shows that the first eight are indeed paired only two or three bytes apart. Together the < and > bytes comprise 1.2% of the file... abnormally high for correspondence, and therefore likely part of some sort of markup code. » The sort by frequency also shows @ and = occurring 935 times each. Are they matched also and part of a code? The first eight appearances are within 7 bytes of each other, with the @ symbol coming first each time. » We should investigate why ^ shows up 479 times and | is present 28 times. Neither is normal for correspondence. The next topic explains how to examine bytes in context and how to survey patterns across an entire file. Following this are topics on analysis of detail within each of the data types.