Simtel MSDOS 1992 June

home *** CD-ROM | disk | FTP | other *** search

/ Simtel MSDOS 1992 June / SIMTEL_0692.cdr / msdos / statstcs / codebook.arc / CODEBOOK.DOC < prev next >

Wrap

Text File | 1989-07-26 | 8.4 KB | 142 lines

Program CODEBOOK.BAS 1.0, 25 July 1989, by Jim Groeneveld. ----------------------------------------------------------------------------- NIPG TNO - - - - - <work> - - - - -|- <home> - - - - -| GROENEVELD@HDETNO51 Postbus 124 | Wassenaarseweg 56 | Schoolweg 14 | JIM%RULTNO@HDETNO51 2300 AC Leiden | 2333 AL Leiden | 8071 BC Nunspeet | TNOSUR::GROENEVELD Nederland (NL) 071-178810 | 03412-60413 | RULTNO::JIM ----------------------------------------------------------------------------- PURPOSE ------- This program unformats a fixed formatted ascii data file for (STATGRAPHICS) using a user created codebook file. STATGRAPHICS can not read ASCII data files with records longer than 640 bytes. Next, preparing to read suited ASCII files is time consuming and user-unfriendly: the user has to create a vector containing the information of the position of each variable in the data file. This means a lot of initial arithmetic and errors are difficult to correct. Besides, all fields will be read: the vector only specifies field widths of fields that are directly following the previous fields. The sum of all widths cannot exceed 640. So using standard STATGRAPHICS options in this case is clumsy, time consuming and error sensitive. This asked for the current replacing solution. DESCRIPTION ----------- CODEBOOK transforms any (fixed formatted) ASCII file with one record per case of any (unlimited) length into multiple Blank or Comma delimited (or eventually Fixed formatted or Report) data files (optionally) with STATGRAPHICS (or generally acceptable, e.g. Lotus) variable names on the first row, each containing a user specified number of variables. This has the advantage that all the necessary preparation can be done within any editor, creation of a codebook file, indicating for each variable to be transformed a.o. the field width, the starting and ending columns and the variable name. The resulting data files may then be read after each other into STATGRAPHICS (or used with any appropriate program). Completely blank fields (mostly representing missing values) in the original ASCII file may be replaced automatically by any (user- specified) numerical or character value in the resulting output data files. This makes reading ASCII data files much more efficient, less sensitive to errors, much quicker and more logical and surveyable. (Unique) limit specifications: ------------------------------ 1) maximum number of variables: 32767 (practically unlimited), default: 58 2) if not reserved enough space initially when specifying the current maximum number of variables: optional automatic (but rather slow) adaptation to the actually required number of variables (number of array elements) read from the codebook file 3) maximum input record length (=number of columns per line): unlimited 4) maximum column specification (interpreted bytes/record):32767*255-1=8,355,584 (practically unlimited, practically limited by available memory in BASIC) 5) number of cases: unlimited (only hardware and software (BASIC) restrictions) The resulting unformatted data file(s) contain default 58 variables because of the STATGRAPHICS limits of 640 bytes max.line length and 10 character variable names separated by delimiters; STATGRAPHICS, however, allows for a maximum of 64 variables to be edited at the same time within its data editor and even more (?) within any (STATGRAPHICS) data file. The output records (generally unformatted values) are preceded by a first line with variable names. As many unformatted, blank or comma delimited (or fixed formatted or report) output files are generated as are necessary to contain the total number of variables as a multiple of the number of variables per output file. They are named automatically by the file name of the original formatted data file with their sequence number as the extension. All output files may be read by STATGRAPHICS. USE --- A codebook file should be created using any editor in which on each line a variable should be described as follows: (all data descriptors, except for the first column, MUST be separated by COMMA's, only widths and columns may be ended with one or more spaces) FIRST column :─┬─ space: numeric or character variable to be output 'as is'; (WITHOUT ├─ " : character variable to be output within double quotes; ending ├─ ' : character variable to be output within single quotes; delimiter) └─ any other character (or empty line): comment line, no action. Missing Value:─┬─ any value to replace originally entirely blank fields, may (END with │ be a character value, optionally enclosed by double quotes; a COMMA) └─ if left empty it will take the value prompted for when run; Starting Column : integer positive value <=8,355,584 (end with comma or spaces); Ending Column : integer positive value <=8,355,584 (end with comma or spaces); Field Width :─┬─ integer positive value <=255, for double checking field (end with com-│ correspondence with Starting and Ending Column; ma or spaces)└─ if 0: disable checking with Starting and Ending Column; Variable Name:─┬─ any character value up to 255 (!) characters, not quoted; (end with └─ if omitted a default name, consisting of 'VarX' in which X is comma/space/ the variable number, will be generated; the Variable Names are /EOL:CRLF) inserted as the first line of the output file(s); Comment : optional, may be omitted, not interpreted. Column specifications need not to be contiguous and sequential. Columns may be skipped or read multiple times (as part of different variables). This allows for extracting one or more data files with a restricted, specified number of any variables from the original database file. MEMORY REQUIREMENTS ------------------- Approximate needed memory space per variable to process: 1 byte VAR.TYPE$ (first column of codebook file) ≈ 2 bytes MISSING.VALUE$ (average 2 columns) 4 bytes BEGIN.COLUMN! (double precision) 4 bytes END.COLUMN! (double precision) ≈ 10 bytes VARIABLE.NAME$ (common max. variable name length) ≈ 2 bytes VALUE$ in DATA.LINE$ (average 2 columns) ────────── ≈ 23 bytes (say generally no more then 25) per variable altogether. E.g. for 1000 variables this requires a data space in BASIC memory of ≈25 Kb. So the programs own algorithmic limit of 32767 variables may not be reached far enough due to other limitations. (Maybe compiled BASIC allows for more data space). The same applies for the maximum column specification: this would assume a data file with at least one line of more than 8 Mb long, which has to fit within memory entirely. Suggestion: if limits occur while processing a codebook file break it into smaller pieces (of about the number of variables per output file) and run CODEBOOK multiple times. This will not take more time in total. Beware of duplicate file names! Rename, if necessary in between. Suggestion: if limits occur while processing a database file break it into smaller pieces using COPYFIX (specify record lengths of ≤80 and included and synchronized CRLF's) and SEPARATE (specify NO record numbers) and rerun CODEBOOK with adapted (variables and column specifications) parts of the original codebook file on each of the (renamed!: different FileNameS and no numeric extensions) generated files from SEPARATE. GWBASIC-LINE INPUT ------------------ In GWBASIC a LINE INPUT reads at most 255 characters within ONE line. If 255 characters are read any following CRLF has not been encountered yet. Any succeeding LINE INPUT will start from the point where the previous LINE INPUT was left. If still more than 255 characters are to be read only 255 will be read, leaving the rest for the eventual next LINE INPUT. If less than 255 characters on the SAME line are to be read, even if only a remaining CRLF, they are ALL read, INCLUDING the CRLF, but the CRLF are NOT part of the read STRING. Any following LINE INPUT starts with the NEXT line. If another BASIC (e.g. BASICA, according to its manual) processes LINE INPUT in a different way the course of this program may be unpredicted and erroneous, so BE AWARE of your BASIC version! The number of characters read by a LINE INPUT statement may be changed ONLY IF NECESSARY by redefining the BASIC variable MAX.LINE.INPUT.LENGTH in program line 70. NOTE: (GW)BASIC performs Garbage Collection (or House Cleaning) regularly.