home *** CD-ROM | disk | FTP | other *** search
- Program CODEBOOK.BAS 1.0, 25 July 1989, by Jim Groeneveld.
- -----------------------------------------------------------------------------
- NIPG TNO - - - - - <work> - - - - -|- <home> - - - - -| GROENEVELD@HDETNO51
- Postbus 124 | Wassenaarseweg 56 | Schoolweg 14 | JIM%RULTNO@HDETNO51
- 2300 AC Leiden | 2333 AL Leiden | 8071 BC Nunspeet | TNOSUR::GROENEVELD
- Nederland (NL) 071-178810 | 03412-60413 | RULTNO::JIM
- -----------------------------------------------------------------------------
-
- PURPOSE
- -------
- This program unformats a fixed formatted ascii data file for (STATGRAPHICS)
- using a user created codebook file. STATGRAPHICS can not read ASCII data files
- with records longer than 640 bytes. Next, preparing to read suited ASCII files
- is time consuming and user-unfriendly: the user has to create a vector
- containing the information of the position of each variable in the data file.
- This means a lot of initial arithmetic and errors are difficult to correct.
- Besides, all fields will be read: the vector only specifies field widths of
- fields that are directly following the previous fields. The sum of all widths
- cannot exceed 640. So using standard STATGRAPHICS options in this case is
- clumsy, time consuming and error sensitive. This asked for the current replacing
- solution.
-
- DESCRIPTION
- -----------
- CODEBOOK transforms any (fixed formatted) ASCII file with one record per case of
- any (unlimited) length into multiple Blank or Comma delimited (or eventually
- Fixed formatted or Report) data files (optionally) with STATGRAPHICS (or
- generally acceptable, e.g. Lotus) variable names on the first row, each
- containing a user specified number of variables. This has the advantage that
- all the necessary preparation can be done within any editor, creation of a
- codebook file, indicating for each variable to be transformed a.o. the field
- width, the starting and ending columns and the variable name. The resulting
- data files may then be read after each other into STATGRAPHICS (or used with
- any appropriate program). Completely blank fields (mostly representing missing
- values) in the original ASCII file may be replaced automatically by any (user-
- specified) numerical or character value in the resulting output data files.
- This makes reading ASCII data files much more efficient, less sensitive to
- errors, much quicker and more logical and surveyable.
-
- (Unique) limit specifications:
- ------------------------------
- 1) maximum number of variables: 32767 (practically unlimited), default: 58
- 2) if not reserved enough space initially when specifying the current maximum
- number of variables: optional automatic (but rather slow) adaptation to the
- actually required number of variables (number of array elements) read from
- the codebook file
- 3) maximum input record length (=number of columns per line): unlimited
- 4) maximum column specification (interpreted bytes/record):32767*255-1=8,355,584
- (practically unlimited, practically limited by available memory in BASIC)
- 5) number of cases: unlimited (only hardware and software (BASIC) restrictions)
-
- The resulting unformatted data file(s) contain default 58 variables because of
- the STATGRAPHICS limits of 640 bytes max.line length and 10 character variable
- names separated by delimiters; STATGRAPHICS, however, allows for a maximum of 64
- variables to be edited at the same time within its data editor and even more (?)
- within any (STATGRAPHICS) data file.
-
- The output records (generally unformatted values) are preceded by a first line
- with variable names. As many unformatted, blank or comma delimited (or fixed
- formatted or report) output files are generated as are necessary to contain the
- total number of variables as a multiple of the number of variables per output
- file. They are named automatically by the file name of the original formatted
- data file with their sequence number as the extension. All output files may be
- read by STATGRAPHICS.
-
- USE
- ---
- A codebook file should be created using any editor in which on each line a
- variable should be described as follows:
- (all data descriptors, except for the first column, MUST be separated by
- COMMA's, only widths and columns may be ended with one or more spaces)
- FIRST column :─┬─ space: numeric or character variable to be output 'as is';
- (WITHOUT ├─ " : character variable to be output within double quotes;
- ending ├─ ' : character variable to be output within single quotes;
- delimiter) └─ any other character (or empty line): comment line, no action.
- Missing Value:─┬─ any value to replace originally entirely blank fields, may
- (END with │ be a character value, optionally enclosed by double quotes;
- a COMMA) └─ if left empty it will take the value prompted for when run;
- Starting Column : integer positive value <=8,355,584 (end with comma or spaces);
- Ending Column : integer positive value <=8,355,584 (end with comma or spaces);
- Field Width :─┬─ integer positive value <=255, for double checking field
- (end with com-│ correspondence with Starting and Ending Column;
- ma or spaces)└─ if 0: disable checking with Starting and Ending Column;
- Variable Name:─┬─ any character value up to 255 (!) characters, not quoted;
- (end with └─ if omitted a default name, consisting of 'VarX' in which X is
- comma/space/ the variable number, will be generated; the Variable Names are
- /EOL:CRLF) inserted as the first line of the output file(s);
- Comment : optional, may be omitted, not interpreted.
-
- Column specifications need not to be contiguous and sequential. Columns may be
- skipped or read multiple times (as part of different variables). This allows
- for extracting one or more data files with a restricted, specified number of any
- variables from the original database file.
-
- MEMORY REQUIREMENTS
- -------------------
- Approximate needed memory space per variable to process:
- 1 byte VAR.TYPE$ (first column of codebook file)
- ≈ 2 bytes MISSING.VALUE$ (average 2 columns)
- 4 bytes BEGIN.COLUMN! (double precision)
- 4 bytes END.COLUMN! (double precision)
- ≈ 10 bytes VARIABLE.NAME$ (common max. variable name length)
- ≈ 2 bytes VALUE$ in DATA.LINE$ (average 2 columns)
- ──────────
- ≈ 23 bytes (say generally no more then 25) per variable altogether.
- E.g. for 1000 variables this requires a data space in BASIC memory of ≈25 Kb.
- So the programs own algorithmic limit of 32767 variables may not be reached
- far enough due to other limitations. (Maybe compiled BASIC allows for more
- data space). The same applies for the maximum column specification: this would
- assume a data file with at least one line of more than 8 Mb long, which has to
- fit within memory entirely.
-
- Suggestion: if limits occur while processing a codebook file break it into
- smaller pieces (of about the number of variables per output file) and run
- CODEBOOK multiple times. This will not take more time in total. Beware of
- duplicate file names! Rename, if necessary in between.
-
- Suggestion: if limits occur while processing a database file break it into
- smaller pieces using COPYFIX (specify record lengths of ≤80 and included and
- synchronized CRLF's) and SEPARATE (specify NO record numbers) and rerun CODEBOOK
- with adapted (variables and column specifications) parts of the original
- codebook file on each of the (renamed!: different FileNameS and no numeric
- extensions) generated files from SEPARATE.
-
- GWBASIC-LINE INPUT
- ------------------
- In GWBASIC a LINE INPUT reads at most 255 characters within ONE line. If 255
- characters are read any following CRLF has not been encountered yet. Any
- succeeding LINE INPUT will start from the point where the previous LINE INPUT
- was left. If still more than 255 characters are to be read only 255 will be
- read, leaving the rest for the eventual next LINE INPUT. If less than 255
- characters on the SAME line are to be read, even if only a remaining CRLF, they
- are ALL read, INCLUDING the CRLF, but the CRLF are NOT part of the read STRING.
- Any following LINE INPUT starts with the NEXT line. If another BASIC (e.g.
- BASICA, according to its manual) processes LINE INPUT in a different way the
- course of this program may be unpredicted and erroneous, so BE AWARE of your
- BASIC version! The number of characters read by a LINE INPUT statement may be
- changed ONLY IF NECESSARY by redefining the BASIC variable MAX.LINE.INPUT.LENGTH
- in program line 70.
-
- NOTE: (GW)BASIC performs Garbage Collection (or House Cleaning) regularly.
-