home *** CD-ROM | disk | FTP | other *** search
- ARC-FILE.INF contains information extracted from UNARC.INF by
- Robert A. Freed (September, 1986) and from SQSHINFO.DOC by Phil Katz,
- (December, 1986).
-
- Note: In the following discussion, UNARC refers to Robert A. Freed's
- CP/M-80 program for extracting files from MSDOS ARCs. The definitions
- of the ARC file format are based on MSDOS ARC and PKARC/PKXARC
- programs.
-
- ARCHIVE FILE FORMAT
- -------------------
-
- Component files are stored sequentially within an archive. Each
- entry is preceded by a 29-byte header, which contains the directory
- information. There is no wasted space between entries. (This is in
- contrast to the centralized directory used by Novosielski libraries.
- Although random access to subfiles within an archive can be noticeably
- slower than with libraries, archives do have the advantage of not
- requiring pre-allocation of directory space.)
-
- Archive entries are normally maintained in sorted name order. The
- general format of the 29-byte archive header is as follows:
-
- archive mark, header version, file header, file data...
-
- Byte 1: 1A Hex.
-
- This marks the start of an archive header. If this byte is not
- found when expected, UNARC will scan forward in the file (up to
- 64K bytes) in an attempt to find it (followed by a valid
- compression version). If a valid header is found in this
- manner, a warning message is issued and archive file processing
- continues. Otherwise, the file is assumed to be an invalid
- archive and processing is aborted. (This is compatible with
- MS-DOS ARC version 5.12). Note that a special exception is made
- at the beginning of an archive file, to accommodate
- "self-unpacking" archives (see below).
-
- Byte 2: Compression version, as follows:
-
- 0 = End of file marker (remaining bytes not present).
- 1 = Stored (obsolete).
- 2 = Stored - No compression.
- 3 = Packed - (non-repeat packing).
- 4 = Squeezed (Huffman squeezing, after packing)
- 5 = Crunched (Obsolete - 12-bit static LZW without non-repeat pack).
- 6 = Crunched (Obsolete - 12-bit static LZW with non-repeat packing).
- 7 = Crunched (Obsolete - after packing, using faster hash algorithm).
- 8 = Crunched (using dynamic LZW variations, after packing).
- (The initial LZW code size is 9-bits with a
- maximum code size of 12-bits. The possibility of
- adaptive resets are implemented in this mode.)
- 9 = Squashed (The file was compressed with Dynamic LZW
- compression without non-repeat packing. The
- initial LZW code size is 9-bits with a maximum
- maximum code size of 13-bits. The possibility of
- adaptive resets are implemented in this mode.)
-
- Bytes 3-15: ASCII file name, nul-terminated.
-
- (All of the following numeric values are stored low-byte first.)
-
- Bytes 16-19: File size of compressed file, in bytes.
- (The number of bytes in the file data area following the
- header.)
-
- Bytes 20-21: File date, in 16-bit MS-DOS format:
- Bits 15:9 = year - 1980
- Bits 8:5 = month of year
- Bits 4:0 = day of month
- (All zero means no date.)
-
- Bytes 22-23: File time, in 16-bit MS-DOS format:
- Bits 15:11 = hour (24-hour clock)
- Bits 10:5 = minute
- Bits 4:0 = second/2 (not displayed by UNARC)
-
- Bytes 24-25: Cyclic redundancy check (CRC) value (see below).
-
- Bytes 26-29: Original (uncompressed) file length in bytes.
- (This field is not present for version 1 entries, byte 2 = 1.
- I.e., in this case the header is only 25 bytes long. Because
- version 1 files are uncompressed, the value normally found in
- this field may be obtained from bytes 16-19.)
-
- SELF-UNPACKING ARCHIVES
- -----------------------
-
- A "self-unpacking" archive is one which can be renamed to a .COM
- file and executed as a program. An example of such a file is the MS-DOS
- program ARC512.COM, which is a standard archive file preceded by a
- three-byte jump instruction. The first entry in this file is a simple
- "bootstrap" program in uncompressed form, which loads the subfile
- ARC.EXE (also uncompressed) into memory and passes control to it. In
- anticipation of a similar scheme for future distribution of UNARC, the
- program permits up to three bytes to precede the first header in an
- archive file (with no error message).
-
- CRC COMPUTATION
- ---------------
-
- Archive files use a 16-bit cyclic redundancy check (CRC) for error
- control. The particular CRC polynomial used is x^16 + x^15 + x^2 + 1,
- which is commonly known as "CRC-16" and is used in many data
- transmission protocols (e.g. DEC DDCMP and IBM BSC), as well as by most
- floppy disk controllers. Note that this differs from the CCITT
- polynomial (x^16 + x^12 + x^5 + 1), which is used by the XMODEM-CRC
- protocol and the public domain CHEK program (although these do not
- adhere strictly to the CCITT standard). The MS-DOS ARC program does
- perform a mathematically sound and accurate CRC calculation. (We
- mention this because it contrasts with some unfortunately popular public
- domain programs we have witnessed, which from time immemorial have based
- their calculation on an obscure magazine article which contained a
- typographical error!)
-
- Additional note (while we are on the subject of CRC's): The
- validity of using a 16-bit CRC for checking an entire file is somewhat
- questionable. Many people quote the statistics related to these
- functions (e.g. "all two-bit errors, all single burst errors of 16 or
- fewer bits, 99.997% of all single 17-bit burst errors, etc."), without
- realizing that these claims are valid only if the total number of bits
- checked is less than 32767 (which is why they are used in small-packet
- data transmission protocols). I.e., for file sizes in excess of about
- 4K bytes, a 16-bit CRC is not really as good as what is often claimed.
- This is not to say that it is bad, but there are more reliable methods
- available (e.g. the 32-bit AUTODIN-II polynomial). (End of lecture!)
-
- Notes:
-
- For type 1 stored files, the file header is only 23 bytes in size,
- with the length field not present. In this case, the file length is the
- same as the size field since the file is stored without compression.
-
- The first byte of the data area following the header is used to
- indicate the maximum code size, however only a value of 12 (decimal) is
- currently used or accepted by existing ARC programs.
-
- The algorithm used "squashed" compression is identical to type 8
- crunched files with the exception that the maximum code size is 13 bits
- - i.e. an 8K entry LZW table. However, unlike type 8 files, the first
- byte following the file header is actual data, no maximum code size is
- stored.
-
- References
- ----------
-
- Source code for ARC 5.0 by Tom Henderson of Software Enhancement
- Associates, usually found in a file called ARC50SRC.ARC.
-
- Source code for general Ziv-Lempel-Welch routines by Kent Williams,
- found in a file LZX.ARC. Kent Williams work is also referenced in the
- SEA documentation.
-
- Source code and documentation from the Unix COMPRESS utilities, where
- most of the LZW algorithms used by SEA originated, found in a file
- called COMPRESS.ARC.
-
- Ziv, J. and Lempel, A. Compression of individual sequences via
- variable-rate coding. IEEE Trans. Inform. Theory IT-24, 5 (Sept. 1978),
- 530-536.
-
- The IBM DOS Technical Reference Manual, number 6024125.
-