home *** CD-ROM | disk | FTP | other *** search
- Techie info, or "How Does it Work?"
- -----------------------------------
-
- Don't read this unless you actually care how this program works.
-
- The method is reminiscent of LZSS encoding. I don't know exactly what the
- SS stands for, but this isn't LZSS anyway so I'm just going to call it LZF
- (yes, the 'F' stands for Fehdrau; I can't afford vanity plates for my car
- so this will have to do). Somebody tell me if I'm off-base using L & Z's
- initials in the name. (Lempel & Ziv, I think their names were. Apologies
- to them if my memory fails me.)
-
- There is a control byte for the next 8 elements, which can either be new
- bytes (corresponding bit set to 0) or three-byte references to strings in
- the old program (corresponding bit set to 1). If a string matches only
- one or two characters in the old file, the individual bytes are stored,
- since that way it is either 9 (control bit + byte) or 18 (2 x (control_bit
- + byte)) bits instead of 25 (control bit plus three byte reference).
- Three or more bytes are stored as a reference, since 25 bits is less
- than 27+ (3+ x (control bit + byte)).
-
- The references are to a 4096 byte window into the old file. They are
- stored as 12 bits (hi->lo) of location and 12 bits (hi->lo) of length. A
- reference of 4095,4095 is legal, since access outside of the window is
- allowed, so long as it STARTS in the window. String accesses outside of
- file bounds are not allowed and are cut off if they are found.
-
- The 4096 byte window pointer starts out at the start of the old file. The
- window itself is bounded by pointer-1024 and pointer+3071. However, the
- bounds are modified (never the pointer, just the bounds) so that they are
- never outside of the file bounds, except with files smaller than 4096 bytes
- in which the upper bound is out of the file. The pointer increments by a
- value equal to the length of the longest match if it is between 1 and 32
- bytes. Matches longer than 32 bytes (arbitrary limit, seems to work best)
- move the pointer to the end of that match in the old file. Using this
- method the dearchiving process can also figure out which 4096 bytes of the
- old file are in the window.
-
- "Seems kinda slow." Yeah. Unlike LZSS and LZH, I can't use a binary
- search tree to find the longest match because then I'd have to rebuild the
- tree after every change in the window position. Instead, I had to
- literally search the entire damn buffer. I got a little clever about it,
- though, so that if it, say, had a five-byte match already, it wouldn't
- check the first five bytes of any subsequent string unless the sixth byte
- matched the given string. That's the best I could think of. Better
- ideas, anyone? Anyway, using that method it's only slow on different
- files. A perfect match is really very fast, and close matches are quite
- respectable. If it goes too slow, the .up file is probably not going to
- be very small anyway.
-
- How well does it work? Quite, usually, for me. It's really meant as a
- patch program, not a full upgrade deal. The problem with executables is
- that 9/10 branches are relative, and therefore are changed if anything is
- added or deleted between the branch and its destination, so the relative
- offset changes and causes a string break. Similarly, if your compiler
- uses a register to access a global data area (ie. Manx's use of A4), and
- you add one single variable at the start of the data area, every
- reference to a global variable will be different. Too much of that and
- Update's not going to find many long strings. It's good for, and meant
- for, fixing that nasty division by zero error, unallocated string,
- version number, etc, but not changing 500 instances of printf() to
- fprintf(stderr,). Get it?
-
- By the way, if nothing ever matches, the resultant .up file will be 12.5%
- larger than the original (9 bits per new byte).
-
- For the future: Maybe add Huffman to the code. The control bytes, new
- bytes, and buffer locations should be more or less random. However,
- lengths will likely cluster. But that's another story.
-