home *** CD-ROM | disk | FTP | other *** search
- There has been a lot of interest in the format of MacWrite documents,
- specifically in terms of the compressed data format for the 3.x and 4.x
- test versions which will support disk-based files. I am hesitant to type
- the entire documentation for 3.x and 4.x file formats here and now, but if
- there is a sufficient uproar, I may be persuaded to do it.
-
- First, a word about MW 2.2. A MACA/WORD file whose first two bytes
- are '00 03' is in 2.2 format. As we all know, 2.2 reads the entire document
- into memory, and you manipulate the document within memory. Here is the RECORD
- format for some global variables stored at the beginning of a 2.2 file:
-
- MWGlobals = RECORD
- VerNum, Version number = 3 for MW2.2
- ParaOffset, Pointer to start of first "paragraph"
- MainPCount, Number of paragraphs in Main document
- HeadPCount, Number of paragraphs in Header
- FootPCount: Number of paragraphs in Footer
- INTEGER;
-
- TitlePgF, Title page?
- ScrapDispF, Display scrap?
- FootDispF, Display footers?
- HeadDispF, Display headers?
- RulerDispF, Display rulers?
- Unused1: Byte for word alignment
- BOOLEAN;
-
- AcDocNum, Document that is currently active:
- 0 = Main
- 1 = Header
- 2 = Footer
- StPgNum: Starting page number offset
- INTEGER;
- END; MWGlobals
-
- Every MacWrite formatted file has three sections (Main, header, footer) each
- comprised of paragraphs. A "paragraph" can be either text, a ruler, or a
- picture. The first paragraph is ALWAYS a ruler. So if you are using Fedit,
- and you follow the ParaOffset pointer, you will start at a ruler. The first
- two bytes of a paragraph are an integer indicating what kind of paragraph it
- is (0=ruler, 1=text, 2=picture). The next two bytes are an integer indicating
- the length of the paragraph. If you are only interested in text paragraphs,
- you can follow ParaOffset to the first paragraph, read the paragraph type
- and paragraph length, and skip that many bytes if it is not a text paragraph.
- Then read the next paragraph, and so on. There are MainPCount paragraphs for
- the main, followed by HeadPCount paragraphs for the header, followed by
- FootPCount paragraphs for the footer.
-
- If it is a text paragraph, there would be the two bytes for paragraph type,
- two bytes for paragraph length, and then two bytes for the length of the
- text. After the string of text characters, there is a formatting run, the
- layout of which is rather grody.
-
- Enough for MW2.2 format. That should be enough to get people started. I
- encourage people to hack around in Fedit to learn more about it.
-
- A real sticky wicket, a veritable nasty noodle, is MW 3.x and 4.x file
- formats. They can be identified by a two-byte version number equal to
- '00 06'. The format for the globals at the beginning of the file is similar,
- but not identical, to 2.2:
-
- MWGlobals = RECORD
- VerNum, Version number = 6 for MW 3.x and 4.x
- MainPCount, Number of paragraphs in Main document
- HeadPCount, Number of paragraphs in Header
- FootPCount: Number of paragraphs in Footer
- INTEGER;
-
- TitlePgF, Title page?
- Unused1: Byte for word alignment
- ScrapDispF, Display scrap?
- FootDispF, Display footers?
- HeadDispF, Display headers?
- RulerDispF, Display rulers?
- BOOLEAN;
-
- AcDocNum, Document that is currently active:
- 0 = Main
- 1 = Header
- 2 = Footer
- StPgNum: Starting page number offset
- INTEGER;
- END; MWGlobals
-
- After the starting page number, there is information for using the "free list",
- which is how MW 3.x and 4.x manipulate pages on disk and in memory. It is not
- at all necessary to understand the "free list" unless you are writing a
- MacWrite clone where you need to be able to swap and edit pages from a
- MacWrite-formatted file. If you only need to be able to read, and perhaps
- display, a MacWrite document, you can pretty much avoid the free list stuff
- altogether.
-
- Reading a paragraph is tricky. There are "document variables" at positions
- 00A0 (footer), 00CE (header), and 00FC (main). Bytes 12D-15D in the document
- variables contain the position of the "information array" for the first
- paragraph. (There is an information block for each paragraph in the
- header, footer, and main, and each information block is a total of 16 bytes
- long.) The 8th byte in the information block is the status byte
- for that paragraph. Bit number 3 is of special interest, in that
- it tells whether the paragraph is in compressed format (to be discussed
- shortly). Bytes 9-11 contain the location of the paragraph data, and bytes
- 12-13 are the length of the paragraph data. To determine what kind of
- paragraph it is, bytes 0-1 of the information block contain the paragraph
- height. A positive value indicates the paragraph is text, a negative value
- means a picture, and zero indicates a ruler. Now follow bytes 9-11
- to find the paragraph data. If it is text, the first two bytes of the
- paragraph data will be the length of the text in bytes. It is important to
- note that the length is in bytes, not characters, since the characters may be
- compressed.
-
- ***Text Compression***
- This is one of the most interesting things about MW 3.x and 4.x. In an
- attempt to save disk space, Apple came up with a scheme to compress ASCII
- text down so that two compressed characters would fit into one byte. It
- works as follows: for any given language (in our case, English), the
- developers of the MacWrite system determine the 15 most common characters in
- that language. Note that for almost every language, the most common character
- is the space character. After the space character, it varies from one language
- to another, based upon statistical analysis. These 15 characters are then
- combined to form a Str255 of length 15. In the resource fork of the file,
- is a resource of type STR (id = 700) which is this string. For the English
- language, it looks like this: ' etnroaisdlhcfp'. Now, when it comes to
- compressing text, all you need is one NIBBLE to represent one of these 15
- characters. The nibbles that are used are 0-E. The nibble F is used to
- indicate that the two nibbles that follow are NOT compressed characters, but
- comprise a complete ASCII character. So for example, the word 'tent' would
- look like '21 32', and the word 'Tent' would look like 'F5 41 32'. The
- string 'The tent' would look like 'F5 4B 10 21 32'. Notice immediately that
- we gain a nibble for each compressed character, but we lose one for each
- non-compressed character. In the long run, this technique wins bytes. If,
- however, the text was pathologically bizarre, and it had lots of capital
- letters and punctuation and infrequent letters, then we would be wasting space
- to use up an extra nibble on each uncompressible character. That is why there
- is a compressed bit. The MacWrite program determines whether it will win
- anything by using this compression technique, and acts accordingly.
-
- Let's try running through how we can read out the text of the main part of
- a MacWrite document:
-
- 1) Verify version number = 6
-
- 2) MainPCount = bytes 0002-0003
-
- 3) M_IAP = bytes 0108-010B Main Info Array Pointer
- The byte positions are calculated from the Main Document Variable
- offset of 00FC, plus the offsets of 12D-15D into the MainDocVars.
-
- 4) Paratype = bytes (M_IAP) - (M_IAP+1)
-
- 5) psb = byte (M_IAP+8) Paragraph Status Byte
-
- 6) CompBit = bit 3 of psb Compression Bit
-
- 7) pdp = bytes (M_IAP+9) - (M_IAP+11) Paragraph Data Pointer
-
- 8) para_len = bytes (M_IAP+12) - (M_IAP+13) Paragraph length
-
- Now, if Paratype is positive, we go to the paragraph itself...
-
- 9) text_len = bytes (pdp) - (pdp+1)
-
- 10) proceed to read text_len number of bytes, and decompress using the
- scheme described above (if compressed bit is set)
-
-
- The information described above is excerpted from documentation furnished by:
- Encore Systems
- 20823 Stevens Creek Blvd. C1-B
- Cupertino, California 95014
- (408)446-9565
-
- We can vouch for the above description, since we are currently involved with
- developing software which will interact with MacWrite files. If you have
- any questions about details of MacWrite file format, or about our project here
- at Cornell, write or call us. We may respond individually, or via the net.
-
- Kate MacGregor and Douglas Young
- Decentralized Computer Services
- 401 Uris Hall
- Cornell University
- Ithaca, NY 14853
- (607)256-4981
-
- Doug Young's electronic address is DMYJARTJ@CORNELLA.BITNET
- -------
-