home *** CD-ROM | disk | FTP | other *** search
- CHANGES.DOC: How the CSTAR compiler could be improved
-
- June 25, 1991
-
- If I were to do CSTAR all over again I would make several significant
- changes. Although I doubt that anyone would want to revise CSTAR a
- whole lot, the following comments would also apply to speeding up a C
- compiler, so they may be of more general interest.
-
- What follows are terse hints about how to improve CSTAR's performance
- and coding style. I would be glad to discuss any of these ideas at greater
- length with you. Just give me a call.
-
- 1. I would make CSTAR token based, rather than character based. CSTAR
- spends a lot of time scanning identifiers and copying characters from one
- buffer to another. (I've measured.) Tokens would be created for identifiers,
- strings, constants and other entities very early on.
-
- 2. I would make CSTAR a multiple pass compiler. This would
- significantly *increase* its speed because only the first two passes would
- deal with characters. All other passes would deal with tokens. The pass
- structure would closely follow the translation phases described in the ANSI
- C Standard.
-
- Pass one would be a state machine which read the entire file into memory
- and then gather statistics about the file. These statistics would be used to
- allocate memory for all strings, identifiers and tokens. This is an important
- point: although this initial pass would seem to be pure overhead, it would
- *greatly* simplify later storage allocation. (No more testing to see if we
- have to allocate more memory.) Storage for all tokens, symbol tables and
- strings (including the spellings of identifiers) would then be allocated *all at
- once.*
-
- Pass two would also be a state machine, similar to the first pass. Pass two
- would simply "fill in" the already allocated storage with all the tokens and
- strings. Symbol tables would be created in a later pass. Later passes would
- handle parsing and output.
-
- By the way, I am not sure that the initial statistics gathering pass would
- really be a good idea, and it is worth looking into. If if doesn't turn out it
- could be abandoned without giving up tokens.
-
- 3. I would distinguish between an id's spelling table and its symbol table.
- The spelling table would be created when the id is first tokenized. It would
- contain information about whether the id could be a reserved word, is
- currently defined as a macro, etc. The symbol table would be used during
- parsing, and would contain a pointer to the spelling table.
-
- The name table would have the following format:
-
- attribute byte: reserved-word and macro-expansion bits and
- possibly other bits.
-
- length byte: the length of the identifier.
-
- the spelling of the identifier itself, \0 terminated.
-
- 4. I would have CSTAR look up an id only once, when it is first tokenized.
- All further references to an identifier would be via a pointer to the id's
- spelling table entry which would be part of the id's token. This will save
- *lots* of time.
-
- 5. I would rewrite the semantic routines. (Notice all the bug fixes in
- sem.c) I tried to do all the semantic actions on the fly which turned out to
- be a big mistake. It would be much better, and probably *faster*, to use
- three passes for doing semantic analysis. The first pass would build a full
- parse tree for declarations, the second pass would do flow analysis, and a
- third pass would output the program including the Sherlock macros.
- Having a tokenized parse tree helps a lot, since each pass can move back
- and forth in the tree as required.
-
- 6. I would scrap the file copier logic in sysnext. Although this logic
- works, it is *very* slow since it is used on every input character. CSTAR
- should produce its output in a final pass after all tokenizing and parsing is
- complete. That way very little scanning will need to be done.
-
- To summarize these first 6 points, I would use tokens and multiple passes
- instead of a single-pass character-oriented approach. The remaining points
- deal with programming style.
-
- 7. I would use stream routines in the macro expansion logic instead of a
- profusion of buffers. (See the code in def.c) The stream routines would
- hide the details of testing for full buffers and allocating new buffers.
- Although the details of macro expansion and token pasting are inherently
- complex, using streams would have significantly simplified matters.
- (Dealing with tokens instead of characters would also remove some
- complications from the macro expansion code.)
-
- 8. I would use many more header files. This is made possible and
- desirable by several new features in the new ANSI C Standard. First, the
- Standard defines many new headers, which I would naturally incorporate
- into the program. Second, having function prototypes is a godsend, so I
- would define a header file for just about every .c file. The Macintosh
- version of CSTAR uses this new style and it has been quite an
- improvement.
-
- I have found two headers to be particularly useful: env.h and ver.h. env.h
- contains #defines that describe the current environment. So any machine-
- dependent code is enclosed in
-
- #ifdef ENV_XXX
- code
- #endif
-
- The ver.h header contains strings describing the current version of the code.
-
- Although my opinions are based on statistics and a great deal of reflection I
- am not attached to them. If you have any questions or comments or other
- suggestions I would be happy to hear from you.
-