home *** CD-ROM | disk | FTP | other *** search
- OPAA
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- LET'S BUILD A COMPILER!
-
- By
-
- Jack W. Crenshaw, Ph.D.
-
- 16 April 1989
-
-
- Part IX: A TOP VIEW
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- PAA
-
-
-
-
-
- *****************************************************************
- * *
- * COPYRIGHT NOTICE *
- * *
- * Copyright (C) 1989 Jack W. Crenshaw. All rights reserved. *
- * *
- *****************************************************************
-
-
- INTRODUCTION
-
- In the previous installments, we have learned many of the
- techniques required to build a full-blown compiler. We've done
- both assignment statements (with Boolean and arithmetic
- expressions), relational operators, and control constructs. We
- still haven't addressed procedure or function calls, but even so
- we could conceivably construct a mini-language without them.
- I've always thought it would be fun to see just how small a
- language one could build that would still be useful. We're
- ALMOST in a position to do that now. The problem is: though we
- know how to parse and translate the constructs, we still don't
- know quite how to put them all together into a language.
-
- In those earlier installments, the development of our programs
- had a decidedly bottom-up flavor. In the case of expression
- parsing, for example, we began with the very lowest level
- constructs, the individual constants and variables, and worked
- our way up to more complex expressions.
-
- Most people regard the top-down design approach as being better
- than the bottom-up one. I do too, but the way we did it
- certainly seemed natural enough for the kinds of things we were
- parsing.
-
- You mustn't get the idea, though, that the incremental approach
- that we've been using in all these tutorials is inherently
- bottom-up. In this installment I'd like to show you that the
- approach can work just as well when applied from the top down ...
- maybe better. We'll consider languages such as C and Pascal, and
- see how complete compilers can be built starting from the top.
-
- In the next installment, we'll apply the same technique to build
- a complete translator for a subset of the KISS language, which
- I'll be calling TINY. But one of my goals for this series is
- that you will not only be able to see how a compiler for TINY or
- KISS works, but that you will also be able to design and build
- compilers for your own languages. The C and Pascal examples will
- help. One thing I'd like you to see is that the natural
- structure of the compiler depends very much on the language being
- translated, so the simplicity and ease of construction of the
- compiler depends very much on letting the language set the
- program structure.ABAB
- - 2 -A*A*
- PAA
-
-
-
-
-
- It's a bit much to produce a full C or Pascal compiler here, and
- we won't try. But we can flesh out the top levels far enough so
- that you can see how it goes.
-
- Let's get started.
-
-
- THE TOP LEVEL
-
- One of the biggest mistakes people make in a top-down design is
- failing to start at the true top. They think they know what the
- overall structure of the design should be, so they go ahead and
- write it down.
-
- Whenever I start a new design, I always like to do it at the
- absolute beginning. In program design language (PDL), this top
- level looks something like:
-
-
- begin
- solve the problem
- end
-
-
- OK, I grant you that this doesn't give much of a hint as to what
- the next level is, but I like to write it down anyway, just to
- give me that warm feeling that I am indeed starting at the top.
-
- For our problem, the overall function of a compiler is to compile
- a complete program. Any definition of the language, written in
- BNF, begins here. What does the top level BNF look like? Well,
- that depends quite a bit on the language to be translated. Let's
- take a look at Pascal.
-
-
- THE STRUCTURE OF PASCAL
-
- Most texts for Pascal include a BNF or "railroad-track"
- definition of the language. Here are the first few lines of one:
-
-
- <program> ::= <program-header> <block> '.'
-
- <program-header> ::= PROGRAM <ident>
-
- <block> ::= <declarations> <statements>
-
-
- We can write recognizers to deal with each of these elements,
- just as we've done before. For each one, we'll use our familiar
- single-character tokens to represent the input, then flesh things
- out a little at a time. Let's begin with the first recognizer:
- the program itself.A6A6
- - 3 -A*A*
- PAA
-
-
-
-
-
- To translate this, we'll start with a fresh copy of the Cradle.
- Since we're back to single-character names, we'll just use a 'p'
- to stand for 'PROGRAM.'
-
- To a fresh copy of the cradle, add the following code, and insert
- a call to it from the main program:
-
-
- {--------------------------------------------------------------}
- { Parse and Translate A Program }
-
- procedure Prog;
- var Name: char;
- begin
- Match('p'); { Handles program header part }
- Name := GetName;
- Prolog(Name);
- Match('.');
- Epilog(Name);
- end;
- {--------------------------------------------------------------}
-
-
- The procedures Prolog and Epilog perform whatever is required to
- let the program interface with the operating system, so that it
- can execute as a program. Needless to say, this part will be
- VERY OS-dependent. Remember, I've been emitting code for a 68000
- running under the OS I use, which is SK*DOS. I realize most of
- you are using PC's and would rather see something else, but I'm
- in this thing too deep to change now!
-
- Anyhow, SK*DOS is a particularly easy OS to interface to. Here
- is the code for Prolog and Epilog:
-
-
- {--------------------------------------------------------------}
- { Write the Prolog }
-
- procedure Prolog;
- begin
- EmitLn('WARMST EQU $A01E');
- end;
-
-
- {--------------------------------------------------------------}
- { Write the Epilog }
-
- procedure Epilog(Name: char);
- begin
- EmitLn('DC WARMST');
- EmitLn('END ' + Name);
- end;
- {--------------------------------------------------------------}A6A6
- - 4 -A*A*
- PAA
-
-
-
-
-
- As usual, add this code and try out the "compiler." At this
- point, there is only one legal input:
-
-
- px. (where x is any single letter, the program name)
-
-
- Well, as usual our first effort is rather unimpressive, but by
- now I'm sure you know that things will get more interesting.
- There is one important thing to note: THE OUTPUT IS A WORKING,
- COMPLETE, AND EXECUTABLE PROGRAM (at least after it's assembled).
-
- This is very important. The nice feature of the top-down
- approach is that at any stage you can compile a subset of the
- complete language and get a program that will run on the target
- machine. From here on, then, we need only add features by
- fleshing out the language constructs. It's all very similar to
- what we've been doing all along, except that we're approaching it
- from the other end.
-
-
- FLESHING IT OUT
-
- To flesh out the compiler, we only have to deal with language
- features one by one. I like to start with a stub procedure that
- does nothing, then add detail in incremental fashion. Let's
- begin by processing a block, in accordance with its PDL above.
- We can do this in two stages. First, add the null procedure:
-
-
- {--------------------------------------------------------------}
- { Parse and Translate a Pascal Block }
-
- procedure DoBlock(Name: char);
- begin
- end;
- {--------------------------------------------------------------}
-
-
- and modify Prog to read:
-
-
- {--------------------------------------------------------------}
- { Parse and Translate A Program }
-
- procedure Prog;
- var Name: char;
- begin
- Match('p');
- Name := GetName;
- Prolog;
- DoBlock(Name);
- Match('.');
- Epilog(Name);A*A*
- - 5 -
- PAA
-
-
-
-
-
- end;
- {--------------------------------------------------------------}
-
-
- That certainly shouldn't change the behavior of the program, and
- it doesn't. But now the definition of Prog is complete, and we
- can proceed to flesh out DoBlock. That's done right from its BNF
- definition:
-
-
- {--------------------------------------------------------------}
- { Parse and Translate a Pascal Block }
-
- procedure DoBlock(Name: char);
- begin
- Declarations;
- PostLabel(Name);
- Statements;
- end;
- {--------------------------------------------------------------}
-
-
- The procedure PostLabel was defined in the installment on
- branches. Copy it into your cradle.
-
- I probably need to explain the reason for inserting the label
- where I have. It has to do with the operation of SK*DOS. Unlike
- some OS's, SK*DOS allows the entry point to the main program to
- be anywhere in the program. All you have to do is to give that
- point a name. The call to PostLabel puts that name just before
- the first executable statement in the main program. How does
- SK*DOS know which of the many labels is the entry point, you ask?
- It's the one that matches the END statement at the end of the
- program.
-
- OK, now we need stubs for the procedures Declarations and
- Statements. Make them null procedures as we did before.
-
- Does the program still run the same? Then we can move on to the
- next stage.
-
-
- DECLARATIONS
-
- The BNF for Pascal declarations is:
-
-
- <declarations> ::= ( <label list> |
- <constant list> |
- <type list> |
- <variable list> |
- <procedure> |
- <function> )*A6A6
- - 6 -A*A*
- PAA
-
-
-
-
-
- (Note that I'm using the more liberal definition used by Turbo
- Pascal. In the standard Pascal definition, each of these parts
- must be in a specific order relative to the rest.)
-
- As usual, let's let a single character represent each of these
- declaration types. The new form of Declarations is:
-
-
- {--------------------------------------------------------------}
- { Parse and Translate the Declaration Part }
-
- procedure Declarations;
- begin
- while Look in ['l', 'c', 't', 'v', 'p', 'f'] do
- case Look of
- 'l': Labels;
- 'c': Constants;
- 't': Types;
- 'v': Variables;
- 'p': DoProcedure;
- 'f': DoFunction;
- end;
- end;
- {--------------------------------------------------------------}
-
-
- Of course, we need stub procedures for each of these declaration
- types. This time, they can't quite be null procedures, since
- otherwise we'll end up with an infinite While loop. At the very
- least, each recognizer must eat the character that invokes it.
- Insert the following procedures:
-
-
- {--------------------------------------------------------------}
- { Process Label Statement }
-
- procedure Labels;
- begin
- Match('l');
- end;
-
-
- {--------------------------------------------------------------}
- { Process Const Statement }
-
- procedure Constants;
- begin
- Match('c');
- end;
-
-
- {--------------------------------------------------------------}
- { Process Type Statement }A6A6
- - 7 -A*A*
- PAA
-
-
-
-
-
- procedure Types;
- begin
- Match('t');
- end;
-
-
- {--------------------------------------------------------------}
- { Process Var Statement }
-
- procedure Variables;
- begin
- Match('v');
- end;
-
-
- {--------------------------------------------------------------}
- { Process Procedure Definition }
-
- procedure DoProcedure;
- begin
- Match('p');
- end;
-
-
- {--------------------------------------------------------------}
- { Process Function Definition }
-
- procedure DoFunction;
- begin
- Match('f');
- end;
- {--------------------------------------------------------------}
-
-
- Now try out the compiler with a few representative inputs. You
- can mix the declarations any way you like, as long as the last
- character in the program is'.' to indicate the end of the
- program. Of course, none of the declarations actually declare
- anything, so you don't need (and can't use) any characters other
- than those standing for the keywords.
-
- We can flesh out the statement part in a similar way. The BNF
- for it is:
-
-
- <statements> ::= <compound statement>
-
- <compound statement> ::= BEGIN <statement>
- (';' <statement>) END
-
-
- Note that statements can begin with any identifier except END.
- So the first stub form of procedure Statements is:A6A6
- - 8 -A*A*
- PAA
-
-
-
-
-
- {--------------------------------------------------------------}
- { Parse and Translate the Statement Part }
-
- procedure Statements;
- begin
- Match('b');
- while Look <> 'e' do
- GetChar;
- Match('e');
- end;
- {--------------------------------------------------------------}
-
-
- At this point the compiler will accept any number of
- declarations, followed by the BEGIN block of the main program.
- This block itself can contain any characters at all (except an
- END), but it must be present.
-
- The simplest form of input is now
-
- 'pxbe.'
-
- Try it. Also try some combinations of this. Make some
- deliberate errors and see what happens.
-
- At this point you should be beginning to see the drill. We begin
- with a stub translator to process a program, then we flesh out
- each procedure in turn, based upon its BNF definition. Just as
- the lower-level BNF definitions add detail and elaborate upon the
- higher-level ones, the lower-level recognizers will parse more
- detail of the input program. When the last stub has been
- expanded, the compiler will be complete. That's top-down
- design/implementation in its purest form.
-
- You might note that even though we've been adding procedures, the
- output of the program hasn't changed. That's as it should be.
- At these top levels there is no emitted code required. The
- recognizers are functioning as just that: recognizers. They are
- accepting input sentences, catching bad ones, and channeling good
- input to the right places, so they are doing their job. If we
- were to pursue this a bit longer, code would start to appear.
-
- The next step in our expansion should probably be procedure
- Statements. The Pascal definition is:
-
-
- <statement> ::= <simple statement> | <structured statement>
-
- <simple statement> ::= <assignment> | <procedure call> | null
-
- <structured statement> ::= <compound statement> |
- <if statement> |
- <case statement> |
- <while statement> |A*A*
- - 9 -
- PAA
-
-
-
-
-
- <repeat statement> |
- <for statement> |
- <with statement>
-
-
- These are starting to look familiar. As a matter of fact, you
- have already gone through the process of parsing and generating
- code for both assignment statements and control structures. This
- is where the top level meets our bottom-up approach of previous
- sessions. The constructs will be a little different from those
- we've been using for KISS, but the differences are nothing you
- can't handle.
-
- I think you can get the picture now as to the procedure. We
- begin with a complete BNF description of the language. Starting
- at the top level, we code up the recognizer for that BNF
- statement, using stubs for the next-level recognizers. Then we
- flesh those lower-level statements out one by one.
-
- As it happens, the definition of Pascal is very compatible with
- the use of BNF, and BNF descriptions of the language abound.
- Armed with such a description, you will find it fairly
- straightforward to continue the process we've begun.
-
- You might have a go at fleshing a few of these constructs out,
- just to get a feel for it. I don't expect you to be able to
- complete a Pascal compiler here ... there are too many things
- such as procedures and types that we haven't addressed yet ...
- but it might be helpful to try some of the more familiar ones.
- It will do you good to see executable programs coming out the
- other end.
-
- If I'm going to address those issues that we haven't covered yet,
- I'd rather do it in the context of KISS. We're not trying to
- build a complete Pascal compiler just yet, so I'm going to stop
- the expansion of Pascal here. Let's take a look at a very
- different language.
-
-
- THE STRUCTURE OF C
-
- The C language is quite another matter, as you'll see. Texts on
- C rarely include a BNF definition of the language. Probably
- that's because the language is quite hard to write BNF for.
-
- One reason I'm showing you these structures now is so that I can
- impress upon you these two facts:
-
- (1) The definition of the language drives the structure of the
- compiler. What works for one language may be a disaster for
- another. It's a very bad idea to try to force a given
- structure upon the compiler. Rather, you should let the BNF
- drive the structure, as we have done here.A6A6
- - 10 -A*A*
- PAA
-
-
-
-
-
- (2) A language that is hard to write BNF for will probably be
- hard to write a compiler for, as well. C is a popular
- language, and it has a reputation for letting you do
- virtually anything that is possible to do. Despite the
- success of Small C, C is _NOT_ an easy language to parse.
-
-
- A C program has less structure than its Pascal counterpart. At
- the top level, everything in C is a static declaration, either of
- data or of a function. We can capture this thought like this:
-
-
- <program> ::= ( <global declaration> )*
-
- <global declaration> ::= <data declaration> |
- <function>
-
- In Small C, functions can only have the default type int, which
- is not declared. This makes the input easy to parse: the first
- token is either "int," "char," or the name of a function. In
- Small C, the preprocessor commands are also processed by the
- compiler proper, so the syntax becomes:
-
-
- <global declaration> ::= '#' <preprocessor command> |
- 'int' <data list> |
- 'char' <data list> |
- <ident> <function body> |
-
-
- Although we're really more interested in full C here, I'll show
- you the code corresponding to this top-level structure for Small
- C.
-
-
- {--------------------------------------------------------------}
- { Parse and Translate A Program }
-
- procedure Prog;
- begin
- while Look <> ^Z do begin
- case Look of
- '#': PreProc;
- 'i': IntDecl;
- 'c': CharDecl;
- else DoFunction(Int);
- end;
- end;
- end;
- {--------------------------------------------------------------}
-
- Note that I've had to use a ^Z to indicate the end of the source.
- C has no keyword such as END or the '.' to otherwise indicate the
- end.A*A*
- - 11 -
- PAA
-
-
-
-
-
- With full C, things aren't even this easy. The problem comes
- about because in full C, functions can also have types. So when
- the compiler sees a keyword like "int," it still doesn't know
- whether to expect a data declaration or a function definition.
- Things get more complicated since the next token may not be a
- name ... it may start with an '*' or '(', or combinations of the
- two.
-
- More specifically, the BNF for full C begins with:
-
-
- <program> ::= ( <top-level decl> )*
-
- <top-level decl> ::= <function def> | <data decl>
-
- <data decl> ::= [<class>] <type> <decl-list>
-
- <function def> ::= [<class>] [<type>] <function decl>
-
-
- You can now see the problem: The first two parts of the
- declarations for data and functions can be the same. Because of
- the ambiguity in the grammar as written above, it's not a
- suitable grammar for a recursive-descent parser. Can we
- transform it into one that is suitable? Yes, with a little work.
- Suppose we write it this way:
-
-
- <top-level decl> ::= [<class>] <decl>
-
- <decl> ::= <type> <typed decl> | <function decl>
-
- <typed decl> ::= <data list> | <function decl>
-
-
- We can build a parsing routine for the class and type
- definitions, and have them store away their findings and go on,
- without their ever having to "know" whether a function or a data
- declaration is being processed.
-
- To begin, key in the following version of the main program:
-
-
- {--------------------------------------------------------------}
- { Main Program }
-
- begin
- Init;
- while Look <> ^Z do begin
- GetClass;
- GetType;
- TopDecl;
- end;
- end.A*A*
- - 12 -
- PAA
-
-
-
-
-
- {--------------------------------------------------------------}
-
-
- For the first round, just make the three procedures stubs that do
- nothing _BUT_ call GetChar.
-
- Does this program work? Well, it would be hard put NOT to, since
- we're not really asking it to do anything. It's been said that a
- C compiler will accept virtually any input without choking. It's
- certainly true of THIS compiler, since in effect all it does is
- to eat input characters until it finds a ^Z.
-
- Next, let's make GetClass do something worthwhile. Declare the
- global variable
-
-
- var Class: char;
-
-
- and change GetClass to do the following:
-
-
- {--------------------------------------------------------------}
- { Get a Storage Class Specifier }
-
- Procedure GetClass;
- begin
- if Look in ['a', 'x', 's'] then begin
- Class := Look;
- GetChar;
- end
- else Class := 'a';
- end;
- {--------------------------------------------------------------}
-
-
- Here, I've used three single characters to represent the three
- storage classes "auto," "extern," and "static." These are not
- the only three possible classes ... there are also "register" and
- "typedef," but this should give you the picture. Note that the
- default class is "auto."
-
- We can do a similar thing for types. Enter the following
- procedure next:
-
-
- {--------------------------------------------------------------}
- { Get a Type Specifier }
-
- procedure GetType;
- begin
- Typ := ' ';
- if Look = 'u' then begin
- Sign := 'u';A*A*
- - 13 -
- PAA
-
-
-
-
-
- Typ := 'i';
- GetChar;
- end
- else Sign := 's';
- if Look in ['i', 'l', 'c'] then begin
- Typ := Look;
- GetChar;
- end;
- end;
- {--------------------------------------------------------------}
-
- Note that you must add two more global variables, Sign and Typ.
-
- With these two procedures in place, the compiler will process the
- class and type definitions and store away their findings. We can
- now process the rest of the declaration.
-
- We are by no means out of the woods yet, because there are still
- many complexities just in the definition of the type, before we
- even get to the actual data or function names. Let's pretend for
- the moment that we have passed all those gates, and that the next
- thing in the input stream is a name. If the name is followed by
- a left paren, we have a function declaration. If not, we have at
- least one data item, and possibly a list, each element of which
- can have an initializer.
-
- Insert the following version of TopDecl:
-
-
- {--------------------------------------------------------------}
- { Process a Top-Level Declaration }
-
- procedure TopDecl;
- var Name: char;
- begin
- Name := Getname;
- if Look = '(' then
- DoFunc(Name)
- else
- DoData(Name);
- end;
- {--------------------------------------------------------------}
-
-
- (Note that, since we have already read the name, we must pass it
- along to the appropriate routine.)
-
- Finally, add the two procedures DoFunc and DoData:
-
-
- {--------------------------------------------------------------}
- { Process a Function Definition }
-
- procedure DoFunc(n: char);A*A*
- - 14 -
- PAA
-
-
-
-
-
- begin
- Match('(');
- Match(')');
- Match('{');
- Match('}');
- if Typ = ' ' then Typ := 'i';
- Writeln(Class, Sign, Typ, ' function ', n);
- end;
-
- {--------------------------------------------------------------}
- { Process a Data Declaration }
-
- procedure DoData(n: char);
- begin
- if Typ = ' ' then Expected('Type declaration');
- Writeln(Class, Sign, Typ, ' data ', n);
- while Look = ',' do begin
- Match(',');
- n := GetName;
- WriteLn(Class, Sign, Typ, ' data ', n);
- end;
- Match(';');
- end;
- {--------------------------------------------------------------}
-
-
- Since we're still a long way from producing executable code, I
- decided to just have these two routines tell us what they found.
-
- OK, give this program a try. For data declarations, it's OK to
- give a list separated by commas. We can't process initializers
- as yet. We also can't process argument lists for the functions,
- but the "(){}" characters should be there.
-
- We're still a _VERY_ long way from having a C compiler, but what
- we have is starting to process the right kinds of inputs, and is
- recognizing both good and bad inputs. In the process, the
- natural structure of the compiler is starting to take form.
-
- Can we continue this until we have something that acts more like
- a compiler. Of course we can. Should we? That's another matter.
- I don't know about you, but I'm beginning to get dizzy, and we've
- still got a long way to go to even get past the data
- declarations.
-
- At this point, I think you can see how the structure of the
- compiler evolves from the language definition. The structures
- we've seen for our two examples, Pascal and C, are as different
- as night and day. Pascal was designed at least partly to be easy
- to parse, and that's reflected in the compiler. In general, in
- Pascal there is more structure and we have a better idea of what
- kinds of constructs to expect at any point. In C, on the other
- hand, the program is essentially a list of declarations,
- terminated only by the end of file.A*A*
- - 15 -
- PAA
-
-
-
-
-
- We could pursue both of these structures much farther, but
- remember that our purpose here is not to build a Pascal or a C
- compiler, but rather to study compilers in general. For those of
- you who DO want to deal with Pascal or C, I hope I've given you
- enough of a start so that you can take it from here (although
- you'll soon need some of the stuff we still haven't covered yet,
- such as typing and procedure calls). For the rest of you, stay
- with me through the next installment. There, I'll be leading you
- through the development of a complete compiler for TINY, a subset
- of KISS.
-
- See you then.
-
-
- *****************************************************************
- * *
- * COPYRIGHT NOTICE *
- * *
- * Copyright (C) 1989 Jack W. Crenshaw. All rights reserved. *
- * *
- *****************************************************************AIAI
-
-
-
-
-
- - 16 -A*A*
- @