home *** CD-ROM | disk | FTP | other *** search
- OPAA
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- LET'S BUILD A COMPILER!
-
- By
-
- Jack W. Crenshaw, Ph.D.
-
- 24 July 1988
-
-
- Part II: EXPRESSION PARSING
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- PAA
-
-
-
-
-
- *****************************************************************
- * *
- * COPYRIGHT NOTICE *
- * *
- * Copyright 1988 (C) Jack W. Crenshaw. All rights reserved. *
- * *
- *****************************************************************
-
-
- GETTING STARTED
-
- If you've read the introduction document to this series, you will
- already know what we're about. You will also have copied the
- cradle software into your Turbo Pascal system, and have compiled
- it. So you should be ready to go.
-
-
- The purpose of this article is for us to learn how to parse and
- translate mathematical expressions. What we would like to see as
- output is a series of assembler-language statements that perform
- the desired actions. For purposes of definition, an expression
- is the right-hand side of an equation, as in
-
- x = 2*y + 3/(4*z)
-
- In the early going, I'll be taking things in _VERY_ small steps.
- That's so that the beginners among you won't get totally lost.
- There are also some very good lessons to be learned early on,
- that will serve us well later. For the more experienced readers:
- bear with me. We'll get rolling soon enough.
-
- SINGLE DIGITS
-
- In keeping with the whole theme of this series (KISS, remember?),
- let's start with the absolutely most simple case we can think of.
- That, to me, is an expression consisting of a single digit.
-
- Before starting to code, make sure you have a baseline copy of
- the "cradle" that I gave last time. We'll be using it again for
- other experiments. Then add this code:
-
-
- {---------------------------------------------------------------}
- { Parse and Translate a Math Expression }
-
- procedure Expression;
- begin
- EmitLn('MOVE #' + GetNum + ',D0')
- end;
- {---------------------------------------------------------------}
-
-
- And add the line "Expression;" to the main program so that it
- reads:A*A*
- - 2 -
- PAA
-
-
-
-
-
- {---------------------------------------------------------------}
- begin
- Init;
- Expression;
- end.
- {---------------------------------------------------------------}
-
-
- Now run the program. Try any single-digit number as input. You
- should get a single line of assembler-language output. Now try
- any other character as input, and you'll see that the parser
- properly reports an error.
-
-
- CONGRATULATIONS! You have just written a working translator!
-
- OK, I grant you that it's pretty limited. But don't brush it off
- too lightly. This little "compiler" does, on a very limited
- scale, exactly what any larger compiler does: it correctly
- recognizes legal statements in the input "language" that we have
- defined for it, and it produces correct, executable assembler
- code, suitable for assembling into object format. Just as
- importantly, it correctly recognizes statements that are NOT
- legal, and gives a meaningful error message. Who could ask for
- more? As we expand our parser, we'd better make sure those two
- characteristics always hold true.
-
- There are some other features of this tiny program worth
- mentioning. First, you can see that we don't separate code
- generation from parsing ... as soon as the parser knows what we
- want done, it generates the object code directly. In a real
- compiler, of course, the reads in GetChar would be from a disk
- file, and the writes to another disk file, but this way is much
- easier to deal with while we're experimenting.
-
- Also note that an expression must leave a result somewhere. I've
- chosen the 68000 register DO. I could have made some other
- choices, but this one makes sense.
-
-
- BINARY EXPRESSIONS
-
- Now that we have that under our belt, let's branch out a bit.
- Admittedly, an "expression" consisting of only one character is
- not going to meet our needs for long, so let's see what we can do
- to extend it. Suppose we want to handle expressions of the form:
-
- 1+2
- or 4-3
- or, in general, <term> +/- <term>
-
- (That's a bit of Backus-Naur Form, or BNF.)ABAB
- - 3 -A*A*
- PAA
-
-
-
-
-
- To do this we need a procedure that recognizes a term and leaves
- its result somewhere, and another that recognizes and
- distinguishes between a '+' and a '-' and generates the
- appropriate code. But if Expression is going to leave its result
- in DO, where should Term leave its result? Answer: the same
- place. We're going to have to save the first result of Term
- somewhere before we get the next one.
-
- OK, basically what we want to do is have procedure Term do what
- Expression was doing before. So just RENAME procedure Expression
- as Term, and enter the following new version of Expression:
-
-
- {---------------------------------------------------------------}
- { Parse and Translate an Expression }
-
- procedure Expression;
- begin
- Term;
- EmitLn('MOVE D0,D1');
- case Look of
- '+': Add;
- '-': Subtract;
- else Expected('Addop');
- end;
- end;
- {--------------------------------------------------------------}
-
-
- Next, just above Expression enter these two procedures:
-
-
- {--------------------------------------------------------------}
- { Recognize and Translate an Add }
-
- procedure Add;
- begin
- Match('+');
- Term;
- EmitLn('ADD D1,D0');
- end;
-
-
- {-------------------------------------------------------------}
- { Recognize and Translate a Subtract }
-
- procedure Subtract;
- begin
- Match('-');
- Term;
- EmitLn('SUB D1,D0');
- end;
- {-------------------------------------------------------------}A6A6
- - 4 -A*A*
- PAA
-
-
-
-
-
- When you're finished with that, the order of the routines should
- be:
-
- o Term (The OLD Expression)
- o Add
- o Subtract
- o Expression
-
- Now run the program. Try any combination you can think of of two
- single digits, separated by a '+' or a '-'. You should get a
- series of four assembler-language instructions out of each run.
- Now try some expressions with deliberate errors in them. Does
- the parser catch the errors?
-
- Take a look at the object code generated. There are two
- observations we can make. First, the code generated is NOT what
- we would write ourselves. The sequence
-
- MOVE #n,D0
- MOVE D0,D1
-
- is inefficient. If we were writing this code by hand, we would
- probably just load the data directly to D1.
-
- There is a message here: code generated by our parser is less
- efficient than the code we would write by hand. Get used to it.
- That's going to be true throughout this series. It's true of all
- compilers to some extent. Computer scientists have devoted whole
- lifetimes to the issue of code optimization, and there are indeed
- things that can be done to improve the quality of code output.
- Some compilers do quite well, but there is a heavy price to pay
- in complexity, and it's a losing battle anyway ... there will
- probably never come a time when a good assembler-language pro-
- grammer can't out-program a compiler. Before this session is
- over, I'll briefly mention some ways that we can do a little op-
- timization, just to show you that we can indeed improve things
- without too much trouble. But remember, we're here to learn, not
- to see how tight we can make the object code. For now, and
- really throughout this series of articles, we'll studiously
- ignore optimization and concentrate on getting out code that
- works.
-
- Speaking of which: ours DOESN'T! The code is _WRONG_! As things
- are working now, the subtraction process subtracts D1 (which has
- the FIRST argument in it) from D0 (which has the second). That's
- the wrong way, so we end up with the wrong sign for the result.
- So let's fix up procedure Subtract with a sign-changer, so that
- it readsA9A9
-
- - 5 -A*A*
- PAA
-
-
-
-
-
- {-------------------------------------------------------------}
- { Recognize and Translate a Subtract }
-
- procedure Subtract;
- begin
- Match('-');
- Term;
- EmitLn('SUB D1,D0');
- EmitLn('NEG D0');
- end;
- {-------------------------------------------------------------}
-
-
- Now our code is even less efficient, but at least it gives the
- right answer! Unfortunately, the rules that give the meaning of
- math expressions require that the terms in an expression come out
- in an inconvenient order for us. Again, this is just one of
- those facts of life you learn to live with. This one will come
- back to haunt us when we get to division.
-
- OK, at this point we have a parser that can recognize the sum or
- difference of two digits. Earlier, we could only recognize a
- single digit. But real expressions can have either form (or an
- infinity of others). For kicks, go back and run the program with
- the single input line '1'.
-
- Didn't work, did it? And why should it? We just finished
- telling our parser that the only kinds of expressions that are
- legal are those with two terms. We must rewrite procedure
- Expression to be a lot more broadminded, and this is where things
- start to take the shape of a real parser.
-
-
- GENERAL EXPRESSIONS
-
- In the REAL world, an expression can consist of one or more
- terms, separated by "addops" ('+' or '-'). In BNF, this is
- written
-
- <expression> ::= <term> [<addop> <term>]*
-
-
- We can accomodate this definition of an expression with the
- addition of a simple loop to procedure Expression:AQAQ
-
- - 6 -A*A*
- PAA
-
-
-
-
-
- {---------------------------------------------------------------}
- { Parse and Translate an Expression }
-
- procedure Expression;
- begin
- Term;
- while Look in ['+', '-'] do begin
- EmitLn('MOVE D0,D1');
- case Look of
- '+': Add;
- '-': Subtract;
- else Expected('Addop');
- end;
- end;
- end;
- {--------------------------------------------------------------}
-
-
- NOW we're getting somewhere! This version handles any number of
- terms, and it only cost us two extra lines of code. As we go on,
- you'll discover that this is characteristic of top-down parsers
- ... it only takes a few lines of code to accomodate extensions to
- the language. That's what makes our incremental approach
- possible. Notice, too, how well the code of procedure Expression
- matches the BNF definition. That, too, is characteristic of the
- method. As you get proficient in the approach, you'll find that
- you can turn BNF into parser code just about as fast as you can
- type!
-
- OK, compile the new version of our parser, and give it a try. As
- usual, verify that the "compiler" can handle any legal
- expression, and will give a meaningful error message for an
- illegal one. Neat, eh? You might note that in our test version,
- any error message comes out sort of buried in whatever code had
- already been generated. But remember, that's just because we are
- using the CRT as our "output file" for this series of
- experiments. In a production version, the two outputs would be
- separated ... one to the output file, and one to the screen.
-
-
- USING THE STACK
-
- At this point I'm going to violate my rule that we don't
- introduce any complexity until it's absolutely necessary, long
- enough to point out a problem with the code we're generating. As
- things stand now, the parser uses D0 for the "primary" register,
- and D1 as a place to store the partial sum. That works fine for
- now, because as long as we deal with only the "addops" '+' and
- '-', any new term can be added in as soon as it is found. But in
- general that isn't true. Consider, for example, the expression
-
- 1+(2-(3+(4-5)))ABAB
- - 7 -A*A*
- PAA
-
-
-
-
-
- If we put the '1' in D1, where do we put the '2'? Since a
- general expression can have any degree of complexity, we're going
- to run out of registers fast!
-
- Fortunately, there's a simple solution. Like every modern
- microprocessor, the 68000 has a stack, which is the perfect place
- to save a variable number of items. So instead of moving the term
- in D0 to D1, let's just push it onto the stack. For the benefit
- of those unfamiliar with 68000 assembler language, a push is
- written
-
- -(SP)
-
- and a pop, (SP)+ .
-
-
- So let's change the EmitLn in Expression to read:
-
- EmitLn('MOVE D0,-(SP)');
-
- and the two lines in Add and Subtract to
-
- EmitLn('ADD (SP)+,D0')
-
- and EmitLn('SUB (SP)+,D0'),
-
- respectively. Now try the parser again and make sure we haven't
- broken it.
-
- Once again, the generated code is less efficient than before, but
- it's a necessary step, as you'll see.
-
-
- MULTIPLICATION AND DIVISION
-
- Now let's get down to some REALLY serious business. As you all
- know, there are other math operators than "addops" ...
- expressions can also have multiply and divide operations. You
- also know that there is an implied operator PRECEDENCE, or
- hierarchy, associated with expressions, so that in an expression
- like
-
- 2 + 3 * 4,
-
- we know that we're supposed to multiply FIRST, then add. (See
- why we needed the stack?)
-
- In the early days of compiler technology, people used some rather
- complex techniques to insure that the operator precedence rules
- were obeyed. It turns out, though, that none of this is
- necessary ... the rules can be accommodated quite nicely by our
- top-down parsing technique. Up till now, the only form that
- we've considered for a term is that of a single decimal digit.A6A6
- - 8 -A*A*
- PAA
-
-
-
-
-
- More generally, we can define a term as a PRODUCT of FACTORS;
- i.e.,
-
- <term> ::= <factor> [ <mulop> <factor ]*
-
- What is a factor? For now, it's what a term used to be ... a
- single digit.
-
- Notice the symmetry: a term has the same form as an expression.
- As a matter of fact, we can add to our parser with a little
- judicious copying and renaming. But to avoid confusion, the
- listing below is the complete set of parsing routines. (Note the
- way we handle the reversal of operands in Divide.)
-
-
- {---------------------------------------------------------------}
- { Parse and Translate a Math Factor }
-
- procedure Factor;
- begin
- EmitLn('MOVE #' + GetNum + ',D0')
- end;
-
-
- {--------------------------------------------------------------}
- { Recognize and Translate a Multiply }
-
- procedure Multiply;
- begin
- Match('*');
- Factor;
- EmitLn('MULS (SP)+,D0');
- end;
-
-
- {-------------------------------------------------------------}
- { Recognize and Translate a Divide }
-
- procedure Divide;
- begin
- Match('/');
- Factor;
- EmitLn('MOVE (SP)+,D1');
- EmitLn('DIVS D1,D0');
- end;AKAK
-
- - 9 -A*A*
- PAA
-
-
-
-
-
- {---------------------------------------------------------------}
- { Parse and Translate a Math Term }
-
- procedure Term;
- begin
- Factor;
- while Look in ['*', '/'] do begin
- EmitLn('MOVE D0,-(SP)');
- case Look of
- '*': Multiply;
- '/': Divide;
- else Expected('Mulop');
- end;
- end;
- end;
-
-
- {--------------------------------------------------------------}
- { Recognize and Translate an Add }
-
- procedure Add;
- begin
- Match('+');
- Term;
- EmitLn('ADD (SP)+,D0');
- end;
-
-
- {-------------------------------------------------------------}
- { Recognize and Translate a Subtract }
-
- procedure Subtract;
- begin
- Match('-');
- Term;
- EmitLn('SUB (SP)+,D0');
- EmitLn('NEG D0');
- end;ANAN
-
-
- - 10 -A*A*
- PAA
-
-
-
-
-
- {---------------------------------------------------------------}
- { Parse and Translate an Expression }
-
- procedure Expression;
- begin
- Term;
- while Look in ['+', '-'] do begin
- EmitLn('MOVE D0,-(SP)');
- case Look of
- '+': Add;
- '-': Subtract;
- else Expected('Addop');
- end;
- end;
- end;
- {--------------------------------------------------------------}
-
-
- Hot dog! A NEARLY functional parser/translator, in only 55 lines
- of Pascal! The output is starting to look really useful, if you
- continue to overlook the inefficiency, which I hope you will.
- Remember, we're not trying to produce tight code here.
-
-
- PARENTHESES
-
- We can wrap up this part of the parser with the addition of
- parentheses with math expressions. As you know, parentheses are
- a mechanism to force a desired operator precedence. So, for
- example, in the expression
-
- 2*(3+4) ,
-
- the parentheses force the addition before the multiply. Much
- more importantly, though, parentheses give us a mechanism for
- defining expressions of any degree of complexity, as in
-
- (1+2)/((3+4)+(5-6))
-
- The key to incorporating parentheses into our parser is to
- realize that no matter how complicated an expression enclosed by
- parentheses may be, to the rest of the world it looks like a
- simple factor. That is, one of the forms for a factor is:
-
- <factor> ::= (<expression>)
-
- This is where the recursion comes in. An expression can contain a
- factor which contains another expression which contains a factor,
- etc., ad infinitum.
-
- Complicated or not, we can take care of this by adding just a few
- lines of Pascal to procedure Factor:ABAB
- - 11 -A*A*
- PAA
-
-
-
-
-
- {---------------------------------------------------------------}
- { Parse and Translate a Math Factor }
-
- procedure Expression; Forward;
-
- procedure Factor;
- begin
- if Look = '(' then begin
- Match('(');
- Expression;
- Match(')');
- end
- else
- EmitLn('MOVE #' + GetNum + ',D0');
- end;
- {--------------------------------------------------------------}
-
-
- Note again how easily we can extend the parser, and how well the
- Pascal code matches the BNF syntax.
-
- As usual, compile the new version and make sure that it correctly
- parses legal sentences, and flags illegal ones with an error
- message.
-
-
- UNARY MINUS
-
- At this point, we have a parser that can handle just about any
- expression, right? OK, try this input sentence:
-
- -1
-
- WOOPS! It doesn't work, does it? Procedure Expression expects
- everything to start with an integer, so it coughs up the leading
- minus sign. You'll find that +3 won't work either, nor will
- something like
-
- -(3-2) .
-
- There are a couple of ways to fix the problem. The easiest
- (although not necessarily the best) way is to stick an imaginary
- leading zero in front of expressions of this type, so that -3
- becomes 0-3. We can easily patch this into our existing version
- of Expression:AKAK
-
- - 12 -A*A*
- PAA
-
-
-
-
-
- {---------------------------------------------------------------}
- { Parse and Translate an Expression }
-
- procedure Expression;
- begin
- if IsAddop(Look) then
- EmitLn('CLR D0')
- else
- Term;
- while IsAddop(Look) do begin
- EmitLn('MOVE D0,-(SP)');
- case Look of
- '+': Add;
- '-': Subtract;
- else Expected('Addop');
- end;
- end;
- end;
- {--------------------------------------------------------------}
-
-
- I TOLD you that making changes was easy! This time it cost us
- only three new lines of Pascal. Note the new reference to
- function IsAddop. Since the test for an addop appeared twice, I
- chose to embed it in the new function. The form of IsAddop
- should be apparent from that for IsAlpha. Here it is:
-
-
- {--------------------------------------------------------------}
- { Recognize an Addop }
-
- function IsAddop(c: char): boolean;
- begin
- IsAddop := c in ['+', '-'];
- end;
- {--------------------------------------------------------------}
-
-
- OK, make these changes to the program and recompile. You should
- also include IsAddop in your baseline copy of the cradle. We'll
- be needing it again later. Now try the input -1 again. Wow!
- The efficiency of the code is pretty poor ... six lines of code
- just for loading a simple constant ... but at least it's correct.
- Remember, we're not trying to replace Turbo Pascal here.
-
- At this point we're just about finished with the structure of our
- expression parser. This version of the program should correctly
- parse and compile just about any expression you care to throw at
- it. It's still limited in that we can only handle factors
- involving single decimal digits. But I hope that by now you're
- starting to get the message that we can accomodate further
- extensions with just some minor changes to the parser. You
- probably won't be surprised to hear that a variable or even a
- function call is just another kind of a factor.A*A*
- - 13 -
- PAA
-
-
-
-
-
- In the next session, I'll show you just how easy it is to extend
- our parser to take care of these things too, and I'll also show
- you just how easily we can accomodate multicharacter numbers and
- variable names. So you see, we're not far at all from a truly
- useful parser.
-
-
- A WORD ABOUT OPTIMIZATION
-
- Earlier in this session, I promised to give you some hints as to
- how we can improve the quality of the generated code. As I said,
- the production of tight code is not the main purpose of this
- series of articles. But you need to at least know that we aren't
- just wasting our time here ... that we can indeed modify the
- parser further to make it produce better code, without throwing
- away everything we've done to date. As usual, it turns out that
- SOME optimization is not that difficult to do ... it simply takes
- some extra code in the parser.
-
- There are two basic approaches we can take:
-
- o Try to fix up the code after it's generated
-
- This is the concept of "peephole" optimization. The general
- idea it that we know what combinations of instructions the
- compiler is going to generate, and we also know which ones
- are pretty bad (such as the code for -1, above). So all we
- do is to scan the produced code, looking for those
- combinations, and replacing them by better ones. It's sort
- of a macro expansion, in reverse, and a fairly
- straightforward exercise in pattern-matching. The only
- complication, really, is that there may be a LOT of such
- combinations to look for. It's called peephole optimization
- simply because it only looks at a small group of instructions
- at a time. Peephole optimization can have a dramatic effect
- on the quality of the code, with little change to the
- structure of the compiler itself. There is a price to pay,
- though, in both the speed, size, and complexity of the
- compiler. Looking for all those combinations calls for a lot
- of IF tests, each one of which is a source of error. And, of
- course, it takes time.
-
- In the classical implementation of a peephole optimizer,
- it's done as a second pass to the compiler. The output code
- is written to disk, and then the optimizer reads and
- processes the disk file again. As a matter of fact, you can
- see that the optimizer could even be a separate PROGRAM from
- the compiler proper. Since the optimizer only looks at the
- code through a small "window" of instructions (hence the
- name), a better implementation would be to simply buffer up a
- few lines of output, and scan the buffer after each EmitLn.
-
- o Try to generate better code in the first placeA6A6
- - 14 -A*A*
- PAA
-
-
-
-
-
- This approach calls for us to look for special cases BEFORE
- we Emit them. As a trivial example, we should be able to
- identify a constant zero, and Emit a CLR instead of a load,
- or even do nothing at all, as in an add of zero, for example.
- Closer to home, if we had chosen to recognize the unary minus
- in Factor instead of in Expression, we could treat constants
- like -1 as ordinary constants, rather then generating them
- from positive ones. None of these things are difficult to
- deal with ... they only add extra tests in the code, which is
- why I haven't included them in our program. The way I see
- it, once we get to the point that we have a working compiler,
- generating useful code that executes, we can always go back
- and tweak the thing to tighten up the code produced. That's
- why there are Release 2.0's in the world.
-
- There IS one more type of optimization worth mentioning, that
- seems to promise pretty tight code without too much hassle. It's
- my "invention" in the sense that I haven't seen it suggested in
- print anywhere, though I have no illusions that it's original
- with me.
-
- This is to avoid such a heavy use of the stack, by making better
- use of the CPU registers. Remember back when we were doing only
- addition and subtraction, that we used registers D0 and D1,
- rather than the stack? It worked, because with only those two
- operations, the "stack" never needs more than two entries.
-
- Well, the 68000 has eight data registers. Why not use them as a
- privately managed stack? The key is to recognize that, at any
- point in its processing, the parser KNOWS how many items are on
- the stack, so it can indeed manage it properly. We can define a
- private "stack pointer" that keeps track of which stack level
- we're at, and addresses the corresponding register. Procedure
- Factor, for example, would not cause data to be loaded into
- register D0, but into whatever the current "top-of-stack"
- register happened to be.
-
- What we're doing in effect is to replace the CPU's RAM stack with
- a locally managed stack made up of registers. For most
- expressions, the stack level will never exceed eight, so we'll
- get pretty good code out. Of course, we also have to deal with
- those odd cases where the stack level DOES exceed eight, but
- that's no problem either. We simply let the stack spill over
- into the CPU stack. For levels beyond eight, the code is no
- worse than what we're generating now, and for levels less than
- eight, it's considerably better.
-
- For the record, I have implemented this concept, just to make
- sure it works before I mentioned it to you. It does. In
- practice, it turns out that you can't really use all eight levels
- ... you need at least one register free to reverse the operand
- order for division (sure wish the 68000 had an XTHL, like the
- 8080!). For expressions that include function calls, we wouldA6A6
- - 15 -A*A*
- PAA
-
-
-
-
-
- also need a register reserved for them. Still, there is a nice
- improvement in code size for most expressions.
-
- So, you see, getting better code isn't that difficult, but it
- does add complexity to the our translator ... complexity we can
- do without at this point. For that reason, I STRONGLY suggest
- that we continue to ignore efficiency issues for the rest of this
- series, secure in the knowledge that we can indeed improve the
- code quality without throwing away what we've done.
-
- Next lesson, I'll show you how to deal with variables factors and
- function calls. I'll also show you just how easy it is to handle
- multicharacter tokens and embedded white space.
-
- *****************************************************************
- * *
- * COPYRIGHT NOTICE *
- * *
- * Copyright 1988 (C) Jack W. Crenshaw. All rights reserved. *
- * *
- *****************************************************************AIAI
-
-
-
-
-
- - 16 -A*A*
- @