home *** CD-ROM | disk | FTP | other *** search
- OPAA
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- LET'S BUILD A COMPILER!
-
- By
-
- Jack W. Crenshaw, Ph.D.
-
- 2 April 1989
-
-
- Part VIII: A LITTLE PHILOSOPHY
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- PAA
-
-
-
-
-
- *****************************************************************
- * *
- * COPYRIGHT NOTICE *
- * *
- * Copyright (C) 1989 Jack W. Crenshaw. All rights reserved. *
- * *
- *****************************************************************
-
-
- INTRODUCTION
-
- This is going to be a different kind of session than the others
- in our series on parsing and compiler construction. For this
- session, there won't be any experiments to do or code to write.
- This once, I'd like to just talk with you for a while.
- Mercifully, it will be a short session, and then we can take up
- where we left off, hopefully with renewed vigor.
-
- When I was in college, I found that I could always follow a
- prof's lecture a lot better if I knew where he was going with it.
- I'll bet you were the same.
-
- So I thought maybe it's about time I told you where we're going
- with this series: what's coming up in future installments, and in
- general what all this is about. I'll also share some general
- thoughts concerning the usefulness of what we've been doing.
-
-
- THE ROAD HOME
-
- So far, we've covered the parsing and translation of arithmetic
- expressions, Boolean expressions, and combinations connected by
- relational operators. We've also done the same for control
- constructs. In all of this we've leaned heavily on the use of
- top-down, recursive descent parsing, BNF definitions of the
- syntax, and direct generation of assembly-language code. We also
- learned the value of such tricks as single-character tokens to
- help us see the forest through the trees. In the last
- installment we dealt with lexical scanning, and I showed you
- simple but powerful ways to remove the single-character barriers.
-
- Throughout the whole study, I've emphasized the KISS philosophy
- ... Keep It Simple, Sidney ... and I hope by now you've realized
- just how simple this stuff can really be. While there are for
- sure areas of compiler theory that are truly intimidating, the
- ultimate message of this series is that in practice you can just
- politely sidestep many of these areas. If the language
- definition cooperates or, as in this series, if you can define
- the language as you go, it's possible to write down the language
- definition in BNF with reasonable ease. And, as we've seen, you
- can crank out parse procedures from the BNF just about as fast as
- you can type.ABAB
- - 2 -A*A*
- PAA
-
-
-
-
-
- As our compiler has taken form, it's gotten more parts, but each
- part is quite small and simple, and very much like all the
- others.
-
- At this point, we have many of the makings of a real, practical
- compiler. As a matter of fact, we already have all we need to
- build a toy compiler for a language as powerful as, say, Tiny
- BASIC. In the next couple of installments, we'll go ahead and
- define that language.
-
- To round out the series, we still have a few items to cover.
- These include:
-
- o Procedure calls, with and without parameters
-
- o Local and global variables
-
- o Basic types, such as character and integer types
-
- o Arrays
-
- o Strings
-
- o User-defined types and structures
-
- o Tree-structured parsers and intermediate languages
-
- o Optimization
-
- These will all be covered in future installments. When we're
- finished, you'll have all the tools you need to design and build
- your own languages, and the compilers to translate them.
-
- I can't design those languages for you, but I can make some
- comments and recommendations. I've already sprinkled some
- throughout past installments. You've seen, for example, the
- control constructs I prefer.
-
- These constructs are going to be part of the languages I build.
- I have three languages in mind at this point, two of which you
- will see in installments to come:
-
- TINY - A minimal, but usable language on the order of Tiny
- BASIC or Tiny C. It won't be very practical, but it will
- have enough power to let you write and run real programs
- that do something worthwhile.
-
- KISS - The language I'm building for my own use. KISS is
- intended to be a systems programming language. It won't
- have strong typing or fancy data structures, but it will
- support most of the things I want to do with a higher-
- order language (HOL), except perhaps writing compilers.ABAB
- - 3 -A*A*
- PAA
-
-
-
-
-
- I've also been toying for years with the idea of a HOL-like
- assembler, with structured control constructs and HOL-like
- assignment statements. That, in fact, was the impetus behind my
- original foray into the jungles of compiler theory. This one may
- never be built, simply because I've learned that it's actually
- easier to implement a language like KISS, that only uses a subset
- of the CPU instructions. As you know, assembly language can be
- bizarre and irregular in the extreme, and a language that maps
- one-for-one onto it can be a real challenge. Still, I've always
- felt that the syntax used in conventional assemblers is dumb ...
- why is
-
- MOVE.L A,B
-
- better, or easier to translate, than
-
- B=A ?
-
- I think it would be an interesting exercise to develop a
- "compiler" that would give the programmer complete access to and
- control over the full complement of the CPU instruction set, and
- would allow you to generate programs as efficient as assembly
- language, without the pain of learning a set of mnemonics. Can
- it be done? I don't know. The real question may be, "Will the
- resulting language be any easier to write than assembly"? If
- not, there's no point in it. I think that it can be done, but
- I'm not completely sure yet how the syntax should look.
-
- Perhaps you have some comments or suggestions on this one. I'd
- love to hear them.
-
- You probably won't be surprised to learn that I've already worked
- ahead in most of the areas that we will cover. I have some good
- news: Things never get much harder than they've been so far.
- It's possible to build a complete, working compiler for a real
- language, using nothing but the same kinds of techniques you've
- learned so far. And THAT brings up some interesting questions.
-
-
- WHY IS IT SO SIMPLE?
-
- Before embarking on this series, I always thought that compilers
- were just naturally complex computer programs ... the ultimate
- challenge. Yet the things we have done here have usually turned
- out to be quite simple, sometimes even trivial.
-
- For awhile, I thought is was simply because I hadn't yet gotten
- into the meat of the subject. I had only covered the simple
- parts. I will freely admit to you that, even when I began the
- series, I wasn't sure how far we would be able to go before
- things got too complex to deal with in the ways we have so far.
- But at this point I've already been down the road far enough to
- see the end of it. Guess what?A6A6
- - 4 -A*A*
- PAA
-
-
-
-
-
- THERE ARE NO HARD PARTS!
-
-
- Then, I thought maybe it was because we were not generating very
- good object code. Those of you who have been following the
- series and trying sample compiles know that, while the code works
- and is rather foolproof, its efficiency is pretty awful. I
- figured that if we were concentrating on turning out tight code,
- we would soon find all that missing complexity.
-
- To some extent, that one is true. In particular, my first few
- efforts at trying to improve efficiency introduced complexity at
- an alarming rate. But since then I've been tinkering around with
- some simple optimizations and I've found some that result in very
- respectable code quality, WITHOUT adding a lot of complexity.
-
- Finally, I thought that perhaps the saving grace was the "toy
- compiler" nature of the study. I have made no pretense that we
- were ever going to be able to build a compiler to compete with
- Borland and Microsoft. And yet, again, as I get deeper into this
- thing the differences are starting to fade away.
-
- Just to make sure you get the message here, let me state it flat
- out:
-
- USING THE TECHNIQUES WE'VE USED HERE, IT IS POSSIBLE TO
- BUILD A PRODUCTION-QUALITY, WORKING COMPILER WITHOUT ADDING
- A LOT OF COMPLEXITY TO WHAT WE'VE ALREADY DONE.
-
-
- Since the series began I've received some comments from you.
- Most of them echo my own thoughts: "This is easy! Why do the
- textbooks make it seem so hard?" Good question.
-
- Recently, I've gone back and looked at some of those texts again,
- and even bought and read some new ones. Each time, I come away
- with the same feeling: These guys have made it seem too hard.
-
- What's going on here? Why does the whole thing seem difficult in
- the texts, but easy to us? Are we that much smarter than Aho,
- Ullman, Brinch Hansen, and all the rest?
-
- Hardly. But we are doing some things differently, and more and
- more I'm starting to appreciate the value of our approach, and
- the way that it simplifies things. Aside from the obvious
- shortcuts that I outlined in Part I, like single-character tokens
- and console I/O, we have made some implicit assumptions and done
- some things differently from those who have designed compilers in
- the past. As it turns out, our approach makes life a lot easier.
-
- So why didn't all those other guys use it?
-
- You have to remember the context of some of the earlier compiler
- development. These people were working with very small computersA*A*
- - 5 -
- PAA
-
-
-
-
-
- of limited capacity. Memory was very limited, the CPU
- instruction set was minimal, and programs ran in batch mode
- rather than interactively. As it turns out, these caused some
- key design decisions that have really complicated the designs.
- Until recently, I hadn't realized how much of classical compiler
- design was driven by the available hardware.
-
- Even in cases where these limitations no longer apply, people
- have tended to structure their programs in the same way, since
- that is the way they were taught to do it.
-
- In our case, we have started with a blank sheet of paper. There
- is a danger there, of course, that you will end up falling into
- traps that other people have long since learned to avoid. But it
- also has allowed us to take different approaches that, partly by
- design and partly by pure dumb luck, have allowed us to gain
- simplicity.
-
- Here are the areas that I think have led to complexity in the
- past:
-
- o Limited RAM Forcing Multiple Passes
-
- I just read "Brinch Hansen on Pascal Compilers" (an
- excellent book, BTW). He developed a Pascal compiler for a
- PC, but he started the effort in 1981 with a 64K system, and
- so almost every design decision he made was aimed at making
- the compiler fit into RAM. To do this, his compiler has
- three passes, one of which is the lexical scanner. There is
- no way he could, for example, use the distributed scanner I
- introduced in the last installment, because the program
- structure wouldn't allow it. He also required not one but
- two intermediate languages, to provide the communication
- between phases.
-
- All the early compiler writers had to deal with this issue:
- Break the compiler up into enough parts so that it will fit
- in memory. When you have multiple passes, you need to add
- data structures to support the information that each pass
- leaves behind for the next. That adds complexity, and ends
- up driving the design. Lee's book, "The Anatomy of a
- Compiler," mentions a FORTRAN compiler developed for an IBM
- 1401. It had no fewer than 63 separate passes! Needless to
- say, in a compiler like this the separation into phases
- would dominate the design.
-
- Even in situations where RAM is plentiful, people have
- tended to use the same techniques because that is what
- they're familiar with. It wasn't until Turbo Pascal came
- along that we found how simple a compiler could be if you
- started with different assumptions.
-
-
- o Batch ProcessingA*A*
- - 6 -
- PAA
-
-
-
-
-
- In the early days, batch processing was the only choice ...
- there was no interactive computing. Even today, compilers
- run in essentially batch mode.
-
- In a mainframe compiler as well as many micro compilers,
- considerable effort is expended on error recovery ... it can
- consume as much as 30-40% of the compiler and completely
- drive the design. The idea is to avoid halting on the first
- error, but rather to keep going at all costs, so that you
- can tell the programmer about as many errors in the whole
- program as possible.
-
- All of that harks back to the days of the early mainframes,
- where turnaround time was measured in hours or days, and it
- was important to squeeze every last ounce of information out
- of each run.
-
- In this series, I've been very careful to avoid the issue of
- error recovery, and instead our compiler simply halts with
- an error message on the first error. I will frankly admit
- that it was mostly because I wanted to take the easy way out
- and keep things simple. But this approach, pioneered by
- Borland in Turbo Pascal, also has a lot going for it anyway.
- Aside from keeping the compiler simple, it also fits very
- well with the idea of an interactive system. When
- compilation is fast, and especially when you have an editor
- such as Borland's that will take you right to the point of
- the error, then it makes a lot of sense to stop there, and
- just restart the compilation after the error is fixed.
-
-
- o Large Programs
-
- Early compilers were designed to handle large programs ...
- essentially infinite ones. In those days there was little
- choice; the idea of subroutine libraries and separate
- compilation were still in the future. Again, this
- assumption led to multi-pass designs and intermediate files
- to hold the results of partial processing.
-
- Brinch Hansen's stated goal was that the compiler should be
- able to compile itself. Again, because of his limited RAM,
- this drove him to a multi-pass design. He needed as little
- resident compiler code as possible, so that the necessary
- tables and other data structures would fit into RAM.
-
- I haven't stated this one yet, because there hasn't been a
- need ... we've always just read and written the data as
- streams, anyway. But for the record, my plan has always
- been that, in a production compiler, the source and object
- data should all coexist in RAM with the compiler, a la the
- early Turbo Pascals. That's why I've been careful to keep
- routines like GetChar and Emit as separate routines, inA6A6
- - 7 -A*A*
- PAA
-
-
-
-
-
- spite of their small size. It will be easy to change them
- to read to and write from memory.
-
-
- o Emphasis on Efficiency
-
- John Backus has stated that, when he and his colleagues
- developed the original FORTRAN compiler, they KNEW that they
- had to make it produce tight code. In those days, there was
- a strong sentiment against HOLs and in favor of assembly
- language, and efficiency was the reason. If FORTRAN didn't
- produce very good code by assembly standards, the users
- would simply refuse to use it. For the record, that FORTRAN
- compiler turned out to be one of the most efficient ever
- built, in terms of code quality. But it WAS complex!
-
- Today, we have CPU power and RAM size to spare, so code
- efficiency is not so much of an issue. By studiously
- ignoring this issue, we have indeed been able to Keep It
- Simple. Ironically, though, as I have said, I have found
- some optimizations that we can add to the basic compiler
- structure, without having to add a lot of complexity. So in
- this case we get to have our cake and eat it too: we will
- end up with reasonable code quality, anyway.
-
-
- o Limited Instruction Sets
-
- The early computers had primitive instruction sets. Things
- that we take for granted, such as stack operations and
- indirect addressing, came only with great difficulty.
-
- Example: In most compiler designs, there is a data structure
- called the literal pool. The compiler typically identifies
- all literals used in the program, and collects them into a
- single data structure. All references to the literals are
- done indirectly to this pool. At the end of the
- compilation, the compiler issues commands to set aside
- storage and initialize the literal pool.
-
- We haven't had to address that issue at all. When we want
- to load a literal, we just do it, in line, as in
-
- MOVE #3,D0
-
- There is something to be said for the use of a literal pool,
- particularly on a machine like the 8086 where data and code
- can be separated. Still, the whole thing adds a fairly
- large amount of complexity with little in return.
-
- Of course, without the stack we would be lost. In a micro,
- both subroutine calls and temporary storage depend heavily
- on the stack, and we have used it even more than necessary
- to ease expression parsing.A*A*
- - 8 -
- PAA
-
-
-
-
-
- o Desire for Generality
-
- Much of the content of the typical compiler text is taken up
- with issues we haven't addressed here at all ... things like
- automated translation of grammars, or generation of LALR
- parse tables. This is not simply because the authors want
- to impress you. There are good, practical reasons why the
- subjects are there.
-
- We have been concentrating on the use of a recursive-descent
- parser to parse a deterministic grammar, i.e., a grammar
- that is not ambiguous and, therefore, can be parsed with one
- level of lookahead. I haven't made much of this limitation,
- but the fact is that this represents a small subset of
- possible grammars. In fact, there is an infinite number of
- grammars that we can't parse using our techniques. The LR
- technique is a more powerful one, and can deal with grammars
- that we can't.
-
- In compiler theory, it's important to know how to deal with
- these other grammars, and how to transform them into
- grammars that are easier to deal with. For example, many
- (but not all) ambiguous grammars can be transformed into
- unambiguous ones. The way to do this is not always obvious,
- though, and so many people have devoted years to develop
- ways to transform them automatically.
-
- In practice, these issues turn out to be considerably less
- important. Modern languages tend to be designed to be easy
- to parse, anyway. That was a key motivation in the design
- of Pascal. Sure, there are pathological grammars that you
- would be hard pressed to write unambiguous BNF for, but in
- the real world the best answer is probably to avoid those
- grammars!
-
- In our case, of course, we have sneakily let the language
- evolve as we go, so we haven't painted ourselves into any
- corners here. You may not always have that luxury. Still,
- with a little care you should be able to keep the parser
- simple without having to resort to automatic translation of
- the grammar.
-
-
- We have taken a vastly different approach in this series. We
- started with a clean sheet of paper, and developed techniques
- that work in the context that we are in; that is, a single-user
- PC with rather ample CPU power and RAM space. We have limited
- ourselves to reasonable grammars that are easy to parse, we have
- used the instruction set of the CPU to advantage, and we have not
- concerned ourselves with efficiency. THAT's why it's been easy.
-
- Does this mean that we are forever doomed to be able to build
- only toy compilers? No, I don't think so. As I've said, we can
- add certain optimizations without changing the compilerA*A*
- - 9 -
- PAA
-
-
-
-
-
- structure. If we want to process large files, we can always add
- file buffering to do that. These things do not affect the
- overall program design.
-
- And I think that's a key factor. By starting with small and
- limited cases, we have been able to concentrate on a structure
- for the compiler that is natural for the job. Since the
- structure naturally fits the job, it is almost bound to be simple
- and transparent. Adding capability doesn't have to change that
- basic structure. We can simply expand things like the file
- structure or add an optimization layer. I guess my feeling is
- that, back when resources were tight, the structures people ended
- up with were artificially warped to make them work under those
- conditions, and weren't optimum structures for the problem at
- hand.
-
-
- CONCLUSION
-
- Anyway, that's my arm-waving guess as to how we've been able to
- keep things simple. We started with something simple and let it
- evolve naturally, without trying to force it into some
- traditional mold.
-
- We're going to press on with this. I've given you a list of the
- areas we'll be covering in future installments. With those
- installments, you should be able to build complete, working
- compilers for just about any occasion, and build them simply. If
- you REALLY want to build production-quality compilers, you'll be
- able to do that, too.
-
- For those of you who are chafing at the bit for more parser code,
- I apologize for this digression. I just thought you'd like to
- have things put into perspective a bit. Next time, we'll get
- back to the mainstream of the tutorial.
-
- So far, we've only looked at pieces of compilers, and while we
- have many of the makings of a complete language, we haven't
- talked about how to put it all together. That will be the
- subject of our next two installments. Then we'll press on into
- the new subjects I listed at the beginning of this installment.
-
- See you then.
-
- *****************************************************************
- * *
- * COPYRIGHT NOTICE *
- * *
- * Copyright (C) 1989 Jack W. Crenshaw. All rights reserved. *
- * *
- *****************************************************************ANAN
- - 10 -A*A*
- @