home *** CD-ROM | disk | FTP | other *** search
Text File | 1990-03-27 | 80.7 KB | 1,729 lines |
- James Roskind C Porting Preprocessor (JRCPP)
-
- JRCPP LANGUAGE REFERENCE MANUAL (3/23/90)
-
-
-
- Copyright (C) 1990 James Roskind, All rights reserved. Permission
- is granted to copy and distribute this file as part any machine
- readable archive containing the entire, unmodified, JRCPP PUBLIC
- DISTRIBUTION PACKAGE (henceforth call the "Package"). The set of
- files that form the Package are described in the README file that
- is a part of the Package. Permission is granted to individual
- users of the Package to copy individual portions of the Package
- (i.e., component files) in any form (e.g.: printed, electronic,
- electro-optical, etc.) desired for the purpose of supporting
- users of the Package (i.e., providing online, or onshelf
- documentation access; executing the binary JRCPP code, etc.).
- Permission is not granted to distribute copies of individual
- portions of the Package, unless a machine readable version of the
- complete Package is also made available with such distribution.
- Abstracting with credit is permitted. There is no charge or
- royalty fee required for copies made in compliance with this
- notice. To otherwise copy elements of this package requires
- prior permission in writing from James Roskind.
-
- James Roskind
- 516 Latania Palm Drive
- Indialantic FL 32903
-
- End of copyright notice
-
-
- What the above copyright means is that you are free to use and
- distribute (or even sell) the entire set of files in this Package,
- but you can't split them up, and distribute them as separate files.
- The notice also says that you cannot modify the copies that you
- distribute, and this ESPECIALLY includes NOT REMOVING the any part of
- the copyright notice in any file. JRCPP currently implements a C
- Preprocessor, but the users of this Package do NOT surrender any
- right of ownership or copyright to any source text that is processed
- by JRCPP, either before or after processing. Similarly, there are no
- royalty or fee requirements for using the post-preprocessed output of
- JRCPP.
-
- This Package is expected to be distributed by shareware and freeware
- channels (including BBS sites), but the fees paid for "distribution"
- costs are strictly exchanged between the distributor, and the
- recipient, and James Roskind makes no express or implied warranties
- about the quality or integrity of such indirectly acquired copies.
- Distributors and users may obtain the Package (the Public
- distribution form) directly from the author by following the ordering
- procedures in the REGISTRATION file.
-
-
- DISCLAIMER:
-
- JAMES ROSKIND PROVIDES THIS FILE "AS IS" WITHOUT WARRANTY OF ANY
- KIND, EITHER EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO, THE
- IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR
- PURPOSE. THE ENTIRE RISK AS TO THE QUALITY AND PERFORMANCE OF THE
- PROGRAM AND DOCUMENTATION IS WITH YOU. Some states do not allow
- disclaimer of express or implied warranties in certain transactions,
- therefore, this statement may not apply to you.
-
-
- UNIX is a registered trademark of AT&T Bell Laboratories.
- ____________________________________________________________________
-
-
-
- James Roskind C Porting Preprocessor (JRCPP)
-
- JRCPP LANGUAGE REFERENCE MANUAL
-
- INTRODUCTION
-
- This document, in the company of the "ANSI Programming Language C"
- Standard, is intended to act as a language reference manual. Most
- significantly, this document discusses the performance of JRCPP in
- official "ANSI undefined", "ANSI unspecified" and "ANSI
- implementation defined" domains of the C Language. In addition, it
- lists performance limitations of JRCPP, and directly relates these
- limitations to the standard's requirements for "Implementation
- limits".
-
- As an additional matter, this document identifies vaguenesses (and in
- the rare case, errors) in the ANSI C Standard, and describes the
- resolution adopted by JRCPP. Hence this document is also the
- "Rationale" for JRCPP, in much the same way the the ANSI C standard
- has an accompanying document "Rational For ANSI Programming Language
- C". This document will generally not discuss aspects of the standard
- that do not involve preprocessing activities performed on source
- files.
-
- Note that this document was written based on the Draft Proposed ANSI
- C Standard, X3J11/88-158, date December 8, 1988. After a drawn out
- appeals process, I believe this draft was accepted in January 1990 by
- the ANSI Standards Committee. I am not aware that any changes were
- made during that appeals process, and I apologize in advance for any
- errors I might have made in this regard, or in the description that
- follows.
-
- In all cases where this Language Reference Manual deviates from the
- ANSI C Standard, this document should be assumed to be in error, and
- the corresponding bug/misperformance in the JRCPP program (if any)
- should be reported. The ANSI C Standard a tremendous work, and I
- realize that my abridged commentary in many areas does not do justice
- to the meticulous selection of elaborate wording in the official
- Standard. For many, my description will be enough, but for language
- lawyers, there is no replacement for the official ANSI document.
-
- Section numbers in this document have been chosen to match those of
- the ANSI C standard, and hence certain gaps are present. These gaps
- represent areas where either there is generally no impact on
- preprocessing activities, or no additional commentary seems necessary.
-
- LISTING OF SECTIONS
-
- 1.3 References
- 1.6 Definition of Terms
- 2. ENVIRONMENT
- 2.1.1.2 ENVIRONMENT- Translation phases
- 2.1.1.3 ENVIRONMENT- Diagnostics
- 2.2 ENVIRONMENTAL CONSIDERATION
- 2.2.1 ENVIRONMENTAL CONSIDERATION- Character sets
- 2.2.1.1 ENVIRONMENTAL CONSIDERATION- Trigraphs Sequences
- 2.2.1.3 ENVIRONMENTAL CONSIDERATION- Character sets- Multibyte characters
- 2.2.4 ENVIRONMENTAL CONSIDERATION- Translation Limits
- 3.1 LANGUAGE- LEXICAL ELEMENTS
- 3.1.2 LANGUAGE- LEXICAL ELEMENTS- Identifiers
- 3.1.3.3 LANGUAGE- LEXICAL ELEMENTS- Character constants
- 3.1.4 LANGUAGE- LEXICAL ELEMENTS- String literals
- 3.1.7 LANGUAGE- LEXICAL ELEMENTS- Header names
- 3.1.8 LANGUAGE- LEXICAL ELEMENTS- Preprocessing numbers
- 3.8 LANGUAGE- PREPROCESSING DIRECTIVES
- 3.8.1 LANGUAGE- PREPROCESSING DIRECTIVES- Conditional inclusion
- 3.8.2 LANGUAGE- PREPROCESSING DIRECTIVES- Source file inclusion
- 3.8.3 LANGUAGE- PREPROCESSING DIRECTIVES- Macro replacement
- 3.8.3.2 LANGUAGE- PREPROCESSING DIRECTIVES- The # operator
- 3.8.3.3 LANGUAGE- PREPROCESSING DIRECTIVES- The ## operator
- 3.8.3.5 LANGUAGE- PREPROCESSING DIRECTIVES- Scope of macro definitions
- 3.8.4 LANGUAGE- PREPROCESSING DIRECTIVES- Line control
- 3.8.6 LANGUAGE- PREPROCESSING DIRECTIVES- Pragma directive
- 3.8.8 LANGUAGE- PREPROCESSING DIRECTIVES- Predefined macro names
-
-
- 1.3 References
-
- In addition to the 6 references listed in the standard (the most
- significant of which is probably "The C Reference Manual", by
- Kernighan and Ritchie), an additional reference set should be
- considered. Since JRCPP is intended to support many dialects of C,
- as well as C++, references for C++ are:
-
- "The C++ Programming Language", by Bjarne Stroustrup, Addison-Wessley
- (1986), Copyright Bell Telephone Laboratories Inc.
-
- "The Annotated C++ Reference Manual" by Margaret A. Ellis and Bjarne
- Stroustrup, Addison-Wessley (to be published).
-
-
- 1.6 Definition of Terms
-
- Among the 17 terms defined in this section (such as "bit", "byte",
- "argument" "parameter"...) which are certainly crucial to a reference
- manual, there are also several terms which identify the focus of this
- manual. The definition are for the phrases "Unspecified behavior",
- "Undefined behavior", and "Implementation defined behavior". The
- following are my interpretations of these definitions:
-
- "Unspecified behavior": Although the source code is considered
- correct, the standard has no requirements on any implementation. An
- example of this, is the precedence for the paste (##) and stringize
- operators. Notice that it is not even required that an implementation
- be CONSISTENT in its handling of this issue!
-
- "Undefined behavior" the relevant source construct is not portable
- ANSI C. As a result, the implementation can accept or reject the
- construct, at any point in time from preprocessing and compilation,
- through bound execution. Fundamentally such behavior is used to
- clearly identify non-portable source constructs.
-
- "Implementation defined behavior": The relevant source code is
- considered correct, and each implementation is responsible for
- defining the behavior of that construct. An example of this is the
- number of significant characters in identifier names (above and
- beyond what are required in a minimally ANSI C implementation).
-
- The above definitions will be referred to regularly during the
- commentary on JRCPP, and its support for the ANSI C standard.
-
-
- 2. ENVIRONMENT
-
- 2.1.1.2 ENVIRONMENT- Translation phases
-
- This section describes the actual phases of translation of C source
- code. The phases also serve to delineate the points between
- preprocessing, and compilation. The phases may be summarized as
- follows:
-
- Phase 1) The characters in the source are translated into those of
- the "source character set". During this process, JRCPP translates 8
- bit characters into 7 bit characters, by ignoring the high order bit,
- and by translating the source file characters 0 and 128 into simple
- spaces (ASCII 32). This phase also includes identification of line
- delimiters, and the removal of trigraphs. On an DOS/OS2 platform,
- JRCPP identifies the two source file characters
- <carriage-return><line-feed> as the terminator for each line, which
- is henceforth referred to as <newline>. The ANSI Standard also
- requires that complete trigraph removal be performed in this phase,
- and JRCPP fully supports this. Note that all diagnostics issued by
- JRCPP are based upon line counts generated in this phase, and hence
- most editors can be used to move to the line identified in a
- diagnostic.
-
- Phase 2) All occurrences of a backslash followed immediately by a
- <newline> are removed. This removal "splices" together the
- consecutive lines that were only separated by this "escaped out
- newline". This process may also be seen as combining several
- physical lines, as viewed by an editor, into a long logical line.
- This activity is most useful for programmers that wish to have many
- characters on a single source line, in order to, for example, make
- them part of a single preprocessing directive. The ANSI Standard also
- requires that every non-empty source file end in a newline, that is
- not escaped by a backslash (JRCPP diagnoses these conditions).
-
- Notice that phase 1 is complete before phase 2 is started. Hence the
- removal of an escaped newline CANNOT create a trigraph that is
- eligible for translation.
-
- Phase 3) This phase of translation is referred to as "tokenization".
- In this phase, sequences of characters are gathered together for
- processing as whole units (tokens). This phase also defines comments
- to be interpreted as equivalent to a single space character. The
- standard allows implementations to consider consecutive (non-newline)
- whitespace (space, tab, page feed, lone carriage return) as
- equivalent to single spaces. The ANSI Standard also specifies that a
- source file cannot end in either a partial (unterminated) comment, or
- in a partial preprocessing token. (JRCPP diagnoses an unterminated
- comment at the end of a file).
-
- Note that since comments and tokens are removed at the same time
- (i.e.: via a single left to right scan for the largest possible
- lexical group), there is some contention between "otherwise
- overlapping" string literals, character constants, and comments.
- This contention is always resolved by accepting the largest possible
- token (or comment) before allowing a new token to begin. For example,
- the following is a pair of comments surrounding an identifier:
-
- /* ignore " inside comment*/ identifier /* still ignore " in comment*/
-
- Hence we see that not only don't comments nest, but string literals
- do not form tokens within comments (and hence cannot "hide" the
- comment terminator). Similarly, the following is a pair of
- consecutive string literals:
-
- "comments begin /* outside" " and end with the */ sequence"
-
- This example shows that comments are not scanned for within string
- literals (and hence cannot "hide" the terminal close quote). Finally,
- the following is the sum of two character constants,
-
- '"' + '"'
-
- Which demonstrates that neither are character constants scanned
- internal for any other extended sequences (such as comments are
- literals).
-
- The standard does have the confusing phrase: "a source file shall not
- end in a partial preprocessing token", as part of its description of
- this phase. Recall that phase 2 ensured that the file ended in a
- carriage return, which "terminates" any preprocessing token! It
- appears impossible to have a "non-terminated preprocessing token" at
- the end of a file. There is a CHANCE that the standard meant to say
- "shall not terminate in a partial preprocessing #if directive group",
- but this would not make sense as such items are not identified until
- later phases. Finally there is the possibility that this requirement
- was installed in the Standard before the agreement was reached that a
- file should not end in an escaped newline (re: phase 2 requirement),
- and then (accidentally) never taken out. We assume this latter
- interpretation is correct, and we ignore the constraint on "partial
- preprocessing token at end of file".
-
- JRCPP adopts the aforementioned policy allowing sequences of
- non-newline whitespaces to be equivalent to a single space, and
- compacts comments and whitespace into single spaces. It is critical
- to note that the fact that, "comments are NOT removed prior to this
- phase", means that a program cannot "comment out" trigraph sequences,
- or any activity performed in the earlier phases. In addition, the
- fact that comments are removed in this phase means that constructs
- that "look like comments" in later phases (e.g.: after macro
- expansion activity) are not regarded as comments.
-
- Finally, the fact that comments are translated into single space
- characters, includes the case where the comment contains a newline!
- This specifically means that preprocessing directives (discussed in
- next phase) are not terminated at the end of the line, if the newline
- marking that point is within a comment. The implication of this
- should be clear to programmers who had previously used a macro
- definition of the following form on some non-ANSI compiler:
-
- #define start_comment /*
- code /* comment_text */ more_code
-
-
- The above lines do not define "start_comment" (as understood by later
- phases) to be the sequence "/*". In the above sample, the "/* and
- all characters up until the comment terminator "*/", are compacted
- into a single space. Since the next comment terminator occurs on the
- next line of the example, the above code is equivalent to:
-
- #define start_comment more_code
-
- On the brighter side, JRCPP would have issued a warning about the
- above sequence as having a "/*" within a comment.
-
-
- Phase 4) In phase 4, the tokens are parsed (grouped together) to form
- preprocessing directives and source code. This activity includes
- establishing and maintaining a database of macros (re: #define and
- #undef), conditionally including sections of source code (re: #if,
- #ifdef, #ifndef, #elif, #else, #endif), inserting additional files
- (re: #include), providing user supplied error messages (#error), and
- servicing implementation defined directives (#pragma). Note that
- when a #include directive is processed, the phases 1-4 are all
- applied to the source file as it is inserted.
-
- Phase 5-8 of the ANSI Standard relate to phases of processing that I
- would refer to as compilation and linking. It is conceivable that
- the concatenation of adjacent string literals (phase 6) should be
- considered as part of the preprocessing effort, but they have NOT
- been included in JRCPP for two reasons:
-
- Reason 1) If a series of large string literals were concatenated,
- then there is a good chance that the result would be too large for
- many lexical analysers (re: the first scanning phase of a compiler)
- to handle. I would prefer to produce code that is acceptable to a
- larger range of compilers.
-
- Reason 2) Hexadecimal escape sequences have no termination mark.
- Hence the concatenation of two string literals may be MUCH more
- complex than concatenating the "sections between the quotes".
- (Example: "\x1" "b" is NOT the same as "\x1b". Specifically, in most
- in DOS environment, "\x1" "b" is the same as "\001" "\097" or
- equivalently "\001\097", whereas "\x1b" is the same as "\033".) Since
- hex escape sequences have no terminator, an example such as what was
- just given MUST be translated into a series of octal escape sequences
- (at least the trailing hex sequence in the first literal must).
- Unfortunately, the translation of long hex escape sequences with
- "equivalent" octal escape sequences, would introduce an area of
- platform dependency that is probably best avoided in a portable
- preprocessor.
-
-
- 2.1.1.3 ENVIRONMENT- Diagnostics
-
- The standard requires that at least one diagnostic message be emitted
- for every "violation of any syntax rule or constraint". JRCPP
- attempts to support this, with the two caveats that a) parsing of the
- C language output is NOT performed, and associated error checking is
- not provided. As an interesting example of this support, the user
- should note that special JRCPP features (re: #pragma
- diagnostic_adjust) allow arbitrary diagnostic messages to be
- "silenced". In order to support the above ANSI C requirement, the
- first such adjustment of a diagnostic severity level CAUSES a
- diagnostic to be issues. Hence at least that diagnostic notification
- is present no matter what user customizations are applied to
- diagnostics.
-
-
- 2. ENVIRONMENTAL CONSIDERATION
-
- 2.2.1 ENVIRONMENTAL CONSIDERATION- Character sets
-
- The character set supported by JRCPP includes the full range of
- characters that are required by the standard. In addition, the ASCII
- characters in the range 129 to 255 are interpreted as though their
- high bit of an eight bit byte was 0 (i.e., mapped into values 1
- though 127), and the ASCII values 0 and 128 are treated as spaces
- (ASCII 32).
-
- 2.2.1.1 ENVIRONMENTAL CONSIDERATION- Trigraphs Sequences
-
- All the standard trigraph sequences are supported. These include:
-
- Trigraph Equivalent Character
-
- ??= #
- ??( [
- ??/ \
- ??) ]
- ??' ^
- ??< {
- ??! |
- ??> }
- ??- ~
-
- Here again it is significant to recall that trigraph sequences are
- replaced in the very first phase of translation. Hence the following
- is not a comment:
-
- ??/* test*/
-
- as it is equivalent to:
-
- \* test*/
-
- In a similar vein, the following obscure code has surprising meaning:
-
- "/control??/" /* continue till "/* real comment */
-
- as it is equivalent to:
-
- "/control\" /* continue till "/* real comment */
-
- Which is later tokenized as the long single string literal:
-
- "/control\" /* continue till "
-
-
-
-
- 2.2.1.3 ENVIRONMENTAL CONSIDERATION- Character sets- Multibyte
- characters
-
- Other than support for the single byte characters required in the
- standard, multibyte characters are not specified or supported in
- JRCPP. If multibyte characters are encountered, they are passed along
- blindly, but they cannot be evaluated in any meaningful as a #if/elif
- expression (a diagnostic is produced if such an attempt is made).
- This stance is in keeping with the requirements of the standard.
-
- 2.2.4 ENVIRONMENTAL CONSIDERATION- Translation Limits
-
- This section of the standard requires that at least one program exist
- that satisfies all of the limits, and can be translated. The
- following are the limits that relate to a preprocessor, and the
- details of how to construct a program exercising those limits. Note
- that the limits are very easy for JRCPP to handle, and no true
- "cunning" is required to generate a required test program.
-
-
- 8 nesting levels of conditional inclusion
-
- Currently about 38+ (see Appendix B of USERS MANUAL) levels are
- supported, but this static limit may be removed in future
- releases.
-
- 32 nesting levels of parenthesized expressions within a full
- expression
-
- The parsing stack for evaluation of preprocessor is currently set
- at 150 (see Appendix B of USERS MANUAL) levels. At a minimum, it
- would require 150 non-white tokens (not just characters, but
- whole tokens) on a preprocessing #if line to cause a parser stack
- overflow. Lines less than 150 tokens cannot cause an overflow,
- but the absolute limit on parenthesis nesting depends upon the
- number of additional operators, along with their precedence and
- placement. Any demonstration program that that has expressions
- that "tower to the left", such as:
- "((((((...(((5+4)*3-3)+.../7)|6-8)+1", and has less than about
- 130 nested parenthesis, should also be acceptable to JRCPP.
-
- 31 significant initial characters in an identifier name
-
- All identifiers are considered significant in all their
- characters, which may extend well beyond 31 characters. See
- Appendix B of USERS MANUAL for actual restrictions an the
- absolute length of identifiers.
-
- 511 External identifier names in one translation unit
-
- JRCPP has no static limit on the number of distinct identifiers
- of any type.
-
- 1024 Macro identifier simultaneously defined in one translation unit
-
- JRCPP has no static limit on the number of macros defined.
-
- 31 Parameters in one macro definition
-
- JRCPP has no static limit on the number of parameters for a
- function like macro.
-
- 31 Arguments in one macro invocation
-
- JRCPP has no static limit on the number of arguments supplied to
- a macro invocation.
-
- 509 Characters in a logical source line
-
- JRCPP has no static limit on the number of characters in a source
- line. There is a limit on the number of characters in a single
- token, but there is no limit on the number of tokens on a line.
- (see Appendix B of USERS MANUAL).
-
- 509 Characters in a string literal (after concatenation)
-
- Since JRCPP does not currently perform string concatenation, this
- limit does not generally apply. The limit on the length of a
- single token applies to individual string literals (see Appendix
- B of USERS MANUAL).
-
- 8 levels of nested #include
-
- JRCPP has no static limit on the number of nested include files.
- To support this (no limit) stance, JRCPP does require that at
- least 2 file handles be made available to it, in addition to the
- standard set of stdin, stdout, stderr. See your operating system
- manual for details. Note that there is a limit on the depth of
- nested file inclusion when the original source file is actually
- stdin. This limit is based on the operating system restriction on
- the number of files that may be open at one time. This odd
- limitation may be removed in future versions, but typical DOS
- configurations would only limit nested inclusion at about 16
- levels.
-
-
- 3.1 LANGUAGE- LEXICAL ELEMENTS
-
- This section defines exactly how to interpret a series of characters
- as tokens. The one point of undefined behavior in this section
- concerns the presence of unmatched single (') or double (") quotes
- appearing on a logical line. JRCPP makes an effort to not abandon
- compilation when it encounters errors, and its behavior in this area
- is typical of such resolutions.
-
- In the case of an unmatched single quote ('), JRCPP assumes that the
- programmer forgot the quote, but assumes that only a single character
- "character constant" was intended. Hence for the purposes of error
- recovery, the single quote and at most one following c-char (which
- includes single characters, and a select set of escape sequences, but
- excludes newlines) is accepted as a character constant. This
- construction of an erroneous is performed despite the fact that
- without the terminal quote, the spelling of the token is invalid.
-
- In the case of an unmatched double quote ("), JRCPP also assumes that
- the programmer forgot the quote. In the case of string literals, it
- is assumed that that most literals are fairly long. For the purposes
- of error recovery, JRCPP assumes that the original quote, along with
- the longest possible sequence of s-chars (a class of characters that
- includes single characters, and a select set of escape sequences, but
- excludes newlines) formed the string literal.
-
- Note that in both cases diagnostics are generated that will, by
- default, prevent any preprocessed output from being generated. The
- default settings of these diagnostics can however be overridden for
- the purposes of generating some output.
-
-
- 3.1.2 LANGUAGE- LEXICAL ELEMENTS- Identifiers
-
- JRCPP supports the standard definition of identifiers, consisting of
- a leading alphabetic character (or an underscore), and continuing
- with an arbitrary sequence of alphanumeric characters and underscores.
-
- As an extension, JRCPP also supports the presence of the character
- '$' at any position (including first character) of an identifier, but
- it flags such usage as an error. Here again JRCPP can be seen to
- comply with the ANSI requirements for diagnosing
- nonportable/nonstandard constructs, while still allowing the user the
- opportunity to ignore the error, and facilitate a porting operation
- (note that the default diagnostic level of such an error is
- sufficient to preclude output, but this level may be modified via the
- #pragma adjust_diagnostic ... directive). This extension does not in
- any way conflict with the ANSI standard, as a '$' character, outside
- of a string literal or character constant token, is usually illegal
- anyway. Hence incorporating it into an identifier does not preclude
- any valid constructs.
-
- In certain obscure cases, an ANSI conformant program might have a '$'
- character provided outside of a string literal, or character
- constant. This placement is only potentially legal if the '$' is
- formed into part of a valid token by the end of the preprocessing
- phases. If this obscure case is actually significant to a user,
- modification of diagnostic levels can permit this construct. If I am
- pressed by registered users, I may modify the performance of the
- preprocessor to more naturally support such obscure ANSI C conformant
- cases.
-
- This section of the Standard also discusses the significance of
- characters in an identifier name. Specifically, it requires that all
- of the first 31 characters in a macro name be considered when
- comparing names and invocations. In order to support the many
- existing implementations, the standard leaves as "undefined behavior"
- whether identifiers that differ ONLY beyond the 31st character.
- JRCPP resolves this simply by treating all characters in any
- identifier name as significant. This may identify as errors some
- typos that other compilers overlook, but this only tends to make the
- code more robust in terms of portability.
-
-
-
- 3.1.3.3 LANGUAGE- LEXICAL ELEMENTS- Character constants
-
- In the discussion of character constants by the ANSI Standard, it is
- mentioned that when undefined escape sequences are encountered in a
- character constant, the results are undefined.
-
- Note that the defined escape sequences for use within character
- constants include:
-
- '\\' (backslash),
- '\'' (single quote),
- '\"' (double quote),
- '\?' (question mark),
- '\a' (alarm or bell),
- '\b' (backspace),
- '\f' (form feed),
- '\n' (newline), '
- '\r' (carriage return),
- '\t' (tab),
- '\v' (vertical tab),
- octal escape sequences with 1-3 octal digits, and
- hexadecimal escape sequences with arbitrarily many hex digits.
-
- Examples of the latter two types are: '\27', and '\xab10cd'.
-
- When JRCPP finds an invalid escape sequence within a character
- constant (and there is a trailing quote found later on that line), a
- diagnostic is produced, but the character sequence is accepted as a
- character constant. The severity level of the diagnostic is
- sufficient to prevent the preprocessor from producing output, but the
- level may be varied by the user if acceptance of such sequences is
- considered reasonable for the user's target compiler.
-
-
- 3.1.4 LANGUAGE- LEXICAL ELEMENTS- String literals
-
- The undefined behavior in string literals is also centered on the
- presence of illegal escape sequences within the literal. In an
- analogous fashion to the handling of character constants, the
- presence of illegal escape sequences generates a diagnostic, but
- (error recovery) accepts the sequence. The default severity the
- diagnostic is high enough that on output will not be produced by the
- preprocessor unless the level is adjusted downward.
-
- 3.1.7 LANGUAGE- LEXICAL ELEMENTS- Header names
-
- This section of the Standard discusses the lexical form of file names
- that are used in #include directives. The undefined behavior in this
- area involves the presence of the characters ', \, ", or /* within
- the <....> form of an include directive, and the presence of ', \, or
- /* within the "...." form of the directive. Since the original
- platform for JRCPP was DOS/OS2, the defining of behavior of such
- sequences is quite important (DOS and OS2 file systems use '\' as a
- separator in path names, in the same way as UNIX systems use '/' as a
- separator). In order to support the use of standard DOS/OS2 path
- names, a header name is considered a special and distinct token.
-
- JRCPP defines a "...." style header name to begin with a double
- quote, and continue until a matching double quote is encountered,
- without passing the newline. Note that escape sequences are NOT
- honored during the scanning of this token, and hence backslash
- characters represent themselves directly (and the final quote CANNOT
- be escaped using a backslash). In addition, since this is a single
- token, the presence of /* within it is of no consequence. The only
- context in which a "...." style header name is permitted by JRCPP
- (and hence scanned for), is as the first non-whitespace token,
- following the keyword "include", on a #include directive line. Note
- that comments are considered whitespace, and may precede the "...."
- style header name. The following are examples of entirely legal
- include directives:
-
- #include /* comment */ "sys\header.h"
- #include "weird/*b.h"
- #include "any char is legal !@' wow \"
-
- Note however that the operating system will more than likely be
- unable to find such files!. The mapping from include directives into
- actual file names involves replacing each occurrence of a '\' or '/'
- with the appropriate path name separator, and requesting that file be
- opened. Consider for example the following:
-
- #define stringize(x) #x
- #include stringize( \sys\header )
-
- Since the macro will expand its argument to "\\sys\\header" (details
- of stringization are defined in section 3.8.3.2), the file will be
- searched for using four backslashes! This mapping into a file name
- is independent of whether the file name was provided in the "...."
- style, or it was a string literal generated by some preprocessing
- replacement.
-
- In an identical fashion, a <....> style header name is defined by
- JRCPP to begin with a '<' character, and to not terminate until the
- first '>' character is reached, without extending past the newline.
- The context for scanning for this token is identical to that of the
- "...." style header name. As with the "...." style header names,
- there are NO special characters (i.e.: escape sequences) interpreted
- during the scanning for such a token. The following are legal
- #include directives, and demonstrate this:
-
- #include <system.h\header.h/*this_too>
- #include /*comment*/ /*comment*/ <any char ' or " or even \>
- #/*comment*/ include /*comment
- comment continues*/ < spaces even count >
-
- The last example also demonstrates that comments are reduced to a
- single space, and hence do not disrupt the context of the scan as
- defined.
-
- Note that all characters between the delimiters < and >, (or between
- the double quotes in the "...." style) are interpreted as being part
- of the file name.
-
- In the interest of portability, it is suggested that the user refrain
- from using the standard '\' path delimiter in an DOS/OS2 environment,
- and instead make use of the equivalent character '/'.
-
-
- 3.1.8 LANGUAGE- LEXICAL ELEMENTS- Preprocessing numbers
-
- The lexical token "preprocessing number" appears to have been placed
- into the standard to allow for arbitrary substrings of valid numbers
- to be manipulated conveniently. The need for such a token is perhaps
- motivated by the requirement (given elsewhere in the standard) that
- if the result of some preprocessing operation (such as token pasting
- via ##) is not a valid preprocessing token, then the resulting
- behavior is undefined. With that requirement "on the books", it then
- follows that substrings of numbers should be considered valid. For
- example, the substring `3.14' could be pasted onto `0e4', to yield
- the result `3.140e4'. In order to make life as easy as possible for
- the implementers, the standard is VERY broad in its allowance of what
- is a valid preprocessing number. For example, the sequence
- `1.2aZ4E-_6.7.2_3' is a valid preprocessing number. As per the
- standard, this token is supported, in the full generality that it is
- specified.
-
- The user should also be warned of the fact that a preprocessing
- number is a SINGLE TOKEN, and hence is not scanned internally for the
- presence of macro names. For example, when the above example
- `1.2aZ4E-_6.7.2_3' is present in a file, the preprocessor will NOT
- consider the macro aZ4E for expansion, even if it is defined! The
- point being made here is that when number are placed adjacent to
- letters in the source file, they will typically be blended together
- into a single token, and the letters will not be eligible for macro
- substitution. Similarly, even though the `.' operator may be
- overloaded in C++, if it is placed to the right of and adjacent to
- ANY number sequence, it will be absorbed as part of that token!
-
-
- 3.8 LANGUAGE- PREPROCESSING DIRECTIVES
-
- The descriptions given in this section cover all aspects of
- preprocessor directives. I will in general paraphrase some of the
- significant areas that I consider non-intuitive. An interested
- reader should certainly consider examining the actual standard for
- any additional details.
-
- One notable item in the overview section is that tokens within
- directives are generally NOT subject to macro expansion, unless
- otherwise noted. Hence logical lines of text are categorized as
- directive or non-directive lines BEFORE macro expansion of such lines
- take place. In addition, as pointed out in a later section, if macro
- expansion produces something that "resembles" a directive line, it is
- NOT processed as a preprocessing directive. The following is a
- summary of the actions of various directives with regards to
- expansion of the tokens that follow the "# directive_name":
-
- Directives for which tokens, on the line with them, are expanded:
-
- #if
- #elif
- #include
- #line
-
- Directive for which tokens on the line with them, are not expanded:
-
- #ifdef
- #ifndef
- #define
- #undef
- #error
- #pragma
-
- Directives which cannot legally have other tokens on the line with
- them:
-
- #else
- #endif
- #
-
- Note that the #if and #elif have some additional translation that is
- performed on their tokens both BEFORE and AFTER macro expansion. The
- #include and #line directive are only expanded when a standard form
- of arguments is not present. In the case of a #include directive
- that does require expansion of the tokens, the post-expansion tokens
- are processed (concatenated) after expansion and rescanned for a
- standard format. The fact that additional tokens are not allowed
- following the null directive (the lone #), is significant in that any
- other lines that begin with # are strictly illegal.
-
-
- 3.8.1 LANGUAGE- PREPROCESSING DIRECTIVES- Conditional inclusion
-
- Conditional inclusion refers to the use of #if directives (along with
- its various forms and grouping directives) to cause a section of code
- to be optionally included or excluded from the preprocessed result.
- Fundamentally, there are three ways to start an if directive group
- (#if, #ifndef, #ifdef), two ways to continue a group (#elif, #else),
- and one directive to mark the end of the group (#endif). Since such
- if groups can nest (i.e., contain inner groups), we will start with
- the description of an outermost conditional group, and then discuss
- the ramifications on inner groups. We will also defer discussion of
- #ifdef and #ifndef, as their definition follows directly from the the
- definition of #if.
-
-
- Subsection A) Evaluation Of #if Expressions
-
- The first point to address is how the tokens on a line with a #if are
- evaluated, and what their resulting "value" signifies. To make the
- discussion clearer, we will assume the following example
-
- #define hi 5
- #if (1u == defined(hello)) || (3 < hi + low + int)
- ...
-
- The process of evaluating the tokens on a line with a directive
- consist of 6 phases:
-
- 1) remove all occurrences of "defined identifier" and "define (
- identifier )", and replace them with either 0 or 1 (1 iff
- the identifier is currently defined as a macro)
-
- 2) macro expand the line that resulted from phase 1 (using
- standard rules described in section 3.8.3)
-
- 3) replace all identifiers and keywords in the result of phase 2
- with the number 0 (this result is a list of constants and
- operators)
-
- 4) convert all constants of type "int" to identical constants of
- type "long", and constants of type "unsigned int" to
- "unsigned long".
-
- 5) evaluate the expression produced by phase 4 according to
- standard C expressions methods, but always use "long"
- types for subexpressions that evaluate to an "int", and
- "unsigned long" for expressions that evaluate to an
- "unsigned int" (the final result is an integral constant
- of type "long" or "unsigned long")
-
- 6) if the final result of phase 5 is equal to 0, then the
- expression was false, otherwise it is true.
-
-
- The process can be demonstrated on the example given, with the
- following evaluations. After phase 1 removal of "defined":
-
- #define hi 5
- #if (1u == 0) || (3 < hi + low + int)
- ...
-
- After phase 2 macro expansion
-
- #define hi 5
- #if (1u == 0)) || (3 < 5 + low + int)
- ...
-
- After phase 3 replacement of identifiers and keywords with 0:
-
- #define hi 5
- #if (1u == 1) || (3 < 5 + 0 + 0)
- ...
-
- After phase 4 conversion of "int"s to "long"s:
-
- #if (1uL == 1L) || (3L < 5L + 0L + 0L)
- ...
-
- The phase 5 evaluation might proceed something like:
-
- (1uL == 1uL) || (3L < 5L)
- ( 1 ) || ( 1 )
- ( 1L ) || ( 1L )
- 1
- 1L
-
- Finally in phase 5, the above constant can be seen to be non-zero,
- and hence the result of the evaluation is true.
-
- The above rules work for the most part, as expected by "almost"
- everyone, but the following details and anomalies are worth noting.
-
- Note that in phase 1 an attempt is made to remove all occurrences of
- the operator "defined". If this operator is not applied to an
- identifier, the Standard indicates that the results are undefined.
- JRCPP considers this scenario to be a syntax error, and aborts
- evaluation of the expression. As a means of error recovery, JRCPP
- assumes an evaluated result of FALSE, and a diagnostic is generated.
-
- A second point of ANSI C undefined behavior takes place when the
- result of macro expansion produces the operator "defined". For
- simplicity and portability of code, JRCPP disregards the presence of
- such an operator as the result of macro expansion, and follows
- exactly the multiphase algorithm supplied above. Hence occurrences
- of the keyword "defined" that are produced by macro expansion are
- replaced in phase 3 with a value of 0.
-
- During the expansion of phase 4, some ANSI C undefined behavior
- exists with regard to evaluating character constants. The two points
- here that must be resolved are how to evaluate multicharacter
- character constants (which contain more than one item, which is
- either an escape sequence or a simple character), and whether
- character constants may assume negative values. As mentioned in
- earlier sections, multicharacter character constants represent a
- major area of non-portability, and hence they are not effectively
- supported in #if expression evaluation. Specifically, if a
- multicharacter character constant (such as 'zq') appears in an
- expression, it is truncated to a single character constant, keeping
- only the leftmost character (or escape sequence) and of course a
- diagnostic is generated. Character constants under JRCPP evaluate as
- "signed int", which in a DOS 8088-80x86 environment is taken to be a
- 16 bit signed integer. Single character character constants always
- evaluate as positive numbers. Octal character constants are all
- considered positive (i.e., '\000' through '\777'), but hexadecimal
- character constants may evaluate to a negative number. Specifically,
- if the high order bit of a hexadecimal character constant (when
- viewed as a "signed int") is set (i.e., '\x8000' through '\xffff') on
- a 16 bit signed int architecture), then the number is negative, using
- a two's complement representation. Additionally, if a hexadecimal
- escape sequence exceeds the representational precision or range of a
- character constant (e.g., "signed int" under JRCPP, which corresponds
- to 16 bits under a DOS environment), then the high order bits are
- discarded.
-
- There are several subtleties involving preprocessor "#if" expression
- evaluation. The first item to observe is that the expression must be
- formed using only integral constant subexpression (i.e., no floating
- point; no pointers); casts may not be used; and the 'sizeof' operator
- is not evaluated (in fact, 'sizeof' is replaced in phase 3 by the
- value 0). As per allowance by the standard, there is no guarantee
- that character constants will be evaluated identically in the
- preprocessor as they are in the compiler (since JRCPP is external and
- unknown to your compiler, this is all the more important to be aware
- of).
-
- One slightly quirky aspect of the evaluation of #if centers around
- the consistent use of "long" types to replace "int" types. The
- following demonstrates this "quirk", and is commonly a thorn (bug?)
- in the side of many "would be ANSI compatible" preprocessors:
-
- #if 1 > (0 ? 1u : -1)
-
- The tricky aspect of evaluation of this example involves the value of
- the ternary "?:" subexpression AFTER the transition to "long" types
- is made. The subexpression looks like "(0L ? 1uL : -1L)". Note the
- the type associated with this ternary must be the "larger" of the
- types "unsigned long" (from 1uL) and "signed long" (from -1L). Hence,
- according to ANSI, the result of the ternary expression must the
- "unsigned long" representation of "-1", which is actually the largest
- possible "unsigned long". So we can see that the above expression
- ends up evaluating to FALSE! The moral of the story for programmers
- is to exercise care when working with negative numbers in the
- preprocessor #if statements.
-
-
- Subsection B: Other conditional inclusion directives
-
- As mentioned in the standard the lines:
-
- #ifdef any_identifier
- and
- #ifndef any_identifier
-
- are equivalent, respectively, to:
-
- #if defined any_identifier
- and
- #if ! defined any_identifier
-
- Each basic conditional inclusion section of source code consists of a
- "if group", followed by any number of "elif groups", followed
- optionally by an "else group", and terminated by a line with a #endif
- directive. An "if group" consists of a #if directive (or
- equivalent), followed optionally by lines of code. An "elif group"
- consists of a #elif directive (which has an expression to evaluate),
- followed optionally by lines of code. An "else group", consists of a
- #else directive, followed optionally by lines of code. Note that for
- each #if directive, there MUST be a matching #endif directive that
- follows it, and that these conditional inclusion sections do nest
- within a single "group" of code (i.e., within a single "if group", or
- within a single "elif group", or within a single "else group").
-
-
-
- The semantics (meaning) of these directives is most simply given by
- an example:
-
- #if expression_1
- block 1
- #elif expression_2
- block 2
- #elif expression_3
- block 3
- #else
- block 4
- #endif
-
- Fundamentally, only one of the blocks 1,2,3, and 4 can EVER be passed
- to the output of the preprocessor (if we were missing the "else
- group", then it is possible that none of the blocks would be
- processed). If expression_1 evaluates to TRUE, then ONLY block 1 is
- processed, and block 2, 3, and 4 are discarded, and expression 2 and
- 3 need not even be evaluated. On the other hand, if expression 1 is
- FALSE, then block 1 is discarded, and expression 2 is evaluated as
- though it were the start of a conditional inclusion section. Hence
- the first #if or #elif directive that evaluates to true causes its
- associated section of code to be included, and all other sections in
- the other groups to be discarded. IF none of these expression
- evaluate to TRUE, then the code in the "else group", if it exists and
- has code, is processed.
-
-
- One very common use of conditional inclusion is to effectively
- comment out a large section of code, that more than likely has
- /*...*/ based comments. Since standard /*...*/ based comments do not
- nest, large blocks of code CANNOT be safely removed using standard
- /*...*/ delimiters. In contrast, since conditional inclusion
- directives do nest, placing "#if 0" at the start of the section, and
- "#endif" at the end of the section effectively comments out (safely)
- an arbitrary block of code. Also, from a stylistic point of view,
- the fact that these directives DO NOT have to appear directly
- adjacent to the left margin (as was the case in some early C
- preprocessors) allows such commenting to be done in a very
- aesthetically pleasing format.
-
-
-
- 3.8.2 LANGUAGE- PREPROCESSING DIRECTIVES- Source file inclusion
-
- Source file inclusion is performed using the "#include" directive.
- For our discussion, we will refer to directives like:
-
- #include <stdio.h>
-
- as <...> style includes, and
-
- #include "header.h"
-
- as "..." style includes.
-
- Subsection A) Interpretation of expanded tokens following #include
-
- One notable change to many compilers to support the ANSI standard, is
- the acceptance of include directives of the form:
-
- #include token1 token2 ..... tokeni
-
- wherein the token sequence cannot be interpreted as either a <...> or
- "..." style include. We will refer to this include directive as a
- "macro derived" directive, in honor of the fact that the tokens must
- be macro expanded before the include directive can be acted upon.
- Note that to not be a "..." style include, the first character of
- token1 must be other than `"', or at least there must be no other `"'
- character later on the include line. Similarly, to avoid the <...>
- style, the first character of token1 must be other than `<', or there
- must be no terminating `>' later on the line. Recall also from the
- discussion of tokenization, that backslashes cannot escape out a
- closing quote, and that a the sequence /* is NOT honored within the
- file name in either the "..." or <...> style include (this is a
- special context).
-
- The first element of undefined behavior for #include directives
- concerns the method by which the results of macro expanding the
- tokens in a macro derived directive are interpreted. The most
- obvious (and simple) case is when the expansion of the entire token
- sequence is a simple string literal. The more complex case involves
- an expansion that is still a token sequence, such that the first
- token begins with a `<', and the last token ends with a `>'. The
- following are examples of such include directives:
-
- #define myfile "header.h"
- #define yourfile /* this macro has a null definition */
- #define less_than <
- #define greater_than >
- #define big_system_header os2.h
-
- #include yourfile myfile
- #include less_than big_system_header greater_than
-
- Although the above macro defined directives expand to tokens
- sequences that "look like" more common include directives, there are
- some special differences. Specifically, using the above macro
- definition the resulting token sequences look like:
-
- #include "header.h"
- #include < os2.h >
-
- Note that the above casual macro definitions left leading and
- trailing spaces in the file name for the latter example. Although
- this could have been avoided by using function like macros, which can
- be placed sequentially with no white space between them, the presence
- of whitespace in the result is believed to be common among users of
- this feature. With the above examples understood, the JRCPP
- resolution is simply to concatenate all the tokens, ignoring
- inter-token whitespace, and THEN reinterpret the resulting character
- sequence in the context appropriate to <...> or "..." style include
- directives (i.e., no special meaning for backslashes, etc.).
-
-
- Subsection B) Search algorithm for included files
-
- A second point of undefined behavior in the ANSI C standard involves
- where exactly included files are searched for. There are actually
- several very distinct conventions for this search mechanism, even
- within file systems which are hierarchically based (such as UNIX).
- JRCPP has adopted a default strategy that is consistent with
- Microsoft C 5.1, and several other more recent compilers. There is
- also support (selected via #pragma include_search) for algorithms
- compatible with Microsoft C 4.0, and support for approaches
- compatible with older UNIX system cpp implementations.
-
- If an application is placing all the source and header files in a
- single directory, which is also the current working directory during
- the compilation, and the system include directory is a single
- absolute directory, then the search algorithm then almost any search
- algorithm will suffice. On the other hand, if header files in one
- directory includes header files in a second directory, which then
- include header files in yet a third directory, while the user has a
- current working directory yet elsewhere, and some of the system
- include or application include directories use relative path names,
- the meaning of:
-
- #include "header.h"
-
- is far from obvious. Historically, projects developed under UNIX
- placed all source and header files in a single directory, and the
- discussion of search algorithms was irrelevant. With the growing
- complexity of code, and the presence of a multitude of programmers on
- a project, the need has arisen to hierarchically segregate sections
- of large project into file hierarchies. In order to support function
- calls between these sections, header file inclusion well outside the
- current directory has become commonplace. Various vendors have
- adopted algorithms that support the "trivial" case described
- originally, but there has often been disagreement about how to
- process the more complex cases.
-
- The philosophy that drove the development of the slightly complex
- include strategy of JRCPP motivated by the following requirements:
-
- 1) It should be possible to write include files, that include
- other files, without concern about what source file was being
- compiled, where that file was, and what directory the user was in
- when the compilation was requested. This allows complex systems
- of header files to be written INDEPENDENT of the application that
- uses them.
-
- 2) To allow for even more complex sets of include files, if file
- A included file B, and file A was able to include file C, then
- file B should be able to include file C with equal ease. In some
- sense, this concept is similar to inheritance in object oriented
- programming.
-
- 3) It should be possible for a user to change current
- directories, and in doing so change what files are accessible via
- include searches (assuming the programmer has orchestrated the
- placement of header files to support this strategy). This allows
- different versions of an application to be compiled easily into
- different directory areas.
-
- 4) It should be possible to define the location of include files
- via a relative path. This facility would allow the construction
- of source file hierarchies that are easily ported to distinct
- absolute positions in file system hierarchies.
-
- We will start with some definitions of terms. We define the "system
- include path" to be a list of directories where system header files
- are provided. Typically all the ANSI C specified library header
- files are in directories listed in the system include path, and files
- in such directories can be expected to never change (hence it is rare
- to provide "make" dependencies on such files). The "application
- include path" is a list of directories that contain header files
- significant to a specific application or project. Typically, files
- in the application include path tend to change often during program
- development. Next there is the "current directory", with its
- standard meaning in a UNIX or DOS like file hierarchy. Three
- additional directory lists need to be defined, the "original source
- directory", the "current source directory", and the "ancestral source
- directories". The "original source directory" is the directory in
- which the source file specified on the compilation or preprocessor
- command line was found. The "current source directory" is the
- directory which contains the include directive file that is currently
- being parsed, and has the include directive that we are trying to
- process. The "ancestral source directories" are a list of
- directories that begin with the "current source directory", and
- proceed back through each level of nested inclusion (specifying the
- directory in which that source file was found), all the way to the
- "original source directory".
-
- The algorithm for searching for header.h consists of the bottom level
- ancestral search, and a top level search. The algorithm terminates
- the first time a file can be accessed. In pigeon code, the algorithm
- would look like:
-
- ancestral_search(file_name)
- {
- if (file_name has relative prefix)
- {
- for (every prefix p, in the ancestral include list)
- do try to open (p/file_name)
- }
-
- /* but as a last resort ... */
- try to open (file_name) /* with no prefix */
- }
-
- Driving the code that we just listed is the higher level search, that
- is directed by the use of user specified include paths:
-
- standard_search(file_name)
- {
- try ancestral_search(file_name) /* with no prefix */
-
- if (file_name has relative path)
- {
- for every prefix p in the application include path
- do ancestral_search(p/file_name)
-
- for every prefix p in the system include path
- do ancestral_search(p/file_name)
- }
- }
-
- The above pigeon code corresponds to the algorithm for finding "..."
- style includes. When <...> style includes are searched for, the file
- is not searched for directly (with no prefix) unless an absolute path
- specifier, and the "application include path" is never made use of.
- The following pigeon code describes the search for <...> style
- include files:
-
- system_search(file_name)
- {
- if (file_name has relative path)
- {
- for every prefix p in the system include path
- do ancestral_search(p/file_name)
- }
- else /* file has an absolute path specification */
- try to open (file_name)
- }
-
-
- It should also be pointed out that when the ancestral_search function
- is used, that the path prefix for the ancestral include files are
- tried sequentially, starting with the current source file, and
- proceeding back to the original source file.
-
-
- Next, we will present an example of the above definitions, and the
- resulting search path. We will assume:
-
- The current working directory is /current_path. The source file
- original_path/original.c included the file header1path/header1.h
- (hence original_path is the "original source directory"). File
- header1.h included header2path/header2.h (hence the "ancestral
- directory list" is: header2path, header1path, original_path). The
- system include path contains the directory sysdir1, and sysdir2. The
- application include path contains the directories appdir1, and
- appdir2. For our example of searching, we will examine the search
- pattern when header file header2.h contains the include directive:
-
- #include "header3.h"
-
- If the file specified as header3.h does have an absolute path prefix,
- then only that absolute path/filename is used in searching for the
- file (an example with an absolute path is "/usr/lex/lexdefs.h"). If
- header3.h does not have an absolute path prefix, then the following
- search pattern is followed. With the caveat to be mentioned in a
- moment, the search for a file consists of trying to open the
- following:
-
- header3.h
- appdir1/header3.h
- appdir2/header3.h
- sysdir1/header3.h
- sysdir2/header3.h
-
- There is, as mentioned, one caveat to the above search sequence. The
- caveat is that whenever a file in the above list has a relative path
- prefix, then the prefixes provided by the ancestral directories are
- tried sequentially, followed by the unadorned filename, before moving
- on down the list. Since we assumed that header3.h did not have an
- absolute path prefix, the following files would actually be subject
- to fopen() calls:
-
- header2path/header3.h
- header1path/header3.h
- original_path/header3.h
- current_path/header3.h (same as simply `header3.h')
-
- The above list corresponds to a search using path prefixes taken from
- the ancestral include list, and then using the current directory.
-
- Note that if one of the application include path directories had only
- a relative path prefix, then it too would make use of the ancestral
- include directories for prefixes. For example, if app2dir was a
- relative path, then header3.h would (if it were not found earlier in
- our list) be searched for in:
-
- header2path/app2dir/header3.h
- header1path/app2dir/header3.h
- original_path/app2dir/header3.h
- currentpath/app2dir/header3.h (same as `app2dir/header3.h')
-
- Subsection B: Include strategy compatibility
-
- The include strategy described above is compatible with Microsoft C
- 5.1, and several other major compilers. Since this strategy is
- rather general, it tends to provide coverage of search areas
- sufficient for most any project. If however, absolute compatibility
- is desired with other compilers, the search strategy may be modified
- by use of the appropriate pragma options. The adjustments permitted
- all represent restrictions to search in only a subset of the default
- areas.
-
- For example, the SUN UNIX cpp searches for "..." style includes first
- in the parent directory, and then in the application's include path.
- To achieve complete compatibility with SUN, it is necessary to use
- the pragma:
-
- #pragma include_search youngest_ancestor_only
-
- As a second example, Lattice C 3.0 searched for "..." include files
- in the current working directory, and then in the application's
- include path. To be compatible with such a strategy, the following
- pragmas should be entered:
-
- #pragma include_search current_directory
-
- Note that the options for the different ancestral search modes
- include: `youngest_ancestor_only', `eldest_ancestor_only', and
- `all_ancestors'. These options correspond respectively to the using
- the directory of the current_path, original_path, and the full search
- algorithm. The default provided by JRCPP is `all_ancestors', but
- omitting an ancestral selection (as in the last example) implies the
- use of no ancestral directories.
-
- Note that if searching in the current_directory, and searching in
- directories associated with all ancestral modes are disabled (via
- omission in such a pragma), then it is impossible to include any
- files without using an absolute prefix supplied in a system include
- path, or an applications include path.
-
- 3.8.3 LANGUAGE- PREPROCESSING DIRECTIVES- Macro replacement
-
- This section has only two mentions of undefined behavior, but it has
- several topics worthy of commentary. We will start with the
- comments, and then proceed to resolve the cases of undefined behavior.
-
- It should be mentioned here that JRCPP has a rather novel pragma that
- explains in detail the steps taken during a macro expansion. This
- information may be used as a tutorial assistant (for the novice), as
- a debugging assistant (for the professional programmer), or as a
- method of proof of a bug (for a registered user to file a bug
- report). The explanation given includes a step by step breakdown of
- the macro expansion process, along with reference listings of current
- macro definitions as they are applied, and reasons for ignoring
- current definitions (such as the fact that an identifier is already a
- part of a recursive expansion of itself). For more details see
- "#pragma describe_macro_expansions", and note that this feature can
- be turned on and off to localize its impact. It is expected that
- this pragma can easily augment this Language Reference Manual by
- provided annotated examples.
-
- ANSI C differs from many prior C preprocessor implementations in that
- an identifier may not be redefined "differently" from its current
- macro definition, without first undefining the existing definition.
- Some non-ANSI implementations allowed redefinitions to mask existing
- definitions, and future #undef directives to "pop" into visibility
- older definitions. Other implementations simply allowed new macro
- definitions to overwrite existing definitions. JRCPP only supports
- the ANSI C approach, with the agreement with the ANSI C Rationale
- that other formats are error prone, and generally an actual errors.
- Note that redefinitions that are identical to the original (such as
- occurs when a header file is include twice) are fully legal. The
- Standard is quite specific in defining what a benign redefinition is,
- but it can be summarized by saying that the only difference allowed
- in a redefinition is the length of white space interludes. (There is
- actually a subtle error in the Standard. The restriction that the
- "order of the parameters" be unchanged in a redefinition, was not
- listed. JRCPP assumes this was a typo, and requires that the order
- of parameters be unchanged in a macro redefinition). If I receive
- sufficient input from users that my decision here is a hindrance, I
- will support the other standards involving non-benign redefinitions
- (with appropriate diagnostics), controlled via pragma directives.
-
- A second item worthy of noting is that whitespace at the start and
- end of a replacement list is not considered part of the replacement
- list. Hence it is impossible to define a macro that expands to
- whitespace.
-
- The two aspects of undefined behavior involve odd occurrences during
- the gathering of arguments for a function like macro invocation. The
- first problem that needs to be addressed is what the behavior is when
- a token sequence that could be interpreted as a directive is
- encountered while gathering a list of arguments. The second point of
- undefined behavior is present when an argument consists of "no
- preprocessing tokens".
-
- Fundamentally, the problem with allowing preprocessor directives to
- occur during the gathering of parameter list for a macro, is that a
- #define or #undef directive might be encountered. Consider the
- following code:
-
- #define f(x) x+1
- f(2) /* that was easy */
- f( /* look for the argument ... */
- #undef f
- 2 ) /* found the arg, but should we use it? */
-
- JRCPP resolves such ambiguities very easily by not generally allowing
- directives to be present within macro argument lists. Specifically,
- JRCPP would generate a diagnostic indicating that there was no
- closing parenthesis for the macro invocation (JRCPP stopped looking
- when it reached the directive). Although this is a VERY reasonable
- result when #define and #undef directives are reached, it is not so
- obviously necessary when certain other directives are reached. To
- assist the programmer that is using existing code such as:
-
- #define f(x) x+1
- f(
- #if foo
- 2
- #else
- 3
- #endif
- )
-
- a pragma (#pragma delayed_expansion [on|off]) is provided that causes
- such "ANSI C undefined behavior" to be acceptable. The restriction
- on this pragma based extension is that no directives are allowed
- within the argument list that can even possibly cause a change in the
- macro database (i.e., the relevance of the change is not considered;
- the significance of the change, such as a benign redefinition, is not
- considered). Simply put, if a #define, #undef, or #pragma is
- encountered during a scan for arguments to a macro, the scan is
- terminated with a "missing close paren" diagnostic, even if the
- delayed_expansion pragma is active.
-
- As mentioned, the second area of ANSI C undefined behavior involves
- the "presence of an argument consisting of no tokens... before
- argument substitution...". This phrase unfortunately contradicts the
- definition of argument: "... sequence of preprocessing tokens ... in
- a macro invocation". Note also that "substitution" is the time at
- which expansion of the argument is considered, and hence the odd
- phrase cannot possibly refer to the situation where an argument is
- expanded, but the result is nil (or whitespace). Rather than harp on
- this inconsistency, we will discuss what perhaps is a related (or
- even intended) problem: What is the interpretation of a missing
- argument (or at least white space where an argument should be)?
-
- It is clearly stated that the number of arguments in a macro
- invocation must agree exactly with the number of arguments in the
- corresponding macro definition, and hence JRCPP generates a
- diagnostic if this is not the case. In addition, JRCPP enlists an
- error recovery strategy that consists of substituting whitespace
- (which is clearly not a valid ANSI C argument) for any missing
- arguments. This strategy is intended to be compatible with some
- prior non-ANSI implementations. Note that this action can cause a
- secondary error if the macro attempts use this argument in certain
- ways. The following example demonstrate these secondary errors:
-
- #define paste_left(x) something ## x
- #define paste_right(x) x ## something
- #define stringize(x) # x
- paste_left(/*white*/)
- paste_right(/*white*/)
- stringize(/*white*/)
-
- Fortunately, since most code that exploits this non-ANSI behavior
- (missing argument is actually whitespace) does not use the paste
- operator (##) or the stringize operator (#), hence this secondary
- error will have tend not to occur. Applications that are porting
- non-ANSI code through JRCPP may then choose to lower the severity
- level of the diagnostic that reports the whitespace argument, and
- accept the error recovery procedure as reasonable. (User feedback on
- my error recovery scheme may improve compatibility with other
- implementations).
-
-
-
- 3.8.3.2 LANGUAGE- PREPROCESSING DIRECTIVES- The # operator
-
- The # operator provides the stringizing functionality to ANSI C, that
- was often provided via expansion of parameters within string literals
- in older (and NON-ANSI) implementations. One key point that should
- be stressed is that when this functionality is used, then the
- argument is NOT macro expanded before being stringized (i.e., placed
- into quotes). For example:
-
- #define stringize(x) #x
- #define A 2
- stringize(2) /* becomes "2" */
- stringize(A) /* becomes "A" */
-
- If a user wants the argument to be expanded and THEN stringized, the
- following construction should be used:
-
- #define stringize(x) # x
- #define expand_then_stringize(x) stringize(x)
- #define A 2
- expand_then_stringize(2) /* becomes "2" */
- expand_then_stringize(A) /* becomes stringize(2), which
- becomes "2"*/
-
- The Standard indicates that the order of evaluation of the operators
- # and ## is unspecified. Since the token to the right of the
- stringize operator (#) must be a parameter, it would appear that the
- following are the only two cases to consider:
-
- #define F(x) word ## # x
- #define G(x) # x ## other
-
- JRCPP follows the C tradition of providing higher precedence for
- unary operators than for binary operators. JRCPP parses the
- definition of F to attempt to paste a stringized version of parameter
- x onto the right side of the identifier `word'. This decision is a
- bit immaterial, as the result of pasting any valid preprocessing
- token to the left side of a string literal (the stringized version of
- x) is almost always an invalid preprocessing token. Similarly, the
- definition of G provides a request to paste the word `other' onto the
- right side of the stringized version of parameter x. Defining the
- precedence in any other way appears to be of equally little use.
-
-
- 3.8.3.3 LANGUAGE- PREPROCESSING DIRECTIVES- The ## operator
-
- The pasting operator ## supplies the functionality in ANSI C that was
- supplied in various compilers in the past, by means of various hacks.
- Most hacks were based on methods of getting adjacent identifiers to
- "flow together". The two methods that I am aware of are:
-
- #define f() start
- f()end /* for SOME non-ANSI cpp, becomes: `startend' */
-
- and
-
- #define g start
- g/*comments go away*/end /* some non-ANSI cpp: `startend' */
-
- Neither of these constructs are supported under ANSI C, and in both
- cases JRCPP defaults to produces the two tokens `start' and `end',
- separated by a space. The first of the two approaches is supported
- via a pragma under JRCPP (see #pragma space_between_tokens).
-
- It should be emphasized that, just as with the stringize operator,
- arguments are NOT expanded, prior to insertion in replacement list,
- at points where they are about to be subjects of a ## operator. For
- example:
-
- #define A 2
- #define append(x) x ## right left ## x x
- append(A) /* becomes: Aright leftA 2 */
-
- If the user desires an expansion prior to pasting, the construct
- described earlier with regard to stringization must be used.
-
-
- 3.8.3.5 LANGUAGE- PREPROCESSING DIRECTIVES- Scope of macro definitions
-
- This section of the standard has some very nice examples of the
- process of macro expansion. These include the use of the paste
- operator (##), the stringize operator (#), and the prevention of
- infinite recursion of macros. If the user tries these torture tests
- on JRCPP, rather than reveal an error in JRCPP, they will reveal a
- typo in the ANSI C Standard. The specific error in the standard
- involves:
-
- #define str(x) #x
- str(: @\n)
-
- which the standard incorrectly expands to ": \n", but should have
- expanded to ": @\\n". The goal of the stringize operation is to
- produce text that can print exactly as the argument that was
- supplied. The standard is clear on this point, and other items in the
- expansion demonstrate this. Unfortunately, I am sure that this typo
- in the Standard will also be the source of many bug reports.
-
- The user that is trying these tests should also run them with the
- pragma space_between_tokens set to off, if they would like the format
- to be closer to that of the listing in the Standard. In either case
- the results should be correct. The user may also note a slight
- discrepancy in the format of the output, due to the fact that JRCPP
- maintains line position information much more accurately than most
- other preprocessors. In this regard, consider the example:
-
- #define swap(x,y) y x
- swap (
- +i
- +j,
- +k
- +l)
-
- Most preprocessors produce the output:
-
- #line 2
- +k +l +i +j
-
- Whereas JRCPP produces:
-
- #line 5
- +k
- +l
- #line 3
- +i
- +j
-
- The big advantage of the method provided by JRCPP is exposed when
- compiler diagnostics wish to refer to a token in such a stream. When
- JRCPP is used, a diagnostic for "syntax error on token `+'" can be
- very specific about the line number with the offending character.
- With other preprocessors, the user is just told the error was on line
- 2. (As a historical note, many pre-ANSI preprocessor required that
- the macro name and all the arguments be placed on a single line.
- Many users have, as a result, built large logical lines when a macro
- was being invoked. ANSI C established a standard whereby this was no
- longer necessary, but many compiler manufacturers are slow to service
- the users that have moved to this more readable notation.)
-
-
- 3.8.4 LANGUAGE- PREPROCESSING DIRECTIVES- Line control
-
- The #line directive is fully supported by JRCPP. There are several
- points to note about its performance. The first item worthy of note
- is that the standard provides for macro expansion of the tokens on
- the logical line with the #line directive. Unfortunately, it does
- not provide for arithmetic reductions. The result of macro expansion
- must be either a digit sequence (representing a line number), or a
- digit sequence with a string literal (file name).
-
- Note that the line directive requires a string literal, and has no
- consideration of the sort of context that was provided for evaluation
- of a file name in an include directive. This distinction means that
- when a backslash is used in a file name that is specified using a
- #line directive, then it must be "escaped" using another backslash.
-
- 3.8.6 LANGUAGE- PREPROCESSING DIRECTIVES- Pragma directive
-
- JRCPP makes extensive use of pragmas in order to direct customization
- of the performance of the preprocessor. Users should refer to the
- JRCPP Users Guide for a complete list of valid pragmas, along with
- their meaning. As per the Standard, unrecognized pragmas are ignored
- by JRCPP. In order to facilitate the use of pragma directives to
- control the compiler, unrecognized pragmas are passed unchanged
- (except for comment removal and whitespace compaction) through to the
- post-preprocessed output file.
-
- Note that because some pragmas do modify the macro database, pragma
- directives are not permitted within macro invocation argument lists.
- If there is some need to pass forward a pragma to the compiler
- without having it acted upon by the preprocessor (for example, when
- JRCPP would misunderstand it), then the following sort of approach
- can be taken:
-
- #define HIDE_PRAGMA
- HIDE_PRAGMA # pragma any tokens
-
- Due to the fact that the results of expansion will NEVER be
- considered by JRCPP to be a directive, the pragma presented in this
- cloak will not be processed by JRCPP. Unfortunately, when a user is
- forced to this extreme, the protection against macro expanding the
- list of tokens for an unknown pragma is lost. JRCPP has endeavored
- to use novel names that should not clash with the specifications of
- many other implementations.
-
-
-
- 3.8.8 LANGUAGE- PREPROCESSING DIRECTIVES- Predefined macro names
-
- All five of the ANSI C predefined macros are supported by JRCPP. The
- macros are:
-
- __LINE__ Current presumed line number in the source file. The value
- of __LINE__ will be in the range permissible for a signed
- long integer.
-
- __FILE__ Current presumed file name, provided as a string literal.
- Note that because __FILE__ is a string literal, any
- occurrences of the backslash character in the actual source
- name have been replaced by `\\' (an escaped backslash).
-
-
-
- __DATE__ The date on which JRCPP began to preprocess the original
- source file, expressed as a character string. The format
- will always be "Mon dd yyyy", where Mon in the abbreviation
- for the month, dd is 1 or 2 digit representation of the day
- of the month, and yyyy is the 4 digit calendar year.
-
- __TIME__ The time at which JRCPP began to preprocess the original
- source file, expressed as a character string. The format is
- "hh:mm:ss", where hh is the number of hours past midnight
- local time, mm is the number of minutes past the hour, and
- ss is the number of seconds past the whole minute.
-
- __STDC__ The integer constant 1. This constant is meant to indicate
- that the compiler/preprocessor is as conforming ANSI C
- implementation.
-
- The Standard restricts the preprocessor from redefining any of these
- macro names, as well as attempting to undefine any of them.
- Similarly, the standard precludes the use of #define or #undef on the
- identifier `defined' (which has special meaning in evaluating a
- #if/elif directive).
-
- JRCPP supports the full ANSI C Standard as indicated above, but
- maintains customization features that allow it to be modified
- slightly to be non-conforming.
-
- One major point is that most of the time, the compiler that JRCPP is
- preprocessing for is not ANSI conformant. With this situation in
- mind, there must be some mechanism for undefining __STDC__. Special
- pragmas have been provided in JRCPP to accomplish most of these tasks
- (see #pragma undefine_macros).
-
- As a special note, the pragma that switches to C++ mode (see #pragma
- cplusplus_mode) has the following effect: The macro __STDC__ is
- undefined, and the macro __cplusplus is defined. Moreover, this new
- macro __cplusplus has the same reserved status (i.e.: cannot be
- #define'd or #undef'ed) as __STDC__ has under default JRCPP. In
- addition, the one line // style comments are also supported in c++
- mode.
-
-
-