Macro: User Manual and Report

(From: Crafting a Compiler in C/Fischer & LeBlanc slightly altered by Kris Stewart for cs524/spim Spring 2008) A BNF for MACRO
1. Introduction
The term project for CS 524 - Compiler Construction will be to implement a compiler for a subset of Macro. Macro is a langauge similiar in many ways to Ada, and has been chosen to illustrate concepts in the design of programming languages and compilers.

1.1 Lexical Considerations
The standard character set is implementation dependent, but must contain all the special symbols needed to form Macro tokens. Comments are delimited by "--" and end of line. Comments may appear anywhere an end of line may appear; they are ignored by the scanner. Upper and lower-case letters are considered equivalent in all tokens except strings. White space (blank, tab or end-of-line) serve to separate tokens; otherwise they are ignored. White space may not appear in any token except a string constant (see below). The maximum allowed line length is implementation-defined.

1.2 Tokens
The following are keywords. They are all reserved words.
              ARRAY        BEGIN      BODY         ELSE
              ELSIF        END        EXIT         IF
              LOOP         PROCEDURE  READ         RECORD
              THEN         TYPE       WHEN         WRITE
              PACKAGE      PRIVATE    IN           OUT
              IS           OF
Constants are either integer, float, or string. Integer constants contain only digits. Float constants contain a decimal point. At least one digit must both precede and follow the decimal point. String constants begin and end with a double quote (") and contain any sequence of printable characters. If a double quote appears inside a string constant, it must be repeated (e.g. """Help!"" he cried"). The allowable range of integer constants and the range and precision of float constants are implementation-defined.

Let letter = 'A'..'Z','a'..'z'; digit = '0'..'9'; using a regular expression notation in which ',' represents set union , '.' represents set product, '*' represents Kleene closure, LAMBDA (λ) represents the null string, NOT represents set complement, and literals are delimited by quotes ('), the above definitions may be made more precise:

          CONSTANT = INTCONST , FLOATCONST , STRCONST
          INTCONSTANT = digit . (digit)*
          FLOATCONSTANT = INTCONSTANT . '.' . INTCONSTANT
          STRCONSTANT = '"' . (NOT('"') , '"' . '"')* . '"'
Identifiers are strings of letters and digits starting with a letter (not to include the reserved keywords). Thus identifiers can be specified as follows, where RESERVED represents the set of reserved keywords:
           IDENTIFIER = (letter , (letter , digit)*) - RESERVED
The length of an identifier must not exceed 32 characters, but all characters of a legal identifier must be used in determining its uniqueness.

The following are the remaining operators and delimiters ($EOF$ represents end-of-file):

           ASSIGNOP = ":="
           MULTOP = "*" , "/"
           PLUSOP = "+" , "-"
           RELOP = "<" , "<=" , ">" , ">=" , "=" , "/="
           DELIM = ".." , ":", ";" , "," , "." , "(" , ")" , "$EOF$"
Adjacent identifiers, reserved words, and numeric literals must be separated by white space (or comments). This guarantees that there is no ambiguity in how tokens are scanned (e.g. 'begina' is one identifier, not 'begin' followed by 'a').

2. Informal Semantics
2.1 Predefined Types
Macro contains 4 predefined types:

2.2 Defining New Types
The syntax for creating a new type in Macro is:
               TYPE  IS ;
becomes the name of the new type whose structure is given by . Thus,
               TYPE fubar IS integer;
defines a new type, fubar, which behaves exactly like an integer. To describe more complex type structures, Macro provides two type specification forms for creating either a record or an array.
(a) Arrays are defined by:
TYPE IS ARRAY( .. ) OF ;
must be either the name of a previously defined type, or the specification of a new type (including ARRAY).
               Example:
                     TYPE arr1 IS ARRAY(1..9) OF integer;
                     TYPE arr2 IS ARRAY(0..1) OF arr1;
                     TYPE arr3 IS ARRAY(0..1) OF ARRAY(1..9) of integer;
(b) Record specifications take the form:
                  TYPE  IS
                   RECORD
                    
                   END RECORD;
For example:
                 TYPE R1 IS RECORD
                               i, j : float;
                               c    : integer;
                            END RECORD;

                 TYPE R IS RECORD
                             e : integer;
                             f : r1;
                           END RECORD;
A component declaration is syntactically identical to a variable declaration (see below). The field names within a record are local to that record and need be unique in that record only. As with arrays, a particular component may also be specifiec by a record form.

2.3 Variable Declarations
Variables are introduced by declarations of the form
               id, id, ..., id : ;
must be a type descriptor (e.g. integer, or the construction of a new type using either the record or array specification forms).
       Example:
                a, b : integer;
                c    : RECORD
                          a, b : integer;
                       END RECORD;

2.4 Variable Denotations
Fields of a record are designated by appending qualifications to a record id. Incomplete record qualifications are not permitted. Consider the following:

If we use the above definition of record r to obtain

                   a, b : ARRAY(1 .. 10) OF r,
all of the following are legal names:
                  a(1).e, a(1).f.c., b(I).f
but the following are illegal:
                   b(1).i, a.e, a.e(1), f.j.c

2.5 Operators and Expressions
The operators, in increasing order of precedence, are:
1. Relational operators: =, <, <=, >, >=, /=
2. Adding operators: +, -
3. Unary operators: +, -
4. Multiplying operators: *, /
5. Field/Package qualification ('.') and subscripting ('(' and ')')

Operators at the same precedence level are evaluated from left to right. All operators are left-associative except the relational operators, which are not associative at all.

2.6 Operator Descriptions
Multiplying Operators
- The operator * is multiplication; the operator / is division. Both operands must be float or integer; the result type is the same as the operand type.
Unary Operators
- The following are defined on all arithmetic types, and always return the type of the operand.
  • +: Identity operator
  • -: Complement operator
Adding operators
- The operator + is addition; the operator - is subtraction. Both operands must be float or integer; the result type is the same as the operand type.
Relational operators
- For the operators <, <=, >, >=, =, and /= both operands must be of the same type; the result is boolean. Equality and inequality (= and /=) are defined for all predefined types. The others are defined for scalar types.

2.7 Assignment Compatibility
Types may be assigned only if they have the same base type. Two distinct type definitions, even if structurally identical, are considered different. This is called name equivalence of types.

Example:

          Given TYPE t1 IS ARRAY(1..10) OF float;
                TYPE t2 IS ARRAY(1..10) OF float;
                a, b : t1;
                c, d : t2;
                e, f : ARRAY(1..10) OF float;
a := b, c := d and e := f are legal, but a := c and c := e are illegal.

There is however, one conversion which the compiler should recognize and perform. Namely, when an arithmetic expression contains both integers and floats, then the integers shall be coerced into floats and the result given as a float. Take special note that the reverse is not true, that is a float is never coerced into being an integer.

2.8 Assignment Statements
Assignments are as in Algol 60 or Pascal, given that the left and right sides of the assignment are compatible (as defined above). Compatible objects of any type may be assigned. The order of evaluation is implementation defined.

2.9 The IF Statement
The IF statement has the form:
          IF Boolean Expression THEN
             sequence of statements
          ELSIF Boolean Expression THEN
             sequence of statements
               .
               .
               .
          ELSIF Boolean Expression THEN
             sequence of statements
          ELSE
             sequence of statements
          END IF
The ELSE clause may be omitted. The semantics are the same as for the IF statement in PASCAL.

2.10 The LOOP Statement
The LOOP statement has the form:
             : LOOP
               sequence of statements
            END LOOP;
This is a "loop forever", but EXIT can be used to terminate iteration (see below). The " :" is optional.

2.11 The READ Statement
*** note - since SPIM always performs a read line taking only one value from the input stream, i chose to force our target language to have only the

READLN

statement. writes in SPIM can either be WRITE or WRITELN ks 8/93 **

In programming language design, a recurring problem is whether to treat READ and WRITE as statement types, or as predefined procedures. There are problems with both approachs. Considering READ and WRITE to be predefined procedures (as Ada does) isn't really satisfactory because (unlike ordinary procedures), we'd like them to take an indefinite number of arguments and perhaps allow non-standard parameter syntax (such as "width specifications"). On the other hand, reserving READ and WRITE seems to obviate the chance of allowing users to extend these statements to handle new data types. In Macro, we'll essentially treat READ and WRITE as predefined procedures that are overloaded. To allow lists of input or output items (which Ada doesn't really support) for the commonly used READ and WRITE procedures, we'll introduce a notational extension in which only the procedures for READ and WRITE may contain an arbitrarily long list of parameters. Thus

               WRITE ("Date =", month, day, year);
would be allowed.

In Macro, only scalars may be read. That is, the standard package contains definitions for READ that accept integer and float values. READ is not predefined to handle arrays or records, though user-defined routines for such types can be written.

READ can handle components of structured types, as long as the components are scalars. Our preprocessor forces READs to handle items left to right so, e.g.,

               READ (i, a(i)); is legal.

2.12 The WRITE Statement
All scalar values (and expressions) as well as strings are handled by the predefined WRITE procedure. Each scalar type has a default output width (implementation determined) which is sufficient to print any value of that type without truncation.

2.13 Subprograms
Procedures are the only kind of subprograms allowed.. A call of a procedure is P (ARG, ARG2, ...);, and a call of a procedure of no arguments is P;. Procedures may be recursive.

At the beginning of every subprogram is a list of all the formal parameters of the procedure. For each parameter its position, name, type and mode (IN, OUT, or IN OUT) is specified (the default is IN).

Example:

                  PROCEDURE F (x : float; y, z : IN OUT integer)
Three formal parameters (x, y, z) are declared. All formal parameters are considered to be local to the body of the subprogram.

The types of all formal parameters must be specified with a type name. An explicit type generator may not be used.

Macro subprograms follow the same scope rules as Pascal: Every variable is automatically imported into any function, or procedure contained within the definition, unless that inner definition contains another declaration of the same variable. Macro allows no forward references, so the scope of a declaration actually extends from the point of its definition to the end of its containing scope.

A procedure can also declare local constants, variables, types and procedures. The general structure of a procedure body definition is like that of PASCAL:

               PROCEDURE  (  ) IS
                       type, variable and subprogram declarations
               BEGIN
                       Statement list
               END;
The () part is optional.

2.14 Parameter modes
When a procedure is called, actual parameters are matched with corresponding formal parameters.
  1. 1. Parameters of mode IN are local constants initialized at the time of call to the corresponding actual parameter. The type of the actual parameter must be assignable to that of the formal parameter. Any actual parameter passed as an IN parameter may be any expression of the proper type. Note that IN parameters are the "safest" kind, since they cannot be altered. That's why they are the default.
  2. 2. Parameters of mode OUT are uninitialized local variables, whose values, at the point of return, are assigned to the corresponding actual parameter. The type of the formal parameter must be assignable to that of the actual parameter. An actual parameter passed as an OUT parameter must be a variable name (possibly qualified), since it will be the target of an assignment.
  3. 3. Parameters of mode IN OUT are local variables, initialized, at the point of call to the value of the value of the actual parameter. At the point of return, the local variable's value is assigned back to the corresponding actual parameter. The types of the formal and actual parameters must be mutually assignable. An actual parameter passed as an IN OUT parameter must be a variable name (possibly qualified), since it will be the target of an assignment. IN OUT parameters can obviously be implemented by performing two distinct copy operations. This is reasonable for scalar parameters. For records, arrays and strings, copy operations may be expensive. Macro therefore allows IN OUT parameters to be implemented as "reference" parameters; that is, by passing an address rather than an actual copy. Programs that depend on how IN OUT parameters are implemented are illegal.

2.15 Forward References
In Macro, no forward references are allowed. If a constant, variable, type or subprogram is to be used, it must already have been defined. It may happen that a subprogram must be called before it has been defined (for example, if A calls B and B calls A). For this reason the declaration of a subprogram may be separated from its actual definition. To declare a subprogram, we just provide its header which specifies its name, and its parameters (if any). Later (in the same name scope) we must provide a complete subprogram definition (including the header). For example:
            PROCEDURE  (  );

2.16 The EXIT Statement
Macro has no GOTO statement. Rather, EXIT serves as a restricted form of GOTO. An EXIT is used to leave a LOOP. An unlabeled EXIT leaves the innermost LOOP structure containing the EXIT statement. A labeled EXIT leaves the LOOP labeled with the corresponding identifier. In both cases, the LOOP exited must be in the same subprogram or main program that contains the EXIT statement. That is, an EXIT may not imply a return from a subprogram.

A conditional EXIT is implied by a "WHEN clause". This is of the form:

               EXIT  WHEN 
The EXIT is performed only if the Boolean expression is true. Why Ada doesn't simple use an IF statement is an interesting question. It appears they decided to "go for barroque."

2.17 Packages
Real world programming languages require a mechanism to modularize programs. This is necessary to structure large programs into manageable units. In Macro, the unit of modularization is the package. A package is of the form:
               PACKAGE  IS
                       type, variable and subprogram declarations
               PRIVATE
                       type declarations
               BODY
                       type, variable and subprogram definitions
               BEGIN
                       Statement list
               END;
The PRIVATE part may be ommitted if there are no private types.

The type, variable and subprogram headers that appear in package declaration section can be made visible to other packages. They are therefore called the visible part of the package. Objects can be referenced by qualifying the object's name with the name of the package in which it appears. Thus P.A would name object A in package P.

We wish to use packages to implement abstract data types, but this leads to a problem. If we place the definition of a type in the visible part of a package, it's implementation is visible outside the package. This is undesirable, as we wish to characterize an abstract data type by its operations, not its implementation. Declarations placed in the body of a package are never visible outside the package, so the definition of an abstract can't be placed there. The PRIVATE part of a package declaration exits to address this problem. In the visible part of the package, the type is decalared to be PRIVATE. The implementation of the type is hidden in the PRIVATE part. Thus is the declaration

          PACKAGE SetStuff IS
                -- This is the visible section
                TYPE Set IS PRIVATE;
                PROCEDURE InSet (i : integer; s : Set; j : OUT integer);
          PRIVATE
                -- This section shows the implementations of items
                -- in the visible section if the implementation wasn't
                -- given there.
                TYPE Set IS ARRAY(1 .. 10) OF integer;
                PROCEDURE InSet (i : integer;
                                 s : Set;
                                 j : OUT integer) IS
                BEGIN
                     j := s(i);
                END;
          BODY
                -- The body section is for initialization, first
                -- the declarations are given...
                i : INTEGER;
          BEGIN
                -- and then the code!
                READ (i);
                s(i) := 1;
          END;
The name is made visible, but not the fact that it is implemented as an array. If the full declaration were in the visible part, its implementation would be visible. By making types private, we can guarantee that they are manipulated only by operations defined in the package itself. All declarations in the package body are completely hidden outside the package. Declarations in the specification part are visible in the body, which is responsible for implementing the subprograms declared in the specification part.

The statements (if any) in the package body following the declarations are executed when the package body is executed. Package bodies are executed in the order in which package bodies are sequenced. Packages declarations may appear anywhere that a procedure, type or variable declaration can appear, and thus may be nested.

2.18 A BNF for Macro
The following is a BNF description of the Macro language syntax. Curly brackets are used to indicate items which may be repeated 0 or more times, while optional items are indicated by being enclosed in square brackets. Non-terminals appear on the left hand side of a production and are in lower case. Terminals are always in upper case. Special symbols, other that {, }, [ and ] are not quoted. Further, terminal names are recognized according to the syntax described at the beginning of this document.
       program      -> {package-decl} $EOF$
       package-decl -> PACKAGE IDENTIFIER IS {declaration}
                          [private-decl] [body] END ;
       private-decl -> PRIVATE {declaration}
       body         -> BODY {declaration} [ BEGIN statement {statement} ]
       declaration  -> IDENTIFIER {, IDENTIFIER} : type ;
                    -> TYPE IDENTIFIER IS type ;
                    -> PROCEDURE IDENTIFIER [formals-list] [proc-body] ;
                    -> package-decl
       type         -> variable
                    -> PRIVATE
                    -> ARRAY ( expression .. expression ) OF type
                    -> RECORD component {component} END RECORD
       component    -> IDENTIFIER {, IDENTIFIER} : type ;
       proc-body    -> IS {declaration} BEGIN statement {statement} END
       formals-list -> ( formal-param {; formal-param} )
       formal-param -> IDENTIFIER {, IDENTIFIER} : [mode] type
       mode         -> IN OUT
                    -> IN
                    -> OUT
       statement    -> READ ( expression {, expression} ) ;
                    -> WRITE ( write_expr {, write_expr} ) ;
                    -> IF boolean THEN statement {statement}
                          {ELSIF boolean THEN statement {statement}}
                          [ELSE statement {statement}] END IF ;
                    -> variable [:= expression] ;
                    -> [IDENTIFIER :] LOOP statement {statement} 
                           END LOOP ;
                    -> EXIT [IDENTIFIER] [WHEN boolean] ;
       write_expr   -> STRINGCONSTANT
                    -> expression
       boolean      -> expression relational-op expression
       relational-op-> >
                    -> <
                    -> =
                    -> >=
                    -> <=
                    -> /=
       expression   -> expression add-op expression
                    -> expression mult-op expression
                    -> ( expression )
                    -> [unary-op] expression
                    -> variable
                    -> INTCONSTANT
                    -> FLOATCONSTANT
       unary-op     -> -
                    -> +
       add-op       -> +
                    -> -
       mult-op      -> *
                    -> /
       variable     -> IDENTIFIER var-rest
       var-rest     -> ( expression {, expression} ) var-rest
                    -> . variable
                    ->
If you look carefully at the above BNF it is apparent that it will allow the construction of some syntactic structures which are illegal. While this is true, it is being done for two reasons:
  1. To make the construction of the LL(1) grammar simpler.
  2. To allow what would otherwise be syntactic errors to be more easily handled in the semantic routines of the compiler where they are more easily dealt with.