An Intuitive Discussion of Why Grammars Look the Way They Do Motivation for Reading Chapter 4: Syntax Analysis

Kris Stewart
Computer Science Department
San Diego State University
05Feb08

Suppose someone asked you to write the grammar (rewrite rules) that would specify arithmetic expressions, using identifiers, constants, adds (+), multiplies (*) and parenthesis (which must be matched). Its clear that the following would generate valid arithmetic expressions beginning with start symbol, E, and ending with a series of tokens of the language:

                         1.  E -> E + E
                         2.  E -> E * E
                         3.  E -> ( E )
                         4.  E -> "identifier"  (id for short)
                         5.  E -> "constant"
Therefore, we could verify that the string id1 + id2 * id3 is a valid string by presenting its derivation from the start symbol E using only the rules of the grammar above.
E->E + E -> id1 + E -> id1 + E * E -> id1 + id2 * E -> id1 + id2 * id3
 r1      r4         r2             r4               r4
Note, we have followed the convention of always expanding the left-most nonterminal. We can also represent this derivation using a tree, each application of a grammar rule corresponding to nodes of the tree.
Tree #1                           E
                               /  |  \
                              E   +   E
                              |    /  |  \
                            id1   E   *   E
                                  |       |
                                 id2     id3
The problem is that even though we have enforced some determinism by always choosing to expand the left most nonterminal remaining in the sentential form, we have ambiguity in the grammar. There is more than one left-most derivation of our target string:
E->E * E -> E + E * E -> id1 + E * E -> id1 + id2 * E -> id1 + id2 * id3
 r2      r1           r4             r4               r4
Again, we draw the derivation tree, where each branch corresponds to the application of a grammar rule:
Tree #2                           E
                               /  |  \
                             E    *    E
                          /  |  \      |
                         E   +   E    id3
                         |       |
                        id1     id2
This lack of determinism, the two different derivation trees from one grammar for this particular string, is what distinguishes this grammar as ambiguous. As a general rule, ambiguous grammars are to be avoided.

There is an additional consideration in the above example which has to do with semantics (meaning). We've been taught from our first math courses that the string 2+5*4 should result in 22, that the multiplication is always done before addition (unless there are parentheses to override this). This is called precedence for the arithmetic operators. It is, in a sense, an arbitrary choice, but one that has been made.

Tree #1 above is correct, Tree #2 is wrong. We want our grammar to be unambiguous and to enforce the correct precendence. Fortunately, this is easy to fix.

A simple strategy to disambiguate:

Suppose we have the grammar rules:

                              E -> E @ E
                              E -> E # E
                              E -> E ! E
                              E -> E " E
where @, #, !, " are operators. We know is that their precedence is given by
       #   higher than      @ , "    higher than     !
i.e. # is the highest precedence and should be performed first; @ and " are at the same precedence (like + and - in our arithmetic) and should be performed after # and before !, which is the lowest precedence. Our rewrite strategy is to take the weakest operator and begin stratification there by adding a new nonterminal, E1,
                             E -> E ! E1
                             E -> E1
The next to lowest operators were @ and ", so we add the nonterminal, E2:
                            E1 -> E1 @ E2
                            E1 -> E2
                            E1 -> E1 " E2
                                  and we already have E1 -> E2
The highest precedence is #, adding E3, and yielding the rules
                            E2 -> E2 # E3
                            E2 -> E3
We can now use these disambiguating rules on the Expression Grammar. Start with the lowest precedence, +, and call the non-terminal we must introduce, T (term):
                              E -> E + T        (E is for expression)
                              E -> T
Then we want the *, and the new non-terminal is called F (factor):
                              T -> T * F
                              T -> F
The highest precedence of all is given by the parenthesis:
                              F -> ( E )
                              F -> "identifier"
                              F -> "constant"

Hopefully, this gives some background for examining simple.lr and its grammar rules.

After reading this, try to incorporate the unary minus with the correct precedence into the expression grammar below. And if you want to be really brave, incorporate exponentiation (which is right associative).

Return to CS 524 Class Page