Kris Stewart
Computer Science Department
San Diego State University
05Feb08
Suppose someone asked you to write the grammar (rewrite rules) that would specify arithmetic expressions, using identifiers, constants, adds (+), multiplies (*) and parenthesis (which must be matched). Its clear that the following would generate valid arithmetic expressions beginning with start symbol, E, and ending with a series of tokens of the language:
1. E -> E + E 2. E -> E * E 3. E -> ( E ) 4. E -> "identifier" (id for short) 5. E -> "constant"Therefore, we could verify that the string id1 + id2 * id3 is a valid string by presenting its derivation from the start symbol E using only the rules of the grammar above.
E->E + E -> id1 + E -> id1 + E * E -> id1 + id2 * E -> id1 + id2 * id3 r1 r4 r2 r4 r4Note, we have followed the convention of always expanding the left-most nonterminal. We can also represent this derivation using a tree, each application of a grammar rule corresponding to nodes of the tree.
Tree #1 E / | \ E + E | / | \ id1 E * E | | id2 id3The problem is that even though we have enforced some determinism by always choosing to expand the left most nonterminal remaining in the sentential form, we have ambiguity in the grammar. There is more than one left-most derivation of our target string:
E->E * E -> E + E * E -> id1 + E * E -> id1 + id2 * E -> id1 + id2 * id3 r2 r1 r4 r4 r4Again, we draw the derivation tree, where each branch corresponds to the application of a grammar rule:
Tree #2 E / | \ E * E / | \ | E + E id3 | | id1 id2This lack of determinism, the two different derivation trees from one grammar for this particular string, is what distinguishes this grammar as ambiguous. As a general rule, ambiguous grammars are to be avoided.
There is an additional consideration in the above example which has to do with semantics (meaning). We've been taught from our first math courses that the string 2+5*4 should result in 22, that the multiplication is always done before addition (unless there are parentheses to override this). This is called precedence for the arithmetic operators. It is, in a sense, an arbitrary choice, but one that has been made.
Tree #1 above is correct, Tree #2 is wrong. We want our grammar to be unambiguous and to enforce the correct precendence. Fortunately, this is easy to fix.
A simple strategy to disambiguate:
Suppose we have the grammar rules:
E -> E @ E E -> E # E E -> E ! E E -> E " Ewhere @, #, !, " are operators. We know is that their precedence is given by
# higher than @ , " higher than !i.e. # is the highest precedence and should be performed first; @ and " are at the same precedence (like + and - in our arithmetic) and should be performed after # and before !, which is the lowest precedence. Our rewrite strategy is to take the weakest operator and begin stratification there by adding a new nonterminal, E1,
E -> E ! E1 E -> E1The next to lowest operators were @ and ", so we add the nonterminal, E2:
E1 -> E1 @ E2 E1 -> E2 E1 -> E1 " E2 and we already have E1 -> E2The highest precedence is #, adding E3, and yielding the rules
E2 -> E2 # E3 E2 -> E3We can now use these disambiguating rules on the Expression Grammar. Start with the lowest precedence, +, and call the non-terminal we must introduce, T (term):
E -> E + T (E is for expression) E -> TThen we want the *, and the new non-terminal is called F (factor):
T -> T * F T -> FThe highest precedence of all is given by the parenthesis:
F -> ( E ) F -> "identifier" F -> "constant"Hopefully, this gives some background for examining simple.lr and its grammar rules.
After reading this, try to incorporate the unary minus with the correct precedence into the expression grammar below. And if you want to be really brave, incorporate exponentiation (which is right associative).