Kris Stewart
Computer Science Department
San Diego State University
05Feb08
Suppose someone asked you to write the grammar (rewrite rules) that would specify arithmetic expressions, using identifiers, constants, adds (+), multiplies (*) and parenthesis (which must be matched). Its clear that the following would generate valid arithmetic expressions beginning with start symbol, E, and ending with a series of tokens of the language:
1. E -> E + E
2. E -> E * E
3. E -> ( E )
4. E -> "identifier" (id for short)
5. E -> "constant"
Therefore, we could verify that the string id1 + id2 * id3 is a valid
string by presenting its derivation from the start symbol E using only
the rules of the grammar above.
E->E + E -> id1 + E -> id1 + E * E -> id1 + id2 * E -> id1 + id2 * id3 r1 r4 r2 r4 r4Note, we have followed the convention of always expanding the left-most nonterminal. We can also represent this derivation using a tree, each application of a grammar rule corresponding to nodes of the tree.
Tree #1 E
/ | \
E + E
| / | \
id1 E * E
| |
id2 id3
The problem is that even though we have enforced some determinism by
always choosing to expand the left most nonterminal remaining in the
sentential form, we have ambiguity in the grammar. There is more than
one left-most derivation of our target string:
E->E * E -> E + E * E -> id1 + E * E -> id1 + id2 * E -> id1 + id2 * id3 r2 r1 r4 r4 r4Again, we draw the derivation tree, where each branch corresponds to the application of a grammar rule:
Tree #2 E
/ | \
E * E
/ | \ |
E + E id3
| |
id1 id2
This lack of determinism, the two different derivation trees from one grammar
for this particular string, is what distinguishes this grammar as ambiguous.
As a general rule, ambiguous grammars are to be avoided.
There is an additional consideration in the above example which has to do with semantics (meaning). We've been taught from our first math courses that the string 2+5*4 should result in 22, that the multiplication is always done before addition (unless there are parentheses to override this). This is called precedence for the arithmetic operators. It is, in a sense, an arbitrary choice, but one that has been made.
Tree #1 above is correct, Tree #2 is wrong. We want our grammar to be unambiguous and to enforce the correct precendence. Fortunately, this is easy to fix.
A simple strategy to disambiguate:
Suppose we have the grammar rules:
E -> E @ E
E -> E # E
E -> E ! E
E -> E " E
where @, #, !, " are operators. We know is that their precedence is given by
# higher than @ , " higher than !
i.e. # is the highest precedence and should be performed first; @ and " are
at the same precedence (like + and - in our arithmetic) and should be performed
after # and before !, which is the lowest precedence.
Our rewrite strategy is to take the weakest operator and begin stratification
there by adding a new nonterminal, E1,
E -> E ! E1
E -> E1
The next to lowest operators were @ and ", so we add the nonterminal, E2:
E1 -> E1 @ E2
E1 -> E2
E1 -> E1 " E2
and we already have E1 -> E2
The highest precedence is #, adding E3, and yielding the rules
E2 -> E2 # E3
E2 -> E3
We can now use these disambiguating rules on the Expression Grammar. Start
with the lowest precedence, +, and call the non-terminal we must introduce,
T (term):
E -> E + T (E is for expression)
E -> T
Then we want the *, and the new non-terminal is called F (factor):
T -> T * F
T -> F
The highest precedence of all is given by the parenthesis:
F -> ( E )
F -> "identifier"
F -> "constant"
Hopefully, this gives some background for examining simple.lr and its grammar
rules.
After reading this, try to incorporate the unary minus with the correct precedence into the expression grammar below. And if you want to be really brave, incorporate exponentiation (which is right associative).