Chap 4: Syntax Analysis

Chap 4: Syntax Analysis - Grammars and Parsing
05Feb08

CS 524 Spr08 students - keep in mind our focus will be on LR grammars, which are introduced in Sections 4.5 (Bottom-Up Parsing) and 4.6 (Intro to LR Parsing: Simple LR). Our text covers both LL grammars (top-down parsing) and LR grammars (bottom-up parsing).

4.1.1 The Role of the Parser

Excellent Figure 4.1 Position of parser in compiler model - show with document viewer

4.2 Context-Free Grammars: Concepts and Notation

We formalize our definitions and introduce some useful notation.

A Context Free Grammer (CFG) is defined by the following four components:

A finite terminal vocabulary V_t; this is the token set produced by the scanner.
A finite set of different, intermediate symbols, called the nonterminal vocabulatory V_n.
A start symbol S ∈ V_n that starts all derivations. A start symbol is sometimes called the goal symbol/
P, a finite set of productions (sometimes called rewrite rules) of the form
A → X₁ … X_m,
where
A ∈ V_n,
X_i ∈ V_n ∪ V_t, 1 ≤ i ≤ m, m ≥ 0
Note that A → λ is a valid production.

Starting with S, nonterminals are rewritten using productions until only terminals remain (at which point the derivation is done). The set of strings derivable from S comprises the context-free grammar G, denoting L(G).

These components are often grouped into a "four-tuple" (V_t, V_n, S, P), which is the formal definition of the CFG. The vocabulary V of a CFG is the set of terminal and nontemrinal symbols, i.e.

V = V _t ∪ V _n.

In describing CFGs and their parsers, it is sometimes important to distinguish whether a single symbol is required or whether a string of symbols is possible. Similarly, sometimes only a terminal or nonterminal symbol is appropriate, and at other times any vocabulary symbol may occur. To clarify exactly what sorts of symbol strings are expected, we will use the following notational conventions:

a, b, c, … denote symbols in V_t
A, B, C, … denote symbols in V_n
U, V, W, … denote symbols in V
α , β , γ … denote strings in V^*
u, v, w, … denote strings in V_t^*

Using this notation, a production would be written as

A → α or A → X₁ … X_m

Often more than one production shares the same right-hand-side. Rather than repeat the left-hand side, an "or notation" is used:

A → α | β | … | ζ

If A → γ is a production, then α A β ⇒ αγβ, where ⇒ denotes a one-step derivation (using production A → γ ).

We extend ⇒ to ⇒ ⁺, derived in one or more step, and ⇒ ^*, derived in zero or more steps.

If A⇒^*β, then β is said to be a sentential form of the CFG.

SF(G) is the set of sentential forms of grammar G.

4.2 Errors in Context-Free Grammar

A grammar may have useless nonterminals. Consider the grammar, G₁:
1. C → c
The nonterminal C cannot be reached from S (the start symbol) and the nonterminal B derives no terminal string. Nonterminals that are unreachable or derive no terminal string are term useless. Useless nonterminals (and production that involve them) can be safely rmeoved from a grammar without changing the language defined by the grammar. A grammar containing useless nonterminal is said to be nonreduced. After useless nonterminals are removed, the grammar is reduced. G₁ is nonreduced. After B and C are removed, we obtain an equivalent grammar, G₂, which is reduced:

S → A
A → a
Many parser generators (like yacc) check to see if a grammar is reduced. If it is not, the grammar probably contains errors (often caused by mistyping the grammar specification).
A more serious grammar flaw is that sometimes a grammar allows a program to have two or more different parse trees (and thus a nonunique structure). Consider the following frammer, which generates expressions using just index -
1. < expr> → <expr> - <expr>
2. < expr> → ID
We can produce two different derivations for ID - ID - ID and this is not a good thing.
Grammars that allow different parse trees for the same terminal string are termed ambiguous. They are rarely used because a unique structure (that is, parse tree) cannot be guaranteeds for all inputs, and hense a unique translation guided by the parse tree may not be obtained.
We would like an algorithm that checks to see if a grammar is ambiguous. However, it is impossible to decide whether a given CFG is amgibuous (Hopcroft and Ullman 1969), so such an algorithm is impossible to create. Fortunately, for certain grammar classes, we can prove that constituent grammars are unambigous.
The most potentially serious flaw that a grammar might have is that it generates the "wrong language".

begin < Stmts > end $

< Stmts > → < Stmt > ; < Stmts >

< Stmts > → λ

< Stmt > → SimpleStmt

< Stmt > → begin < Stmt > end

We will draw the diagram for a Top-down parse and Bottom-up parse for the input stream

begin SimpleStmt ; SimpleStmt ; end $

A bottom-up parse proceeds by discovering subtrees and linking them into increasingly larger trees.

The productions predicted by a top-down parser represent a leftmost derivation; hence such parsers are said to produce a leftmost parse. The sequence of producton recognized by a bottom-up parser is termed a rightmost parse; and is the exact reverse of the production sequence that represents a rightmost derivation.

Another intuitive example of how to disambiguate a grammar is provided by the Expression Grammar. intuitive-grm-spr08.html

Chap 4: Syntax Analysis - Grammars and Parsing 05Feb08

Chap 4: Syntax Analysis - Grammars and Parsing
05Feb08