Chap 4: Syntax Analysis - Grammars and Parsing
05Feb08

CS 524 Spr08 students - keep in mind our focus will be on LR grammars, which are introduced in Sections 4.5 (Bottom-Up Parsing) and 4.6 (Intro to LR Parsing: Simple LR). Our text covers both LL grammars (top-down parsing) and LR grammars (bottom-up parsing).

4.1.1 The Role of the Parser
Excellent Figure 4.1 Position of parser in compiler model - show with document viewer

4.2 Context-Free Grammars: Concepts and Notation
We formalize our definitions and introduce some useful notation.

A Context Free Grammer (CFG) is defined by the following four components:

  1. A finite terminal vocabulary Vt; this is the token set produced by the scanner.
  2. A finite set of different, intermediate symbols, called the nonterminal vocabulatory Vn.
  3. A start symbol S ∈ Vn that starts all derivations. A start symbol is sometimes called the goal symbol/
  4. P, a finite set of productions (sometimes called rewrite rules) of the form
    A → X1 Xm,
    where
    A ∈ Vn,
    Xi Vn Vt, 1 ≤ i ≤ m, m ≥ 0

    Note that A → λ is a valid production.

Starting with S, nonterminals are rewritten using productions until only terminals remain (at which point the derivation is done). The set of strings derivable from S comprises the context-free grammar G, denoting L(G).

These components are often grouped into a "four-tuple" (Vt, Vn, S, P), which is the formal definition of the CFG. The vocabulary V of a CFG is the set of terminal and nontemrinal symbols, i.e.

V = V t V n.

In describing CFGs and their parsers, it is sometimes important to distinguish whether a single symbol is required or whether a string of symbols is possible. Similarly, sometimes only a terminal or nonterminal symbol is appropriate, and at other times any vocabulary symbol may occur. To clarify exactly what sorts of symbol strings are expected, we will use the following notational conventions:

Using this notation, a production would be written as

A → α or A → X1 Xm

Often more than one production shares the same right-hand-side. Rather than repeat the left-hand side, an "or notation" is used:

A → α | β | … | ζ

If A → γ is a production, then α A β ⇒ αγβ, where denotes a one-step derivation (using production A → γ ).

We extend to +, derived in one or more step, and *, derived in zero or more steps.

If A⇒*β, then β is said to be a sentential form of the CFG.

SF(G) is the set of sentential forms of grammar G.

4.2 Errors in Context-Free Grammar

begin < Stmts > end $

  • < Stmts > → < Stmt > ; < Stmts >
  • < Stmts > → λ
  • < Stmt > → SimpleStmt
  • < Stmt > → begin < Stmt > end

    We will draw the diagram for a Top-down parse and Bottom-up parse for the input stream

    begin SimpleStmt ; SimpleStmt ; end $

    A bottom-up parse proceeds by discovering subtrees and linking them into increasingly larger trees.

    The productions predicted by a top-down parser represent a leftmost derivation; hence such parsers are said to produce a leftmost parse. The sequence of producton recognized by a bottom-up parser is termed a rightmost parse; and is the exact reverse of the production sequence that represents a rightmost derivation.

    Another intuitive example of how to disambiguate a grammar is provided by the Expression Grammar. intuitive-grm-spr08.html