Ch3: Lexical Analysis

Ch3: Lexical Analysis
Text: Compilers: Principles, Techniques & Tools
28Jan08

Overview/Motivation

Scanner = Lexical Analyzer. Goal is to group input characters into tokens. Tokens are the fundamental unit of the program to be translated, for example, reserved words of the language or user declared variable names.
Precise definition of tokens is required, based on each programming language that we may wish to translated. For example, strings.
- Quoted string in Pascal - the body of the string can be any sequence of characters except a quote character, which must be doubled.
  Is the newline-character allowed? Does it make sense to have a newline-character in a string?
- C allows newline-character in a string, but it must have been escaped. Pascal forbids them. Ada goes further still and forbids all unprintable characters (because they are normally unreadable).
- Bottom line - the designer of a language must choose the careful, complete definition of a string and stick with it. Usually need to justify choices, initially, then stay with it.
Precise definition of token necessary to ensure that lexical rules are properly enforced. Consider fixed point decimal numbers such as 0.1 or 10.01. Should we also allow .1 or 10.?
- Fortran and C allow them, but in Pascal and Ada they are not, for an interesting reason. Scanners seek to make a token as long as possible, so that ABC is scanned as one identifier and not three separate identifiers A, B, and C. Now consider the sequence 1..10
  - In Pascal and Ada (our project), we wish this to be recognized as a range specifier (1 .. 10). If we are not careful in defining our token, we might scan 1..10 as two real literals, 1. and .10, which could lead to syntax errors.

Once we have a formal specification of token and program structure, it is possible to examine a language, for design flaws.

Scanners perform much the same process, no matter which language they are translating. Therefore we do not wish to spend time to write a scanner, instead we learn to use a UNIX scanner generator, lex.

A convenient way to specify sets of strings is to use Regular Expressions. Regular expressions can be used to generate a scanner which we will study this week.

We use the following definitions:
A set of strings defined by regular expressions are termed regular sets. We start with a finite character set, or vocabulary, denoted V. This vocabulary is the character set used to form tokens.
An empty or null string is allowed (denoted λ or "lambda").
Strings are built from characters in V by catenation, for example
:= begin 123.
The null string, when catenated with any string s, yields itself. Therefore, sλ = λs = s.
Catenation can be extended to sets of strings as follows:
Let P and Q be sets of strings. Then string
s ∈ (P Q) if and only if s can be broken into two pieces: s = s₁s₂ such that s₁ ∈ P and s₂ ∈ Q.

Small finite sets are conveniently represented by listing their elements, which can be individual characters or strings of characters.
Parenthesis are used to delimit expressions, and | (the alternation operator), is used to separate alternatives.
The characters (, ), ', *, + and | are meta-characters that represent punctuation and regular expression operators. Meta-characters must be quoted when used as ordinary characters to avoid ambiguity.
For example, simple.lex, our project's Lex file, gives us many examples. One might also see a list of delimiters specified as
Delim = ( '(' | ')' | := | ; | , | '+' | - | '*' | / | = | $$$ )

Alternation can be extended to sets of strings. Let P and Q be set of strings. Then string
s ∈ (P | Q) if and only if s ∈ P or s ∈ Q.
Large (or infinite sets) are conveniently represented by operations on a finite set of characters and strings. Catenation and alternation can be used.
The operator ^* represents the postfix Kleene Closure operator.
Let P be a set of strings. String s ∈ P^* if and only if s can be broken into zero or more pieces:
s = s₁ s₂ s₃ ... s_n
such that each s_i ∈ P.
A string in P^* is the catenation of zero or more selections (possibly repeated) from P. We explicitly allow n=0, so that λ is always in P^*.

Regular Expressions Defined

φ (Greek-phi) is a regular expression denoting the empty set (the set containing no strings)
λ is a regular expression denoting the set that contains only the empty string. Note that this set is not the same as the empty set, because it contains one element.
A string s is a regular expression denoting the set containing only s. If s contains meta-characters, s can be quoted to avoid ambiguity.
If A and B are regular expressions, then A | B, A B, and A^* are also regular expressions, denoting the alternation, catenation and Kleene closure of corresponding regular sets.

Any finite set of strings can be represented by a regular expression of the form (s₁ | s₂ | ... | s_k )
Operations on sets that will be convenient for our class:

P⁺ denotes all strings consisting of 1 or more strings in P (Positive Closure)

If A is a set of characters, Not(A) = (V - A) can be extended to sets using if S is a set of strings, Not(S) = (V^* - S)

If k is a constant then set A^k denoted all strings formed by catenating k (not necessarily different) strings from A.
A^k = ( A A A ... ) k copies

For example, D = ( 0|1|2...|9) L = (A|...|Z )
Allowing us to define the Ada Comment:
Comment = -- Not(Eol)^*Eol
and the Fixed decimal literal
Lit = D⁺.D⁺
and identifiers composed of letters, digits, underscore, must begin with a letter, ends with letter or digital with no consecutive underscores
ID = L (L|D)^* (_(L|D)⁺)^*
Comments delimited by ##, which allow a single # within the comment body
Comment2 = ##((#|lambda)Not(#))^*##

Finite Automata and Scanners
Use the Document Projector in Smart Classroom to work with Diagrams from this section of our text.

We will be using the UNIX scanner-generator, lex (Section 3.5), and have the "fundamental characters":
letter [a-zA-Z_] digit [0-9] letter_or_digit [a-zA-Z_0-9] white_space [ \t\n] blank [ \t] other .

Lex also defines character classes, delimited by [ and ]. For example, [xyz] represents the class that can match an x, y or z. Ranges of characters are separated by - so that [x-z] is the same as [xyz]. Lex follows the C conventions using \n for newline (or EOL), \t for tab, \\ for the backslash symbol and \10 is the character corresponding to octal 10. The ^ symbol (caret) complements a character class (as Not does). Therefore [^xy] is the character class that matches all characters except x and y.

Lex provides standard regular expression operators, as well as some additions.

Catenation is specified by the juxtaposition of two expressions, thus [ab][cd] will match any of ad, ac, bc and bd.
When outside of character class brackets, individual letters and numbers match themselves; other characters should be quote to avoid misinterpretation as regular expression operators.
For exammple, begin can be matched by the expressions begin, "begin" or [b][e][g][i][n].
Case is significant for lex. The alternation operator is | and parenthesis can be used to control grouping of subexpressions.
To match the reserved word end, allowing any mixture of upper- and lower-case letter, we could use

(E|e)(N|n)(D|d)

Postfix operators ^* (Kleene closure) and ⁺ (positive closure) are provided.
Lex also provides the used of ? (optional inclusion). For example,
Expr? matches Expr zero times or once. It is equivalent to
(expr) | λ
and obviates the need for an explict λ symbol.

The { and } symbols signal the macroexpansion of a symbol defined in the first section of the lex-file, referred to above as "fundamental characters". In our simple.lex file, we define
digit [0-9]
so that {digit}+ expands to [0-9]+ which may help with readability, once you become comfortable with this notation.

The Lex input file has sections delimited by %%.
Fundament Characters %% Regular expressions and the command to execute when recognized return token(): %% any additional C-code, as appropriate

Lex stores input that is matched in the string variable yytext whose length is yylength). Commands may alter yytext in any way and then write the altered text to the output file.
Lex creates an integer function yylex() that may be called from the parser (i.e. yacc in our case) as is standard in syntax-directed compilation. Tokens like white space can be deleted simply by having their associated command not return anything.
With this background, you should be able to examine our specification of tokens for the beginning language recognized by your compiler project:
simple.lex our scanner generator that is input to the UNIX tool lex

Practical Considerations
Reserved words, may look like identifiers, so care is taken in the definition of the language to avoid imposing restrictions on the language user.

Section 3.6 Finite Automata (FA)
You will want to become adept at moving between Regular Expressions and drawing the corresponding finite automata that recognizes them. As well as vice-versa - given a FA, write the regular expression.
Our focus will be on DETERMINISTIC Finite Automata.
You will be expected to draw the diagrams of the Finite Automata that correspond to regular expressions, and vice versa. For example, several problems in our text address this and midterm exam questions will be modeled on this.

Think of Lexical Analysis as the way to check the spelling for a language. Think of Parsing as the grammar or syntax of the language. We still need the semantic processing - does the sentence make any sense?

FYI (Online Refers that may be useful to complement our textbook):

en.wikipedia.org/wiki/Regular_expression
Deterministic Finite State Machine en.wikipedia.org/wiki/Finite_state_automaton
http://www.faqs.org/docs/htmltut/characterentities_famsupp_69.html HTML 4 representation of special characters, such as "lambda".

Ch3: Lexical Analysis Text: Compilers: Principles, Techniques & Tools 28Jan08

Regular Expressions Defined

Ch3: Lexical Analysis
Text: Compilers: Principles, Techniques & Tools
28Jan08