Ch3: Lexical Analysis
Text: Compilers: Principles, Techniques & Tools
- Scanner = Lexical Analyzer.
Goal is to group input characters
into tokens. Tokens are the fundamental unit of the program to be
translated, for example, reserved words of the language or user declared
- Precise definition of tokens is required, based on each programming
language that we may wish to translated. For example, strings.
- Quoted string in Pascal - the body of the string can be any
sequence of characters except a quote character, which must be
Is the newline-character allowed? Does it make sense
to have a newline-character in a string?
- C allows newline-character in a string, but it must have been
escaped. Pascal forbids them. Ada goes further still and forbids
all unprintable characters (because they are normally unreadable).
- Bottom line - the designer of a language must choose the careful,
complete definition of a string and stick with it. Usually need to
justify choices, initially, then stay with it.
- Precise definition of token necessary to ensure that lexical
rules are properly enforced. Consider fixed point decimal numbers
such as 0.1 or 10.01. Should we also allow .1 or 10.?
- Fortran and C allow them, but in Pascal and Ada they are not, for
an interesting reason. Scanners seek to make a token as long as
possible, so that ABC is scanned as one identifier and not three
separate identifiers A, B, and C. Now consider the sequence 1..10
- In Pascal and Ada (our project), we wish this to be
recognized as a range specifier (1 .. 10).
If we are not careful in defining our token, we
might scan 1..10 as two real literals, 1. and .10, which could lead to
- Once we have a formal specification of token and program structure,
it is possible to examine a language, for design flaws.
- Scanners perform much the same process, no matter which language
they are translating. Therefore we do not wish to spend time to write
a scanner, instead we learn to use a UNIX scanner generator,
- A convenient way to specify sets of strings is to use Regular
Expressions. Regular expressions can be used to generate a scanner
which we will study this week.
We use the following definitions:
- A set of strings defined by regular expressions are termed
regular sets. We start with a finite character set, or
vocabulary, denoted V. This vocabulary is the character
set used to form tokens.
- An empty or null string is allowed (denoted λ or "lambda").
- Strings are built from characters in V by catenation,
:= begin 123.
- The null string, when catenated with any string s, yields
itself. Therefore, sλ = λs = s.
- Catenation can be extended to sets of strings as follows:
Let P and Q be sets of strings. Then string
s ∈ (P Q) if
and only if s can be broken into two pieces: s = s1s2
such that s1 ∈ P and s2 ∈ Q.
- Small finite sets are conveniently represented by listing their
elements, which can be individual characters or strings of characters.
- Parenthesis are used to delimit expressions, and | (the alternation
operator), is used to separate alternatives.
- The characters (, ), ', *, + and | are meta-characters that
represent punctuation and regular expression operators. Meta-characters
must be quoted when used as ordinary characters to avoid ambiguity.
- For example,
simple.lex, our project's Lex file, gives us
many examples. One might also see a list of delimiters specified as
Delim = ( '(' | ')' | := | ; | , | '+' | - | '*' | / | = | $$$ )
- Alternation can be extended to sets of strings. Let P and
Q be set of strings. Then string
s ∈ (P | Q) if and only
if s ∈ P or s ∈ Q.
- Large (or infinite sets) are conveniently represented by
operations on a finite set of characters and strings. Catenation
and alternation can be used.
- The operator * represents the postfix Kleene Closure operator.
Let P be a set of strings. String s ∈ P* if and only if s can be
broken into zero or more pieces:
s = s1 s2 s3 ... sn
such that each si ∈ P.
A string in P* is the catenation of zero or more selections
(possibly repeated) from P. We explicitly allow n=0, so that λ
is always in P*.
Regular Expressions Defined
- φ (Greek-phi) is a regular expression denoting the empty
set (the set containing no strings)
- λ is a regular expression denoting the set that contains
only the empty string. Note that this set is not the same as the empty
set, because it contains one element.
- A string s is a regular expression denoting the set containing
only s. If s contains meta-characters, s can be quoted to
- If A and B are regular expressions, then
A | B, A B, and A* are also regular
expressions, denoting the alternation, catenation and Kleene closure of
corresponding regular sets.
Any finite set of strings can be represented by a regular expression of
the form (s1 | s2 | ... | sk )
Operations on sets that will be convenient for our class:
- P+ denotes all strings consisting of 1 or more strings in P
- If A is a set of characters, Not(A) = (V - A) can be extended to sets
using if S is a set of strings, Not(S) = (V* - S)
- If k is a constant then set Ak denoted all strings formed
by catenating k (not necessarily different) strings from A.
Ak = ( A A A ... ) k copies
For example, D = ( 0|1|2...|9) L = (A|...|Z )
Allowing us to define the Ada Comment:
Comment = -- Not(Eol)*Eol
and the Fixed decimal literal
Lit = D+.D+
and identifiers composed of letters, digits, underscore, must begin with a
letter, ends with letter or digital with no consecutive underscores
ID = L (L|D)* (_(L|D)+)*
Comments delimited by ##, which allow a single # within the comment body
Comment2 = ##((#|lambda)Not(#))*##
Finite Automata and Scanners
Use the Document Projector in Smart Classroom to work with Diagrams
from this section of our text.
We will be using the UNIX scanner-generator, lex (Section 3.5),
and have the "fundamental characters":
white_space [ \t\n]
blank [ \t]
Lex also defines character classes, delimited by [ and
]. For example, [xyz] represents the class that can match an
x, y or z. Ranges of characters are separated by - so
that [x-z] is the same as [xyz]. Lex follows the C conventions
using \n for newline (or EOL), \t for tab, \\ for
the backslash symbol and \10 is the character corresponding to
octal 10. The ^ symbol (caret) complements a character class
(as Not does). Therefore [^xy] is the character class that matches
all characters except x and y.
Lex provides standard regular expression operators, as well as some additions.
- The Lex input file has sections delimited by %%.
Regular expressions and the command to execute when recognized
any additional C-code, as appropriate
- Lex stores input that is matched in the string variable yytext
whose length is yylength). Commands may alter yytext in any
way and then write the altered text to the output file.
- Lex creates an integer function yylex() that may be
called from the parser (i.e. yacc in our case) as is standard in
syntax-directed compilation. Tokens like white space can be deleted simply
by having their associated command not return anything.
With this background, you should be able to examine our specification of
tokens for the beginning language recognized by your compiler project:
- simple.lex our scanner generator that is input to the UNIX tool
- Practical Considerations
Reserved words, may look like identifiers, so care is taken in the
definition of the language to avoid imposing restrictions on the
- Section 3.6 Finite Automata (FA)
You will want to become adept at moving between Regular Expressions and
drawing the corresponding finite automata that recognizes them. As well
as vice-versa - given a FA, write the regular expression.
Our focus will be on DETERMINISTIC Finite Automata.
You will be expected to draw the diagrams of the Finite Automata that
correspond to regular expressions, and vice versa. For example, several
problems in our text address this and midterm exam questions will be
modeled on this.
Think of Lexical Analysis as the way to check the spelling for
a language. Think of Parsing as the grammar or syntax of the language.
We still need the semantic processing - does the sentence make any sense?
FYI (Online Refers that may be useful to complement our textbook):
Deterministic Finite State Machine
http://www.faqs.org/docs/htmltut/characterentities_famsupp_69.html HTML 4
representation of special characters, such as "lambda".