Ch3: Lexical Analysis
Text: Compilers: Principles, Techniques & Tools
28Jan08

Overview/Motivation
  1. Scanner = Lexical Analyzer. Goal is to group input characters into tokens. Tokens are the fundamental unit of the program to be translated, for example, reserved words of the language or user declared variable names.

  2. Precise definition of tokens is required, based on each programming language that we may wish to translated. For example, strings.

    • Quoted string in Pascal - the body of the string can be any sequence of characters except a quote character, which must be doubled.
      Is the newline-character allowed? Does it make sense to have a newline-character in a string?
    • C allows newline-character in a string, but it must have been escaped. Pascal forbids them. Ada goes further still and forbids all unprintable characters (because they are normally unreadable).
    • Bottom line - the designer of a language must choose the careful, complete definition of a string and stick with it. Usually need to justify choices, initially, then stay with it.

  3. Precise definition of token necessary to ensure that lexical rules are properly enforced. Consider fixed point decimal numbers such as 0.1 or 10.01. Should we also allow .1 or 10.?

    • Fortran and C allow them, but in Pascal and Ada they are not, for an interesting reason. Scanners seek to make a token as long as possible, so that ABC is scanned as one identifier and not three separate identifiers A, B, and C. Now consider the sequence 1..10
      • In Pascal and Ada (our project), we wish this to be recognized as a range specifier (1 .. 10). If we are not careful in defining our token, we might scan 1..10 as two real literals, 1. and .10, which could lead to syntax errors.

Once we have a formal specification of token and program structure, it is possible to examine a language, for design flaws.

Scanners perform much the same process, no matter which language they are translating. Therefore we do not wish to spend time to write a scanner, instead we learn to use a UNIX scanner generator, lex.

A convenient way to specify sets of strings is to use Regular Expressions. Regular expressions can be used to generate a scanner which we will study this week.

  • Lex stores input that is matched in the string variable yytext whose length is yylength). Commands may alter yytext in any way and then write the altered text to the output file.
  • Lex creates an integer function yylex() that may be called from the parser (i.e. yacc in our case) as is standard in syntax-directed compilation. Tokens like white space can be deleted simply by having their associated command not return anything.

    With this background, you should be able to examine our specification of tokens for the beginning language recognized by your compiler project:

  • simple.lex our scanner generator that is input to the UNIX tool lex

    Practical Considerations

    Reserved words, may look like identifiers, so care is taken in the definition of the language to avoid imposing restrictions on the language user.

    Section 3.6 Finite Automata (FA)

    You will want to become adept at moving between Regular Expressions and drawing the corresponding finite automata that recognizes them. As well as vice-versa - given a FA, write the regular expression.

    Our focus will be on DETERMINISTIC Finite Automata.

    You will be expected to draw the diagrams of the Finite Automata that correspond to regular expressions, and vice versa. For example, several problems in our text address this and midterm exam questions will be modeled on this.


  • Think of Lexical Analysis as the way to check the spelling for a language. Think of Parsing as the grammar or syntax of the language. We still need the semantic processing - does the sentence make any sense?


    FYI (Online Refers that may be useful to complement our textbook):

    en.wikipedia.org/wiki/Regular_expression
    Deterministic Finite State Machine en.wikipedia.org/wiki/Finite_state_automaton
    http://www.faqs.org/docs/htmltut/characterentities_famsupp_69.html HTML 4 representation of special characters, such as "lambda".