CS 524 - Compiler Construction
Main Software Project
Spring 2008 - San Diego State University
Kris Stewart

This URL: https://stewart.sdsu.edu/cs524/spr08/Project/project.html

Return to CS 524 class page
Upd: 26March08 * Project Deadlines

This is a long document. I have tried to collect information in one place that you will need as the semester goes along. Do not be intimidated by all the details. The project is set up to ease you into things. The first phase is a one line change to the file simple.lex; each following phase is more involved, but tries to isolate on a single file for the change. At the end of the semester, you should feel comfortable with making big changes - but not until the end of the semester. The beginning code that you are given is a working compiler with many flags to govern its debugging and output capabilities. You should examine your copy of the early compiler to get comfortable with its capabilities. Its diagnostic output will be very helpful later on.

Table of Contents

Graphical Overview of Project gif file
Background
Obtaining the Initial Project
Tools
Documents
Brief Outline of the Phases of the Project
Due Dates - Grade on the Project (based on Phase Progress)
Project Check Directions
Recommendations on How to Proceed
The Beginning Compiler Project
The Compiler Project Files

Background

A major portion of this course centers on actual hands-on experience writing a compiler and coming to understand how it accomplishes the automatic translation into assembly language. YOU WILL NOT BE WRITING A COMPILER FROM SCRATCH. Instead, you are given a full, working compiler for a limited language. Over the semester we extend this language to resemble a modern day language. We are using the text, Crafting a Compiler in C by Fischer and LeBlanc and will be closely following the recommendations in the text for technique to implement ADA/CS, a subset of Ada.

A compiler is an essential tool that most students in computer science are familiar with only as a tool, you've never seen the insides. I hope that after this course you will have a better understanding of, and perhaps some sympathy for, some of the compilers you use and their lack of meaningful diagnostic messages, especially at run time.

There are several distinct parts in a compiler. This description will give an overview of the code that comprises your project for CS 524.

We use the simulated machine: MIPS R2000. Dr. James Larus, from the University of Wisconsin, developed the simulator, SPIM, to be used in conjunction with the text, "Computer Organization & Design: The Hardware/Software Interface" by Patterson and Hennessy.
Top


Obtaining Initial Compiler

  1. Your initial project should be copied from stewart's account on Rohan. Log into your own Rohan class account and create a directory

    mkdir Phase0

    Move into this directory

    cd Phase0

    and copy the files from stewart's directory

    cp ~stewart/cs524/simple/* .

    Do not forget the final period in the "cp" above. You may get a warning about some files that were not copied, do not worry about them.

  2. Type

    make

    generating the following text response, make-output.txt, and you now will have your own version of the initial compiler. You can work through some of the following examples using the compiler options.

  3. As a first look, just type

    simple

    and see what the options and compiler directives available are.

    rohan% simple
    in     simple.c
    
    No parameters specified
    usage :
    simple [-l list_name][-o code_file][-nc][-nl][-d][-a][-i][-s][-v] source_file
    
            -o name : rename the assembly code file
            -l name : rename the output list file
            -nc     : do not generate code
            -nl     : do not generate listing
            -d      : turn off Tokentrace/Semstack flag
            -a      : turn on assembler dump to screen
            -i      : dump Intermediate Representation
            -s      : dump Symbol Table after parse
            -v      : dump variables addresses
            -p      : turn on PtraceSem
            -c      : turn on PtraceCode
            -b      : turn on PtraceSymtab
            -t      : turn on PtraceTree
            -y      : view parser reductions (yacc)
    
    Compiler directives (default values in parens):
            %%dmp   : Dump the Symbol Table now (off)
            %%c     : Toggle code generation (on)
            %%t     : Toggle token trace (off)
            %%ss    : Toggle semantic stack display (off)
            %%pr    : Toggle procedure trace (off)
    

  4. Construct a small example of your own and turn the options and directives on and off and observe the change in output. Do this early while examples are small so you become familiar with properties.

    Some interesting, informative samples that you just copied from stewart's directory are:

    Frontend Testing [scanning, parsing, semantic analysis]
    simple -i tfirst           (simple-i-tfirst.txt output the AST, Abstract Syntax Tree)
    simple -y tfirst          (output the grammar reductions from yacc)

    Using a shorter, simpler test file: texp0

    Frontend Debug
    simple -p texp0        ( simple -p output)
    Output the entering/leaving of each routine in sem.c so help identify which routine you "enter" but "leave" and thereby the location of your run-time error

    Backend Testing [code generation]
    simple -a texp0        (output the SPIM assembler code to the screen)
    Backend Debug
    simple -c texp0 (output the entering/leaving of code genartion)
    More example options
    simple -d -i -v texp0 (turn on debug, get AST, dump var addresses)
    simple -d -i -a texp0 (turn on debug, get AST, dump assembler code to screen)
    simple -nc -i texp0 (debug off, get AST, do not generate assembler)

  5. Run the code using spim, using the s.out output file that would be produced by the calls to simple above.

    spim -file s.out

    which generates the output.

  6. Get more information on SPIM through its man page

    man spim

  7. There is a script file doit that can put all this together for you. You have to issue the UNIX command chmod 755 doit since the file type is not correct from copying from the instructor's directory.

    doit texp0

    producing the following output.

  8. Take a look at the makefile. This file is doing a lot of work on your behalf. Over the length of the semester, you will become more familiar with all these activities.

NOTE: If any source code example has a readln in it (which tfirst does), then SPIM is going to patiently wait (forever!) for you to type a number as input for the read. So, if the system appears hung, the example you are running might be waiting on a readln - try typing a number.
Top


Tools

Our main external Unix tools are Lex and Yacc. You should examine the man page discussion of each of these.

                             man lex
                             man yacc
For more details, the text discusses both of these. You have a makefile which will properly invoke these tools as needed in our project.

We will also be using the Gnu gcc compiler. It provides some decent run-time debugging help and the makefile is set up to generated the symbol table needed for the debugger to run. There is a lot of option output available from the compiler project itself, but when all else fails you can use the symbolic debugger in the following way. Suppose you have just produced a new version of the compiler project (called "simple" by default) and when running the project (using, for example, simple tfirst) you encounter a run-time error.

                            dbx simple
                            run -d tfirst 
should give you the line number and module in your source code where the run-time error occurred.
				exit
type exit to leave dbx.

You may also be interested in RCS (Revision Control System) for handling the many updates your project will be going through over the semester. A brief introduction is available:

                           man rcsintro
                           man rcs

Top

Documents

Appendix A: SPIM SPIM: A MIPS R2000/R3000 Simulator by J.R. Larus
Project Files
Macro User Manual and Report
This file contains a description of the ADA/CS language our textbook refers to. Implementing a subset (of this Ada subset) will be the goal of your compiler project for this class.
man spim
man xspim

Top

Brief Outline of the Phases of the Project

More details, hints and other helpful information will be made available as the semester proceeds. This is just an outline so that you have an idea of our plan of attack.
Phase 1
Implement comments (one line change simple.lex)
Phase 2
Implement strings (need strings for prompts).
Phase 3
Implement integer variable type and declarations. Implement package declaration. (semantic checking, but no new generated SPIM code)
Phase 4
Implement boolean types and operations
Phase 5
Control structures
  1. if-then-elsif-else-end if
  2. loops and exits
  3. when and while
Phase 6
Subroutines using global variables
Phase 7
Subroutines with parameters and/or local and global variables

Top

Due Dates - Grade based on Phase Progress
30jan08

You are given a compiler for SIMPLE which is compiles a "simple language" which includes arithmetic operations (multiplication, division, addition and subtraction), the If-then statement, Assignment statement and I/O statements. We will extend the SIMPLE compiler to implement Macro (subset of Ada/CS). A self-contained description of Macro is available under Documents of this handout and linked above.

The following extensions will be developed over the semester (you should refer to the Macro description for the definition of our language for more details on how these extensions are to look):

** 26March08Update**

To receive an A on the project, your group must budget its time to have things completed within the semester.
Top


Project Check Directions

Formal procedure for turning in your project. This will be the same procedure all semester long.

Once you have tested your project executable, simple, and feel is it running to your satisfaction you should:

  1. move (or copy) this executable simple to your root directory
  2. send mail to stewart@rohan indicating that your project is ready to be checked
The mail must be sent from the account where your project is located on rohan and must be your class account so that I can have access to the executable to run the tests.

When you begin working as a group, you will have identified the members of the group to your instructor and the Point of Contact (POC). The POC submits the notice that the project is ready to be check and must cc: the other members of the group on this email note.

Do not change the name of the executable. Do not change any of the protections on your file because the script is able to access it as Superuser and I do not want to worry about any unauthorized person trying to access your file.

Top


Recommendations on how to proceed

You have enough space in your rohan account to always keep two directories. One would have the current "finished" compiler project, i.e. the one that is correctly working. The other directory would have the new Phase and soon to be the "working" compiler project for the next Phase.

Over the semester, we will extend the grammar of our language and you will extend the capabilities of your compiler. I would recommend the following order below.

MAKE SURE EACH STEP IS WORKING CORRECTLY BEFORE PROCEEDING TO THE NEXT!

As your skills and confidence grows over the semester, you will be able to do more than one step at a time, but do not be too anxious to do this at first.

  1. Introduce the new tokens into simple.lex and create a SMALL test code, called tfirst with these new tokens used correctly. Run make and then run simple with only the token trace on by inserting

    %%t

    as the first line of the your new test file, first-d, and make sure all new tokens are being recognized.

    simple first-d

  2. Introduce new grammar rules into simple.lr and put dummy stubs into sem.c. Make sure the semantic actions in your grammar are being called during the parse when you expect. Then add the complet code in sem.c (with corresponding definitions of the structures in semform.h) to manipulate the semantic stack appropriately. Do not forget to update print_semstack as you increase the items pushed onto this stack. For your own benefit, continue the commenting technique in sem.c with its representation of the semstack before and after the particular action displayed in comments at the beginning of the semantic action routine. Turn on the yacc dump

    simple -y tfirst

    to be sure things are working properly with the grammar.

  3. Update the printree.c routine to be able to output the AST with the new semantic information. Output the tree and see that the semantic information is there.

    simple -i tfirst

  4. Update codegen.c to produce the assembler code translation for the new grammar statements. DON'T EVEN TRY THIS UNTIL YOUR TREE LOOKS GOOD!
Follow the convention of invoking "entering" and "leaving" in each function that you add to the project. This will greatly enhance the debugging capabilities of your code.

COME TALK TO ME IN OFFICE HOURS
DO NOT LET YOURSELF GET TOO CONFUSED

Top

The Beginning Compiler

You are given a working compiler that handles only simple input such as (this is a dumb code only to show what's possible its name is tfirst):
                beta :=2;
                write (beta,3);
                writeln;
                readln(gamma);
                writeln(gamma);
                alpha := 2 + beta * gamma;
                writeln(alpha);
                if beta = 2 then
                        beta := beta + gamma * alpha + alpha;
                        writeln(beta);
                end if;
You see we have:
  1. Integer variables, which do not have to be declared prior to use
  2. Integer arithmetic
  3. Integer I/O (readln, write and writeln)
  4. If then end if statement
There are four main phases in the process of translating an input source file to assembler language.

Lexical Analysis
effectively checking the spelling. This phase recognizes the "token"s of the language. (The file s.list is produced by the compiler, echoing valid input source.)
     reserved words       readln write writeln if then end if
     symbols              := + * - / ; ( ) = ,
     variable names       any sequence of characters, beginning with
                           an alpha character
For our project, this is defined in the file simple.lex
Syntactic Analysis (parsing)
checking the grammar. Do the statements look right? (e.g. for each "if" there must be an "end if", each statement ends in a ";") This is specified by a grammar, for our project, this is defined in the file simple.lr
Semantic Analysis
checking that statements make some sense. Right now, the language is so limited the programmer who obeys the grammar will probably produce a valid program. But you could use an uninitialized variable, which results in no error or exception with SPIM and the value 0 is printed if you write the uninitialized value. (NOTE: this is not detected now, but you will add this later.)

The first three phases are accomplished using the technique of syntax directed translation. Our compiler analyzes the input source and produces an "intermediate form", called an "abstract syntax tree" (AST).

For the program above, there is a compiler flag that you can use (-i) that will output this intermediate form. One way to represent the program above is through the sequence of statements (2 writeln's are omitted here for space):

    :=          write             readln                :=
   /  \            |               |                   /   \
beta    2          \ list          \list           alpha    +
                     beta 3         gamma                 /   \
                                                         2     *
                                                             /   \
                                                          beta  gamma


                     ifstmt
                     /    \
                   /        \
                  =         thenlist
                /   \          | writeln
             beta    2         | := stmt

Once you set up your copy of the initial project by copying from stewart's file space, you should try
                          simple -i tfirst
to get a feeling for how this intermediate representation (the AST) is presented using crude output capabilities.
Code Generation
After this intermediate form has been produced, the compiler will traverse the Astract Syntax Tree and produce assembler code (in s.out). For the program above, we produce the following. Read this over and try to make sense of it. You will have to be able to write c code that will generate SPIM assembler code and so need to understand this assembler language.

# Register Usage:
#     $s0 for global variables
#
        .text 
        .globl main 
 main:
        la      $s0, GVARS
#
# Start Code
# 
# Generate Assignment Statement 
        li      $t0,2
        sw      $t0,4($s0)
# 
# Generate Write statement 
        li      $v0, 1
        lw      $a0,4($s0)
        syscall
        li      $v0, 1
        li      $a0,3
        syscall
# 
# Generate Writeln statement 
        la      $a0, S0
        li      $v0, 4
        syscall

# 
# Generate Readln statement 
        li      $v0, 5
        syscall
        sw      $v0,8($s0)
# 
# Generate Writeln statement 
        li      $v0, 1
        lw      $a0,8($s0)
        syscall
        la      $a0, S0
        li      $v0, 4
        syscall

# 
# Generate Assignment Statement 
        lw      $t0,4($s0)
        lw      $t1,8($s0)
        mul     $t0,$t0,$t1
        add     $t0,$t0,2
        sw      $t0,0($s0)
# 
# Generate Writeln statement 
        li      $v0, 1
        lw      $a0,0($s0)
        syscall
        la      $a0, S0
        li      $v0, 4
        syscall

#
# Generate If-Then statement 
        lw      $t0,4($s0)
        seq     $t0,$t0,2
        beq     $t0, 0,IF0
# 
# Generate Assignment Statement 
        lw      $t0,8($s0)
        lw      $t1,0($s0)
        mul     $t0,$t0,$t1
        lw      $t1,4($s0)
        add     $t0,$t0,$t1
        lw      $t1,0($s0)
        add     $t0,$t0,$t1
        sw      $t0,4($s0)
# 
# Generate Writeln statement 
        li      $v0, 1
        lw      $a0,4($s0)
        syscall
        la      $a0, S0
        li      $v0, 4
        syscall

#
# End If 
IF0:
#
# Halt execution
        li      $v0 10
        syscall

#
# Finish up by writing out constants 
        .word 0
CONST:          #Constant storage area
        .data
S0:     .asciiz "\n"
#
# Reserve space for global variables
        .word 0
GVARS:          # space for Global Variables
        .data
_alpha: .word 0 # Offset at 0
        .data
_beta:  .word 0 # Offset at 4
        .data
_gamma: .word 0 # Offset at 8
        .data
Temp_Wr:
        .word 0 #just for alignment of write(exprtree)
Running SPIM
The output of our compiler is SPIM assembler code in the file "s.out" and a listing of the original source in "s.list". s.out can then be submitted to SPIM to be assembled and executed. This has been automated in the script file "q"
                             q tfirst
will invoke the compiler, simple, and uses command line options to renamed the generated assembler code, tfirst.s, and listing file, tfirst.list; then tfirst.s is run by SPIM to produce the following:
        execute the code in spim:
 
        SPIM Version 6.2 of January 11, 1999
        Copyright 1990-1998 by James R. Larus (larus@cs.wisc.edu).
        All Rights Reserved.
        See the file README for a full copyright notice.
        Loaded: /opt/spim/bin/trap.handler
        23    <--  appeared as output
        7     <--  I typed the value 7 because the code is waiting on readln
        7     <--  resulting output from the code above
        16    <--  2 + 2 (beta) * 7 (gamma)
        130   <--  2 (beta) + 7 (gamma) * 16 (alpha) + 16 (alpha)
There are several other script files available that produce different amounts of output. In the script "q" above, q stands for quick and provides a minimal level of output.

A more information script file is "doit" and one that produces much more output, that I would recommend using in the following manner:

                   listdoit tfirst >& tfirst.out
You need the redirect >& in order to capture any error messages that are the output from SPIM. (There should be any at this time, but there will be later as the semester progresses.)
Top

The Compiler Project Files

Your compiler is called "simple" and its command line was given above. "simple" translates your input source file and places the output into "s.out". "s.out" can then be read and executed by SPIM. Project File List Graphical Overview (use Print Preview and view at 175%)

simple.lex
contains the lex specification of the tokens of the initial language (used to produce lex-yy.c)
simple.lr
contains the lr specification of the grammar of the initial language (used to produce y-tab.c, y-tab.h)
simple.c
is the main program and directs traffic. This routine parses the command line flags, sets up files, invokes the parse (using yacc), outputs the symbol table and AST if desired, calls codegen.c to generate assembler.
symtab.c
manipulates the symbol table (stored as a binary tree). Initially variables are inserted into the symbol table when encountered, i.e. no previous declaration is required. Note: You must always "Lookup" an id name immediately prior to "Insert"ing the name in the table. This file contains "GetVars", called by sem.c, to convert the binary tree to an alphabetized singly linked list. This also contains dumpstab() to dump the current symtab at any time during the parse.
sem.c
is the semantic workhorse of this project. The grammar, simple.lr, contains "semantic actions" which are invoked as contructs of the language are recognized by Yacc. This file has YOUR code of what to do when items are recognized. This manipulates the semantic stack which contains all meaningful information needed to translate. This is related to, but different from, the parse stack which Yacc maintains to keep track of the LALR(1) parse. Take a careful look at the function done().
codegen.c
recursively traverses the AST constructed by sem.c and generates SPIM assembler code to "s.out".
tracing.c
Debugging aid. Self-indenting printout of each function called. Note: you must be consistent in using "entering" and "leaving" when you start adding your own routines for this debug feature to continue to be useful. I have added code so that you control the type of detailed output when you invoke the compiler, simple:
        -p      : turn on PtraceSem - code in sem.c
        -c      : turn on PtraceCode - code in codegen.c
        -b      : turn on PtraceSymtab - code in symtab.c
        -t      : turn on PtraceTree - code in printree.c
printree.c
Recursively traverses the AST (via Prog.main) and prints out the tree. You must update this as the AST structure is enhanced over the semester.
yaccextr.c
Tid bits to aid debugging and interfacing with Lex and Yacc. Of particular interest are the routines int token(tok_value) and void yyerror(s)
makefile
Script file to manage the whole project. A makefile is a standard mechanism in UNIX that detects when a file has changed and will recompile it, link it with the other files of the project, and produce an update version of simple, your runnning compiler. A simple way to cause a file to be renewed, i.e. have its timestamp change, is to touch filename, or to edit the file.
maketokn
Script file to construct a file (y-tok.h) used by Token() in Lex to output a source listing during the parse. NOTE: if you download a copy of this file from this web page, you must issue the UNIX command
chmod 711 maketokn
from the UNIX command line to make this script file executable.

I employed the following naming conventions for all the header files:

  1. Each *.c file has a *.h file, i.e. sem.c has a sem.h. In the corresponding *.h file you will find definitions of structures used by that routine.
  2. There are some special, complicated structures that are used by more than one *.c file. These are defined in their own *.h file. For example, the symbol table definition is in symboltb.h (note the restriction to 8 letters in order to be able to easily download to a PC). The other main structures of the compiler are:
    1. symboltb.h - the symbol table structure, a binary tree
    2. semform.h - the structure of entries in the semantic stack
    3. semstack.h - the declaration of the array used for the semantic stack
    4. astdef.h - the definition of the global structure constructed by sem.c to hold the AST. This consists of a structure (Prog) with 2 fields: a linked list of translated statements nodes and an alphabetized list of variable attritute records (from the symbol table tree).
    5. booldef.h - declaration of some Booleans used to control debugging i/o
    6. spim_def.h - the definition of machine constants and addressing modes used to generate code
    7. filedef.h - the definition of the files used by the compiler. Consists of the sourcefile and name, the output codefile and name, and the output listing file and name. When making changes, you should consult the appropriate simple.h, symtab.h, sem.h, printree.h, codegen.h, tracing.h or yaccextr.h file to see what structures that *.c code uses.

      You should NOT manipulate the following to header files:

    8. y-tab.h is effectively constructed by yacc (actually yacc produces y.tab.h which the makefile renames to y-tab.h)
    9. y-tok.h is constructed by maketokn from the output of lex
Most of your coding will be in sem.c and codegen.c
--> Top
Back to CS 524 Class Page