CS 524 - Compiler Construction
Main Software Project
Spring 2008 - San Diego State University
Kris Stewart
This URL: https://stewart.sdsu.edu/cs524/spr08/Project/project.html
Return to CS 524 class page
Upd: 26March08 * Project Deadlines
This is a long document. I have tried to collect information
in one place that you will need as the semester goes along. Do not
be intimidated by all the details. The project is set up to ease
you into things. The first phase is a one line change to the file
simple.lex; each following phase is more involved, but tries to
isolate on a single file for the change. At the end of the semester,
you should feel comfortable with making big changes - but not
until the end of the semester. The beginning code that you are
given is a working compiler with many flags to govern its debugging
and output capabilities. You should examine your copy of the
early compiler to get comfortable with its capabilities. Its
diagnostic output will be very helpful later on.
Table of Contents
- Graphical Overview of Project gif file
- Background
- Obtaining the Initial Project
- Tools
- Documents
- Brief Outline of the Phases of the Project
- Due Dates - Grade on the Project (based on Phase Progress)
- Project Check Directions
- Recommendations on How to Proceed
- The Beginning Compiler Project
- The Compiler Project Files
A major portion of this course centers on actual hands-on experience
writing a compiler and coming to understand how it accomplishes the
automatic translation into assembly language. YOU WILL NOT BE WRITING A
COMPILER FROM SCRATCH. Instead, you are given a full, working
compiler for a limited language. Over the semester we extend this
language to resemble a modern day language. We are using the
text, Crafting a Compiler in C by Fischer and LeBlanc and will
be closely following the recommendations in the text for
technique to implement ADA/CS, a subset of Ada.
A compiler is an essential tool that most students in
computer science are familiar with only as a tool, you've never
seen the insides. I hope that after this course you will have a
better understanding of, and perhaps some sympathy for, some of
the compilers you use and their lack of
meaningful diagnostic messages, especially at run time.
There are several distinct parts in a compiler.
This description will give an overview of the code that comprises
your project for CS 524.
We use the simulated machine: MIPS R2000. Dr. James Larus,
from the University of Wisconsin, developed the simulator, SPIM,
to be used in conjunction with the text, "Computer Organization
& Design: The Hardware/Software Interface" by Patterson and
Hennessy.
Top
- Your initial project should be copied from stewart's account on Rohan. Log
into your own Rohan class account and create a directory
mkdir Phase0
Move into this directory
cd Phase0
and copy the files from stewart's directory
cp ~stewart/cs524/simple/* .
Do not forget the final period in the "cp" above. You may get a warning
about some files that were not copied, do not worry about them.
- Type
make
generating the following text response,
make-output.txt,
and you now will have your own version of the initial compiler.
You can work through some of the following examples using the
compiler options.
- As a first look, just type
simple
and see what the options and compiler directives available are.
rohan% simple
in simple.c
No parameters specified
usage :
simple [-l list_name][-o code_file][-nc][-nl][-d][-a][-i][-s][-v] source_file
-o name : rename the assembly code file
-l name : rename the output list file
-nc : do not generate code
-nl : do not generate listing
-d : turn off Tokentrace/Semstack flag
-a : turn on assembler dump to screen
-i : dump Intermediate Representation
-s : dump Symbol Table after parse
-v : dump variables addresses
-p : turn on PtraceSem
-c : turn on PtraceCode
-b : turn on PtraceSymtab
-t : turn on PtraceTree
-y : view parser reductions (yacc)
Compiler directives (default values in parens):
%%dmp : Dump the Symbol Table now (off)
%%c : Toggle code generation (on)
%%t : Toggle token trace (off)
%%ss : Toggle semantic stack display (off)
%%pr : Toggle procedure trace (off)
- Construct a small example of your own and turn the options and
directives on and off and observe the change in output. Do this
early while examples are small so you become familiar with
properties.
Some interesting, informative samples that you just copied from stewart's
directory are:
Frontend Testing [scanning, parsing, semantic analysis]
- simple -i
tfirst
(simple-i-tfirst.txt output the AST, Abstract Syntax Tree)
- simple -y tfirst (output the grammar reductions from yacc)
Using a shorter, simpler test file: texp0
Frontend Debug
- simple -p texp0 (
simple -p output)
Output the entering/leaving of each routine in sem.c
so help identify which routine you "enter" but "leave" and thereby the location
of your run-time error
Backend Testing [code generation]
- simple -a texp0 (output the SPIM assembler code to the screen)
Backend Debug
- simple -c texp0 (output the entering/leaving of code genartion)
More example options
- simple -d -i -v texp0 (turn on debug, get AST, dump var addresses)
- simple -d -i -a texp0 (turn on debug, get AST, dump assembler code to screen)
- simple -nc -i texp0 (debug off, get AST, do not generate assembler)
- Run the code using spim, using the
s.out output file
that would be produced by the calls to simple above.
spim -file s.out
which generates the output.
- Get more information on SPIM through its man page
man spim
- There is a script file doit that can put all this together
for you. You have to issue the UNIX command chmod 755 doit since
the file type is not correct from copying from the instructor's directory.
doit texp0
producing the following
output.
-
Take a look at the makefile. This file
is doing a lot of work on your behalf. Over the length of the
semester, you will become more familiar with all these activities.
NOTE:
If any source code example has a readln in it (which tfirst does),
then SPIM is going to patiently wait (forever!) for you to type
a number as input for the read.
So, if the system appears hung, the example you are
running might be waiting on a readln - try typing a number.
Top
Our main external Unix tools are
Lex and
Yacc. You should
examine the man page discussion of each of these.
man lex
man yacc
For more details, the text discusses both of these. You have a
makefile which will properly invoke these tools as needed in our
project.
We will also be using the Gnu gcc compiler. It provides
some decent run-time debugging help and the makefile is set up to
generated the symbol table needed for the debugger to run. There
is a lot of option output available from the compiler project
itself, but when all else fails you can use the symbolic debugger
in the following way. Suppose you have just produced a new version
of the compiler project (called "simple" by default) and when
running the project (using, for example, simple tfirst) you
encounter a run-time error.
dbx simple
run -d tfirst
should give you the line number and module in your source code where
the run-time error occurred.
exit
type exit to leave dbx.
You may also be interested in RCS (Revision Control System)
for handling the many updates your project will be going through
over the semester. A brief introduction is available:
man rcsintro
man rcs
Top
- Appendix A: SPIM
SPIM: A MIPS R2000/R3000 Simulator by J.R. Larus
-
- Project Files
-
- Macro User Manual and Report
-
This file contains a description of the ADA/CS language our textbook refers to. Implementing a subset (of this Ada subset) will be the goal of your compiler project for this class.
- man spim
- man xspim
Top
More details, hints and other helpful information will be made
available as the semester proceeds. This is just an outline so that you
have an idea of our plan of attack.
- Phase 1
- Implement comments (one line change simple.lex)
- Phase 2
- Implement strings (need strings for prompts).
- Phase 3
- Implement integer variable type and declarations.
Implement package declaration. (semantic checking, but no new generated
SPIM code)
- Phase 4
- Implement boolean types and operations
- Phase 5
- Control structures
- if-then-elsif-else-end if
- loops and exits
- when and while
- Phase 6
- Subroutines using global variables
- Phase 7
- Subroutines with parameters and/or local and global variables
Top
Due Dates - Grade based on Phase Progress
30jan08
You are given a compiler for SIMPLE which compiles a "simple language" that
includes arithmetic operations (multiplication, division,
addition and subtraction), the If-then statement, Assignment statement and
I/O statements. We will extend the SIMPLE
compiler to implement
Macro (subset of Ada/CS). A self-contained
description of Macro is available under Documents of this handout
and linked above.
The following extensions will be developed over the semester (you
should refer to the Macro description for the definition of our language for
more details on how these extensions are to look):
** 26March08Update**
- 1.
Phase1: Extension of simple.lex to handle comments - 06Feb08(individual)
- 2.
Phase2: Add strings to language for output statements - 05March08 (individual)
** 25Feb08 Update**
- 3.
Phase3: Declaration of integer variables and package - Cancelled - Phase 3 provided to class TBD (as Group)
** First Midterm - Compile Time Structures [Lexical Analysis,
Parsing, Symbol Table] - 19Mar08 **
- 4.
Phase4: Declaraction of boolean types and operations - (Group)
* Postponed to end ** was:09April08
- 5.
Phase5: Control structures - as group - 14April08 [was 23Apr08 before deleting P3 and P4]
GETTING TO HERE GIVES YOU A 'C' ON THE PROJECT
- 6.
Phase6a:Simple procedures; global variables - as group
GETTING TO HERE GIVES YOU A 'B' ON THE PROJECT
TO RECEIVE AN 'A' ON THE PROJECT, COMPLETE ONE OF THE FOLLOWING by 12May08 midnight [Our final Exam day]:
- 7a.
Phase7a: Recursive procedures with local and global variables - as group
- 7b.
Phase7b: Recursive procedures with parameters - as group
** Second Midterm - 30April08 - Runtime Structures **
Extra Credit - Can bring your code score to 10/10 - Due by Last Day Finals Week.
You may request an extension on time to complete project only for the extra credit Phase.
Only groups who have Proceduress in by its deadline can make this request. You must
send email to stewart@rohan.sdsu.edu that you intend to pursue this.
It would be fun to see, but I think it asks too much, so I will just
leave this as an indication of how incomplete the compiler is:
- 8.
Declaration of boolean types and operations
- 9.
Record declarations, field references
- 10.
Array declarations, element references
- 11.
User defined types
To receive an A on the project, your group must budget its time
to have things completed within the semester.
Top
Formal procedure for turning in your project. This will be
the same procedure all semester long.
Once you have tested your project executable, simple, and feel
is it running to your satisfaction you should:
- move (or copy) this executable simple to your root directory
- send mail to stewart@rohan indicating that your project is
ready to be checked
The mail must be sent from the account where your project
is located on rohan and must be your class account so that I
can have access to the executable to run the tests.
When you begin working as a group, you will have identified the members
of the group to your instructor and the Point of Contact (POC). The
POC submits the notice that the project is ready to be check and must
cc: the other members of the group on this email note.
Do not change the name of the executable. Do not change any of the
protections on your file because the script is able to access it as
Superuser and I do not want to worry about any unauthorized person
trying to access your file.
Top
You have enough space in your rohan account to always keep
two directories. One would have the current "finished" compiler
project, i.e. the one that is correctly working. The other
directory would have the new Phase and soon to be the "working"
compiler project for the next Phase.
Over the semester, we will extend the grammar of our
language and you will extend the capabilities of your compiler. I
would recommend the following order below.
MAKE SURE EACH STEP IS WORKING CORRECTLY BEFORE PROCEEDING TO THE NEXT!
As your skills and confidence grows over the semester, you will be
able to do more than one step at a time, but do not be too anxious
to do this at first.
- Introduce the new tokens into simple.lex and create a
SMALL test code, called tfirst with these new tokens used
correctly. Run make and then run simple with only the token trace
on by inserting
%%t
as the first line of the your new test file, first-d, and make sure
all new tokens are being recognized.
simple first-d
- Introduce new grammar rules into simple.lr and put dummy
stubs into sem.c. Make sure the semantic actions in your grammar
are being called during the parse when you expect. Then add the
complet code in sem.c (with corresponding definitions of the
structures in semform.h) to manipulate the semantic stack
appropriately. Do not forget to update print_semstack as you
increase the items pushed onto this stack. For your own benefit,
continue the commenting technique in sem.c with its
representation of the semstack before and after the particular
action displayed in comments at the beginning of the semantic
action routine. Turn on the yacc dump
simple -y tfirst
to be sure things are working properly with the grammar.
-
Update the printree.c routine to be able to output the
AST with the new semantic information. Output the tree and see
that the semantic information is there.
simple -i tfirst
- Update codegen.c to produce the assembler code
translation for the new grammar statements. DON'T EVEN TRY THIS
UNTIL YOUR TREE LOOKS GOOD!
Follow the convention of invoking "entering" and "leaving" in
each function that you add to the project. This will greatly
enhance the debugging capabilities of your code.
COME TALK TO ME IN OFFICE HOURS
DO NOT LET YOURSELF GET TOO CONFUSED
Top
You are given a working compiler that handles only simple input
such as (this is a dumb code only to show what's possible its name is
tfirst):
beta :=2;
write (beta,3);
writeln;
readln(gamma);
writeln(gamma);
alpha := 2 + beta * gamma;
writeln(alpha);
if beta = 2 then
beta := beta + gamma * alpha + alpha;
writeln(beta);
end if;
You see we have:
- Integer variables, which do not have to be declared prior to use
- Integer arithmetic
- Integer I/O (readln, write and writeln)
- If then end if statement
There are four main phases in the process of translating an
input source file to assembler language.
- Lexical Analysis
- effectively checking the spelling.
This phase recognizes the "token"s of the language. (The file
s.list is produced by the compiler, echoing valid input source.)
reserved words readln write writeln if then end if
symbols := + * - / ; ( ) = ,
variable names any sequence of characters, beginning with
an alpha character
For our project, this is defined in the file simple.lex
- Syntactic Analysis (parsing)
- checking the grammar. Do the statements look right? (e.g. for
each "if" there must be an "end if", each statement ends in a ";")
This is specified by a grammar, for our project, this is defined in the
file simple.lr
- Semantic Analysis
- checking that statements make some
sense. Right now, the language is so limited the programmer who
obeys the grammar will probably produce a valid program. But you
could use an uninitialized variable, which results in no error or
exception with SPIM and the value 0 is printed if you write the
uninitialized value. (NOTE: this is not detected now, but you
will add this later.)
The first three phases are accomplished using the technique
of syntax directed translation. Our compiler analyzes the input
source and produces an "intermediate form", called an "abstract
syntax tree" (AST).
For the program above, there is a compiler flag that you can use
(-i) that will output this intermediate form. One way to represent the
program above is through the sequence of statements (2 writeln's are
omitted here for space):
:= write readln :=
/ \ | | / \
beta 2 \ list \list alpha +
beta 3 gamma / \
2 *
/ \
beta gamma
ifstmt
/ \
/ \
= thenlist
/ \ | writeln
beta 2 | := stmt
Once you set up your copy of the initial project by copying
from stewart's file space, you should try
simple -i tfirst
to get a feeling for how this intermediate representation (the
AST) is presented using crude output capabilities.
- Code Generation
-
After this intermediate form has been produced, the compiler
will traverse the Astract Syntax Tree and produce assembler code
(in s.out). For the program above, we produce the following. Read
this over and try to make sense of it. You will have to be able
to write c code that will generate SPIM assembler code and so
need to understand this assembler language.
# Register Usage:
# $s0 for global variables
#
.text
.globl main
main:
la $s0, GVARS
#
# Start Code
#
# Generate Assignment Statement
li $t0,2
sw $t0,4($s0)
#
# Generate Write statement
li $v0, 1
lw $a0,4($s0)
syscall
li $v0, 1
li $a0,3
syscall
#
# Generate Writeln statement
la $a0, S0
li $v0, 4
syscall
#
# Generate Readln statement
li $v0, 5
syscall
sw $v0,8($s0)
#
# Generate Writeln statement
li $v0, 1
lw $a0,8($s0)
syscall
la $a0, S0
li $v0, 4
syscall
#
# Generate Assignment Statement
lw $t0,4($s0)
lw $t1,8($s0)
mul $t0,$t0,$t1
add $t0,$t0,2
sw $t0,0($s0)
#
# Generate Writeln statement
li $v0, 1
lw $a0,0($s0)
syscall
la $a0, S0
li $v0, 4
syscall
#
# Generate If-Then statement
lw $t0,4($s0)
seq $t0,$t0,2
beq $t0, 0,IF0
#
# Generate Assignment Statement
lw $t0,8($s0)
lw $t1,0($s0)
mul $t0,$t0,$t1
lw $t1,4($s0)
add $t0,$t0,$t1
lw $t1,0($s0)
add $t0,$t0,$t1
sw $t0,4($s0)
#
# Generate Writeln statement
li $v0, 1
lw $a0,4($s0)
syscall
la $a0, S0
li $v0, 4
syscall
#
# End If
IF0:
#
# Halt execution
li $v0 10
syscall
#
# Finish up by writing out constants
.word 0
CONST: #Constant storage area
.data
S0: .asciiz "\n"
#
# Reserve space for global variables
.word 0
GVARS: # space for Global Variables
.data
_alpha: .word 0 # Offset at 0
.data
_beta: .word 0 # Offset at 4
.data
_gamma: .word 0 # Offset at 8
.data
Temp_Wr:
.word 0 #just for alignment of write(exprtree)
- Running SPIM
-
The output of our compiler is SPIM assembler code in the
file "s.out" and a listing of the original source in "s.list".
s.out can then be submitted to SPIM to be assembled and executed.
This has been automated in the script file "q"
q tfirst
will invoke the compiler, simple, and uses command line options
to renamed the generated assembler code, tfirst.s, and listing
file, tfirst.list; then tfirst.s is run by SPIM to produce the
following:
execute the code in spim:
SPIM Version 6.2 of January 11, 1999
Copyright 1990-1998 by James R. Larus (larus@cs.wisc.edu).
All Rights Reserved.
See the file README for a full copyright notice.
Loaded: /opt/spim/bin/trap.handler
23 <-- appeared as output
7 <-- I typed the value 7 because the code is waiting on readln
7 <-- resulting output from the code above
16 <-- 2 + 2 (beta) * 7 (gamma)
130 <-- 2 (beta) + 7 (gamma) * 16 (alpha) + 16 (alpha)
There are several other script files available that produce
different amounts of output. In the script "q" above, q stands for
quick and provides a minimal level of output.
A more information script file is "doit" and one that produces
much more output, that I would recommend using in the following manner:
listdoit tfirst >& tfirst.out
You need the redirect >& in order to capture any error messages
that are the output from SPIM. (There should be any at this
time, but there will be later as the semester progresses.)
Top
Your compiler is called "simple" and its command line was
given above. "simple" translates your input source file and
places the output into "s.out". "s.out" can then be read and
executed by SPIM.
Project File List
Graphical Overview (use Print Preview and
view at 175%)
- simple.lex
- contains the lex specification of the tokens of the initial language (used to produce lex-yy.c)
- simple.lr
- contains the lr specification of the grammar of the initial language (used to produce y-tab.c, y-tab.h)
- simple.c
- is the main program and directs traffic. This routine parses the
command line flags, sets up files, invokes the parse (using yacc),
outputs the symbol table and AST if desired, calls codegen.c to
generate assembler.
- symtab.c
- manipulates the symbol table (stored as a binary tree).
Initially variables are inserted into the symbol table when encountered,
i.e. no previous declaration is required. Note: You must always
"Lookup" an id name immediately prior to "Insert"ing the name in the table.
This file contains "GetVars", called by sem.c, to convert the binary tree
to an alphabetized singly linked list. This also contains dumpstab() to
dump the current symtab at any time during the parse.
- sem.c
- is the semantic workhorse of this project. The grammar, simple.lr,
contains "semantic actions" which are invoked as contructs of the language
are recognized by Yacc. This file has YOUR code of what to do when items
are recognized. This manipulates the semantic stack which contains all
meaningful information needed to translate. This is related to, but
different from, the parse stack which Yacc maintains to keep track of
the LALR(1) parse. Take a careful look at the function done().
- codegen.c
- recursively traverses the AST constructed by sem.c and generates SPIM
assembler code to "s.out".
- tracing.c
- Debugging aid. Self-indenting printout of each function called.
Note: you must be consistent in using "entering" and "leaving" when
you start adding your own routines for this debug feature to continue to
be useful. I have added code so that you control the type of detailed output
when you invoke the compiler, simple:
-p : turn on PtraceSem - code in sem.c
-c : turn on PtraceCode - code in codegen.c
-b : turn on PtraceSymtab - code in symtab.c
-t : turn on PtraceTree - code in printree.c
- printree.c
- Recursively traverses the AST (via Prog.main) and prints out the tree.
You must update this as the AST structure is enhanced over the semester.
- yaccextr.c
- Tid bits to aid debugging and interfacing with Lex and Yacc. Of
particular interest are the routines int token(tok_value) and
void yyerror(s)
- makefile
- Script file to manage the whole project. A makefile is a standard
mechanism in UNIX that detects when a file has changed and will recompile it,
link it with the other files of the project, and produce an update version of
simple, your runnning compiler. A simple way to cause a file to be
renewed, i.e. have its timestamp change, is to touch filename,
or to edit the file.
- maketokn
- Script file to construct a file (y-tok.h) used by Token() in
Lex to output a source listing during the parse. NOTE: if you
download a copy of this file from this web page, you must issue the
UNIX command
chmod 711 maketokn
from the UNIX command
line to make this script file executable.
I employed the following naming conventions for all the header files:
- Each *.c file has a *.h file, i.e. sem.c has a sem.h. In the
corresponding *.h file you will find definitions of structures used by
that routine.
- There are some special, complicated structures that are used by
more than one *.c file. These are defined in their own *.h file. For
example, the symbol table definition is in symboltb.h (note the
restriction to 8 letters in order to be able to easily download to a
PC). The other main structures of the compiler are:
- symboltb.h - the symbol table structure, a binary tree
- semform.h - the structure of entries in the semantic stack
- semstack.h - the declaration of the array used for the
semantic stack
- astdef.h - the definition of the global structure
constructed by sem.c to hold the AST. This consists of a
structure (Prog) with 2 fields: a linked list of
translated statements nodes and an alphabetized list of
variable attritute records (from the symbol table tree).
- booldef.h - declaration of some Booleans used to control
debugging i/o
- spim_def.h - the definition of machine constants and
addressing modes used to generate code
- filedef.h - the definition of the files used by the
compiler. Consists of the sourcefile and name, the output
codefile and name, and the output listing file and name.
When making changes, you should consult the appropriate simple.h,
symtab.h, sem.h, printree.h, codegen.h, tracing.h or yaccextr.h file
to see what structures that *.c code uses.
You should NOT manipulate the following to header files:
- y-tab.h is effectively constructed by yacc (actually yacc
produces y.tab.h which the makefile renames to y-tab.h)
- y-tok.h is constructed by maketokn from the output of lex
Most of your coding will be in sem.c and codegen.c
-->
Top
Back to CS 524 Class Page