CSC 652 Project

Engineering an Optimizing Compiler

Schedule

Phase 1: Trivial
Phase 2: Code Generation
Phase 3: Control Flow

What to Submit

The three phases will be graded separately. So, your submission should include three folders, one for each phase. In the folder for each phase, besides all the source code needed for building and running your compiler, you must include a Readme file that contains the phase number, a brief description of the files and directories in the folder, an explanation of your program structure especially the parts that you have developed, how to build and run, and known bugs. In addition, you need to submit a final writeup as required for every course project in our class. However, it may simply be a combination of the Readme files of your three phases. There are no page limits on the final writeup.

Downloads & References

  1. AST library: should be included in your CLASSPATH
  2. Test programs
  3. Test programs converted into ADAP format
  4. Test programs after code generation and instrumentation
  5. Example optimization phase: Constant Folding
  6. Instrumented example
  7. Java docs of the AST library
  8. AST library overview
  9. Java docs of the AST library
  10. Project template
  11. Java programming resources from Marty Hall

Introduction

Typically a compiler is made up of a front end that converts source programs to a intermediate representation, a middle end that carries out phases of optimizations and a back end that generates target object codes. Here in your 652 class project, you are expected to implement the middle part of a toy compiler that accepts a subset of C. Though much simpler than a product compiler, making it perfect is not trivial. You need to organize the compiler structure, implement the required phases, with the correctness of the generated code maintained.

Three phases are required:

  1. Trivial: to count the number of dynamic instructions in the program.
  2. Code Generation: to convert the program into an assembly-like format.
  3. Control Flow Graph: to build control flow graphs of the program.

With the development process going on, difficulty level would probably increase too. Regarding this, you'd better well-design your framework at the early stage to avoid rebuild the whole system to fit further optimizations.

The accepted language is a subset of C. The test programs that will be used to evaluate the following properties:

  • There are no pointers in the program.
  • There are no structure types in the program.
  • The only data types are integer and floating point scalars and one-dimensional integer arrays.

A set of test programs can be downloaded from here.

The compiler is written in Java. If you are not familiar with Java, please resort to the resource page.

Setup

Step I: test_prog.c --> {test_prog.adap, test_prog.adap.h}

We use lcc as our compiler's front end that converts the test programs into an intermediate form, ADAP.  ADAP is actually an abstract syntax tree (AST) whose readable form is dumped into an output file with extension .adap. For your convenience, we have already converted the test programs and from here you may download them.

Step II: {test_prog.adap, test_prog.adap.h} --> AST

Now it comes to Java. The AST_lib offers classes to transform the .adap files into internal AST representations. You'll create an instance of ProgAst using your test program's adap file name as the constructor's input parameter. Then you'll be working on the AST all the way until code generation.

Step III: AST --> test_prog.out.c

The third step is code generation which is simply done by calling the AST's method GenCode(). An output C file will be dumped with extension .out.c. Now compile the output .out.c file into an executable object file then run it on your PC, you'll see and check the outputs.

Of course you have to make every step working to see the final correct results.

An example ConstFolding.java shows the use of the parser and code generator as well as traversal and modification of an AST program tree. You can download it from here. A test input is also included: initial program is const1.c; first converted to const1.adap and const1.adap.h by lcc; then applied constant folding by ConstFolding procedure; finally the output is in const1.out.c.

You are provided with a template for the project. You can reuse the Makefile and the directory organization. Just add your own files and make small changes to the Makefile. Type make, you'll complete build, compile and run in a batch way. Check it out here and use it as a start point of your whole project.

In Detail

Phase I: Trivial

In the first phase, you need to instrument the IR and insert the following function calls into the program:

  1. Insert Init(); at the beginning of the program;
  2. Insert RecordInst(); for each statement (except return statements) in the program. For example, the following code fragment

    ...
    a = a + 1;
    foo(a);
    ...

should be transformed to

    ...
    RecordInst();
    a = a + 1;
    RecordInst();
    foo(a);
    ...

  1. Insert Report(); at each exit of the program.

You also need to implement functions Init(), RecordInst(), and Report() and make them part of the program. They maintain a global counter that records the number of dynamic instructions (statements in case of PHASE I) in the program. Init() sets the counter to zero. Each time RecordInst() is called, the counter is incremented by one. Finally, Report() prints the total number of instructions (statements) executed to the standard output. An example of instrumented programs (without the implementation of the functions) can be downloaded from here. Except for recording and reporting number of instructions executed, the instrumented program should behave exactly the same as the original program. Make sure that there is no possibility of name conflicts among variables. However, I would not mind if you do not ensure their uniqueness in the C code, since the names were given by me. However, other than these three functions you need to make sure that none of the entities that you add to the code conflicts with any other variable or function name in the code. Here are my test programs after code generation (PHASE II) and instrumentation. At this stage, you may ignore the extra statements and variables generated in them. You need to learn how to use the provided front-end and back-end tools, how to traverse and manipulate the AST tree before start working on the instrumentation. This will also prepare you for the later phases.

Hint: You may want to add function definitions not by modifying AST, but by adding them directly to the output file. Do not forget #include lines.  

Phase II: Code Generation

In this phase, you need to transform the program into an assembly-like format. In doing so, you are still working on the AST. What is changed is only its internal structure. After the transformation, you may call method GenCode() to generate the output C file.

Expressions in the transformed program should satisfy the following requirements:

  • Left hand side of an assignment statement should be a variable or an array access.
  • Right hand side of an assignment statement should be either a variable, a constant, an array access, a function call or an expression with no more than two operands (or one operator).
  • Any operand of an expression can only be a variable or a constant.
  • The predicate of a conditional jump can be one comparison operation whose two operands are either a variable or a constant.
  • Parameters to a function call can only be a variable or a constant.
  • Index of an array can only be a variable or a constant.
  • Return parameter can only be a variable or a constant.

After the program transformation, new temporary variables could be introduced. . An example program segment

    a = b + c[e] + d;
    if (a+d<b) goto L1;
         foo (m[a+d]);
         m[e+f] = m[a] + m[d];
    L1:

needs to be converted to something like the following:         

    t0 = c[e];
    t1 = b + t0;
    a = t1 + d;
    t2 = a + d;
    if (t2 < b) goto L1;
    t3 = a + d;
    t4 = m[t3];
    foo(t4);
    t5 = m[a];
    t6 = m[d];
    t7 = e + f;
    m[t7] = t5 + t6;
L1:

Some notices:

You should take scanf and printf as special instructions that need not conform to above specifications. You should not make any temporary variable to take value of the pointer passed to scanf. Remember that our language does not have pointers, so we do not deal with them. For printf, you still need to make all the arguments in variable or constant form except for the format string.

Here are my test programs after code generation and instrumentation. Your output should match with them. Remember that you have to take care of types. Our language has integers and floating points only.

Phase III: Control Flow

In this phase, you need to construct a control flow graph (CFG) for each function definition. You should identify all the basic blocks (BB) in a function and figure out predecessors and successors of each BB. You may assign each BB a unique ID with naming scope either within a function or throughout the whole program. As a debugging support, your compiler should provide a means of dumping the CFG's into an output stream. In this assignment, you are required to dump the CFG's of each program into a text file with extension .cfg (for instance, automaton.cfg for automaton.new.c) in the output program directory.

A sample control flow graph is shown as following

        int a, b;

    //Block 1:
    //         Predecessors: Entry
    //         Successors: Block 2, Block 3
        a = 1;
        b = 2;
        if ((a > b)) goto L6;

    //Block 2:
    //         Predecessors: Block 1
    //         Successors: Block 3, Block 4
        a = a - b;
        b = b - a;
        if ((a <= b)) goto L4;

    //Block 3:
    //         Predecessors: Block 1, Block 2
    //         Successors: Exit
    L6:
        return 0;

    //Block 4:
    //         Predecessors: Block 2
    //         Successors: Block 6
    L4:
        goto L3;

    //Block 5:
    //         Predecessors:
    //         Successors: Exit
    L7:
        return 1;

    //Block 6:
    //         Predecessors: Block 4
    //         Successors: Exit
    L3:

Note: your dumped CFG's should not omit labels if they mark the beginning of a basic block. Function calls are not necessarily conditions of starting a new BB.