$ echo "write 3+4*5;" | ./splc
target triple = "x86_64-pc-linux-gnu"
@pfmt = constant [5 x i8] c"%ld\0A\00"
declare i32 @printf(i8*,...)
define i32 @main() {
%v1 = mul i64 4, 5
%v2 = add i64 3, %v1
call i32(i8*,...) @printf(i8* getelementptr([5 x i8], [5 x i8]* @pfmt, i32 0, i32 0), i64 %v2)
ret i32 0
}
Of course the whole point of this is that once you have compiled
to LLVM you can ask the llvm toolchain (in this case Clang) to
compile to machine code for you!
$ echo "write 3+4*5;" | ./splc > foo.ll $ clang -o foo foo.ll $ ./foo 23Note that if you give splc a .spl file as input, it will produce a .ll file from it automatically.
$ echo "write 3+4*5;" > bar.spl $ ./splc bar.spl Reading input from bar.spl and writing output to bar.ll $ clang -o bar bar.ll $ ./bar 23
For this lab, you are going to implement a compiler for the SPL language.
Specifically, you will write a program splc that reads in an SPL
program and writes out a program in the LLVM Intermediate Representation language.
This is a real IR language that is used by many popular modern compilers, and it
allows your program to be ultimately compiled to machine code for just about any
computer that you can find today.
The IR code your splc program generates
can either be executed directly by using the lli command,
compiled to assembly using the llc command, or compiled directly
to an executable using clang.
The lab is structured in three main parts. First is an overview of the LLVM tools you will use and how the LLVM IR language works. There is nothing to turn in for this part, but it will be a useful reference for the rest of the assignment.
The second part of the lab is getting most of SPL supported by your compiler, including arithmetic, reading and writing, variables, and function calls. At this point, you will be able to compile many SPL programs to machine code, with the restriction that your functions won't support any non-local references.
The final part of the lab is partially up to you! We have five possible ways of making your compiler support a larger subset of the SPL language. You need to pick (at least) two of these five options and implement them. As you will discover, each option has some tricky aspects that will challenge you to think hard about how your compiler needs to work.
Writing a compiler is a difficult challenge, since it requires you to have three programs in mind: (1) the SPL program your compiler takes as input; (2) your actual compiler which is written in C++; and (3) the LLVM IR code that your compiler is producing. You are ready for this challenge, but it will take careful, deliberate work. Help each other out (within the bounds of the course policy), give yourself plenty of time, ask your instructors for help as needed, and have fun!
splc compiler is to convert an SPL program such
as myprog.spl into an LLVM IR program such as myprog.ll.
Note that LLVM IR code typically ends with a .ll extension.
There are three main tools that you can use to work with LLVM IR code. These should be installed already on the lab machines; on your virtual machine, you may need to install some extra packages by running:
sudo apt install clang llvm-6.0 llvm-6.0-dev llvm-6.0-doc llvm-runtime
$ clang myprog.c -O0 -S -emit-llvm # Produce myprog.ll from C program myprog.c $ clang myprog.ll -O2 -Wall -o myprog # Produce executable myprog from myprog.ll
.ll program directy. What this does under the hood is really just
compile your program and then run it.
$ lli myprog.ll # Run myprog.ll
$ llc myprog.ll # Compile myprog.ll to x86 assembly myprog.s
The most important types you will need to use in LLVM IR are:
char.
You have to use this type when calling system functions like
printf which expect (arrays of) characters.
int.
You will actually not use this type very much in your
LLVM IR, except as the return value of system calls such as
printf, scanf, or main.
long.
This is the type you will use in LLVM to store any SPL value,
whether it be a boolean, a number, or a function pointer.
[5 x i8].
main
is typically i32() if you don't care about command-line arguments,
or i32(i32,i8**) if you do.
There are various instructions in LLVM to convert between any of these types.
Most useful are
trunc,
to truncate a larger integer type to a smaller one,
zext,
to extend a smaller integer to a larger one by padding with 0 bits,
ptrtoint,
to convert a pointer to an i64, and (you guessed it)
inttoptr,
to do the opposite.
@ and are typically used for global constants or function names, for example
@main.
Within each function, you will more commonly use local register names, which are
preceded with a %. Because this is an intermediate representation, these
names don't really matter, and because the language is SSA, there need to be a lot of them,
so it's common to use some numbering scheme like %v1, %v2, etc.
You will notice that running clang produces register names tha are just numbers
like %1, %2, %3, but you should not use these
kind of names in your own LLVM IR because they have to follow some special (and annoying) rules.
LLVM IR is a 3AC language, and most typical instructions have this form:
%dest = instruction type %arg1, %arg2For example, here is an instruction to
subtract 3 from the 64-bit integer
stored in register %x, storing the result in register %y:
%y = sub i64 %x, 3
You can think of sub i64 as the entire name of
the instruction in the previous case. Some instructions like this are
add, sub, mul, sdiv, and, or.
The icmp
instruction is similar, except there is an extra component
for the type of comparison. For example, the following instruction compares two
64-bit values in registers %v1 and %v2 storing the i1
value of 1 or 0 in register %v3, depending on whether %v1 is
greater than or equal to %v2:
%v3 = icmp sge i64 %v1, %v2
somelabel:
and each basic block must end with a special
"terminator instruction", probably either
br or
ret.
Labels are actually stored in registers; for example the label above would be
stored in a register called %somelabel.
Branch instructions (br) can be either unconditional, like
br label %somelabel
or conditional, like
br i1 %v3, label %iflabel, label %elselabel
@), and arguments. For example,
here is the definition of a function that takes a 64-bit integer
n and returns n^2:
define i64 @sqr(i64 %n) {
%nsq = mul i64 %n, %n
ret i64 %nsq
}
To call a function, you use the call instruction,
like:
%twentyfive = call i64(i64) @sqr (i64 %five)Notice that in function calls, we have to specify the type of the function,
i64(i64) in the previous example,
the name of the function, and then the types and names of all the arguments.
This can feel a little redundant, but anyway it's a requirement of the language.
target triple = "x86_64-pc-linux-gnu"Next we should see any global variables. These will most commonly be string literals, for example like:
@msg = constant [14 x i8] c"Hello, world!\00"Next we have function prototypes for any external functions or system calls, using the
declare keyword, such as
declare i32 @puts(i8*)After this, all that remains is function definitions. One of them should of course be called
@main, like
define i32 @main() {
initial:
call i32(i8*) @puts(i8* getelementptr([14 x i8], [14 x i8]* @msg, i32 0, i32 0))
ret i32 0
}
Putting all this together gives you a complete, working LLVM IR program.
resout stream) LLVM IR code.
You won't actually be running the program, but printing the instructions
to run the program later. Your scanner, parser, and AST generation will be the
same, but your main(), as well as the eval and
exec methods in the AST, will work differently.
If you inspect the starter code for this lab, you'll notice a few differences from the previos interpreter labs:
main() function has been moved out of spl.ypp
and into its own file in splc.cpp. So now spl.ypp is
just a parser. In fact, you shouldn't need to modify the parser (spl.ypp)
or scanner (spl.lpp) for this lab.
value.hpp file, and the eval methods in
AST expression nodes return a string, not a Value object.
This is because the actual values are not known until run-time, so they have no
place in the compiler. So we have ...
string eval(ostream& out, Context* con);where eval does two things: 1) it prints to "out" the LLVM code that will evaluate the expression when the program is ultimately executed, and 2) it returns a string that is either the number that is the value of the expression (only really for constants) or the name of the LLVM variable that will contain the result of the expression when the program is ultimately executed. These variables should always be i64's! Thus, eval for ArithOp 3 + 4 would output to "out"
%v1 = add i64 3, 4and return the string "%v1". That way the parent expression will know to look in variable %v1 to find the result of the evaluation. As for exec, it only prints the LLV code that executes that statement.
eval and exec method
takes a pointer to a Context
object. You can see the (very short!) definition of the Context class in
splc.hpp. Right now, it just maintains a counter and implements a simple
function to return the next unused register name. This will be frequently
useful, since you have to come up with a new name for every instruction in
LLVM IR.
If you compile and run the starter code as-is, you will get an splc
program that works only for literal numbers (class Num),
arithmetic operators (class ArithOp),
and write statements (class Write).
Looking at how those eval and exec methods are implemented
will be useful to get started here.
scanf
to get read expressions working. Because scanf takes the address of
the value you are reading into, you might
need to use alloca, load to get it working.
write 10; write 5 + 6; write 20 + (-8); write -7 + 4*5; write true; write 5 < 1; write not 1 = 2 and 4 != 5; write read * 2; 20;
You will need to use the instructions alloca, load, store to get this working. Look at the documentation and examples there for help in getting started.
We are going to implement a single global scope first. My
suggestion is that you add a map<string,string<
to the context object along with methods to create a
binding and fetch a binding. Remember: you are binding an
SPL variable name to the name of an LLVM variable that
will be a pointer to a memory location; that memory
location contains the current value of "the variable".
This means bindings never change - once you put something
in the map for a given Id, that Id's value in the map is
never reassigned.
Rather than try the full test case shown below, think about starting very small and working your way up. For example, you could first just try to get a program to compile that declares a new variable. Then try assigning a variable and printing it out right away. Starting slowly like this and working carefully, examining the code your compiler produces for very simple cases first, will make this whole lab much easier.
Exercisesnew, assignment, and lookup working.
Here's a program which should work now:
new x := 10; write x + 3; new y := x * x; write y; write y + x; y := 1101; write x+y;
So, go back to your Context class, and change to a stack of scope ... although I strongly recommend you actually use a vector with .push_back(.) and .pop_back() in place of push and pop, and .back() to peek at the top element of the stack. And then make sure you push and pop as you enter and leave blocks.
Exercises
new i := 5;
{
new i := 10;
write i;
i := 11;
write i;
}
write i;
For example, a code fragment such as
ifelse 3 < 5
{ write 100; }
{ write 200; }
write 300;
might be compiled to
%v1 = icmp slt i64 3, 5
br i1 %v1, label %b1, label %b2
b1:
call void(i64) @write (i64 100)
br label %b3
b2:
call void(i64) @write (i64 200)
br label %b3
b3:
call void(i64) @write (i64 300)
Exercises
new secret := 42;
new guess := read;
while guess != secret {
ifelse guess < secret
{ write -1; }
{ write 1; }
guess := read;
}
write secret;Getting functions to work really consists of two tasks.
First, you have to implement function definitions, as triggered
by lambda expressions in the code. Each lambda in the SPL code will
correspond to one function in the LLVM IR code that your compiler produces.
But at the point of the lambda itself, your compiler is in the middle
of emitting code for main(), so you can't
just print this function definition.
I have provided a second stream for you to write to. It's
called delayout. In
fact, delayout is
an ostringstream, which is a C++ type that is
an ostream, which means you can << output to it just
like cout and cerr, but it collects all that output in a
string that you can access with the .str() method. The
starter code is set up so that the last thing the program
does is resout << delayout.str(), i.e.
anything that has been written to delayout is
emitted at the ind of the program.
So while you're writing out main, if you come across a
function, just output that function to delayout so that it
is written after main.
Note: to allow for lambdas containing lambdas, you
may want to introduce your own ostringstream
local variables. That looks like:
ostringstream myout;
At the point of the lambda itself, all your compiler needs to
do is convert the function pointer to an i64 type in LLVM, using the
ptrtoint-to
instruction. For example, check out the very
simple (and mostly pointless) SPL program below and the
LLVM snippet it might compiler to.
Notice how the function name @v2
has type i64(i64)*, i.e. it is a pointer to a
function. Also notice that we are referring to the name
in main, even though it doesn't get declared until later!
There is no "declare-before-use" requirement in LLVM.
new f := lambda x { ret := x*x; };
write 456;
might compile to something like
; ... header definitions etc up here ...
define i32 @main() {
%v3 = ptrtoint i64(i64)* @v2 to i64
%v1 = alloca i64
store i64 %v3, i64* %v1
call i32(i8*,...) @printf(i8* getelementptr([5 x i8], [5 x i8]* @pfmt, i32 0, i32 0), i64 42)
ret i32 0
}
define i64 @v2 (i64 %v4) {
; ... you'll have to figure out what goes here, of course!
}
Once you have lambdas working well, emitting code for function names
like we see above, the next step is to implement function calls. Just as with your
interpreter labs, I recommend starting by ignoring arguments and return values
and just get the control to go to the function and come back. Just as the
example above uses ptrtoint to convert the function pointer to a saved
i64 value, at the function call site you will have to convert
an i64 back to a function pointer using
inttoptr-to.
Important! The key question you have to tackle is this: should the binding of the parameter and ret to memory locations and the storing of the argument value in the parameter's memory location happen a) the code that calls the function, or b) inside the function itself? Note: when we say "function", we're talking about the LLVM function introduced with the "define" keyword. The right answer is b! You can't follow a and do this when the function is called, because you've lost the Lambda* at that point and you won't know what the parameter name is, nor do you have the scope you want --- you won't know what LLVM variable the body of the function will be using to get the locations of the parameter and "ret". So b is the only way to go.
new f := false;
ifelse read = 2
{ f := lambda x { ret := x*x; }; }
{ f := lambda x { ret := x*x*x; }; }
new i := 1;
while i <= 10 {
write f@i;
i := i + 1;
}
The trickier one to get right is this:
new g := lambda y {
new f := lambda x { ret := x*x; };
ret := f@(y+1);
};
write g@5;
i8 value with something like 0 indicating unset,
1 indicating a number, 2 indicating a boolean, and 3 indicating a function.
write
command in SPL should now respect the type of its argument, printing the value of
a number, or "true"/"false" for a boolean, or "function" for a function pointer.
write 1 * false; should still compile, but produce a run-time error from the failed coversion of
false to a number.
exit so you can safely terminate from any point in the program.
isnum.)
new showx := lambda x {
ret := lambda ignored { write x; };
};
new fiver := showx@5;
new sixer := showx@6;
fiver@0;
sixer@0;
fiver@0;
fiver@0; (Look back to Lab 8 for many more examples!)malloc."Please enter a positive number."
new x := read;
ifelse x > 0
{ "Thank you and good job." }
{ "You will now be punished with 50 punches."
new i := 0;
while i < 50 { "punch" i := i + 1; }
}
sqrt, rand, and one other function of your choosing.
main() you'll probably have to have instructions to store
those function pointers just like any other variable would be stored, so that function
calls to your built-in functions work just like any other function call.
isnum.)
and and or expressions,
using phi expressions in LLVM.
and
or or) is only evaluated if it's necessary. For example, if the first part of an
and is false, then we don't need to bother with evaluating the seecond part,
because we know already that the entire expression is definitely false.
alloca call every time a NewStmt is executed. That means a whole lot of memory gets
used even for simple loops like this one to compute \(n^2\) in a very stupid way:
new n := 1000;
new i := 1;
new s := 0;
while i <= n {
new j := 1;
while j <= n {
new temp := s + 1;
s := temp;
j := j + 1;
}
i := i + 1;
}
write s; If you compile and run the program above, you will most likely get a seg fault. But
the issue is not in how you are allocating memory, just in how much memory
you are allocating! As written, the space for variable temp gets allocated
a million itmes. That should be just enough to blow through the 8MB of stack space that Linux
gives you by default.
main
that your compiler outputs should statement(s) to allocate memory for all variables
that are used in main. Then, each time a variable is used, you just access that memory
address which was alreay allocated. In the example above, this means that the space
for temp is only allocated once instead of 1 million times - hooray!
submit -c=SI413 -p=lab13 *
submit -c=SI413 -p=lab13-ext *Note: You must have a README file in your submission that clearly states which of the extensions you have done ... and "done" means it works!