This tutorial walks through creating a compiler using LLVM. This closely follows the LLVM tutorial. In fact, it was originally my intention to work through that tutorial, but the tools I ended up using were different enough that I thought a write-up would be helpful to others.
The LLVM tutorial uses a hand-rolled C++ lexer, parser, and code generator. I created a lexer using Flex, a parser using Bison, and a LLVM IR generator using Digital Mars D, using the
LLVM C interface.
As mentioned, the source language is the Kaleidoscope language, essentially as described by the LLVM tutorial. All values are of type double. Functions can be defined which return a single expression. All functions take a number of double parameters and return a double. External functions can be declared with the extern keyword. If/then/else creates a single expression based returning either the then expression or the else expression.
Input() will read a single double from standard input, and output creates a main() method which writes a single value to standard output.
This example shows off the features of the language, and will be the example used throughout this tutorial. It is a convoluted way of reading a value n, and displaying the nth fibonacci number:
extern cos(x)
extern sqrt(x)
extern fib(n)
def pi() 3.14159
def square(x) x * x
def sin(x) cos(x + pi() / 2)
def len(x, y) sqrt(square(x) + square(y))
def fib(n)
if(n < 2) then
n - 1
else
fib(n - 2) + fib(n - 1)
output len(sin(pi()), fib(input()))
First, let's look at the lexer:
%option noyywrap
%{
#include "parser.tab.h"
char *identifier_str;
%}
IDENTIFIER [a-zA-Z][a-zA-Z0-9]*
NUMBER [0-9.]+
WHITESPACE [ \t\r\n]
SYMBOL [^a-zA-z0-9.]
COMMENT #.*
%%
{WHITESPACE} /* Do nothing */
{COMMENT} /* Do nothing */
{NUMBER} return TOKEN_NUMBER;
def return TOKEN_DEF;
extern return TOKEN_EXTERN;
input return TOKEN_INPUT;
output return TOKEN_OUTPUT;
if return TOKEN_IF;
then return TOKEN_THEN;
else return TOKEN_ELSE;
{IDENTIFIER} identifier_str = yytext; return TOKEN_IDENTIFIER;
{SYMBOL} return *yytext;
<<EOF>> return TOKEN_EOF;
This is a very straightforward Flex lexer. The only logic involved is for identifiers, a global var identifier_str is set to the value of the identifier. All other tokens simply return their int
constants.
The parser is more complex. The goal of the parser is to build an Abstract Syntax Tree (AST) for the language elements. Later, we will walk the AST and generate the code for it.
The AST is defined in the D code. It is a hierarchy of classes, each of which is a subclass of ASTNode. ASTNode looks like the following:
class ASTNode {
public:
int id;
ASTNode left;
ASTNode right;
this(int id, ASTNode left = null, ASTNode right = null) {
this.id = id;
this.left = left;
this.right = right;
}
mixin Accept;
}
For interfacing with Bison, we give each node an id, and keep an array of all the nodes. The ids of lower-level nodes are passed as parameters from Bison when creating higher nodes. The D code then looks up the node with the given id, and inserts it into the AST.
A concrete example helps illustrate this. Here is the Bison code for a multiplication expression:
| expression '*' expression
{ $$ = ast_binary_expression('*', $1, $3); }
This code passes the id of the two expressions to be multiplied into the ast_binary_expression function, which creates a new node for binary expressions and returns its id. That code looks like this:
int addNode(ASTNode node, bool isStatement = false) {
nodes.length = cast(int) fmax(nodes.length, node.id + 1);
nodes[node.id] = node;
if(!statements) {
statements = new Stack!(Statement)();
}
if(isStatement) {
Statement statementNode = new Statement(nextId(), node);
if(statements.length > 0) {
statements.peek().nextStatement = statementNode;
}
statements.push(statementNode);
}
return node.id;
}
int ast_binary_expression(char operation, int lhIndex, int rhIndex) {
return addNode(new BinaryExpression(nextId(), operation, nodes[lhIndex], nodes[rhIndex]));
}
The other nodes work the same way. Here is the complete parser source:
%{
#include
#include
#include
#include "ast.h"
char *endp;
extern char *yytext;
int yylex();
void yyerror(char *s);
%}
%token TOKEN_EOF
%token TOKEN_DEF
%token TOKEN_EXTERN
%token TOKEN_IF
%token TOKEN_THEN
%token TOKEN_ELSE
%token TOKEN_IDENTIFIER
%token TOKEN_NUMBER
%token TOKEN_INPUT
%token TOKEN_OUTPUT
%start statement
%left '<' '=' '>'
%left '+' '-'
%left '*' '/'
%left ','
%%
statement : extern statement
| function statement
| output statement
| TOKEN_EOF
;
identifier: TOKEN_IDENTIFIER
{ $$ = ast_identifier(strdup(yytext)); }
extern : TOKEN_EXTERN prototype
{ $$ = ast_extern($2); }
;
function : TOKEN_DEF prototype expression
{ $$ = ast_function($2, $3); }
;
input : TOKEN_INPUT '(' ')'
{ $$ = ast_input(); }
;
output : TOKEN_OUTPUT expression
{ $$ = ast_output($2); }
;
prototype : identifier '(' ')'
{ $$ = ast_prototype($1, -1); }
| identifier '(' prototype_args ')'
{ $$ = ast_prototype($1, $3); }
;
prototype_args : identifier
{ $$ = ast_prototype_arg($1, -1); }
| identifier ',' prototype_args
{ $$ = ast_prototype_arg($1, $3); }
;
call : identifier '(' ')'
{ $$ = ast_call($1, -1); }
| identifier '(' call_args ')'
{ $$ = ast_call($1, $3); }
;
call_args : expression
{ $$ = ast_call_arg($1, -1); }
| expression ',' call_args
{ $$ = ast_call_arg($1, $3); }
;
expression : expression '+' expression
{ $$ = ast_binary_expression('+', $1, $3); }
| expression '-' expression
{ $$ = ast_binary_expression('-', $1, $3); }
| expression '*' expression
{ $$ = ast_binary_expression('*', $1, $3); }
| expression '/' expression
{ $$ = ast_binary_expression('/', $1, $3); }
| '(' expression ')'
{ $$ = $2; }
| identifier
{ $$ = ast_variable($1); }
| TOKEN_NUMBER
{ $$ = ast_number(strtod(yytext, &endp)); }
| call
| if
| input
;
boolean_expression : expression '<' expression
{ $$ = ast_boolean_expression('<', $1, $3); }
| expression '=' expression
{ $$ = ast_boolean_expression('=', $1, $3); }
| expression '>' expression
{ $$ = ast_boolean_expression('>', $1, $3); }
;
if : TOKEN_IF '(' boolean_expression ')' then_else
{ $$ = ast_if($3, $5); }
;
then_else : TOKEN_THEN expression TOKEN_ELSE expression
{ $$ = ast_then_else($2, $4); }
;
%%
void yyerror(char *s) {
fprintf(stderr, "%s\n", s);
}
Once we have an AST tree, we can traverse the nodes for a variety of useful effects. As a simple example,
we can generate a GraphViz visualization of the AST. See the code for details, but here is the output for
the parser above:

The final step is to transform the AST into a set of LLVM IR instructions. The full syntax of the language is documented in the reference manual. We are using the C interface to programatically generate it, though, so our input will look a little different.
First, we generate a module. Modules are the independent units of compilation; for example, in C each .c file would be a module. For Kaleidoscope, we always create a single module. We also create a builder. Builders are used to create the instructions, and automatically insert them into the appropriate place in the
output. We also define a few basic types, and store them for later:
moduleRef = LLVMModuleCreateWithName(toStringz(name));
builderRef = LLVMCreateBuilder();
doubleType = LLVMDoubleType();
intType = LLVMInt32Type();
charType = LLVMInt8Type();
charPtrType = LLVMPointerType(charType, 0);
charPtrPtrType = LLVMPointerType(charPtrType, 0);
Constants in the source file get corresponding constants in the output:
void previsit(Number number) {
expressions.push(LLVMConstReal(doubleType, number.val));
}
Variables work similarly:
void previsit(Variable variable) {
expressions.push(fnContext.params[variable.name]);
}
Ok, those are pretty simple. Functions are divided into prototypes and implementations. Extern statements have a prototype with no implementation, and Defun statements have both. To facilitate this, we keep a context structure with information on the current function, if any,
auto functionType = getFunctionType(args.length);
fnContext.fn = LLVMAddFunction(moduleRef, toStringz(prototype.name), functionType);
LLVMSetLinkage(fnContext.fn, LLVMExternalLinkage);
}
void previsit(Prototype prototype) {
auto preExisting = LLVMGetNamedFunction(moduleRef, toStringz(prototype.name));
if(preExisting) {
fnContext.fn = preExisting;
} else {
auto args = prototype.flatArgs;
auto functionType = getFunctionType(args.length);
fnContext.fn = LLVMAddFunction(moduleRef, toStringz(prototype.name), functionType);
LLVMSetLinkage(fnContext.fn, LLVMExternalLinkage);
}
foreach(idx, arg; prototype.flatArgs) {
fnContext.params[arg.name] = LLVMGetParam(fnContext.fn, idx);
}
if(fnContext.type == FunctionType.defined) {
fnContext.block = LLVMAppendBasicBlock(fnContext.fn, "entry");
LLVMPositionBuilderAtEnd(builderRef, fnContext.block);
}
}
The last statement, LLVMPositionBuilderAtEnd, ensures that subsequent statements are always inserted into the correct context.
The other complication comes with the if/then/else conditional. For an if statement, three blocks of code are required: a then block, a else block, and a merge block. Code flows to either the then or else branch, both of which calculate their associated expressions and then return to the merge block.
Finally, a Phi node is created. A Phi nodes takes the expression calculated by one of the two blocks, depending on which block was executed. This is the result expression returned by the if statement.
void previsit(If ifNode) {
ifThenElseContext.mergeBlock = LLVMAppendBasicBlock(fnContext.fn, toStringz("merge"));
}
void visit(If ifNode) {
ifThenElseContext.thenBlock = LLVMAppendBasicBlock(fnContext.fn, toStringz("then"));
ifThenElseContext.elseBlock = LLVMAppendBasicBlock(fnContext.fn, toStringz("else"));
LLVMBuildCondBr(builderRef, expressions.pop(), ifThenElseContext.thenBlock, ifThenElseContext.elseBlock);
}
void postvisit(If ifNode) {
fnContext.block = ifThenElseContext.mergeBlock;
auto phi = LLVMBuildPhi(builderRef, doubleType, toStringz("phi"));
LLVMAddIncoming(phi, &ifThenElseContext.thenValue, &ifThenElseContext.thenBlock, 1);
LLVMAddIncoming(phi, &ifThenElseContext.elseValue, &ifThenElseContext.elseBlock, 1);
expressions.push(phi);
}
void previsit(ThenElse thenElseNode) {
LLVMPositionBuilderAtEnd(builderRef, ifThenElseContext.thenBlock);
}
void visit(ThenElse thenElseNode) {
LLVMBuildBr(builderRef, ifThenElseContext.mergeBlock);
LLVMPositionBuilderAtEnd(builderRef, ifThenElseContext.elseBlock);
}
void postvisit(ThenElse thenElseNode) {
ifThenElseContext.elseValue = expressions.pop();
ifThenElseContext.thenValue = expressions.pop();
LLVMBuildBr(builderRef, ifThenElseContext.mergeBlock);
LLVMPositionBuilderAtEnd(builderRef, ifThenElseContext.mergeBlock);
}
Finally, the output script is written to a binary file:
int LLVMWriteBitcodeToFile(LLVMModuleRef, char *);
This file can be disassembled to see the source:
; ModuleID = 'output.i'
@format = internal constant [4 x i8] c"%f\0A\00" ; <[4 x i8]*> [#uses=1]
declare i32 @printf(i8*, ...)
declare i32 @scanf(i8*, ...)
declare double @cos(double)
declare double @sqrt(double)
define double @fib(double) {
entry:
%lttmp = fcmp olt double %0, 2.000000e+00 ; <i1> [#uses=1]
br i1 %lttmp, label %then, label %else
merge: ; preds = %else, %then
%phi = phi double [ %subtmp, %then ], [ %addtmp, %else ] ; <double> [#uses=1]
ret double %phi
then: ; preds = %entry
%subtmp = sub double %0, 1.000000e+00 ; <double> [#uses=1]
br label %merge
else: ; preds = %entry
%subtmp1 = sub double %0, 2.000000e+00 ; <double> [#uses=1]
%call = call double @fib(double %subtmp1) ; <double> [#uses=1]
%subtmp2 = sub double %0, 1.000000e+00 ; <double> [#uses=1]
%call3 = call double @fib(double %subtmp2) ; <double> [#uses=1]
%addtmp = add double %call, %call3 ; <double> [#uses=1]
br label %merge
}
define double @pi() {
entry:
ret double 3.141590e+00
}
define double @square(double) {
entry:
%multmp = mul double %0, %0 ; <double> [#uses=1]
ret double %multmp
}
define double @sin(double) {
entry:
%call = call double @pi() ; <double> [#uses=1]
%divtmp = fdiv double %call, 2.000000e+00 ; <double> [#uses=1]
%addtmp = add double %0, %divtmp ; <double> [#uses=1]
%call1 = call double @cos(double %addtmp) ; <double> [#uses=1]
ret double %call1
}
define double @len(double, double) {
entry:
%call = call double @square(double %0) ; <double> [#uses=1]
%call1 = call double @square(double %1) ; <double> [#uses=1]
%addtmp = add double %call, %call1 ; <double> [#uses=1]
%call2 = call double @sqrt(double %addtmp) ; <double> [#uses=1]
ret double %call2
}
define i32 @main(i32, i8**) {
entry:
%call = call double @pi() ; <double> [#uses=1]
%call1 = call double @sin(double %call) ; <double> [#uses=1]
%inputtedDouble = alloca double ; <double*> [#uses=0]
%call2 = call double @fib(double 7.000000e+00) ; <double> [#uses=1]
%call3 = call double @len(double %call1, double %call2) ; <double> [#uses=1]
%printoutput = call i32 (i8*, ...)* @printf(i8* getelementptr ([4 x i8]* @format, i64 0, i64 0), double %call3) ; <i32> [#uses=0]
ret i32 1
}
Some other tools that can be used:
- lli--Runs the bytecode file directly
- opt--Optimize the bytecode
- llc--Compile to native code
So, with a few batch files, a compiler can be written to output optimized, native code.
The full source of this example can be found at
GitHub