LLVMlite is a small subset of the LLVM IR that we will be using throughout the course as the intermediate representation in our compiler. Conceptually, it is either an abstract assembly-like language or a even lower-level C-like language that is convenient to manipulate programatically.

LLVMlite's features include:

This document explains the structure of well-formed LLVMlite programs and the relevant parts of the code provided with the assignments. A description of the full LLVM intermediate representation can be found in the LLVM Language Reference.

Structure and Well-Formedness of Programs

To give you a sense of structure of LLVMlite programs and the most basic features, the following is our running example, the simple recursive factorial function written in the conrete syntax of the LLVMlite IR.

    define i64 @fac(i64 %n) {              ; (1)
      %1 = icmp sle i64 %n, 0              ; (2) 
      br i1 %1, label %ret, label %rec     ; (3)
    ret:                                   ; (4)
      ret i64 1
    rec:                                   ; (5)
      %2 = sub i64 %n, 1                   ; (6)
      %3 = call i64 @fac(i64 %2)           ; (7)
      %4 = mul i64 %n, %3
      ret i64 %4                           ; (8)

    define i64 @main() {                   ; (9)
      %1 = call i64 @fac(i64 6)
      ret i64 %1                           

First, notice the function definition at (1). The i64 annotations declare the return type and the type of the argument n. The argument is prefixed with "%" to indicate that it's an identifier local to the function, while fac is prefixed with "@" to indicate that it is in scope in the entire compilation unit.

Next, at (2) we have the first instruction of the body of fac, which performs a signed comparison of the argument %n and 0 and assigns the result to the temporary %1. The instruction at (3) is a "terminator", and marks the end the current block. It will transfer control to either ret at (4) or rec at (5). The labels at (4) and (5) each indicate the beginning of a new block of instructions. Notice that the entry block starting at (2) is not labeled: in LLVM it is illegal to jump back to the entry block of a function body. Moving on, (6) performs a subtraction and names the result %2. The i64 annotation indicates that both operands are 64-bit integers.

Finally, (8) returns from the function with the result named by %4, and (9) defines the main function of the program, which simply calls fac with a literal i64 argument.

Program Structure

LLVMlite programs consist of three types of global definitions: function definitions, global data definitions, and named type definitions, which may be interleaved. These definitions are in scope for the entire compiltaion unit, may be mutually recursive, and need not be declared in order.


Functions, global data definitions, and instruction are explicitly annotated with types. These are divided into "simple" types that may appear on the stack and as arguments to functions and "aggregate" types that may only appear in global and heap-allocated data. (Unlike full LLVM, LLVM lite does not allow locals to hold structured data.) There is also a "void" type that only appears in the return type of instructions and functions that do not return a value. This is essentially the ML unit type, but it has the additional restriction that it cannot appear as the type of an operand, so it is actually illegal to give it a name in the LLVM concrete syntax. In the following table we use T to range over simple and aggregate (non-void, non-function) types, F to range over function types, and S to range over simple types.

Concrete Syntax
void void Indicates the instruction does not return a usable value.
i1, i64 simple 1-bit (boolean) and 64-bit integer values.
T* simple Pointer that can be dereferenced if its target is compatible with T
i8* simple Pointer to the first character in a null-terminated array of bytes. Note: i8* is a valid type, but just i8 is not. LLVMlite programs do not operate over byte-sized integer values.
F* simple Function pointer
S(S1, ..., SN) function A function from S1, ..., SN to S
void(S1, ..., SN) function A function from S1, ..., SN to void
{ T1, ..., TN } aggregate Tuple of values of types T1, ..., TN
[ N x T ] aggregate Exactly N values of type T
%NAME * Abbreviation defined by a top-level named type definition

Named Types

Named type definitions

%IDENT = type T
define abbreviations for types in the scope of the entire compilation unit. The following specification assumes that these are replaced with their definitions whenever they are encountered. Note that recursive types, in which T mentions %IDENT are allowed, but for the type to be well formed, each such recursive occurrence must appear under a *. More generally, any collection of named types may be mutually recursive (i.e. the names may appear in the the definitions), but each cycle of such references must be broken by a *.

Global Definitions

The next kind of top-level definition is global data

@IDENT = global T G
where G ranges over global initializers, described in the following table, and T is the associated type. The global identifier @IDENT, when used in the program, has type T*. For example, the following program fragment has valid annotations:
      @foo = global i64 42
      @bar = global i64* @foo
      @baz = global i64** @bar

Concrete Syntax
null T* The null pointer constant.
[0-9]+ i64 64-bit integer literal.
@IDENT T* Global identifier. The type is always a pointer of the type associated with the global definition.
c"[A-z]*\00" [ N x i8 ] String literal. The size of the array N should be the length of the string in bytes, including the null terminator \00.
[ T G1, ..., T GN ] [ N x T ] Array literal.
{ T1 G1, ..., TN GN } {T1,...,TN}   Struct literal.


We now turn to the parts of a function declaration. Each instruction in a function has zero or more operands which for the purposes of determining the well-formedness of programs, are restricted to the following types.

Concrete Syntax
null T* The null pointer constant
[0-9]+ i64 64-bit integer literal
@IDENT T* Global identifier. The type can always be determined from the global definitions and is always a pointer
%IDENT S Local identifier: can only name values of simple type. The type determined by an local definition of %IDENT in scope

Instructions and Terminators

The following table describes the restrictions on the types that may appear as parameters of well-formed instructions, and the constraints on the operands and result of the instruction for the purposes of type-checking. We assume that named types have been replace by their definitions.

For example, in the call instruction, each type parameter S1, ..., SN must be a simple type. When we type check a program containing this instruction, we must make sure that the operand OP1 has exactly the function pointer type S1(S2, ..., SN)*, and that the remaining operands OP2, ..., OPN have types S2, ..., SN.

Concrete Syntax
Operand → Result Types
%L = BOP i64 OP1, OP2 i64 x i64 → i64
%L = alloca S - → S*
%L = load S* OP S* → S
store S OP1, S* OP2 S x S* → void
%L = icmp CND S OP1, OP2 S x S → i1
%L = call S1 OP1(S2 OP2, ..., SN OPN) S1(S2, ..., SN)* x S2 x ... x SN → S1
call void OP1(S2 OP2, ... ,SN OPN) void(S2, ..., SN)* x S2 x ... x SN → void
%L = getelementptr T1* OP1, i32 OP2, ..., i32 OPN   T1* x i64 x ... x i64 -> GEPTY(T1, OP1, ..., OPN)*
%L = bitcast T1* OP to T2* T1* → T2*

Getelementptr Well-Formedness and Result Type

The getelementptr instruction has some additional well-formedness requirements. Operands after the first must all be constants, unless they are used to index into an array. LLVM actually requires the operands used to index into structs to be 32-bit integers. Rather than introducing 32-bit integers into our language, we will use our 64-bit constants and operands and assume the arguments of getelementptr always fall in the range [0, Int32.max_int].

In the table above, the result type of a getelementptr instruction described using the GEPTY function, which is defined in pseudocode as follows:

      GEPTY : T -> operand list -> T
      GEPTY : T operand::path' = GEPTY' T path'

      GEPTY' : T -> operand list -> T
      GEPTY' T                             [] = T
      GEPTY' { T1, ..., TN } (Const m)::path' = GEPTY' TM path' when m <= N
      GEPTY' [ _ x T ]         operand::path' = GEPTY'  T path'
Notice that GEPTY is a partial function. When GEPTY is not defined, the corresponding instruction is malformed. This happens when, for example: Also notice that a GEP instruction that indexes beyond the size of an array is well-formed. The length information on array tags is only present to help the compiler lay out data in memory and is not verified statically.

Blocks, CFGs, and Function Definitions

A block (or "basic" block) is just a sequence of instructions followed by a terminator:

Concrete Syntax
Operand → Result Types
ret void - → -
ret S OP S → -
br label %LAB - → -
br i1 OP, label %LAB1, label %LAB2   i1 → -

The body of a function is represented by a control flow graph (CFG). A CFG consists of a distinguished entry block and a sequence blocks of prefixed with a label LAB:. A function definition has a return type, the function name, a list of formal parameters and their types, and the body of the function. The full syntax of a function definition is then:

define [S|void] @IDENT(S1 OP, ... , SN OP) { BLOCK (LAB: BLOCK)...}

Like global data definitions, the type of the defined global identifier @IDENT is S(S1, ... , SN)* or void(S1, ... , SN)*, a function pointer.

There are some additional global well-formedness requirements for function definitions. Each label and local definition must be unique. In this way, a local identifier both names the result of an instruction and serves to identify the instruction within a function body. For the locals in a CFG to be well scoped, there must never be a path to a use of a local that does not pass through its definition. We will not go into the details here.

Last modified: Tue Feb 7 16:45:49 EST 2017