The Vapor Programming Language

11 Jun 2014

Vapor is an attempt to create a programming language that encourages reasoning about mutability. While mutation is one of the primary means a programmer has to improve the performance of his or her program, it is also a source of many bugs and can make reasoning about programs very difficult. By tracing mutability and providing strong immutability guarantees, vapor allows programmers to make important trade-offs at the individual function level about what level of mutability is right for the problem at hand.

Specification

Vapor is a language which makes it easy for programmers to write code that can be correct and efficient by managing mutability at the language level. Most imperative languages use mutability for performance, but can have severe problems when it comes to reasoning about the application state. On the other hand, pure languages often are accused of having substandard performance because of excessive copying. In either case, it is often impossible for a programmer to use an object or function written in one style in the other, creating a barrier to entry when a different paradigm is required. Vapor aims to be a language that embraces both styles of programming for their merits and allows for great flexibility in managing the boundaries of pure and imperative code.

Many existing languages have some way for managing mutability of objects, but often it is either too lax or too restrictive to be used effectively by most audiences. One one hand the const keyword from C/C++ is easily ignored or removed by a cast and does not guarantee that a const reference will not be mutated by another part of the application. On the other hand the Rust borrow checker is so restrictive that it allows only one mutable reference to an object at a time. In vapor every immutable reference is guaranteed to be fully shareable, but algorithms for transformation can be written using an imperative style that yields better performance for consecutive updates.

Types

Vapor has 3 built-in types: int, bool, and nullable. These types can be used a-la-carte, or combined using structure declarations to encapsulate related fields. All types in vapor can easily be used either mutably or immutably. What this means is that the programmer is freed from choosing either a mutable or immutable object when they are designing a type, giving them flexibility when they use the object. A user-defined structure for a 2D point would be declared like so:

struct Point2D {
    x: int,
    y: int
}

This generates the equivalent of the following two Java classes:

public class MutablePoint2D {
    public int x;
    public int y;

    public MutablePoint2D(int x, int y) {
        this.x = x;
        this.y = y;
    }

    public ImmutablePoint2D getImmutable() {
        return new ImmutablePoint2D(x, y);
    }
}

public class ImmutablePoint2D {
    public final int x;
    public final int y;

    public ImmutablePoint2D(int x, int y) {
        this.x = x;
        this.y = y;
    }

    public MutablePoint2D getMutable() {
        return new MutablePoint2D(x, y);
    }
}

This can be a huge help to developers who are often put off by the tedium of writing a class hierarchy that can accurately reflect mutability and immutability.

Int

The int type represents integers in vapor. The standard integer operators +, -, *, /, and % are all defined with the standard meaning and operator precedence. It's worth noting that integers in vapor are represented as object with identity the same way that user-defined types are i.e. they can be passed mutably or immutably.

Bool

The bool type represents logical true or false values. The standard boolean operators !, &&, and || are defined with the standard meaning and operator precedence. Both && and || are defined as short-circuiting operators. Like ints, bools are represented as objects and can be passed either mutably or immutably.

Nullable

The nullable type is a generic type that represents an optional value. Nullables have a special syntax: for any type T, a nullable T is denoted *T. The legal value for a nullable type is any legal value of the underlying type and the special value null. The null value for a type can be obtained using the built-in function null<T> and any nullable can be checked for the null value using the isNull<T> function. Regular T values can be boxed to a nullable value by assigning into a nullable variable, but nullables can be unboxed using the * operator. The behavior of dereferencing a null value is termination of the program. Nullable values have identity and can be passed either mutably or immutably. For example, it is idiomatic to write the following code:

fn collesce<T>(opt: &*T, val: T) {
    if (isNull<T>(opt)) {
        opt = val;
    }
}

This function takes a mutable pointer and, if it is null, assigns it the value passed in as the second parameter. Since even null has identity, the caller of this function will observe the parameter opt to be non-null after calling this function like so:

var i = null<int>(); // i is set to null
coalesce(&i, 5); // i is now set to 5

Structs

Users of vapor can define their own data types by using the struct keyword. A struct consists of at least one named field of a previously defined type. The fields of a struct can be accessed by the . operator and have the same mutability as the containing object. It is not possible to have a mutable field in an immutable struct or vice-versa, but the value of fields can be copied into mutable or immutable variables and fields of other objects. For instance, one might define a type for rational numbers like so:

struct Ratio {
    num: int,
    den: int
}

Functions

Functions can be defined in vapor by using the fn keyword. Functions can take zero or more arguments of any built-in or user-defined types and may optionally return a value of any type. Parameters to functions are passed by value unless the type of the parameter is prefixed with the & sigil. For example, a immutably add method could be specified as follows:

fn add(lhs: int, rhs: int): int {
    return lhs + rhs;
}

In contrast, an increment function that changes one of its parameters can be written:

fn increment(lhs: &int, rhs: int): int {
    lhs = lhs + rhs;
    return lhs;
}

Parameter Mutability

Parameter mutability in vapor is purely a contract between callers and callees, once inside a function declaration, parameters can be used just like any other local variables including mutating them and passing them mutably to other functions. This means that parameters must be defensively copied when passed immutably into a function if they are mutated during the course of the function's execution.

Parameter mutability is vapor's most important feature because it makes the use of destructive updates very clear to the programmer. This is the exact opposite approach of languages like Java that are mutable by default. For instance, the following Java code leaks mutable state:

import java.util.Calendar;

public class Clock {
    private Calendar time;
    private int elapsed;

    public Clock(Calendar time) {
        this.time = time;
        this.elapsed = 0;
    }

    public void tick() {
        time.add(Calendar.SECOND, 1);
        elapsed++;
    }

    public Calendar getTime() {
        return time;
    }
}

While small, well defined classes like these are generally easy to understand, there are some technical issues with this class. There are two opportunities for external code to modify the internal state of the Clock without involving any of the clock's code. When the Calendar is passed to the constructor, and when the Calendar is returned from the getter, it is possible for another object to modify the state of the Calendar without incrementing the elapsed counter, destroying the consistency of the Clock object. To preserve the integrity of the Clock class it must be written in a much more defensive manner:

import java.util.Calendar;

public class Clock {
    private Calendar time;
    private int elapsed;

    public Clock(Calendar time) {
        this.time = (Calendar)time.clone();
        this.elapsed = 0;
    }

    public void tick() {
        time.add(Calendar.SECOND, 1);
        elapsed++;
    }

    public Calendar getTime() {
        return (Calendar)time.clone();
    }
}

This design is fraught with its own problems, it is tedious to maintain and very error-prone. Vapor allows for a much simpler way of dealing with problems like this using mutable parameters.

struct Clock {
    time: Calendar,
    elapsed: int
}

fn clock(time: Calendar) {
    return new Clock { time: time,
                       elapsed: 0 };
}

fn tick(self: &Clock) {
    addSecond(self.time, 1);
    self.elapsed = self.elapsed + 1;
}

fn getTime(self: Clock): Calendar {
    return self.time;
}

Because of vapor's mutability guarantees, the tick function is allowed to mutate the internal state of the clock, but neither the clock function nor the getTime function will be able to. A second advantage of this approach is the possibility for static analysis to elide the object copy if the caller never tries to mutate the return value of getTime.

Generics

Out of a desire not to have to handle the nullable type in a special method, I added generic function calls to vapor. Generic methods are invoked by calling specifying type parameters surrounded in < and > before the standard arguments. This allowed me to specify the null<T>() and isNull<T>(arg0: T) functions as special built-ins without creating specialized syntax. These functions are used like so:

var x = null<int>(); // type of x is *int
var isNullNull = isNull<int>(x); // returns true
x = 5; // this assignment succeeds by "boxing" 5
var isFiveNull = isNull<int>(x); // returns false

Implementation

My compiler is written using the ANTLR parser-generator and generates Java source code that is then compiled to Java bytecode by using the Java compiler API. Originally, I had designed a parser by hand, but issues around the infinite lookahead necessary for parsing generic function calls led me to choose a tool that was better suited for the job.

Java Parser

My first attempt to create a parser consisted of writing a parser by hand in Java. While I was able to make decent progress and parse the majority of the language after only a short time working on the parser, adding support for generic functions became a roadblock because of the ambiguity in token sequences like A < B .... This sequence can either be a comparison of two variables, A and B, or it is the beginning of an invocation of A with the first type parameter B. I was able to manage the ambiguity by creating a lexer that could mark a spot in the token stream and then rewind back to that point and replay the input. Unfortunately, this made my parser as a whole more brittle, and at this point it became apparent to me that it would be prudent to switch to using a parsing tool than to solve problems like these on my own.

ANTLR Parser

My second parser was created using the ANTLR tool to generate Java source code based on an EBNF-like grammar file. Using this code generation tool turned out to be much more robust and allowed me to make up the progress I had lost with my own parser.

Some of the problems that I encountered with ANTLR were issues with some of the implementation details of the tool. It does not automatically resolve issues like the infinite lookahead that I encountered without some sort of user intervention by enabling backtracking. This was easy enough to grasp and was much less painful than my previous solution, but there are some rough edges to the tool, and I found myself getting exceptions from deep inside the tool that I couldn't understand and that I could not find an explanation of on the internet. Ultimately, I was able to create an equivalent grammar using other constructs from the ANTLR tool that worked, though I am still not clear what the problem was.

Code Generation

After parsing and static analysis have been done using the ANTLR tool, my compiler generates Java source code for each struct and function that was declared in the input file. Because Java is such a high-level language, all of the operators and even generic type instantiation were translated into the equivalent syntax in Java. The built-in types and functions for vapor were also hand-written in Java and included in the generated source file.

After creating the Java source code, my compiler feeds the code into the Java compiler API and executes the main method. Because the runtime classes are already on the classpath, they are linked to the new code by the compiler automatically.

Optimizations

To increase the performance of applications written in vapor, I applied a few optimizations that allow for copy elision. Each optimization decreased the number of object allocations that the program executed without breaking any of the guarantees of mutability that vapor offers. Using the optimizations below, I was able to reduce the number of objects my code generated from over 40 to 13 on the binary tree benchmark in appendix A.

Caching

One area that was immediately apparent where I could save some allocations was by reducing the number of immutable objects that exist. Because immutable objects cannot be changed by any part of the program, they are perfect candidates for caching. If a program decides that it needs to mutate one of these cached objects, a mutable copy is made per the language specification. I was also able to enhance the copy logic of objects so that if I knew there was an existing immutable object with the same value as the given object, I would return the cached object rather than instantiating a new one.

While all of these caching techniques deal with the built-in types, that is the area that stands to gain the most from improvement since all other types will be built from a combination of the built-ins.

Bool Caching

Because the bool type has only two values, there should only ever be two immutable bool objects in any program. I was able to implement this by overriding the copy behavior to return one of my singleton instances whenever possible. In any program where booleans are not actively mutated, this can save a large number of object allocations.

Int Caching

Any int literal in a program is likely to be used again so I used a non-boxing Map from native Java ints to immutable vapor int instances. I also cached the immutable values between -128 and 127 because these values are likely to occur in many different calculations.

Nullable Caching

There should only ever be one null object for each type, so I added code generation of a singleton null instance for every type. For types like a binary tree where many nodes are null, this can amount to a huge savings and for types like int that are unlikely to be used as part of a nullable, it has only a small cost.

Nullable Flattening

While the native int and bool types have their own class and instances, I quickly realized that creating instances for everything that is nullable can be a huge burden on the runtime. By adding a single boolean flag to every object, the language pays a small cost in the size of objects while getting a large benefit in the number of objects created. There was also a slight increase in complexity of the generated code because actions like copy have to take into account the nullability of both objects. This flexibility did not introduce any bugs, however in the type checking mechanism because type checking is done entirely before Java code generation.

Parameter Copy Ellison

Because functions are allowed to mutate their parameters and the semantics of the language dictate that only changes to the mutable parameters will be visible to the caller, a defensive copy must be made when an immutable parameter is passed. By making the defensive copy the responsibility of the callee, a function can simply determine that it does not mutate any objects and elide the copy entirely. This enabled better performance for the built-in functions and any function that uses no mutation.

Struct Copy Ellison

One of the semantics of structs in vapor is that there is no aliasing of fields, each field belongs to exactly one object. This means that when a field on a struct is set, it has to make a defensive copy. I was able to save some allocations by allowing immutable fields to be shared between multiple objects without copying. When an object needs to mutate one of its fields, it will simply create a mutable copy first.

Constructor Ellison

One area where many objects were being created were places where new objects were being constructed and then assigned to a field of a mutable object. This was creating an unnecessary object that was then never used again as the older object would simply copy all of its fields. By allowing assignments from constructors to fields to skip the intermediate step and just perform the copy, a number of object allocations were saved.

Appendix

Appendix A: Binary Tree

struct Tree {
    value: int,
    left: *Tree,
    right: *Tree
}

fn insert(self: &*Tree, value: int) {
    if (isNull<Tree>(self)) {
        self = new Tree { value: value,
                          left: null<Tree>(),
                          right: null<Tree>() };
    } else {
        if (value < (*self).value) {
            insert(&(*self).left, value);
        } else {
            insert(&(*self).right, value);
        }
    }
}

fn printTree(self: *Tree) {
    if (!isNull<Tree>(self)) {
        printTree((*self).left);
        print((*self).value);
        printTree((*self).right);
    }
}

fn main() {
    var tree = null<Tree>();
    insert(&tree, 5);
    insert(&tree, 7);
    insert(&tree, 6);
    insert(&tree, 3);

    printTree(tree);
}

When this program is run after being compiled, it generates 13 objects total, three null objects (for int, bool, and Tree), two bool objects (true and false), 4 Tree objects, and 4 int objects. It prints the following to standard out:

3
5
6
7