The Implementation of Unlambda

by Noah Medling — 01 January 2010

A couple of weeks ago, I was perusing my list of things I’d like to accomplish, and “write a non-trivial programming language interpreter” had apparently bubbled to the top while I wasn’t looking. At around the same time, I happened across Unlambda, which is a functional Turing tarpit (as opposed to an imperative one: if it was functional in the useful sense, it wouldn’t be called a tarpit), and a project was born.

The Language

Unlambda, so named because it’s basically untyped lambda calculus without the lambda construct, consists entirely of twelve functions (s, k, i, v, d, c, e, .x, @, ?x, and |), and a single operator (`). There are no variables, user-defined functions, or indeed anything other than than that handful of predefined functions. It looks like this:

```s``s``sii`ki
  `k.*``s``s`ks
 ``s`k`s`ks``s``s`ks``s`k`s`kr``s`k`sikk
  `k``s`ksk

If you can’t tell by reading it—and I’d be very impressed if you could—that little program prints the Fibonacci sequence as asterisks (*, *, **, ***, *****, …) to standard output.

Function Application

Unlambda’s only operator, ` (that’s ASCII 96, the backtick/grave accent), denotes function application: the expression `FG means F applied to G. In C, that might be expressed as F(G). Because every function in Unlambda takes exactly one argument, this notation is unambiguous, and parentheses are unnecessary. For example, ``FGH means `FG applied to H, and `F`GH means F applied to `GH.

Expressions in Unlambda are evaluated left to right (except in the presence of the d function, which we’ll see later). In the expression `FG, F is evaluated first, then G, then the application of F to G.

As noted above, every function in Unlambda is unary. Some functions, however, require multiple arguments. We can resolve this by currying those functions: transforming them into series of single-argument functions that simply wait for more arguments. Consider, for example, a function F that takes two arguments. First, we apply F to G, yielding `FG, which does nothing. Applying that expression to a second argument, H, yields ``FGH, which applies the original function F to arguments G and H.

The Functions

Unlambda has three ‘simple’ functions: three from SKI combinator calculus (guess which!), and one more that’s simple enough to be lumped in with them:

s
The S combinator. Takes three arguments and applies the first to the third, the result of which is then applied to the result of applying the second argument to the third. More concicely, ```sFGH``FH`GH.
k
The K combinator. Takes two arguments, ignores the second, and returns the first. ``kFGF.
i
The I combinator. Takes one argument, which is returned unaltered. `iFF.
v
Void. Takes one argument and returns v. `vFv.

There are two functions for input and three for output. In all cases, ‘x’ is a placeholder for an arbitrary character, and not meant to represent the literal character ‘x’:

.x
Print. Equivalent to i, except that when applied, .x prints the character ‘x’ to standard output. Any single character can be substituted for ‘x’ here: .z prints ‘z’ instead of ‘x’.
r
Print newline. Shorthand for .<newline>, where <newline> is the literal newline character (ASCII 10).
@
Read. `@F reads a single character from standard input, recording that character as the current character, and evaluates to `Fi if the read was successful (and not EOF), otherwise `Fv.
?x
Query current character. If the current character is equal to ‘x’, `?xF evaluates to `Fi. If not, `Fv.
|
Print current character. If there is a current character (i.e., the most recent application of @ was successful), `|F evaluates to `F.x, where ‘x’ is the current character. If not, `Fv.

Finally, there are three functions that disregard Unlambda’s usual order of evaluation:

d

Delay. The expression `dF evaluates to a promise to evaluate the expression F at some point in the future. When that promise is applied, it first evaluates F, then applies F to its argument.

For example, consider the expression ``d`.xii. Here, d is applied to the expression `.xi, (which is not evaluated, and hence nothing is printed at this point), yielding a promise, which I’ll call <p>. This leaves us with `<p>i. Because <p> is being applied, the expression it contains is now substituted for <p>, leaving us with ``.xii. We apply .x to i, which prints ‘x’ and returns i, and we’re left with `ii, which evaluates to i.

c

Call with current continuation. The expression `cF applies F to the current continuation. If F evaluates without applying its argument, then `cF evaluates to that value. If, however, F’s argument is applied to something, `cF evaluates to that value, even if a different result has already been obtained.

Consider, for example, the expression ``ci.x: `ci applies i to the current continuation, which I’ll call <cont>, yielding `i<cont>. This reduces, simply, to <cont>, and we’re left with `<cont>.x. Now, because we’re applying a continuation to something, we travel back in time to the point when that continuation was created, in this case the application of c to i, and force that expression to return the value given to the continuation instead of whatever it returned before. So, we have `ci again, but because we got there by applying <cont> to .x, it returns .x and we’re left with `.x.x. That, in turn, prints ‘x’ and evaluates to .x.

e

Exit, or the top-level continuation. When applied, e terminates the program, forcing the whole thing to evaluate to e’s argument.

Learning to write programs in Unlambda—if you’re as unfamiliar with SKI combinator calculus as I was when I started writing this, anyway—is a frustrating, mind-bending experience, and I highly recommend it. I won’t, however, explain it here any further, as this post is intended to be about Unlambda’s implementation, not about the language itself.

If you’d like to learn more about the language itself, the Unlambda homepage is a good place to start, as are the entries on the Esolang wiki and Wikipedia.

Implementing Unlambda

The implementation of Unlambda in a language as poorly-endowed as C is somewhat tricky: C does not have closures, continuations, tail calls, or garbage collection, and Unlambda does. What makes Unlambda so attractive to implement, in fact, is that it contains nothing else: it has most of the things that make implementing ‘real’ functional programming languages interesting, and none of the easy, tedious things that make those languages useful.

I’ll be discussing the details of my implementation shortly. If you’d like to follow along with the actual source code rather than the truncated examples presented herein, you can acquire it here.

Garbage Collection

Because Unlambda does not have mutable values or any way to reference those values before or during their initialization, it is impossible to create cycles, and we can get away with reference counting: for each dynamically-allocated object, we simply maintain a count of the number of places in which it is referenced, and deallocate it when that count becomes zero.

The memory management code looks like this:

typedef struct Obj {
  size_t refcount;
  enum ObjectType {O_ATOM, O_EXPR, O_CONT} type;
} Obj;

struct Atom {Obj obj, ...};
struct Expr {Obj obj, ...};
struct Cont {Obj obj, ...};

void incref(Obj *);
void decref(Obj *);

#define INCREF(x) (incref((Obj *)(x)), (x))
#define DECREF(x) (decref((Obj *)(x)))

Working with reference counting is best thought of as a kind of ownership: A function owns and is responsible for its arguments, and when a function returns a dynamically-allocated value, the caller owns that value. When a value is discarded, DECREF should be called to destroy that particular reference; when ownership of a value is shared with another function, INCREF should be called so that there are enough references to go around.

Data Structures

Unlamda consists of only two types: functions (or atoms, to borrow Lisp’s terminology) and unevaluated expressions. Atoms are fairly easy to represent:

typedef struct Atom {
  struct Obj obj;
  enum AtomType {
    A_S, A_S1, A_S2, A_K, A_K1, A_I, A_V, A_D, A_D1,
    A_C, A_C1, A_E, A_DOT, A_AT, A_QUERY, A_PIPE
  } type;
  union {
    char ch;
    struct {
      struct Atom *a1, *a2;
    } p;
    struct Expr *expr;
    struct Cont *cont;
  } u;
} Atom;

In addition to the obvious members of AtomType, A_S1, A_S2, A_K1, and A_D1 stand for the partially-applied s, k, and d functions, and A_C1 is an unapplied continuation.

Expressions, being either atoms or applications of other expressions, are even easier:

typedef struct Expr {
  struct Obj obj;
  enum ExprType {E_ATOM, E_APPLY} type;
  union {
    struct Atom *atom;
    struct {
      struct Expr *e1, *e2;
    } a;
  } u;
} Expr;

There is also a third type, Cont, which describes continuations and is similarly straightforward. We’ll see that one in the section on tail calls and continuation-passing style.

Object Initialization

C lacks, among other things, optional and keyword arguments to functions. In C99, however, it is possible to borrow an idiom from JavaScript and pass a struct literal as the sole argument to a function instead of a larger number of primitives:

foo((struct bar){.a=1, .b=2});

Defining and keeping track of the the various structs necessary to do this would probably lead to an unmaintainable mess, but it works well for writing constructors for complicated data structures, as the types are already defined. For example, here’s the constructor for Atom:

Atom *new_atom(Atom a) {
  Atom *x = xmalloc(sizeof (Atom));
  *x = a, x->obj = (Obj){1,O_ATOM};
  return x;
}

#define ATOM(...) \
  new_atom((Atom){{0,O_ATOM},__VA_ARGS__})

There are similiar constructors for Expr and Cont. This allows us to allocate and initialize complex expressions in a simple and straighforward manner:

// ``k.x.y
EXPR(E_APPLY, .u.apply={
  EXPR(E_APPLY, .u.apply={
    EXPR(E_ATOM, .u.atom=ATOM(A_K)),
    EXPR(E_ATOM, .u.atom=ATOM(A_DOT, .u.ch='x'))
  }),
  EXPR(E_ATOM, .u.atom=ATOM(A_DOT, .u.ch='y'))
})

Parsing

Because Unlambda has only one operator and all but two of its literals consist of a single character, parsing Unlambda is trivial. The only feature of my parser that’s really even worth a mention is the fact that the atoms and expressions for those single-character literals are statically allocated, which saves time on allocation and deallocation, and keeps the memory footprint small by generating fewer duplicates:

static Atom atom_s = {{2, O_ATOM}, A_S};
// ...

static Expr expr_s = {{1, O_EXPR}, E_ATOM,
                      .u.atom=&atom_s};
//...

Expr *parse(FILE *f);

Evaluation

Once an expression has been parsed, it’s time to evaluate it. We’ll start with a naïve evaluator, written without regard for performance considerations or system limitations:

Atom *eval(Expr *e) {
  Atom *a;
  switch (e->type) {
    case E_ATOM:
      a = INCREF(e->u.atom);
      DECREF(e);
      return a;
    case E_APPLY:
      a = apply(eval(INCREF(e->u.a.e1)),
                eval(INCREF(e->u.a.e2)));
      DECREF(e);
      return a;
  }
}

Atom *apply(Atom *a1, Atom *a2) {
  Atom *a;
  switch (a1->type) {
    case A_S:
      a = ATOM(A_S1, .u.p={a2});
      DECREF(a1);
      return a;
    case A_S1:
      a = ATOM(A_S2, .u.p={INCREF(a1->u.p.a1), a2});
      DECREF(a1);
      return a;
    case A_S2:
      a = apply(apply(INCREF(a1->u.p.a1), a2),
                apply(INCREF(a1->u.p.a2), a2));
      DECREF(a1);
      return a;
    // ...
  }
}

That implementation would function just fine for the trivial sort of program that you can write out without having to think too hard about it. This implementation has a few problems. It would break down on more complicated programs, however.

Most obviously, c, d, and e are impossible to implement this way, and so any program that makes use of them won’t function properly, if at all. It isn’t very hard to restructure the above to accomadate d, and e can probably be made to work as well (i.e., without resorting to exit), but c is actually impossible.

Next, consider the fact that any nontrivial Unlambda program is going to use a lot of recursion, and C maintains a call stack of limited size. A deeply-recursive expression like ```sii``sii is going to exhaust the stack in short order.

Promises

Adding support for promises to the above is fairly straightforward. We can simply add another version of apply that expects its second argument to be unevaluated:

Atom *eval(Expr *e) {
  Atom *a;
  switch (e->type) {
    // ...
    case E_APPLY:
      a = apply1(eval(INCREF(e->u.a.e1)),
                 INCREF(e->u.a.e2));
      DECREF(e);
      return a;
    // ...
  }
}

Atom *apply1(Atom *a, Expr *e) {
  if (a->type == A_D) {
    DECREF(a);
    return ATOM(A_D1, .u.expr=INCREF(e));
  }
  return apply2(a, eval(e));
}

Atom *apply2(Atom *a1, Atom *a2) {
  Atom *a;
  switch (a1->type) {
    // ...
    case A_D1:
      a = apply2(eval(INCREF(a1->u.expr)), a2);
      DECREF(a1);
      return a;
    // ...
  }
}

Tail Calls and Continuation-Passing Style

When a function is called in C, an activation record is pushed onto the call stack to keep track of the return address, local variables, and whatever else the specific implementation finds useful. When a function returns, its activation record is popped from the stack, leaving the caller’s activation record at the top. This is a straightforward and easy way to work with function calls at a low level, but it can cause problems when recursion gets introduced: the call stack is generally a fixed size and not terribly large, and when it fills up, things break.

We can avoid allocating a new stack frame, however, when a function call is in tail position. That is, the value returned by the called function is returned again by the caller, with no further processing. In this case, we can perform what is known as tail call optimization and replace the current stack frame with that of the called function instead of pushing a new one onto the stack. Some C compilers perform this optimization, but it isn’t part of the ANSI C standard, nor is it particularly relable where it is available.

The usual approach to dealing with the lack of proper tail call optimizaiton is to transform recursive computations into iterative ones by using passing all function calls through another function, called a trampoline. Instead of calling a function in the normal manner, you’d return a thunk (a delayed computation, or in our case, a struct from which the call can reconstructed), which ‘bounces’ from the trampoline as the intended call.

Here’s what that might look like:

typedef struct Thunk {
  struct Obj obj;
  enum ThunkType {T_F, T_E, T_A1, T_A2} type;
  union {
    struct { struct Atom *a; } f;
    struct { struct Expr *e; } e;
    struct { struct Atom *a; struct Expr *e; } a1;
    struct { struct Atom *a1, *a2; } a2;
  } u;
} Thunk;

Atom *eval(Expr *e) {
  Thunk *t = THUNK(T_E, .u.e={e}), t2;
  Atom *a;
  for (;;) {
    switch (t->type) {
      case T_F: goto done;
      case T_E:
        t2 = expr(INCREF(t->u.e.e));
        break;
      case T_A1:
        t2 = apply1(INCREF(t->u.a1.a),
                    INCREF(t->u.a1.e));
        break;
      case T_A2:
        t2 = apply2(INCREF(t->u.a2.a1),
                    INCREF(t->u.a2.a2));
        break;
    }
    DECREF(t2);
    t = t2;
  }
  done:
  a = INCREF(t->u.f.a);
  DECREF(t);
  return a;
}

Thunk *expr(Expr *e) { /* ... */ }
Thunk *apply1(Atom *a, Expr *e) { /* ... */ }
Thunk *apply2(Atom *a1, Atom *a2) { /* ... */ }

Here, eval is our trampoline. It calls functions in a loop, returning when it comes across a thunk marked final (T_F). The function expr has taken the place of what we were calling eval earlier. This eliminates the danger of stack overflows, and it also brings us one step closer to being able to implement c and e.

By adding a field to our thunk indicating what to do next, we can encode the entire execution of our program. At this point, we can start calling it a continuation, because that’s what it is:

typedef struct Cont {
  struct Obj obj;
  enum ContType {C_F, C_E, C_A1, C_A2, C_A3} type;
  union {
    struct { struct Atom *a; } f;
    struct { struct Expr *e; } e;
    struct { struct Atom *a; struct Expr *e; } a1;
    struct { struct Atom *a1, *a2; } a2;
    struct { struct Atom *a1, *a2, *a3; } a3;
  } u;
  struct Cont *c;
} Cont;

Note the addition of C_A3 and its arguments: continuations make the evaluation of s tricky, and the easiest way to deal with it was to add another helper function.

Now, we need to add an additional paramater to all of our Thunk-returning functions: the current continuation. Just like in Unlambda, it’s a function that represents the execution of the remainder of the evaluation. One can’t call a struct in C, though, so we’ll introduce a utility function to mimic the behavior we want:

Cont *call(Atom *, Cont *);

This makes a shallow copy of the given Cont with the supplied Atom substituted for the first NULL argument. For example, the following two expressions should produce the same result:

call(a, CONT(C_A2, .u.a2={x, NULL}))
CONT(C_F, .u.a2={x, a})

Passing continuations around in this manner is called, predictibly, continuation-passing style. Now, let’s extend our trampoline to deal with continuations instead of thunks:

Atom *eval(Expr *e) {
  Cont *c = CONT(C_E, .u.e={e}, CONT(C_F));
  Cont *cc; Atom *a;
  for (;;) {
    switch (c->type) {
      case C_F:
        goto done;
      case C_E:
        cc = expr(INCREF(c->u.e.e), INCREF(c->c));
        break;
      case C_A1:
        cc = apply1(INCREF(c->u.a1.a),
                    INCREF(c->u.a1.e), INCREF(c->c));
        break;
      case C_A2:
        cc = apply2(INCREF(c->u.a2.a1),
                    INCREF(c->u.a2.a2), INCREF(c->c));
        break;
      case C_A3:
        cc = apply3(INCREF(c->u.a3.a1),
                    INCREF(c->u.a3.a2),
                    INCREF(c->u.a3.a3), INCREF(c->c));
        break;
    }
    DECREF(c); c = cc;
  }
  done:
  a = INCREF(c->u.f.a);
  DECREF(c);
  return a;
}

And here’s expr, in it’s continuation-passing glory:

Cont *expr(Expr *e, Cont *c) {
  switch (e->type) {
    case E_ATOM:
      c = call(e->u.atom, c);
      DECREF(e);
      return c;
    case E_APPLY:
      c = CONT(C_E, .u.e={e->u.a.e1},
               CONT(C_A1, .u.a1={NULL, e->u.a.e2},
                    c));
      DECREF(e);
      return c;
   }
}

And apply3, which processes the second and third applications of the S combinator, along with the portion of apply2 that made it necessary:

Cont *apply3(Atom *a1, Atom *a2, Atom *a3, Cont *c) {
  return CONT(C_A2, .u.a2={a2, a3},
              CONT(C_A2, .u.a2={a1, NULL}, c));
}

Cont *apply2(Atom *a1, Atom *a2, Cont *c) {
  switch (a1->type) {
    // ...
    case A_S2:
      c = CONT(C_A2, .u.a2={INCREF(a1->u.p.a1), a2},
               CONT(C_A3, .u.a3={NULL,
                 INCREF(a1->u.p.a2), INCREF(a2)}, c));
      DECREF(a1);
      return c;
    // ...
}

Now, we’re finally ready to implement c and e. With continuation-passing style, the hard work is already done, making the act of actually adding support for them trivial:

Cont *apply2(Atom *a1, Atom *a2, Cont *c) {
  switch (a1->type) {
    // ...
    case A_C:
      c = CONT(C_A2, .u.a2={a2,
        ATOM(A_C1, .u.cont=INCREF(c))}, c);
      DECREF(a1);
      return c;
    case A_C1:
      c = call(a2, INCREF(a1->u.cont));
      DECREF(a1);
      return c;
    case A_E:
      c = CONT(C_F, .u.f={a2}, c);
      DECREF(a1);
      return c;
    // ...
  }
}

The End

That’s it! If you didn’t grab when I mentioned it above, my code is available here. It’s under the MIT license, so go ahead and play around with it.