The Implementation of Unlambda
A couple of weeks ago, I was perusing my list of things I’d like to accomplish, and “write a non-trivial programming language interpreter” had apparently bubbled to the top while I wasn’t looking. At around the same time, I happened across Unlambda, which is a functional Turing tarpit (as opposed to an imperative one: if it was functional in the useful sense, it wouldn’t be called a tarpit), and a project was born.
The Language
Unlambda, so named because it’s basically untyped lambda calculus without the
lambda construct, consists entirely of twelve functions (s, k, i, v,
d, c, e, .x, @, ?x, and |), and a single operator (`).
There are no variables, user-defined functions, or indeed anything other
than than that handful of predefined functions. It looks like this:
```s``s``sii`ki
`k.*``s``s`ks
``s`k`s`ks``s``s`ks``s`k`s`kr``s`k`sikk
`k``s`ksk
If you can’t tell by reading it—and I’d be very impressed if you could—that
little program prints the Fibonacci sequence as asterisks (*, *, **,
***, *****, …) to standard output.
Function Application
Unlambda’s only operator, ` (that’s ASCII 96, the backtick/grave accent),
denotes function application: the expression `FG means F applied to G.
In C, that might be expressed as F(G). Because every function in Unlambda
takes exactly one argument, this notation is unambiguous, and parentheses are
unnecessary. For example, ``FGH means `FG applied to H, and
`F`GH means F applied to `GH.
Expressions in Unlambda are evaluated left to right (except in the presence of
the d function, which we’ll see later). In the expression `FG, F is
evaluated first, then G, then the application of F to G.
As noted above, every function in Unlambda is unary. Some functions, however,
require multiple arguments. We can resolve this by currying those functions:
transforming them into series of single-argument functions that simply wait for
more arguments. Consider, for example, a function F that takes two arguments.
First, we apply F to G, yielding `FG, which does nothing. Applying
that expression to a second argument, H, yields ``FGH, which applies the
original function F to arguments G and H.
The Functions
Unlambda has three ‘simple’ functions: three from SKI combinator calculus (guess which!), and one more that’s simple enough to be lumped in with them:
s- The S combinator. Takes three arguments and applies the first to the
third, the result of which is then applied to the result of applying the
second argument to the third. More concicely,
```sFGH→``FH`GH. k- The K combinator. Takes two arguments, ignores the second, and returns the
first.
``kFG→F. i- The I combinator. Takes one argument, which is returned unaltered.
`iF→F. v- Void. Takes one argument and returns
v.`vF→v.
There are two functions for input and three for output. In all cases, ‘x’ is a placeholder for an arbitrary character, and not meant to represent the literal character ‘x’:
.x- Print. Equivalent to
i, except that when applied,.xprints the character ‘x’ to standard output. Any single character can be substituted for ‘x’ here:.zprints ‘z’ instead of ‘x’. r- Print newline. Shorthand for
.<newline>, where<newline>is the literal newline character (ASCII 10). @- Read.
`@Freads a single character from standard input, recording that character as the current character, and evaluates to`Fiif the read was successful (and notEOF), otherwise`Fv. ?x- Query current character. If the current character is equal to ‘x’,
`?xFevaluates to`Fi. If not,`Fv. |- Print current character. If there is a current character (i.e., the most
recent application of
@was successful),`|Fevaluates to`F.x, where ‘x’ is the current character. If not,`Fv.
Finally, there are three functions that disregard Unlambda’s usual order of evaluation:
d-
Delay. The expression
`dFevaluates to a promise to evaluate the expressionFat some point in the future. When that promise is applied, it first evaluatesF, then appliesFto its argument.For example, consider the expression
``d`.xii. Here,dis applied to the expression`.xi, (which is not evaluated, and hence nothing is printed at this point), yielding a promise, which I’ll call<p>. This leaves us with`<p>i. Because<p>is being applied, the expression it contains is now substituted for<p>, leaving us with``.xii. We apply.xtoi, which prints ‘x’ and returnsi, and we’re left with`ii, which evaluates toi. c-
Call with current continuation. The expression
`cFappliesFto the current continuation. IfFevaluates without applying its argument, then`cFevaluates to that value. If, however,F’s argument is applied to something,`cFevaluates to that value, even if a different result has already been obtained.Consider, for example, the expression
``ci.x:`ciappliesito the current continuation, which I’ll call<cont>, yielding`i<cont>. This reduces, simply, to<cont>, and we’re left with`<cont>.x. Now, because we’re applying a continuation to something, we travel back in time to the point when that continuation was created, in this case the application ofctoi, and force that expression to return the value given to the continuation instead of whatever it returned before. So, we have`ciagain, but because we got there by applying<cont>to.x, it returns.xand we’re left with`.x.x. That, in turn, prints ‘x’ and evaluates to.x. e-
Exit, or the top-level continuation. When applied,
eterminates the program, forcing the whole thing to evaluate toe’s argument.
Learning to write programs in Unlambda—if you’re as unfamiliar with SKI combinator calculus as I was when I started writing this, anyway—is a frustrating, mind-bending experience, and I highly recommend it. I won’t, however, explain it here any further, as this post is intended to be about Unlambda’s implementation, not about the language itself.
If you’d like to learn more about the language itself, the Unlambda homepage is a good place to start, as are the entries on the Esolang wiki and Wikipedia.
Implementing Unlambda
The implementation of Unlambda in a language as poorly-endowed as C is somewhat tricky: C does not have closures, continuations, tail calls, or garbage collection, and Unlambda does. What makes Unlambda so attractive to implement, in fact, is that it contains nothing else: it has most of the things that make implementing ‘real’ functional programming languages interesting, and none of the easy, tedious things that make those languages useful.
I’ll be discussing the details of my implementation shortly. If you’d like to follow along with the actual source code rather than the truncated examples presented herein, you can acquire it here.
Garbage Collection
Because Unlambda does not have mutable values or any way to reference those values before or during their initialization, it is impossible to create cycles, and we can get away with reference counting: for each dynamically-allocated object, we simply maintain a count of the number of places in which it is referenced, and deallocate it when that count becomes zero.
The memory management code looks like this:
typedef struct Obj {
size_t refcount;
enum ObjectType {O_ATOM, O_EXPR, O_CONT} type;
} Obj;
struct Atom {Obj obj, ...};
struct Expr {Obj obj, ...};
struct Cont {Obj obj, ...};
void incref(Obj *);
void decref(Obj *);
#define INCREF(x) (incref((Obj *)(x)), (x))
#define DECREF(x) (decref((Obj *)(x)))
Working with reference counting is best thought of as a kind of ownership: A
function owns and is responsible for its arguments, and when a function returns
a dynamically-allocated value, the caller owns that value. When a value is
discarded, DECREF should be called to destroy that particular reference; when
ownership of a value is shared with another function, INCREF should be called
so that there are enough references to go around.
Data Structures
Unlamda consists of only two types: functions (or atoms, to borrow Lisp’s terminology) and unevaluated expressions. Atoms are fairly easy to represent:
typedef struct Atom {
struct Obj obj;
enum AtomType {
A_S, A_S1, A_S2, A_K, A_K1, A_I, A_V, A_D, A_D1,
A_C, A_C1, A_E, A_DOT, A_AT, A_QUERY, A_PIPE
} type;
union {
char ch;
struct {
struct Atom *a1, *a2;
} p;
struct Expr *expr;
struct Cont *cont;
} u;
} Atom;
In addition to the obvious members of AtomType, A_S1, A_S2, A_K1, and
A_D1 stand for the partially-applied s, k, and d functions, and A_C1
is an unapplied continuation.
Expressions, being either atoms or applications of other expressions, are even easier:
typedef struct Expr {
struct Obj obj;
enum ExprType {E_ATOM, E_APPLY} type;
union {
struct Atom *atom;
struct {
struct Expr *e1, *e2;
} a;
} u;
} Expr;
There is also a third type, Cont, which describes continuations and is
similarly straightforward. We’ll see that one in the section on tail calls and
continuation-passing style.
Object Initialization
C lacks, among other things, optional and keyword arguments to functions. In C99, however, it is possible to borrow an idiom from JavaScript and pass a struct literal as the sole argument to a function instead of a larger number of primitives:
foo((struct bar){.a=1, .b=2});
Defining and keeping track of the the various structs necessary to do this
would probably lead to an unmaintainable mess, but it works well for writing
constructors for complicated data structures, as the types are already defined.
For example, here’s the constructor for Atom:
Atom *new_atom(Atom a) {
Atom *x = xmalloc(sizeof (Atom));
*x = a, x->obj = (Obj){1,O_ATOM};
return x;
}
#define ATOM(...) \
new_atom((Atom){{0,O_ATOM},__VA_ARGS__})
There are similiar constructors for Expr and Cont. This allows us to
allocate and initialize complex expressions in a simple and straighforward
manner:
// ``k.x.y
EXPR(E_APPLY, .u.apply={
EXPR(E_APPLY, .u.apply={
EXPR(E_ATOM, .u.atom=ATOM(A_K)),
EXPR(E_ATOM, .u.atom=ATOM(A_DOT, .u.ch='x'))
}),
EXPR(E_ATOM, .u.atom=ATOM(A_DOT, .u.ch='y'))
})
Parsing
Because Unlambda has only one operator and all but two of its literals consist of a single character, parsing Unlambda is trivial. The only feature of my parser that’s really even worth a mention is the fact that the atoms and expressions for those single-character literals are statically allocated, which saves time on allocation and deallocation, and keeps the memory footprint small by generating fewer duplicates:
static Atom atom_s = {{2, O_ATOM}, A_S};
// ...
static Expr expr_s = {{1, O_EXPR}, E_ATOM,
.u.atom=&atom_s};
//...
Expr *parse(FILE *f);
Evaluation
Once an expression has been parsed, it’s time to evaluate it. We’ll start with a naïve evaluator, written without regard for performance considerations or system limitations:
Atom *eval(Expr *e) {
Atom *a;
switch (e->type) {
case E_ATOM:
a = INCREF(e->u.atom);
DECREF(e);
return a;
case E_APPLY:
a = apply(eval(INCREF(e->u.a.e1)),
eval(INCREF(e->u.a.e2)));
DECREF(e);
return a;
}
}
Atom *apply(Atom *a1, Atom *a2) {
Atom *a;
switch (a1->type) {
case A_S:
a = ATOM(A_S1, .u.p={a2});
DECREF(a1);
return a;
case A_S1:
a = ATOM(A_S2, .u.p={INCREF(a1->u.p.a1), a2});
DECREF(a1);
return a;
case A_S2:
a = apply(apply(INCREF(a1->u.p.a1), a2),
apply(INCREF(a1->u.p.a2), a2));
DECREF(a1);
return a;
// ...
}
}
That implementation would function just fine for the trivial sort of program that you can write out without having to think too hard about it. This implementation has a few problems. It would break down on more complicated programs, however.
Most obviously, c, d, and e are impossible to implement this way, and so
any program that makes use of them won’t function properly, if at all. It
isn’t very hard to restructure the above to accomadate d, and e can
probably be made to work as well (i.e., without resorting to exit),
but c is actually impossible.
Next, consider the fact that any nontrivial Unlambda program is going to use a
lot of recursion, and C maintains a call stack of limited size. A
deeply-recursive expression like ```sii``sii is going to exhaust the stack
in short order.
Promises
Adding support for promises to the above is fairly straightforward. We can
simply add another version of apply that expects its second argument to be
unevaluated:
Atom *eval(Expr *e) {
Atom *a;
switch (e->type) {
// ...
case E_APPLY:
a = apply1(eval(INCREF(e->u.a.e1)),
INCREF(e->u.a.e2));
DECREF(e);
return a;
// ...
}
}
Atom *apply1(Atom *a, Expr *e) {
if (a->type == A_D) {
DECREF(a);
return ATOM(A_D1, .u.expr=INCREF(e));
}
return apply2(a, eval(e));
}
Atom *apply2(Atom *a1, Atom *a2) {
Atom *a;
switch (a1->type) {
// ...
case A_D1:
a = apply2(eval(INCREF(a1->u.expr)), a2);
DECREF(a1);
return a;
// ...
}
}
Tail Calls and Continuation-Passing Style
When a function is called in C, an activation record is pushed onto the call stack to keep track of the return address, local variables, and whatever else the specific implementation finds useful. When a function returns, its activation record is popped from the stack, leaving the caller’s activation record at the top. This is a straightforward and easy way to work with function calls at a low level, but it can cause problems when recursion gets introduced: the call stack is generally a fixed size and not terribly large, and when it fills up, things break.
We can avoid allocating a new stack frame, however, when a function call is in tail position. That is, the value returned by the called function is returned again by the caller, with no further processing. In this case, we can perform what is known as tail call optimization and replace the current stack frame with that of the called function instead of pushing a new one onto the stack. Some C compilers perform this optimization, but it isn’t part of the ANSI C standard, nor is it particularly relable where it is available.
The usual approach to dealing with the lack of proper tail call optimizaiton is to transform recursive computations into iterative ones by using passing all function calls through another function, called a trampoline. Instead of calling a function in the normal manner, you’d return a thunk (a delayed computation, or in our case, a struct from which the call can reconstructed), which ‘bounces’ from the trampoline as the intended call.
Here’s what that might look like:
typedef struct Thunk {
struct Obj obj;
enum ThunkType {T_F, T_E, T_A1, T_A2} type;
union {
struct { struct Atom *a; } f;
struct { struct Expr *e; } e;
struct { struct Atom *a; struct Expr *e; } a1;
struct { struct Atom *a1, *a2; } a2;
} u;
} Thunk;
Atom *eval(Expr *e) {
Thunk *t = THUNK(T_E, .u.e={e}), t2;
Atom *a;
for (;;) {
switch (t->type) {
case T_F: goto done;
case T_E:
t2 = expr(INCREF(t->u.e.e));
break;
case T_A1:
t2 = apply1(INCREF(t->u.a1.a),
INCREF(t->u.a1.e));
break;
case T_A2:
t2 = apply2(INCREF(t->u.a2.a1),
INCREF(t->u.a2.a2));
break;
}
DECREF(t2);
t = t2;
}
done:
a = INCREF(t->u.f.a);
DECREF(t);
return a;
}
Thunk *expr(Expr *e) { /* ... */ }
Thunk *apply1(Atom *a, Expr *e) { /* ... */ }
Thunk *apply2(Atom *a1, Atom *a2) { /* ... */ }
Here, eval is our trampoline. It calls functions in a loop, returning when
it comes across a thunk marked final (T_F). The function expr has taken
the place of what we were calling eval earlier. This eliminates the danger
of stack overflows, and it also brings us one step closer to being able to
implement c and e.
By adding a field to our thunk indicating what to do next, we can encode the entire execution of our program. At this point, we can start calling it a continuation, because that’s what it is:
typedef struct Cont {
struct Obj obj;
enum ContType {C_F, C_E, C_A1, C_A2, C_A3} type;
union {
struct { struct Atom *a; } f;
struct { struct Expr *e; } e;
struct { struct Atom *a; struct Expr *e; } a1;
struct { struct Atom *a1, *a2; } a2;
struct { struct Atom *a1, *a2, *a3; } a3;
} u;
struct Cont *c;
} Cont;
Note the addition of C_A3 and its arguments: continuations make the
evaluation of s tricky, and the easiest way to deal with it was to add
another helper function.
Now, we need to add an additional paramater to all of our Thunk-returning
functions: the current continuation. Just like in Unlambda, it’s a function
that represents the execution of the remainder of the evaluation. One can’t call
a struct in C, though, so we’ll introduce a utility function to mimic the
behavior we want:
Cont *call(Atom *, Cont *);
This makes a shallow copy of the given Cont with the supplied Atom substituted
for the first NULL argument. For example, the following two expressions should
produce the same result:
call(a, CONT(C_A2, .u.a2={x, NULL}))
CONT(C_F, .u.a2={x, a})
Passing continuations around in this manner is called, predictibly, continuation-passing style. Now, let’s extend our trampoline to deal with continuations instead of thunks:
Atom *eval(Expr *e) {
Cont *c = CONT(C_E, .u.e={e}, CONT(C_F));
Cont *cc; Atom *a;
for (;;) {
switch (c->type) {
case C_F:
goto done;
case C_E:
cc = expr(INCREF(c->u.e.e), INCREF(c->c));
break;
case C_A1:
cc = apply1(INCREF(c->u.a1.a),
INCREF(c->u.a1.e), INCREF(c->c));
break;
case C_A2:
cc = apply2(INCREF(c->u.a2.a1),
INCREF(c->u.a2.a2), INCREF(c->c));
break;
case C_A3:
cc = apply3(INCREF(c->u.a3.a1),
INCREF(c->u.a3.a2),
INCREF(c->u.a3.a3), INCREF(c->c));
break;
}
DECREF(c); c = cc;
}
done:
a = INCREF(c->u.f.a);
DECREF(c);
return a;
}
And here’s expr, in it’s continuation-passing glory:
Cont *expr(Expr *e, Cont *c) {
switch (e->type) {
case E_ATOM:
c = call(e->u.atom, c);
DECREF(e);
return c;
case E_APPLY:
c = CONT(C_E, .u.e={e->u.a.e1},
CONT(C_A1, .u.a1={NULL, e->u.a.e2},
c));
DECREF(e);
return c;
}
}
And apply3, which processes the second and third applications of the S
combinator, along with the portion of apply2 that made it necessary:
Cont *apply3(Atom *a1, Atom *a2, Atom *a3, Cont *c) {
return CONT(C_A2, .u.a2={a2, a3},
CONT(C_A2, .u.a2={a1, NULL}, c));
}
Cont *apply2(Atom *a1, Atom *a2, Cont *c) {
switch (a1->type) {
// ...
case A_S2:
c = CONT(C_A2, .u.a2={INCREF(a1->u.p.a1), a2},
CONT(C_A3, .u.a3={NULL,
INCREF(a1->u.p.a2), INCREF(a2)}, c));
DECREF(a1);
return c;
// ...
}
Now, we’re finally ready to implement c and e. With continuation-passing
style, the hard work is already done, making the act of actually adding
support for them trivial:
Cont *apply2(Atom *a1, Atom *a2, Cont *c) {
switch (a1->type) {
// ...
case A_C:
c = CONT(C_A2, .u.a2={a2,
ATOM(A_C1, .u.cont=INCREF(c))}, c);
DECREF(a1);
return c;
case A_C1:
c = call(a2, INCREF(a1->u.cont));
DECREF(a1);
return c;
case A_E:
c = CONT(C_F, .u.f={a2}, c);
DECREF(a1);
return c;
// ...
}
}
The End
That’s it! If you didn’t grab when I mentioned it above, my code is available here. It’s under the MIT license, so go ahead and play around with it.