The Implementation of Unlambda

A cou­ple of weeks ago, I was perus­ing my list of things I’d like to accom­plish, and “write a non-​trivial pro­gram­ming lan­guage inter­preter” had appar­ently bub­bled to the top while I wasn’t look­ing. At around the same time, I hap­pened across Unlambda, which is a func­tional Tur­ing tarpit (as opposed to an imper­a­tive one: if it was func­tional in the use­ful sense, it wouldn’t be called a tarpit), and a project was born.

The Lan­guage

Unlambda, so named because it’s basi­cally untyped lambda cal­cu­lus with­out the lambda con­struct, con­sists entirely of twelve func­tions (s, k, i, v, d, c, e, .x, @, ?x, and |), and a sin­gle oper­a­tor (`). There are no vari­ables, user-​defined func­tions, or indeed any­thing other than than that hand­ful of pre­de­fined func­tions. It looks like this:

```s``s``sii`ki
  `k.*``s``s`ks
 ``s`k`s`ks``s``s`ks``s`k`s`kr``s`k`sikk
  `k``s`ksk

If you can’t tell by read­ing it—and I’d be very impressed if you could—that lit­tle pro­gram prints the Fibonacci sequence as aster­isks (*, *, **, ***, *****, …) to stan­dard out­put.

Func­tion Appli­ca­tion

Unlambda’s only oper­a­tor, ` (that’s ASCII 96, the backtick/​grave accent), denotes func­tion appli­ca­tion: the expres­sion `FG means F applied to G. In C, that might be expressed as F(G). Because every func­tion in Unlambda takes exactly one argu­ment, this nota­tion is unam­bigu­ous, and paren­the­ses are unnec­es­sary. For exam­ple, ``FGH means `FG applied to H, and `F`GH means F applied to `GH.

Expres­sions in Unlambda are eval­u­ated left to right (except in the pres­ence of the d func­tion, which we’ll see later). In the expres­sion `FG, F is eval­u­ated first, then G, then the appli­ca­tion of F to G.

As noted above, every func­tion in Unlambda is unary. Some func­tions, how­ever, require mul­ti­ple argu­ments. We can resolve this by cur­ry­ing those func­tions: trans­form­ing them into series of single-​argument func­tions that sim­ply wait for more argu­ments. Con­sider, for exam­ple, a func­tion F that takes two argu­ments. First, we apply F to G, yield­ing `FG, which does noth­ing. Apply­ing that expres­sion to a sec­ond argu­ment, H, yields ``FGH, which applies the orig­i­nal func­tion F to argu­ments G and H.

The Func­tions

Unlambda has three ‘sim­ple’ func­tions: three from SKI com­bi­na­tor cal­cu­lus (guess which!), and one more that’s sim­ple enough to be lumped in with them:

s
The S com­bi­na­tor. Takes three argu­ments and applies the first to the third, the result of which is then applied to the result of apply­ing the sec­ond argu­ment to the third. More con­ci­cely, ```sFGH``FH`GH.
k
The K com­bi­na­tor. Takes two argu­ments, ignores the sec­ond, and returns the first. ``kFGF.
i
The I com­bi­na­tor. Takes one argu­ment, which is returned unal­tered. `iFF.
v
Void. Takes one argu­ment and returns v. `vFv.

There are two func­tions for input and three for out­put. In all cases, ‘x’ is a place­holder for an arbi­trary char­ac­ter, and not meant to rep­re­sent the lit­eral char­ac­ter ‘x’:

.x
Print. Equiv­a­lent to i, except that when applied, .x prints the char­ac­ter ‘x’ to stan­dard out­put. Any sin­gle char­ac­ter can be sub­sti­tuted for ‘x’ here: .z prints ‘z’ instead of ‘x’.
r
Print new­line. Short­hand for .<newline>, where <newline> is the lit­eral new­line char­ac­ter (ASCII 10).
@
Read. `@F reads a sin­gle char­ac­ter from stan­dard input, record­ing that char­ac­ter as the cur­rent char­ac­ter, and eval­u­ates to `Fi if the read was suc­cess­ful (and not EOF), oth­er­wise `Fv.
?x
Query cur­rent char­ac­ter. If the cur­rent char­ac­ter is equal to ‘x’, `?xF eval­u­ates to `Fi. If not, `Fv.
|
Print cur­rent char­ac­ter. If there is a cur­rent char­ac­ter (i.e., the most recent appli­ca­tion of @ was suc­cess­ful), `|F eval­u­ates to `F.x, where ‘x’ is the cur­rent char­ac­ter. If not, `Fv.

Finally, there are three func­tions that dis­re­gard Unlambda’s usual order of eval­u­a­tion:

d

Delay. The expres­sion `dF eval­u­ates to a promise to eval­u­ate the expres­sion F at some point in the future. When that promise is applied, it first eval­u­ates F, then applies F to its argu­ment.

For exam­ple, con­sider the expres­sion ``d`.xii. Here, d is applied to the expres­sion `.xi, (which is not eval­u­ated, and hence noth­ing is printed at this point), yield­ing a promise, which I’ll call <p>. This leaves us with `<p>i. Because <p> is being applied, the expres­sion it con­tains is now sub­sti­tuted for <p>, leav­ing us with ``.xii. We apply .x to i, which prints ‘x’ and returns i, and we’re left with `ii, which eval­u­ates to i.

c

Call with cur­rent con­tin­u­a­tion. The expres­sion `cF applies F to the cur­rent con­tin­u­a­tion. If F eval­u­ates with­out apply­ing its argu­ment, then `cF eval­u­ates to that value. If, how­ever, F’s argu­ment is applied to some­thing, `cF eval­u­ates to that value, even if a dif­fer­ent result has already been obtained.

Con­sider, for exam­ple, the expres­sion ``ci.x: `ci applies i to the cur­rent con­tin­u­a­tion, which I’ll call <cont>, yield­ing `i<cont>. This reduces, sim­ply, to <cont>, and we’re left with `<cont>.x. Now, because we’re apply­ing a con­tin­u­a­tion to some­thing, we travel back in time to the point when that con­tin­u­a­tion was cre­ated, in this case the appli­ca­tion of c to i, and force that expres­sion to return the value given to the con­tin­u­a­tion instead of what­ever it returned before. So, we have `ci again, but because we got there by apply­ing <cont> to .x, it returns .x and we’re left with `.x.x. That, in turn, prints ‘x’ and eval­u­ates to .x.

e
Exit, or the top-​level con­tin­u­a­tion. When applied, e ter­mi­nates the pro­gram, forc­ing the whole thing to eval­u­ate to e’s argu­ment.

Learn­ing to write pro­grams in Unlambda—if you’re as unfa­mil­iar with SKI com­bi­na­tor cal­cu­lus as I was when I started writ­ing this, anyway—is a frus­trat­ing, mind-​bending expe­ri­ence, and I highly rec­om­mend it. I won’t, how­ever, explain it here any fur­ther, as this post is intended to be about Unlambda’s imple­men­ta­tion, not about the lan­guage itself.

If you’d like to learn more about the lan­guage itself, the Unlambda home­page is a good place to start, as are the entries on the Esolang wiki and Wikipedia.

Imple­ment­ing Unlambda

The imple­men­ta­tion of Unlambda in a lan­guage as poorly-​endowed as C is some­what tricky: C does not have clo­sures, con­tin­u­a­tions, tail calls, or garbage col­lec­tion, and Unlambda does. What makes Unlambda so attrac­tive to imple­ment, in fact, is that it con­tains noth­ing else: it has most of the things that make imple­ment­ing ‘real’ func­tional pro­gram­ming lan­guages inter­est­ing, and none of the easy, tedious things that make those lan­guages use­ful.

I’ll be dis­cussing the details of my imple­men­ta­tion shortly. If you’d like to fol­low along with the actual source code rather than the trun­cated exam­ples pre­sented herein, you can acquire it here.

Garbage Col­lec­tion

Because Unlambda does not have muta­ble val­ues or any way to ref­er­ence those val­ues before or dur­ing their ini­tial­iza­tion, it is impos­si­ble to cre­ate cycles, and we can get away with ref­er­ence count­ing: for each dynamically-​allocated object, we sim­ply main­tain a count of the num­ber of places in which it is ref­er­enced, and deal­lo­cate it when that count becomes zero.

The mem­ory man­age­ment code looks like this:

typedef struct Obj {
  size_t refcount;
  enum ObjectType {O_ATOM, O_EXPR, O_CONT} type;
} Obj;

struct Atom {Obj obj, ...};
struct Expr {Obj obj, ...};
struct Cont {Obj obj, ...};

void incref(Obj *);
void decref(Obj *);

#define INCREF(x) (incref((Obj *)(x)), (x))
#define DECREF(x) (decref((Obj *)(x)))

Work­ing with ref­er­ence count­ing is best thought of as a kind of own­er­ship: A func­tion owns and is respon­si­ble for its argu­ments, and when a func­tion returns a dynamically-​allocated value, the caller owns that value. When a value is dis­carded, DECREF should be called to destroy that par­tic­u­lar ref­er­ence; when own­er­ship of a value is shared with another func­tion, INCREF should be called so that there are enough ref­er­ences to go around.

Data Struc­tures

Unlamda con­sists of only two types: func­tions (or atoms, to bor­row Lisp’s ter­mi­nol­ogy) and uneval­u­ated expres­sions. Atoms are fairly easy to rep­re­sent:

typedef struct Atom {
  struct Obj obj;
  enum AtomType {
    A_S, A_S1, A_S2, A_K, A_K1, A_I, A_V, A_D, A_D1,
    A_C, A_C1, A_E, A_DOT, A_AT, A_QUERY, A_PIPE
  } type;
  union {
    char ch;
    struct {
      struct Atom *a1, *a2;
    } p;
    struct Expr *expr;
    struct Cont *cont;
  } u;
} Atom;

In addi­tion to the obvi­ous mem­bers of AtomType, A_S1, A_S2, A_K1, and A_D1 stand for the partially-​applied s, k, and d func­tions, and A_C1 is an unap­plied con­tin­u­a­tion.

Expres­sions, being either atoms or appli­ca­tions of other expres­sions, are even eas­ier:

typedef struct Expr {
  struct Obj obj;
  enum ExprType {E_ATOM, E_APPLY} type;
  union {
    struct Atom *atom;
    struct {
      struct Expr *e1, *e2;
    } a;
  } u;
} Expr;

There is also a third type, Cont, which describes con­tin­u­a­tions and is sim­i­larly straight­for­ward. We’ll see that one in the sec­tion on tail calls and continuation-​passing style.

Object Ini­tial­iza­tion

C lacks, among other things, optional and key­word argu­ments to func­tions. In C99, how­ever, it is pos­si­ble to bor­row an idiom from JavaScript and pass a struct lit­eral as the sole argu­ment to a func­tion instead of a larger num­ber of prim­i­tives:

foo((struct bar){.a=1, .b=2});

Defin­ing and keep­ing track of the the var­i­ous structs nec­es­sary to do this would prob­a­bly lead to an unmain­tain­able mess, but it works well for writ­ing con­struc­tors for com­pli­cated data struc­tures, as the types are already defined. For exam­ple, here’s the con­struc­tor for Atom:

Atom *new_atom(Atom a) {
  Atom *x = xmalloc(sizeof (Atom));
  *x = a, x->obj = (Obj){1,O_ATOM};
  return x;
}

#define ATOM(...) \
  new_atom((Atom){{0,O_ATOM},__VA_ARGS__})

There are sim­il­iar con­struc­tors for Expr and Cont. This allows us to allo­cate and ini­tial­ize com­plex expres­sions in a sim­ple and straigh­for­ward man­ner:

// ``k.x.y
EXPR(E_APPLY, .u.apply={
  EXPR(E_APPLY, .u.apply={
    EXPR(E_ATOM, .u.atom=ATOM(A_K)),
    EXPR(E_ATOM, .u.atom=ATOM(A_DOT, .u.ch='x'))
  }),
  EXPR(E_ATOM, .u.atom=ATOM(A_DOT, .u.ch='y'))
})

Pars­ing

Because Unlambda has only one oper­a­tor and all but two of its lit­er­als con­sist of a sin­gle char­ac­ter, pars­ing Unlambda is triv­ial. The only fea­ture of my parser that’s really even worth a men­tion is the fact that the atoms and expres­sions for those single-​character lit­er­als are sta­t­i­cally allo­cated, which saves time on allo­ca­tion and deal­lo­ca­tion, and keeps the mem­ory foot­print small by gen­er­at­ing fewer dupli­cates:

static Atom atom_s = {{2, O_ATOM}, A_S};
// ...

static Expr expr_s = {{1, O_EXPR}, E_ATOM,
                      .u.atom=&atom_s};
//...

Expr *parse(FILE *f);

Eval­u­a­tion

Once an expres­sion has been parsed, it’s time to eval­u­ate it. We’ll start with a naïve eval­u­a­tor, writ­ten with­out regard for per­for­mance con­sid­er­a­tions or sys­tem lim­i­ta­tions:

Atom *eval(Expr *e) {
  Atom *a;
  switch (e->type) {
    case E_ATOM:
      a = INCREF(e->u.atom);
      DECREF(e);
      return a;
    case E_APPLY:
      a = apply(eval(INCREF(e->u.a.e1)),
                eval(INCREF(e->u.a.e2)));
      DECREF(e);
      return a;
  }
}

Atom *apply(Atom *a1, Atom *a2) {
  Atom *a;
  switch (a1->type) {
    case A_S:
      a = ATOM(A_S1, .u.p={a2});
      DECREF(a1);
      return a;
    case A_S1:
      a = ATOM(A_S2, .u.p={INCREF(a1->u.p.a1), a2});
      DECREF(a1);
      return a;
    case A_S2:
      a = apply(apply(INCREF(a1->u.p.a1), a2),
                apply(INCREF(a1->u.p.a2), a2));
      DECREF(a1);
      return a;
    // ...
  }
}

That imple­men­ta­tion would func­tion just fine for the triv­ial sort of pro­gram that you can write out with­out hav­ing to think too hard about it. This imple­men­ta­tion has a few prob­lems. It would break down on more com­pli­cated pro­grams, how­ever.

Most obvi­ously, c, d, and e are impos­si­ble to imple­ment this way, and so any pro­gram that makes use of them won’t func­tion prop­erly, if at all. It isn’t very hard to restruc­ture the above to acco­ma­date d, and e can prob­a­bly be made to work as well (i.e., with­out resort­ing to exit), but c is actu­ally impos­si­ble.

Next, con­sider the fact that any non­triv­ial Unlambda pro­gram is going to use a lot of recur­sion, and C main­tains a call stack of lim­ited size. A deeply-​recursive expres­sion like ```sii``sii is going to exhaust the stack in short order.

Promises

Adding sup­port for promises to the above is fairly straight­for­ward. We can sim­ply add another ver­sion of apply that expects its sec­ond argu­ment to be uneval­u­ated:

Atom *eval(Expr *e) {
  Atom *a;
  switch (e->type) {
    // ...
    case E_APPLY:
      a = apply1(eval(INCREF(e->u.a.e1)),
                 INCREF(e->u.a.e2));
      DECREF(e);
      return a;
    // ...
  }
}

Atom *apply1(Atom *a, Expr *e) {
  if (a->type == A_D) {
    DECREF(a);
    return ATOM(A_D1, .u.expr=INCREF(e));
  }
  return apply2(a, eval(e));
}

Atom *apply2(Atom *a1, Atom *a2) {
  Atom *a;
  switch (a1->type) {
    // ...
    case A_D1:
      a = apply2(eval(INCREF(a1->u.expr)), a2);
      DECREF(a1);
      return a;
    // ...
  }
}

Tail Calls and Continuation-​Passing Style

When a func­tion is called in C, an acti­va­tion record is pushed onto the call stack to keep track of the return address, local vari­ables, and what­ever else the spe­cific imple­men­ta­tion finds use­ful. When a func­tion returns, its acti­va­tion record is popped from the stack, leav­ing the caller’s acti­va­tion record at the top. This is a straight­for­ward and easy way to work with func­tion calls at a low level, but it can cause prob­lems when recur­sion gets intro­duced: the call stack is gen­er­ally a fixed size and not ter­ri­bly large, and when it fills up, things break.

We can avoid allo­cat­ing a new stack frame, how­ever, when a func­tion call is in tail posi­tion. That is, the value returned by the called func­tion is returned again by the caller, with no fur­ther pro­cess­ing. In this case, we can per­form what is known as tail call opti­miza­tion and replace the cur­rent stack frame with that of the called func­tion instead of push­ing a new one onto the stack. Some C com­pil­ers per­form this opti­miza­tion, but it isn’t part of the ANSI C stan­dard, nor is it par­tic­u­larly relable where it is avail­able.

The usual approach to deal­ing with the lack of proper tail call opti­mizaiton is to trans­form recur­sive com­pu­ta­tions into iter­a­tive ones by using pass­ing all func­tion calls through another func­tion, called a tram­po­line. Instead of call­ing a func­tion in the nor­mal man­ner, you’d return a thunk (a delayed com­pu­ta­tion, or in our case, a struct from which the call can recon­structed), which ‘bounces’ from the tram­po­line as the intended call.

Here’s what that might look like:

typedef struct Thunk {
  struct Obj obj;
  enum ThunkType {T_F, T_E, T_A1, T_A2} type;
  union {
    struct { struct Atom *a; } f;
    struct { struct Expr *e; } e;
    struct { struct Atom *a; struct Expr *e; } a1;
    struct { struct Atom *a1, *a2; } a2;
  } u;
} Thunk;

Atom *eval(Expr *e) {
  Thunk *t = THUNK(T_E, .u.e={e}), t2;
  Atom *a;
  for (;;) {
    switch (t->type) {
      case T_F: goto done;
      case T_E:
        t2 = expr(INCREF(t->u.e.e));
        break;
      case T_A1:
        t2 = apply1(INCREF(t->u.a1.a),
                    INCREF(t->u.a1.e));
        break;
      case T_A2:
        t2 = apply2(INCREF(t->u.a2.a1),
                    INCREF(t->u.a2.a2));
        break;
    }
    DECREF(t2);
    t = t2;
  }
  done:
  a = INCREF(t->u.f.a);
  DECREF(t);
  return a;
}

Thunk *expr(Expr *e) { /* ... */ }
Thunk *apply1(Atom *a, Expr *e) { /* ... */ }
Thunk *apply2(Atom *a1, Atom *a2) { /* ... */ }

Here, eval is our tram­po­line. It calls func­tions in a loop, return­ing when it comes across a thunk marked final (T_F). The func­tion expr has taken the place of what we were call­ing eval ear­lier. This elim­i­nates the dan­ger of stack over­flows, and it also brings us one step closer to being able to imple­ment c and e.

By adding a field to our thunk indi­cat­ing what to do next, we can encode the entire exe­cu­tion of our pro­gram. At this point, we can start call­ing it a con­tin­u­a­tion, because that’s what it is:

typedef struct Cont {
  struct Obj obj;
  enum ContType {C_F, C_E, C_A1, C_A2, C_A3} type;
  union {
    struct { struct Atom *a; } f;
    struct { struct Expr *e; } e;
    struct { struct Atom *a; struct Expr *e; } a1;
    struct { struct Atom *a1, *a2; } a2;
    struct { struct Atom *a1, *a2, *a3; } a3;
  } u;
  struct Cont *c;
} Cont;

Note the addi­tion of C_A3 and its argu­ments: con­tin­u­a­tions make the eval­u­a­tion of s tricky, and the eas­i­est way to deal with it was to add another helper func­tion.

Now, we need to add an addi­tional para­mater to all of our Thunk–return­ing func­tions: the cur­rent con­tin­u­a­tion. Just like in Unlambda, it’s a func­tion that rep­re­sents the exe­cu­tion of the remain­der of the eval­u­a­tion. One can’t call a struct in C, though, so we’ll intro­duce a util­ity func­tion to mimic the behav­ior we want:

Cont *call(Atom *, Cont *);

This makes a shal­low copy of the given Cont with the sup­plied Atom sub­sti­tuted for the first NULL argu­ment. For exam­ple, the fol­low­ing two expres­sions should pro­duce the same result:

call(a, CONT(C_A2, .u.a2={x, NULL}))
CONT(C_F, .u.a2={x, a})

Pass­ing con­tin­u­a­tions around in this man­ner is called, pre­dictibly, continuation-​passing style. Now, let’s extend our tram­po­line to deal with con­tin­u­a­tions instead of thunks:

Atom *eval(Expr *e) {
  Cont *c = CONT(C_E, .u.e={e}, CONT(C_F));
  Cont *cc; Atom *a;
  for (;;) {
    switch (c->type) {
      case C_F:
        goto done;
      case C_E:
        cc = expr(INCREF(c->u.e.e), INCREF(c->c));
        break;
      case C_A1:
        cc = apply1(INCREF(c->u.a1.a),
                    INCREF(c->u.a1.e), INCREF(c->c));
        break;
      case C_A2:
        cc = apply2(INCREF(c->u.a2.a1),
                    INCREF(c->u.a2.a2), INCREF(c->c));
        break;
      case C_A3:
        cc = apply3(INCREF(c->u.a3.a1),
                    INCREF(c->u.a3.a2),
                    INCREF(c->u.a3.a3), INCREF(c->c));
        break;
    }
    DECREF(c); c = cc;
  }
  done:
  a = INCREF(c->u.f.a);
  DECREF(c);
  return a;
}

And here’s expr, in it’s continuation-​passing glory:

Cont *expr(Expr *e, Cont *c) {
  switch (e->type) {
    case E_ATOM:
      c = call(e->u.atom, c);
      DECREF(e);
      return c;
    case E_APPLY:
      c = CONT(C_E, .u.e={e->u.a.e1},
               CONT(C_A1, .u.a1={NULL, e->u.a.e2},
                    c));
      DECREF(e);
      return c;
   }
}

And apply3, which processes the sec­ond and third appli­ca­tions of the S com­bi­na­tor, along with the por­tion of apply2 that made it nec­es­sary:

Cont *apply3(Atom *a1, Atom *a2, Atom *a3, Cont *c) {
  return CONT(C_A2, .u.a2={a2, a3},
              CONT(C_A2, .u.a2={a1, NULL}, c));
}

Cont *apply2(Atom *a1, Atom *a2, Cont *c) {
  switch (a1->type) {
    // ...
    case A_S2:
      c = CONT(C_A2, .u.a2={INCREF(a1->u.p.a1), a2},
               CONT(C_A3, .u.a3={NULL,
                 INCREF(a1->u.p.a2), INCREF(a2)}, c));
      DECREF(a1);
      return c;
    // ...
}

Now, we’re finally ready to imple­ment c and e. With continuation-​passing style, the hard work is already done, mak­ing the act of actu­ally adding sup­port for them triv­ial:

Cont *apply2(Atom *a1, Atom *a2, Cont *c) {
  switch (a1->type) {
    // ...
    case A_C:
      c = CONT(C_A2, .u.a2={a2,
        ATOM(A_C1, .u.cont=INCREF(c))}, c);
      DECREF(a1);
      return c;
    case A_C1:
      c = call(a2, INCREF(a1->u.cont));
      DECREF(a1);
      return c;
    case A_E:
      c = CONT(C_F, .u.f={a2}, c);
      DECREF(a1);
      return c;
    // ...
  }
}

The End

That’s it! If you didn’t grab when I men­tioned it above, my code is avail­able here. It’s under the MIT license, so go ahead and play around with it.

Subscribe

Search

Tags

    ©2010 Noah Medling. All rights reserved.