How to program my universal Turing machine in LISP

Introduction---Why switch to binary programs?

Okay, so now we've got a fairly simple version of LISP. Its interpreter is only three hundred lines of Mathematica code, and it's less than a thousand lines C and Java. So let's use it!

The new stuff for AIT is just try, which enables us to give a LISP expression binary data using read-bit and read-exp, which abort if they run off the end of the data. How can we use this to get the correct program-size complexity measure H(x) of AIT? How can we use this to measure the algorithmic information content of an arbitrary S-expression x?

Well, a simple way to do this would be to define H(x) to be the size of the smallest program for x in an actual programming language, one that human beings like to use. For example, we could define H(x) to be the size in characters of the smallest LISP expression whose value is x, when this LISP expression is printed in standard format with exactly one blank separating successive elements of a list.

Well, that's not a bad definition; it is possible to work with it up to a point. And it does have the advantage of being a very straightforward and understandable definition of program-size complexity.

But real programming languages are too redundant; that's why human beings like them! The information in real programs is not packed densely enough. They're too similar to English and other natural languages, in which a small change in a text usually can be detected and doesn't mess up its meaning too much. If information is really densely packed, then any random change in a meaningful message gives another meaningful message. Natural languages have helpful redundancy, which gives us a natural error-detection/correction mechanism.

So how can we pack information into a program as densely as possible? Well, the best way is to give raw bits; we should have absolutely no idea if the next bit is going to be a 0 or a 1. Both should be equally likely and it should be a complete surprise. Then we are really taking advantage of each bit in the program!

Also we want programs to be self-delimiting. What do I mean by this? What I mean, is that we really only have a binary alphabet. If we use a blank or another special character as punctuation at the end of the program, then it's really a 3-symbol alphabet, not a binary alphabet! And then all three symbols would have to be used equally often, they would have to be equally likely, for us to take full advantage of this communications channel. In other words, if blank is only used as punctuation at the end of binary programs, then we are not using it often enough, we're wasting one of the symbols in our alphabet!

So let's stick to a truly binary alphabet, 0's and 1's, so that as a computer reads in a program bit by bit it has to decide by itself when to stop reading it. And if we can do this, then our information content measure will be (sub)additive, because programs can be concatenated and there is no ambiguity where one ends and the next begins. In other words, if a computer decides by itself when to stop reading a program, then we can build up programs by combining smaller subroutines without having to add any punctuation.

Definition of our universal Turing machine U with self-delimiting programs

So our binary programs will consist of two parts. First there'll be a LISP expression that indicates some scheme for reading in the bits of raw binary data and what to do with them. We'll give this LISP prefix in ASCII, eight bits per character, and we'll put a special non-printable reserved character, the UNIX newline, at the end as punctuation. After that we'll put the raw binary data that this LISP prefix is reading in. The LISP prefix will get access to this raw binary data using read-bit and read-exp, so that it has to decide by itself where to stop reading; it won't get a graceful end-of-file indication. And the value of the LISP prefix will be the output of the program. [Later we'll also allow additional output using display. That's how we'll use U to generate infinite sets of S-expressions, and that's how we'll define the complexity H(X) of an infinite set of S-expressions X.]

And this is a universal self-delimiting scheme U, because we can use U to simulate any special-purpose self-delimiting binary computer C by putting in front of C's programs a prefix indicating in LISP how C works. In other words, if p is a program for C, and σ is a LISP program for simulating C, then (append (bits σ) p) is a program for our universal self-delimiting computer U that simulates C running p.

So how do we program U(p), our universal computer U running the binary program p, in LISP? Well, that's very easy to do! Here's a definition of U in LISP:

   define (U p) cadr try no-time-limit 'eval read-exp p

So the value of U is

   lambda (p) cadr try no-time-limit 'eval read-exp p

That's an M-expression, this is the S-expression:

   (lambda (p)
       (car (cdr (try no-time-limit (' (eval (read-exp))) p)))
   )

This says, to run the program p, use a try with no time limit and with p as the binary data, and give the binary data p to (eval (read-exp)). So this will read a LISP expression from the beginning of p, eight bits per character, until it finds a newline character. Then it'll run this LISP expression, it'll evaluate it, giving it the binary data that's left, which is exactly what we wanted it to do. And we select the second element of the triple that try returns, which is the final value returned by the LISP expression that we read in from the beginning of p. [When generating infinite sets of S-expressions, which are displayed, we'd take the third element of the triple that try returns, not the second element.]

And to make things easier the Java applet version of my LISP interpreter provides this as a macro. You just write run-utm-on, that stands for cadr try no-time-limit 'eval read-exp.

Definition of the complexity H(x) of an S-expression x

So U is our universal self-delimiting binary computer, and we define the algorithmic information content or complexity of an arbitrary S-expression x to be the size in bits of the smallest program p for which U(p) = x.

What is the complexity of the n-bit string β?

Let's look at some examples. Let's see how we can use U to calculate an n-bit string β.

Well, the easiest way to do this is to include β as is in the LISP prefix of p and not to use the binary data at all. For example

   (U bits' '(0 1 0 1 1 1))

evaluates to the 6-bit string (0 1 0 1 1 1). How long is this program? Well, it's

   length bits' '(0 1 0 1 1 1)

bits long, that's 8 + 8 × the number of characters in the S-expression

   (' (0 1 0 1 1 1))

So in general, this gives us an 8 + 8 × (5 + 2n) bit program for the n-bit string β. The five is for the five characters (' )), and then it's two characters per bit, times eight bits per character. In other words, we have

H(β) ≤ 16n + c.

Let's do better. To calculate the n-bit string β, let's concatenate the minimum-size program to calculate the decimal number n followed by the n bits in β. So we start by reading in and executing the program to calculate n, then once we know n, we read in the n bits in β. Let's do this in LISP. Well, to read in and run an arbitrary program for U to calculate n, we just do eval read-exp. Then we define a loop function, with one argument k. If k is 0, then we're finished, there are no bits to read, and β is just the empty list nil. If not, we cons read-bit to loop of k minus one. So let's put it all together:

   define pi
      let (loop k)
         if = k 0 nil
         cons read-bit (loop - k 1)
      (loop eval read-exp)

This is the prefix π we have to put in front of an arbitrary program for n followed by the n-bit string β. For example:

   (U
      append bits pi
      append bits 12
             '(0 0 1 1 1 1 1 1 0 0 0 1)
   )

yields the 12-bit string

   (0 0 1 1 1 1 1 1 0 0 0 1)

What this shows is that in general,

H(β) ≤ H(n) + n + c,

where c is length bits π which turns out to be 912.

So in general, just by giving the length n as a decimal constant in LISP, we're using 8 bits for each decimal digit in n, which is about 8log₁₀ n bits, which shows that H(β) is bounded by n + a constant times log n, that is, something of order log n:

H(β) ≤ n + O(log n).

But of course for some values of n we can do much better, for example, if n is 10^10¹⁰. This is a very large number with a very small program, with a very simple algorithmic description. In fact, what is H(10^10¹⁰)? Well, here is a program for this n:

   (U bits' ^ 10 ^ 10 10)

[Don't run this!] So its complexity H(10^10¹⁰) is bounded by

   length bits' ^ 10 ^ 10 10

which is very small. [This, you can run!]

It turns out that this n + H(n) + c upper bound on H(β) is in general the best possible, because most n-bit strings β have complexity H(β) very close to this upper bound. We'll show this at the beginning of Part III.

Of course some n-bit strings have much lower complexity. For example, the string 0ⁿ consisting of n 0's, has complexity within a bounded number of bits of the complexity of n, which in turn is bounded by a constant times log n, that is, is of order log n:

H(0ⁿ) = H(n) + O(1) = O(log n).

[Exercise: work through the details to prove this!]

Definition of the joint complexity H(x, y) of two S-expressions x and y

The next step is to consider the joint complexity of a pair of S-expressions x and y. That's written H(x, y) and it's defined to be the size in bits of the smallest program p for U to calculate the ordered pair (two-element list) (x y). In other words, H(x, y) is just H((x y)).

The first thing to notice about the joint complexity, is that it's symmetrical:

H(x, y) = H(y, x) + O(1).

I'll leave that as an exercise for you, dear reader!

Proof that H(x, y) ≤ H(x) + H(y) + c

As I stated in the preface of this book and in the introduction to this chapter, one of the advantages of self-delimiting programs, one of the advantages of using my U, is that complexity is subadditive, information content is additive, the complexity of a pair is bounded by the sum of its complexities, the joint complexity is bounded by the sum of the individual complexities. How can we show this?

Just consider the prefix ρ defined as follows:

   define rho
      cons eval read-exp cons eval read-exp nil

So ρ is defined to be

   (cons (eval (read-exp)) (cons (eval (read-exp)) nil))

This is

   size rho

= 53 characters long, and it's 8 + 8 × 53 = 8 × 54 = 432 bits long when used as a prefix in a program for U. So that's the constant c in

H(x, y) ≤ H(x) + H(y) + c.

To see an example, try

   (U
      append bits rho
      append bits pi
      append bits 5
      append '(1 1 1 1 1)
      append bits pi
      append bits 9
             '(0 0 0 0 0 0 0 0 0)
   )

It yields, as it should, this pair:

   ((1 1 1 1 1) (0 0 0 0 0 0 0 0 0))

Definition of the relative complexity H(y | x) of the S-expression y given the S-expression x

Is this inequality

H(x, y) ≤ H(x) + H(y) + c

sharp? No, not at all! It's only sharp when x and y have nothing in common, when they are algorithmically independent. Now we'll take a look at the relative complexity of y given x, which we'll write like this: H(y | x). That'll give us a sharp result,

H(x, y) = H(x) + H(y | x) + O(1),

at the end of Part II. For now, all we'll be able to show is one direction of this inequality, namely that

H(x, y) ≤ H(x) + H(y | x) + c.

By the way, since joint complexity is symmetrical, at the end of Part II we'll deduce that

H(x) + H(y | x) = H(y) + H(x | y) + O(1)

and, transposing, that

H(x) − H(x | y) = H(y) − H(y | x) + O(1).

In other words, the extent to which knowing y helps us to know x is the same as the extent to which knowing x helps us to know y! But we won't know this until the end of Part II.

To get these sharp results, we need to be subtle. The obvious definition of relative complexity won't do. That's to define H(y | x) to be the size of the smallest program for U to calculate y if it's given x for free. Instead we're given a minimum-size program for x, not x directly; there is a subtle but important difference.

The reason for picking this definition instead of the obvious one is that a minimum-size program tells us its size as well as its output, so that's really different from just being given the output.

In fact, it's not difficult to see that

H(x, H(x)) = H(x) + O(1).

I'll leave this as an exercise for the reader, but I'll show you the main idea later. And any definition of H(y | x) that satisfies the fundamental decomposition result

H(x, y) = H(x) + H(y | x) + O(1),

would, taking y = H(x), have to give us this:

H(x, H(x)) = H(x) + H(H(x) | x) + O(1).

Thus, if it's properly defined, the relative complexity of H(x) given x has got to be bounded:

H(H(x) | x) = O(1).

Well, this is indeed the case if we're given a minimum-size program for x instead of x directly (we just measure its length instead of running it), as I'll show in a moment. It will take us all of Part II to show that this is all we need to do to get

H(x, y) = H(x) + H(y | x) + O(1)

to work in general! In other words, if we can get this to work for H(x, H(x)), then it'll work for H(x, y) with y ≠ H(x)! But this will be a lot of work to show. For now, later in this chapter, all we'll be able to prove is that

H(x, y) ≤ H(x) + H(y | x) + c.

But first I'll have to show you a new definition of U that works for relative complexity.

How to run relative programs on our universal Turing machine U

Before, programs for our universal machine started off with a LISP prefix, which was read in and run and which could then read in the rest of the program. The LISP prefix that's in binary at the front of the program is just a LISP expression that's evaluated, and as it's evaluated it has access to the binary data that follows it in the program.

Well, to get relative complexity instead of absolute complexity, we just make one small change. Now the binary program for U begins with a lambda expression for a function definition, for the function that gives the result if applied to the stuff we are given for free. So here's a program for n + 10 given n for free: the prefix is

   bits' lambda (n*) + 10 run-utm-on n*

and it's followed by no binary data. And here's an H(m) + c bit program for n + m given n for free: the prefix is

   bits' lambda (n*) let m eval read-exp
                     let n run-utm-on n*
                         + n m

and it's followed by the H(m) bits of a minimum-size program for m. Here's how we get the complexity H(x) if we're given x: the prefix is

   bits' lambda (x*) length x*

and there's no binary data. Hence, as promised, H(H(x) | x) ≤ c; c = 8 + 8 × 25 = 208 is the number of bits in this prefix.

Two things need to be explained here. First of all, we need to use run-utm-on to get what we're being given for free, because we're given a minimum-size program for it, we're not getting it directly! And we use the macro run-utm-on, not the function U, because the definition of U is not visible inside a try, which always starts with a fresh environment.

So we can see if a program for our universal Turing machine U is a relative program or an absolute one. If there's a lambda expression

   (lambda (x y z...) body)

in binary at the beginning of the program, then it's a relative computation, and the number of arguments (x y z...) of the function tells us how many things we're being given for free. If the S-expression in binary at the beginning of the program is not a lambda expression, then it's an ordinary ``absolute-complexity'' program, not a relative-complexity program.

How can we actually run one of these relative programs? Well here's how:

   define (U2 p q)
      [p is the binary program for U,]
      [q is what we're given for free]
      cadr try no-time-limit  [first argument of try]
               cons 'read-exp [second argument of try]
               cons cons "'
                    cons q
                         nil
                    nil
               p              [third argument of try]

The expression that try evaluates is ((read-exp) (' q)), which reads in the lambda expression from the beginning of p and then gives it the free data q. If we were dealing with H(x | y, z), in which we're given two things for free, we'd form

   ((read-exp) (' q) (' r))

instead.

Exercise: now try this out on some examples! Of course, we can't really run examples, because we can't be given the free stuff via its minimum-size program, we can't be sure that we've got that. But in most cases any program for the free stuff will do, and such examples can be checked by running them.

Proof that H(x, y) ≤ H(x) + H(y | x) + c

Now it's time to prove as much of our fundamental result that

H(x, y) = H(x) + H(y | x) + O(1)

as we can now, namely, that

H(x, y) ≤ H(x) + H(y | x) + c.

So p consists of a prefix γ followed by a minimum-size program for x followed by a minimum-size program to calculate y from a minimum-size program for x, and we want to output the pair (x y). The idea is to start off by reading and running the program for x, but not directly, indirectly, a bit at a time, so that we get each bit in the minimum-size program for x as well as x. [That's also the idea of the proof that H(x, H(x)) = H(x) + O(1), which I've left to the reader as an exercise.] Then we give the program for x to the program to get y from x, and then we cons up x and y.

Here's the prefix γ that does this and whose size in bits gives us c:

[  Proof that H(x,y) <= H(x) + H(y|x) + 2872.  ]
[  The 2872-bit prefix gamma proves this.      ]
define gamma
 
   [read program p bit by bit until we get it all]
   let (loop p)
      if = success car try no-time-limit 'eval read-exp p
      [then] p
      [else] (loop append p cons read-bit nil)
 
   let x* (loop nil)         [get x* = program for x]
   let x run-utm-on x*       [get x from x*]
   let y                     [get y from x* by running]
       eval cons 'read-exp   [((read-exp) (' x*))]
            cons cons "'
                 cons x*
                      nil
                 nil
   [form the pair x, y]
 
   cons x cons y nil

That's the definition of the prefix γ. To check that γ is actually 2872 bits long, you run this:

   length bits gamma

And here's an example showing how to use γ:

   run-utm-on

   [get pair x, y by combining                   ]
   [a program for x and a program to get y from x]
 
   append
 
      bits gamma
 
   append
 
      [x* = program to calculate x = 2]
      [[Supposedly x* is smallest possible,]]
      [[but this works for ANY x* for x.]]

      bits' + 1 1
 
      [program to calculate y = x+1 from x*]

      bits' lambda(x*) + 1 run-utm-on x*

This will produce the result (x y) = (2 3).

Software for this chapter

The source code and runs of the software for this chapter can be found on the web at these URL's:

   http://www.cs.umaine.edu/~chaitin/ait/utm2.l
   http://www.cs.umaine.edu/~chaitin/ait/utm2.r

   http://www.cs.auckland.ac.nz/CDMTCS/chaitin/ait/utm2.l
   http://www.cs.auckland.ac.nz/CDMTCS/chaitin/ait/utm2.r

Exercises

Show that
H(x) ≤ H(x, y) + c
and determine c.
Show that
H(x | y) ≤ H(x) + c
and determine c.
Consider an n-bit string β. Show that
H(n) ≤ H(β) + c
and determine c.
Consider the n-bit string 0ⁿ consisting entirely of 0's. Show that
|H(0ⁿ) − H(n)| ≤ c
and determine c.
Symmetry of joint complexity. Show that
|H(x, y) − H(y, x)| ≤ c
and determine c.
Show that
|H(x) − H(x, H(x))| ≤ c
and determine c. In other words, a minimum-size program for x tells us its size H(x) (a decimal number) as well as its output x.
The software for this chapter uses an experimental new primitive function was-read to speed up the program prefix γ. I'm not satisfied with this---invent a better solution!
[This and the next exercise are previews of the last chapter of this book, in which we discuss the complexity H(X) of infinite sets X of S-expressions, defined as the size of the smallest program that makes our universal machine U display the elements of X in some order, with or without repetitions. Sets X for which this is possible are called r.e. = recursively enumerable (old name) or c.e. = computably enumerable (new name).]
Represent a function ψ via its graph, that is, as the set of ordered pairs (x ψ(x)). Show that
H(ψ(x)) ≤ H(ψ) + H(x) + c
and determine c.
Complexity of unions and intersections of infinite sets. Consider two r.e. (c.e.) infinite sets of S-expressions X and Y. Show that
H(X ∩ Y) ≤ H(X) + H(Y) + c
and determine c. Show that
H(X ∪ Y) ≤ H(X) + H(Y) + c'
and determine c'. Hint: merge the programs for generating X and Y in the order that their bits are needed when running both programs in parallel.