[self-interest] hardware implementation of PICs [long]

Sat Sep 4 23:31:52 UTC 1999

Since this is a hardware issue, I am sending this to the Merlin list.
Since this is a Self implementation issue, I am also sending this
to the Self-interest list and apologize to anyone who receives two
copies as a result. I will have to give some background to explain
my idea, so this will be rather long. Those not interested in very
low level implementation details will probably want to simply skip
this entirely.

Stefan Matthias Aust was asking about caching message lookups
and we got into a discussion about Polymorphic Inline Caches (PICs).
These PICs are simply short sequences of code that test the
receiver's type (class in Smalltalk, map in Self) against a short
list of candidates and jumps to the corresponding native code
directly. If none of the alternatives are the correct one, then
the traditional lookup system is invoked and it can extend the
PIC so the current answer is found the next time around. To
make this expansion easier, we might store the PIC in a separate
memory region from the main code in the method and have that
main code include a call to the PIC (by using jump instructions
in the PIC, we guarantee that the return instruction in the invoked
native method will bring us directly back to the main code).
Since the number of entries in a PIC is small, the best implementation
is something like:

      if (receiver->map == SmallIntegerMap) goto SmallIntegerAdd;
      if (receiver->map == FloatMap) goto FloatAdd;
      goto Lookup(receiver->map,"+");

The Self compiler can use this as a source of type information
instead of (or together with) type analysis. It can tell that
sending "+" to the object known as "receiver" at this point
in the source code (known as the call site) has only resulted
in either executing SmallIntegerAdd or FloatAdd. We can't
know if there will be additional types for receiver in the
future (though type analysis might be able to tell us that)
so we won't be able to eliminate these tests from the compiled
code (I am supposing we have decided to recompile this for some
reason), but we can move them around and maybe merge them with
similar tests nearby.

Here is data presented in Table A-1 in Urs's PhD thesis:

  number of receiver types          percentage of call sites
            0                                12.4%
            1                                81.9%
            2                                 4.4%
            3                                 0.8%
            4                                 0.2%
            5-10                              0.2%
            11-99                             0.05%
            >99                               0.02%

Ok, so on to Merlin 6. This machine has 64K words of 80 bits
each for use as a direct mapped data and "microcode" cache -
64 bits of data and 16 bits for the tags. The machine language
in Merlin 6 is Self bytecodes: nothing else exists in main
memory. When it needs to execute some bytecodes, the instruction
pointer is extended with a special "context register" and the
result is looked up in the cache. If the cache hits, then the
data found there is microcode to be directly executed and *not*
the bytecodes pointed to by the instruction pointer! This is
a subtle, but very important difference relative to all
architectures I am aware of. If the cache misses, then the
CPU fetches microcode from a fixed location in a special
region in the cache. This is the bytecode interpreter/compiler
and it will fetch the bytecodes (storing them in the *data*
region of the cache) and will either execute them directly
or generate new microcode to be stored in the cache.

The context register normally has a value of zero, but this
can be changed by the microcode. This allows more than
one microcode to be associated with a single bytecode. This
is used not only to implement multicycle instructions (so
it would be a microcode instruction pointer), but also for
Self 4.0 customized compilation.

About the microcode - the processor is a simple MOVE CPU
(http://cardit.et.tudelft.nl/MOVE/) and each 64 bit word
includes 4 move instructions that are executed in a single
clock. I will represent one microcode word as

[cond1]src1->dest1;[cond2]src2->dest2;[cond3]src3->dest2;[cond4]src4->dest4

which means that data in src1 is transfered to dest1 in this
cycle if cond1 happens to be true. Since each word is in a
cache, I will represent the whole thing as

 <IP,CR> microcode word

This means that if the current value of the instruction pointer
is IP and of the context register is CR, then this microcode word
(the four moves) is executed. Here is an example of microcode
for adding two integers:

   <109,0> [true] R2 -> ADDER_OP; [true] R3 -> ADDER_TRIG; ... ; ...
   <109,1> [true] ADDER_OUT -> R4 ; [true] INC_IP -> IP ; ... ; ...

As you can see, this is even more primitive than a RISC CPU. But
note that the compiler has complete control and could find
interesting things to do with the four moves I have omitted from
the example.

When I was designing the Tachyon CPU, I came up with a nice
implementation for PICs. But it used associative memory for
the microcode cache and I can't put that in the Merlin 6. So
here is the idea adapted for a direct mapped cache:

   <432,0> [true] R6 -> PIC_TRIG; [true] CONST32 -> PIC_OP; #????
   <IntMap> [true] R0-> ADDER_OP; [true] R2 -> ADDER_TRIG; #7 -> CR; ..
   <FloatMap> [true] R0 -> FADD_OP; [true] R2 -> FADD_TRIG; #3 -> CR; ..
   <XXXXXX> ZZZZZZZZZ
   ....
   <432,3> [true] FADD_OUT -> R0; ... ; ... ; ...
   ....
   <432,7> [true] ADDER_OUT -> R0; ... ; ... ; ...

I am supposing that by the time the CPU tries to execute the bytecode
at memory address 432 (which would be "send '+'" or something similar)
the map of the guy of the top of the stack has been loaded into R6.
We move a 32 bit constant into the operation register of the PIC
unit. This constant comes from the 32 lower bits in this cache
word (shown as #???? above), so the last two moves don't exist
(their condition is automatically forced to "false"). Meanwhile,
R6 has been moved to the trigger register. The PIC unit compares
bits 2 to 11 in R6 to three 10 bit values in the constant.
Depending on which one matches, it fetches in the next clock the
word in the cache located 1, 2 or 3 words after the one with these
moves. Instead of comparing that word's tag with <IP,CR> as it
normally does, it compares it with bits 12 to 27 in R6 and if
they don't match it causes a cache miss. That constant would
be calculated by the compiler like this:

        ???? = (((intMap >> 2) & 0x0FFC) << 20) |
               (((floatMap >> 2) & 0x0FFC) << 10) |
               0; /* no third entry */

This way we can have PICs up to 3 entries (which is most of them)
which start executing the destination code in just one clock. If
the PIC has less than 3 entries (as in this example), the other
cache lines can be reused for other things.

This is a key feature of Merlin 6, so I would love to know of
any flaws that I might have overlooked before I spend all my
money on it. I am specially interested in the opinions of those
who think that hardware of OO languages is a bad idea.

-- Jecel