[self-interest] Re: hardware implementation of PICs [long]

Sun Sep 5 03:23:14 UTC 1999

Hello, Jecel!
>From all you have written, I have only one question: how do you deal with
changes when you embed the language so deep into the hardware? You guys were
discussing eliminating some bytecodes, changing the syntax of the language,
and so on. If you change something, must you reimplement hardware?
Also I'm curious to know how you write the "system software", like the
compiler. In assembly?

Regards,
Douglas

Jecel Assumpcao Jr wrote:

> Since this is a hardware issue, I am sending this to the Merlin list.
> Since this is a Self implementation issue, I am also sending this
> to the Self-interest list and apologize to anyone who receives two
> copies as a result. I will have to give some background to explain
> my idea, so this will be rather long. Those not interested in very
> low level implementation details will probably want to simply skip
> this entirely.
>
> Stefan Matthias Aust was asking about caching message lookups
> and we got into a discussion about Polymorphic Inline Caches (PICs).
> These PICs are simply short sequences of code that test the
> receiver's type (class in Smalltalk, map in Self) against a short
> list of candidates and jumps to the corresponding native code
> directly. If none of the alternatives are the correct one, then
> the traditional lookup system is invoked and it can extend the
> PIC so the current answer is found the next time around. To
> make this expansion easier, we might store the PIC in a separate
> memory region from the main code in the method and have that
> main code include a call to the PIC (by using jump instructions
> in the PIC, we guarantee that the return instruction in the invoked
> native method will bring us directly back to the main code).
> Since the number of entries in a PIC is small, the best implementation
> is something like:
>
>       if (receiver->map == SmallIntegerMap) goto SmallIntegerAdd;
>       if (receiver->map == FloatMap) goto FloatAdd;
>       goto Lookup(receiver->map,"+");
>
> The Self compiler can use this as a source of type information
> instead of (or together with) type analysis. It can tell that
> sending "+" to the object known as "receiver" at this point
> in the source code (known as the call site) has only resulted
> in either executing SmallIntegerAdd or FloatAdd. We can't
> know if there will be additional types for receiver in the
> future (though type analysis might be able to tell us that)
> so we won't be able to eliminate these tests from the compiled
> code (I am supposing we have decided to recompile this for some
> reason), but we can move them around and maybe merge them with
> similar tests nearby.
>
> Here is data presented in Table A-1 in Urs's PhD thesis:
>
>   number of receiver types          percentage of call sites
>             0                                12.4%
>             1                                81.9%
>             2                                 4.4%
>             3                                 0.8%
>             4                                 0.2%
>             5-10                              0.2%
>             11-99                             0.05%
>             >99                               0.02%
>
> Ok, so on to Merlin 6. This machine has 64K words of 80 bits
> each for use as a direct mapped data and "microcode" cache -
> 64 bits of data and 16 bits for the tags. The machine language
> in Merlin 6 is Self bytecodes: nothing else exists in main
> memory. When it needs to execute some bytecodes, the instruction
> pointer is extended with a special "context register" and the
> result is looked up in the cache. If the cache hits, then the
> data found there is microcode to be directly executed and *not*
> the bytecodes pointed to by the instruction pointer! This is
> a subtle, but very important difference relative to all
> architectures I am aware of. If the cache misses, then the
> CPU fetches microcode from a fixed location in a special
> region in the cache. This is the bytecode interpreter/compiler
> and it will fetch the bytecodes (storing them in the *data*
> region of the cache) and will either execute them directly
> or generate new microcode to be stored in the cache.
>
> The context register normally has a value of zero, but this
> can be changed by the microcode. This allows more than
> one microcode to be associated with a single bytecode. This
> is used not only to implement multicycle instructions (so
> it would be a microcode instruction pointer), but also for
> Self 4.0 customized compilation.
>
> About the microcode - the processor is a simple MOVE CPU
> (http://cardit.et.tudelft.nl/MOVE/) and each 64 bit word
> includes 4 move instructions that are executed in a single
> clock. I will represent one microcode word as
>
> [cond1]src1->dest1;[cond2]src2->dest2;[cond3]src3->dest2;[cond4]src4->dest4
>
> which means that data in src1 is transfered to dest1 in this
> cycle if cond1 happens to be true. Since each word is in a
> cache, I will represent the whole thing as
>
>  <IP,CR> microcode word
>
> This means that if the current value of the instruction pointer
> is IP and of the context register is CR, then this microcode word
> (the four moves) is executed. Here is an example of microcode
> for adding two integers:
>
>    <109,0> [true] R2 -> ADDER_OP; [true] R3 -> ADDER_TRIG; ... ; ...
>    <109,1> [true] ADDER_OUT -> R4 ; [true] INC_IP -> IP ; ... ; ...
>
> As you can see, this is even more primitive than a RISC CPU. But
> note that the compiler has complete control and could find
> interesting things to do with the four moves I have omitted from
> the example.
>
> When I was designing the Tachyon CPU, I came up with a nice
> implementation for PICs. But it used associative memory for
> the microcode cache and I can't put that in the Merlin 6. So
> here is the idea adapted for a direct mapped cache:
>
>    <432,0> [true] R6 -> PIC_TRIG; [true] CONST32 -> PIC_OP; #????
>    <IntMap> [true] R0-> ADDER_OP; [true] R2 -> ADDER_TRIG; #7 -> CR; ..
>    <FloatMap> [true] R0 -> FADD_OP; [true] R2 -> FADD_TRIG; #3 -> CR; ..
>    <XXXXXX> ZZZZZZZZZ
>    ....
>    <432,3> [true] FADD_OUT -> R0; ... ; ... ; ...
>    ....
>    <432,7> [true] ADDER_OUT -> R0; ... ; ... ; ...
>
> I am supposing that by the time the CPU tries to execute the bytecode
> at memory address 432 (which would be "send '+'" or something similar)
> the map of the guy of the top of the stack has been loaded into R6.
> We move a 32 bit constant into the operation register of the PIC
> unit. This constant comes from the 32 lower bits in this cache
> word (shown as #???? above), so the last two moves don't exist
> (their condition is automatically forced to "false"). Meanwhile,
> R6 has been moved to the trigger register. The PIC unit compares
> bits 2 to 11 in R6 to three 10 bit values in the constant.
> Depending on which one matches, it fetches in the next clock the
> word in the cache located 1, 2 or 3 words after the one with these
> moves. Instead of comparing that word's tag with <IP,CR> as it
> normally does, it compares it with bits 12 to 27 in R6 and if
> they don't match it causes a cache miss. That constant would
> be calculated by the compiler like this:
>
>         ???? = (((intMap >> 2) & 0x0FFC) << 20) |
>                (((floatMap >> 2) & 0x0FFC) << 10) |
>                0; /* no third entry */
>
> This way we can have PICs up to 3 entries (which is most of them)
> which start executing the destination code in just one clock. If
> the PIC has less than 3 entries (as in this example), the other
> cache lines can be reused for other things.
>
> This is a key feature of Merlin 6, so I would love to know of
> any flaws that I might have overlooked before I spend all my
> money on it. I am specially interested in the opinions of those
> who think that hardware of OO languages is a bad idea.
>
> -- Jecel
>
> ------------------------------------------------------------------------
>
> eGroups.com home: http://www.egroups.com/group/self-interest
> http://www.egroups.com - Simplifying group communications