[self-interest] embedding based prototype languages

Wed Nov 6 23:48:56 UTC 2002

| claimed a 1 clock dispatch for an embedded Self (recently renamed 
"Oliver Smalltalk" in an effort to make other Smalltalkers see that we 
are really a part of their team).

On Wednesday 06 November 2002 18:26, James McCartney wrote:
> OK I've been curious about this ever since you posted it. If it is
> not a secret, I wish you would tell more about it. I assume you are
> flattening parents into a single object.

There are several tricks that work together. A simple masking function 
is used to convert an object's pointer into a pointer for its map, so 
we don't have to look that up in the object's first word or something. 
The drawback is that sometimes an object has to change maps after a 
programming change (added or removed a slot) and that means changing 
the object's pointer, which means scanning the whole memory to update 
any references to the old pointer.

This is a 16 bit machine so I can't afford to have many objects. So the 
sources for all of an object's slots are gathered into one string which 
is stored in the object's map (so it doesn't take up an object pointer 
of its own). The object's "outliner" gives you a more Self-like view, 
so this implementation detail is hidden. When you make any programming 
changes the new string is used to build a new map, which is then 
compiled and finally any of the object's children are recompiled.

Besides the source string, a map contains a table and machine language 
code. This table includes a two word entry for every single selector 
defined in the system. There can be up to 4096 selectors, so this per 
map table can grow to be as much as 16KB!! But like I said, I have 
memory to spare. The tables for all maps are always exactly the same 
size and they grow as you define new selectors.

Most instructions (except for "send") are only 5 bits long, so up to 
three of them can fit in a 16 bit word. With two words we can have up 
to 6 instructions or 3 instructions plus a 16 bit literal. When we 
compile a map, for each possible selector we fill in its two word entry 
with the right code. If it is a constant slot, we do something like

              pushLiteral  42 ; return

while if it is a data slot we might do

             read 3 ; return

or an assignment slot

            write 3 ; return

and if it is a method slot we can simply stuff the code into those two 
words if it will fit, or else put the code at the end of the map and 
put a jump to that code in the two words.

Note that most entries would have nothing defined for them in a given 
object, so we simple compile code to call the debugger with a "message 
not found" error. These entries are considered "empty" in what is 
described next.

Before actually compiling a map, we copy the table and extra code from 
its first parent (which already includes everything from its parents), 
then everything from the second parent and so on. Then we compile the 
map overwritting anything copied from the parents.

So at runtime, given the pointer to an object and a selector of a 
message to send to it, we can

  1) find the object's map with no memory accesses
  2) fetch from 1 the first word from the entry corresponding to the 
selector
  3) execute that word

> Do you also flatten the
> instance's own slots along with the parents?

If you mean that you can't tell what came from the object and what came 
from the parent, you are right.

What I described above won't work in some cases: selectors that access 
data slots in the parent and resends. In the first case, code is 
compiled to explicitly delegate to the parent. It is like adding this 
local slot:

              buttonColor: c = (globals ui preferences buttonColor: c)

I actually got rid of the lobby and globals so this sort of thing 
wouldn't  be so common (it adds a lot of overhead to every single 
map!).

And about the resends, the compiler simply inlines the parent's source 
for the method into the spot where the resend was.

> It would seem like you'd
> need to hash in that case, but obviously you're not since you say 1
> clock. But I'm not even sure what that means since even an array
> lookup is not 1 clock.

If we are executing a method in objA and the current top of stack points 
to objB, we can have these accesses via the cache (which needs a 16 
object pointer and a 16 bit field offset to find the 16 bit word we 
want):

clock N:    cache ( map(objA), PCa ) ===> "send selector23"
clock N+1:   cache ( map(objB), 23<<1 ) ===> "read 3 ; return"
clock N+2:   cache ( objB, 3 ) ===> data slot 3's contents
clock N+3:    ----
clock N+4:   cache ( map(objA), PCa+1) ===> "selfsend selector92"

It is annoying that the return takes 2 clocks (N+3 and N+4). With a 
little more hardware that can be reduced to 1 clock or even, when the 
instruction in N+2 doesn't do a memory access (which "read 3" does, but 
most don't), 0 clocks.

Anyway, what I mean by "1 clock dispatch" is that in clock N we fetched 
a send instruction and in clock N+1 we were already fetching the first 
instruction from the right method. This is possible because the 
hardware is a simple stack machine which isn't pipelined.

-- Jecel