[self-interest] Re: resends

Sun Aug 29 17:07:29 UTC 1999

Hi!

Jecel wrote three weeks ago:

>Here is the statistics for the bytecodes for the 26339
>methods in the standard snapshot:
>
>      implicitSend   81590
>      send           62233
>      literal        27795
>      pushSelf        4639
>      return          4158
>      index           3233
>      resendOp         598
>      delegatee         17

Which made me think about the encoding again.  We can use these numbers to
do some math.  Yeah!

A Self method object looks like this, I think:

(| bytecodes. literals. |)

where bytecodes contains a byteVector and literals a vector (or nil).  As I
don't know how many objects contain nil, I'll assume every object has at
least an empty vector.  Let's further assume that every normal object has
an object header of 8 bytes (4 bytes = 1 word = reference to map and 4
bytes for hash value, gc flags and type flags).  Vectors have have 12 bytes
(additionally 4 bytes size) but let's assume we're using a compact encoding
for short vectors.  (Jecel even mentioned 16 bytes object header).

That means, for 26339 methods, there's an VM overhead of (at least)
26339*3*8 = 632136 bytes.  Reducing a method object to a single 32bit-word
object (wordVector) will save (at most) 66%, 26339*2*8 = 421424 bytes,
needing only 210712 bytes.

This is already quite impressive.

Now Self needs 183648 bytes for its byte code instructions (ignoring
resends).  For Jecel's and mine encoding, this is more difficult to
calculate.  Go on.

Each method needs about 7 byte codes.  However, Jecel uses only sends and
pushes but needs to add a return to every method, not only to 4158.  So
I'll assume 202596 instructions.  This is about 7.7 instructions per method
and means we can still assume that 1 word is enough per method in general.
This sums up to 26339 * 4 = 105356 bytes then.  Calculating the size for
the literal is easy again, as any push and send instructions needs exactly
4 bytes: 176257 * 4 = 705028.  

So Jecel's approach needs about 210712 (VM overhead) + 105356
(instructions) + 705028 (literals) = 1021096 bytes or about 997k.

Compared to original Self, these are 76k less for instructions and at least
17k (4639 * 4 for the push self) but probably much more for literals.  The
big saving of ~410k is because of the reduced VM overhead.  The netto
encoding saving is less than 60k!

I can't really calcuate more for the original self as I don't know how
often instructions can share a literal and how many methods have literals
at all.  93% of all instructions need literals. This means, about 6.5
literal references per method.  My VisualWorks Smalltalk has 34066 methods.
Here only 9787 don't share literals.  That is 20% share one literal and 51%
two or more literals. However, the typical VisualWorks method has 19 byte
codes.  So let's assume that about 50% of all Self method share at least
one symbol. That is, of 171618 literal references in 26339 methods, we can
remove at least 8780 13170 references, or 13170 * 4 = 52680 bytes.
Compared to Jecel, the netto saving is reduced to 8k.  Am I right?

Now to my suggestion.  I need 202596 instructions, as Jecel.  However these
are all bytes which need to fit into words.  7.7 instructions will fit into
two words: 26339 * 8 = 210712 bytes.  Let's look into VisualWorks again:
The 34066 methods contain 304607 message sends or other symbol literals.
These are the top 10 message sends:

#+       6561 
#==      5977 
#new     4478 
#=       4172 
#at:put: 4051 
#-       3292 
#at:     3272 
#size    2702 
#@       2635 
#value:  2328

The top 63 sends are 92945 or 30% of all sends.  This means, we can remove
30% or 43147 of 143823 literals for sends at least. The other 27795
literals are probably numbers, strings or else.  Because of encoding -1, 0,
1 or 2 as bytecodes, I'll remove additional 10% or 2780 references.  In
sum: (100676 + 25015 = 125691) * 4 = 502764 bytes.

So my approach needs about 210712 (VM overhead) + 210712 (instructions) +
502764 (literals) = 924188 bytes or about 903k.

Better but still disappointing.

I could save a few bytes by using this tricky encoding:

Byte code instructions are stored from the end of the combined vector and
literals from the beginning.  So you don't have to inter-weave both things
and you don't loose memory for padding to 4 bytes.  Why didn't I thought
about this earlier? (This would also work for Jecel's idea, to be fair).
This would save the tremendous amount of some 8116 bytes.  Cool.

To wrap up, if you really want to save space, don't bother with the byte
codes of all these little methods, either write big methods (uuh!) or
reduce the VM object overhead.

Actually you don't need a VM overhead at all if you add the code directly
to the map object.  This would probably add some work to the garbage
collector but otherwise, it would free up the complete VM overhead of
210712 bytes.

bye
--
Stefan Matthias Aust  //  Bevor wir fallen, fallen wir lieber auf.