Hi all,
Forgive the longish email.
I have been looking into Self CPU usage as a side effect of looking into its file handling. Separely, Ben Noordhuis on GitHub has opened an issue also about CPU usage, but a different issue.
Both of these issues cause the standard Self image running morphic to use too much CPU. The first issue is masked by a 'fix' which I did, which on reflection turns out to be entirely misconceived. The second issue is with the interaction between the VM level repl and the scheduler.
I'm talking about the OS X build here. I haven't checked this out yet with the Linux build.
FIRSTLY
Deep in the heart of the file handling code lurks the method "unixGlobals os while_EINTR_do: block IfFail: errBlk"
In the 4.5 build this looks like this:
unixGlobals os while_EINTR_do: block IfFail: errBlk = ([ [ | :exit_inner_and_retry | process this sleep: 1. ^ callBlock value: [ | :error | error = 'EINTR' ifTrue: exit_inner_and_retry. errBlk value: error. ]. ] exit. scheduler isRunning ifTrue: [ process this yield ]. ] loop )
If you remove the "process this sleep: 1", you will notice Self peg the CPU at 100% (or as high as the OS will allow). This was why I initially put that sleep in. This was wrong: Mea culpa. It fixed the symptoms, but not the problem.
The problem is *not* that too many EINTR signals being received. The problem is that when reading or writing files, the reading/writing loop calls "suspendIfAsync", which is (for OS X):
traits unixFile osVarients bsd suspendIfAsync = ( setOwnerIfFail: [process this yield. ^ self]. suspendForIO)
following down:
traits unixFile osVarients bsd setOwnerIfFail: fb = ( fcntl: fcntls f_setown With: os getpid IfFail: fb)
This is attempting to set the VM as the process that will receive SIGIO and SIGURG signals for events on the file in question.
On the Mac this always fails with ENOTTY for the standard streams (stdin, stdout, stderr), so the suspend never happens. Instead the stream read or written busy locks, pegging the CPU. This includes the prompt reading from stdin.
The Solaris version is: suspendIfAsync = ( scheduler isRunning ifTrue: [suspendForIO] )
Changing the OS X version to be the same as the solaris version *seems* to work. Note that os_file's setAsync already does the ownership handling, so we shouldn't be redoing it each time anyway (Is this a sign that there were previously problems?)
But I don't yet understand why the setOwnerIfFail: is failing...
SECONDLY
Ben Noordhuis has opened Issue #34 on GitHub: https://github.com/russellallen/self/issues/34
He says:
To reproduce: look at top(1) when ./vm/Self is running in another terminal. It always hovers at 5-10% CPU usage on my machine.
Here is what I think happens: The SIGALRM signal handler that's installed by what I suspect is the scheduler (profiler?) seems to run at a 100 Hz frequency. When you start a shell, all those signals cause the fread() that is called by InteractiveScanner::read_next_char() to keep returning from and re-entering the read() system call.
My comment:
Changing "scheduler setRealTimer: ms" to either not set SIGALRM, or set SIGALRM no more frequently than at 1 second intervals, does drop the CPU usage down to about 1% CPU on my machine. The timer is not being set at 10ms intervals all the time - it depends on usage - but this seems a commonly requested interval.
The reading code is in scanner.cpp and is:
fint InteractiveScanner::read_next_char() { char c; while (true) { if (fread(&c, sizeof(char), 1, stdin)) return c; if (feof(stdin)) return EOF; } }
CONCLUSION
Help! No seriously, any comments, suggestions? Is your setup giving you completely different answers/results? Is Linux the same?
Cheers, Russell
On Tue, Apr 22, 2014 at 9:07 PM, Russell Allen mail@russell-allen.com wrote:
To reproduce: look at top(1) when ./vm/Self is running in another terminal. It always hovers at 5-10% CPU usage on my machine.
I did some investigation in a Linux Self system. If I run 'vm/Self' with no snapshot so it's just in the interactive scanner I see it floating between 0 and 1% CPU on Linux (3-5% running under strace). I see the jumping in and out of the 'read' system call. The cpu usage matches the following C program which was written to imitate what the Self VM is doing at this point:
While low it would be good to remove this if possible. Can the interactive scanner read loop be exited when it's no longer used? I assume the SIGALRM usage is used for the Self scheduler. Perhaps that could be optimized such that it's turned off if all processes are blocked in i/o.
If I create a snapshot using 'worldBuilder.self' and run that without entering the desktop (ie. staying in the REPL) I still see only 0-1% CPU. Now it's entering/existing the 'select' system call though.
If I start the desktop (via 'desktop open), Self CPU usage jumps to 3-5%. The desktop overhead seems high. Here strace shows that 'gettimeofday' is called a lot, followed by 'select'. This is what I get with about 60 seconds of idle in the desktop:
% time seconds usecs/call calls errors syscall ------ ----------- ----------- --------- --------- ---------------- 84.32 0.024001 7 3312 3263 select 6.95 0.001979 0 6342 rt_sigprocmask 5.46 0.001555 0 3590 3269 rt_sigreturn 2.70 0.000768 0 15455 gettimeofday 0.37 0.000104 0 3504 3469 recv 0.13 0.000036 0 196 getrusage 0.07 0.000020 0 64 mmap2 0.00 0.000000 0 179 6 read 0.00 0.000000 0 45 2 write 0.00 0.000000 0 30 1 open 0.00 0.000000 0 28 close
The 'gettimeofday' calls appear to come from real_time_prim in miscPrims.cpp. This is the _TimeReal primitive and only used in 'currentTime' in 'traits time'. It's hard to test with that stubbed out since it's used a lot all over the place (via 'current' in 'traits time').
On Wed, Apr 23, 2014 at 2:18 AM, Chris Double chris.double@double.co.nz wrote:
The 'gettimeofday' calls appear to come from real_time_prim in miscPrims.cpp. This is the _TimeReal primitive and only used in 'currentTime' in 'traits time'. It's hard to test with that stubbed out since it's used a lot all over the place (via 'current' in 'traits time').
On 64bit linux calls to gettimeofday aren't a syscall so it is faster. Because we compile self as a 32 bit app this optimisation isn't used unfortunately. As a quick test I did this hacky patch to drop gettimeofday usage to 10 times/second by calling it in a thread:
This dropped the CPU usage down to 2% on linux with the desktop open so it does seem to have some effect. It also slows down the Self system though. Does the scheduler use gettimeofday in some way?
On 23 Apr 2014, at 12:18 am, Chris Double chris.double@double.co.nz wrote:
While low it would be good to remove this if possible. Can the interactive scanner read loop be exited when it's no longer used? I assume the SIGALRM usage is used for the Self scheduler. Perhaps that could be optimized such that it's turned off if all processes are blocked in i/o.
I think SIGALRM is also used to step morphic's polling of io. ( through worldMorph runLoop -> worldMorph delayIfNoInputFor: ms )
The runLoop method comment says:
Don't hog the CPU by running too fast when there is no user input, even though there may be ongoing animation or monitoring activities. If we get done before consuming the desired frame time, sleep for the remainder of the time, checking for user inputs at periodic intervals. (A refinement would be to hang on the XEvent queue when there is no animation or other activity happening, but the current scheme uses extremely little CPU time when nothing is happening anyway.)
This is probably also the main caller of gettimeofday, which it uses to determine the time for each morphic cycle.
We can in the longer term do what the comment suggests and not poll for io, but this won't entirely fix our issue because as soon as we have an animation we will be back with SIGALRMs every few milliseconds to drive the animation.
I'll look to see whether I can find a way to turn off the interactive scanner read loop.
Russell
Hi Russell,
Just guessing, here, as I don't have anything to look at other than this post - but the symptoms you describe fit the pattern.
My guess is that someone changed a contract without noticing it. As you suggest here:
Note that os_file's setAsync already does the ownership handling, so we shouldn't be redoing it each time anyway (Is this a sign that there were previously problems?)
The original contract was 'someone has to set the owner'.
This is attempting to set the VM as the process that will receive SIGIO and SIGURG signals for events on the file in question.
It was probably changed (while trying to fix another interrupt related bug) in the following way.
1. Someone has to set the owner. 2. It almost always works, just this one case doesn't. 3. Aha - setAsync is assuming the owner will be set somewhere else. . 4. If it is almost always set somewhere else, we'd get what we have here. 5. Let's set it right away, in setAsync.
Probably didn't solve the problem, but, having already reasoned that it *could* have solved the problem, and noticing that it *appears* not to make things worse, that 'fix' stayed in.
I'll skip the rest of what I think might be going on, because it doesn't matter, yet.
I suggest taking that bit out of setAsync (i.e. do not set the owner) and see what happens..
If my guess is right, the CPU will no longer be pegged. If my guess was not right, revert the code back to what it was - no harm done.
-Jim
.
On 4/22/2014 2:07 AM, Russell Allen wrote:
Hi all,
Forgive the longish email.
I have been looking into Self CPU usage as a side effect of looking into its file handling. Separely, Ben Noordhuis on GitHub has opened an issue also about CPU usage, but a different issue.
Both of these issues cause the standard Self image running morphic to use too much CPU. The first issue is masked by a 'fix' which I did, which on reflection turns out to be entirely misconceived. The second issue is with the interaction between the VM level repl and the scheduler.
I'm talking about the OS X build here. I haven't checked this out yet with the Linux build.
FIRSTLY
Deep in the heart of the file handling code lurks the method "unixGlobals os while_EINTR_do: block IfFail: errBlk"
In the 4.5 build this looks like this:
unixGlobals os while_EINTR_do: block IfFail: errBlk = ([ [ | :exit_inner_and_retry | process this sleep: 1. ^ callBlock value: [ | :error | error = 'EINTR' ifTrue: exit_inner_and_retry. errBlk value: error. ]. ] exit. scheduler isRunning ifTrue: [ process this yield ]. ] loop )
If you remove the "process this sleep: 1", you will notice Self peg the CPU at 100% (or as high as the OS will allow). This was why I initially put that sleep in. This was wrong: Mea culpa. It fixed the symptoms, but not the problem.
The problem is *not* that too many EINTR signals being received. The problem is that when reading or writing files, the reading/writing loop calls "suspendIfAsync", which is (for OS X):
traits unixFile osVarients bsd suspendIfAsync = ( setOwnerIfFail: [process this yield. ^ self]. suspendForIO)
following down:
traits unixFile osVarients bsd setOwnerIfFail: fb = ( fcntl: fcntls f_setown With: os getpid IfFail: fb)
This is attempting to set the VM as the process that will receive SIGIO and SIGURG signals for events on the file in question.
On the Mac this always fails with ENOTTY for the standard streams (stdin, stdout, stderr), so the suspend never happens. Instead the stream read or written busy locks, pegging the CPU. This includes the prompt reading from stdin.
The Solaris version is: suspendIfAsync = ( scheduler isRunning ifTrue: [suspendForIO] )
Changing the OS X version to be the same as the solaris version *seems* to work. Note that os_file's setAsync already does the ownership handling, so we shouldn't be redoing it each time anyway (Is this a sign that there were previously problems?)
But I don't yet understand why the setOwnerIfFail: is failing...
SECONDLY
Ben Noordhuis has opened Issue #34 on GitHub: https://github.com/russellallen/self/issues/34
He says:
To reproduce: look at top(1) when ./vm/Self is running in another terminal. It always hovers at 5-10% CPU usage on my machine.
Here is what I think happens: The SIGALRM signal handler that's installed by what I suspect is the scheduler (profiler?) seems to run at a 100 Hz frequency. When you start a shell, all those signals cause the fread() that is called by InteractiveScanner::read_next_char() to keep returning from and re-entering the read() system call.
My comment:
Changing "scheduler setRealTimer: ms" to either not set SIGALRM, or set SIGALRM no more frequently than at 1 second intervals, does drop the CPU usage down to about 1% CPU on my machine. The timer is not being set at 10ms intervals all the time - it depends on usage - but this seems a commonly requested interval.
The reading code is in scanner.cpp and is:
fint InteractiveScanner::read_next_char() { char c; while (true) { if (fread(&c, sizeof(char), 1, stdin)) return c; if (feof(stdin)) return EOF; } }
CONCLUSION
Help! No seriously, any comments, suggestions? Is your setup giving you completely different answers/results? Is Linux the same?
Cheers, Russell
Yahoo Groups Links
On 23 Apr 2014, at 10:47 am, jas jas@cruzio.com wrote:
Just guessing, here, as I don't have anything to look at other than this post - but the symptoms you describe fit the pattern.
Snip
Probably didn't solve the problem, but, having already reasoned that it *could* have solved the problem, and noticing that it *appears* not to make things worse, that 'fix' stayed in.
This process is instantly recognisable to me as something I'm trying to train myself out of, though I can't take the blame this time :)
Taking that bit out of suspendIfAsync does, as you suggest, fix the immediate CPU pegging issue.
But I'm not happy yet, because I still can't do proper non-pegging IO on OS X - I can't call setOwnerIfFail: on pipes and then do the obvious:
[ pipe suspendIfAsyc. pipe dataReady ifTrue: [ doSomethingWith: pipe read ] ] loop
(or rather, I can but suspendIfAsync fails to suspend because setOwnerIfFail: fails)
Changing to pipe suspendIO in these circumstances instead (ie dropping setOwnerIfFail:) has the interesting side effect that the thread suspends *until some input on the command line causes SIGIO to be sent* - so that the pipe only reads while you hold down the enter key on Stdin :)
Reading up on how setAsync/SIGIO work reveals that, of course, BSD, Linux and Solaris all do it slightly differently. Sigh.
I've just checked and setOwnerIfFail/fcntl is working on Linux, which is why the CPU pegging isn't noticed.
Russell
On 23 Apr 2014, at 1:02 pm, Russell Allen mail@russell-allen.com wrote:
But I'm not happy yet, because I still can't do proper non-pegging IO on OS X - I can't call setOwnerIfFail: on pipes
It seems as if not being able to successfully call fcntl(pipe, F_SETOWN, pid) on OS X is not Self's fault but an OS X thing.
Easiest illustration of this is probably in python:
import os, sys, fcntl fcntl.fcntl(sys.stdin, fcntl.F_SETOWN, os.getpid())
gives:
IOError: [Errno 25] Inappropriate ioctl for device
same as for Self. Doing the same on a named pipe gives a 102 error : Operation not supported on socket. Which is interesting given a named pipe isn't a socket.
I can't find this documented anywhere - nothing in man page for fcntl or Apple bug reports db. I have asked on StackOverflow but no one has answered, though someone did give me a 'student' badge, so that's nice.
I am at a little bit of a loss here. We need something better than reading in a loop, or doing select with immediate timeout in a loop.
Russell
On Tue, Apr 22, 2014 at 9:07 PM, Russell Allen mail@russell-allen.com wrote:
Changing the OS X version to be the same as the solaris version *seems* to work. Note that os_file's setAsync already does the ownership handling, so we shouldn't be redoing it each time anyway (Is this a sign that there were previously problems?)
But I don't yet understand why the setOwnerIfFail: is failing...
From what I can see, Self expects i/o on files to result in a SIGIO.
It then does a 'select' and then schedules any process waiting on any file that the 'select' indicates has i/o available. To get that SIGIO it uses an 'fcntl' command in setOwnerIfFail. If that call fails it just yields the process since it can't guarantee receiving a SIGIO later to wake it up.
If that call is failing with ENOTTY but terminal i/o is still working I think we could check for ENOTTY in the fail block and ignore it. It would pay to investigate how i/o for the terminal i/o is handled though. Perhaps it's being picked up via a call to select that's triggered by something else.
self-interest@lists.selflanguage.org