I've begun some very preliminary benchmarking of my state-based actors library against the current standard actors library and seen some rather interesting results. The test I'm running spawns four pairs of actors and sends about 40 million messages between each pair. The only logic in the actors is to count how many messages they have received, ensure they are received in order, and occasionally send back a "More" message. This is so the mailbox doesn't overfill and cause out-of-memory errors.
The benchmark has three different incarnations, each testing a different way of processing messages until some termination condition occurs. The versions are:
- loop
- recursive react
- reactWhile (only available with the State-based actors, but relatively easy to port to "classic" actors)
The code for the loop version is below:
class LoopPingSender(val totalPings: Int, val pingReceiver: LoopPingReceiver) extends Actor { private var pingsSent = 0 def act() { Actor.loop { pingReceiver ! Ping(pingsSent) pingsSent += 1 if (pingsSent == totalPings) { println("Sender Done") exit("sender completed successfully") } else if ((pingsSent % 1000) == 0) react { case More => } } } } class LoopPingReceiver(val totalPings: Int) extends Actor { private var pingsReceived = 0 def act() { Actor.loop { react { case Ping(n) if n == pingsReceived => { pingsReceived += 1 if (pingsReceived == totalPings) { println("Receiver Done") exit("receiver completed successfully") } else if ((pingsReceived % 1000) == 0) { reply(More) } } case Ping(n) => { println("error") exit("invalid n value: " + n) } } } } }
The non-scientifically gathered results are as follows, all times in seconds:
Version | Standard Actors | State-based Actors |
---|---|---|
loop | 298.22 | 96.84 |
Recursive react | 90.25 | 113.57 |
reactWhile | N/A | 40.63 |
Standard Actor loop
Let's consider the loop version first. loop in the standard actor implementation is rather convoluted. Rather than directly supporting looping, the standard actor implementation provides a package private hook for a function that is executed when the actor is terminating. loop uses this hook to subvert the termination process by throwing an exception and subsequently restarting execution of the actor. This makes it so that two exceptions are thrown per message processed rather than the normal one. Here's a small snippet from the profiling results:
Compiled + native Method 40.0% 5148 + 0 ExceptionBlob 13.9% 658 + 1133 scala.actors.Scheduler$$anon$2.run 13.1% 391 + 1299 scala.actors.Actor$.loop 3.4% 441 + 0 scala.actors.FJTaskRunner.run --- snip --- 73.7% 6705 + 2771 Total compiled
Stub + native Method 20.6% 0 + 2654 java.lang.Thread.isInterrupted
What you can see is that the ExceptionBlob is dominating the runtime, followed by Thread.isInterrupted. ExceptionBlob is a class in the JVM that observes the unwinding of the stack. As far as I could discern through so separate microbenchmarking (which I'll discuss another day, after I've spent some time in HotSpot code), the amount of time consumed by ExceptionBlob is proportionate depth of the stack being unwound by exceptions. In other words, throwing fewer exceptions doesn't do any good if you use them to unwind deeper stacks, and unwinding shallower stacks doesn't do any good if you throw more exceptions. Caching exceptions doesn't help, either, but I digress. The bottom line is that loop is pretty inefficient in the standard actors implementation.
State-based Actor loop
In my state-based actors, there are two states to handle looping, both of which are subclasses of Running
. The first is Looping
, which is initially entered when the actor starts looping. The second is ResumeLooping
, which is used to resume looping after the actor has been suspended. Both run the body of the react in a tail-recursive loop until their are no more messages to process, at which time they suspend the actor. This means that many message can potentially be processed before any exception-based signaling is used. The following is a snippet from a profile:
Compiled + native Method 67.5% 3233 + 0 ExceptionBlob 6.4% 306 + 0 scala.actors.Actor$class.withLock 4.9% 44 + 189 actorbench.LoopPingSender$$anonfun$act$1.apply 4.8% 97 + 135 scala.actors.Actor$$anonfun$send$1.apply 4.5% 116 + 99 scala.actors.Actor$Running.tr$1 3.8% 36 + 145 scala.actors.Actor$ResumeLooping.enter --- snip --- Stub + native Method 0.2% 0 + 9 java.lang.Thread.isInterrupted
Again, ExceptionBlob dominates the execution time, although the total time it consumes is lower. Unfortunately, despite using significantly fewer exception signals, my state-based actors tend to build up significantly deeper stacks than the actors in the standard library.
Standard Actor recursive react
Recursive react seems to be the best performing looping pattern for the standard actor library. It's a pretty simple pattern:
class PingSender(val totalPings: Int, val pingReceiver: PingReceiver) extends Actor { def act() { def recurse(i: Int): Unit = { if (i > 0) { pingReceiver ! Ping(i) if ((i % 1000) == 0) { react { case More => recurse(i - 1) } } else { recurse(i - 1) } } else { println("Sender Done") } } recurse(totalPings) } }
As you can see, the function containing the react
block is simply recursively called as needed. Note that despite the fact that the recursive call appears to be in a tail position, it is not a tail recursive call and will not be optimized by the compiler. The PartialFunction
defined within the react
block is actually a separate object with an apply method, so while recurse
and the PartialFunction
are mutually recursive, they are not tail recursive.
Compiled + native Method 19.1% 456 + 0 ExceptionBlob 13.0% 89 + 221 scala.actors.Reaction.run 8.6% 53 + 152 scala.actors.Actor$class.send 7.9% 61 + 127 scala.actors.Actor$class.react 5.3% 59 + 67 scala.Iterator$class.toList --- snip --- 68.1% 862 + 765 Total compiled Stub + native Method 26.4% 0 + 631 java.lang.Thread.isInterrupted --- snip --- 26.9% 0 + 643 Total stub
This time Thread.isInterrupted
dominates, with ExceptionBlob
not too far behind.
Actor States recursive react
It turns out that the "hybrid trampoline" I implemented to facilitate recursive reacts without giving up the thread provides pretty bad performance.
64.9% 3714 + 0 ExceptionBlob 11.4% 568 + 85 scala.actors.Actor$class.withoutLock 8.4% 478 + 1 scala.actors.Actor$class.withLock 5.1% 90 + 201 scala.actors.Actor$$anonfun$reactWithin$1.apply 4.2% 115 + 125 scala.actors.Actor$$anonfun$send$1.apply --- snip --- 97.8% 5006 + 595 Total compiled Stub + native Method 0.1% 0 + 6 java.lang.Thread.currentThread --- snip --- 0.2% 0 + 12 Total stub
The problem is that when I wrote it, I thought that the cost of throwing exceptions was primarily proportionate to the number of exceptions thrown rather than the amount of stack that is unwound. Consequently, I made the trampoline only bounce once every n invocations in order to reduce the number of exceptions being thrown. As it turns out, what I should do is find a place where the stack is likely to be the shallowest, and throw an exception every time.
Actor States reactWhile
reactWhile
addresses the pitfalls presented in the previous versions and provides roughly twice the performance of next best performer, recursive react
on the standard actor library.
Compiled + native Method 30.0% 171 + 331 scala.actors.Actor$Running.tr$1 24.4% 263 + 146 actorbench.ReactWhileReceiver.withLock 12.4% 26 + 181 actorbench.ReactWhileSender.sendPings 8.5% 142 + 0 scala.actors.Actor$class.withoutLock --- snip --- 79.0% 660 + 661 Total compiled Stub + native Method 15.1% 0 + 252 java.lang.Thread.isInterrupted 0.9% 0 + 15 java.lang.Thread.currentThread 16.0% 0 + 267 Total stub
ExceptionBlob
has completely disappeared from the profile. That is because reactWhile
makes minimal use of exceptions for signaling. Consequently, its performance is dominated by some of the complicated for my (worthless) hybrid trampoline, my completely unoptimized lock implementation, and calls to Thread.isInterrupted
that are made when the bound thread is returned to the pool when the actor suspends. I'll write a blog on how reactWhile
works at a later date, after I cleanup some of the other problems that I've listed.
Parting Matter
It really annoys me when people post benchmark results without providing details of their setup along with all of the code required to duplicate them, and that's just what I've done. I've also benchmarked a very narrow set of behavior, failed for warm up the JVM, and not run multiple trials. Developing a lightweight framework for benchmarking actor performance, along with a more comprehensive set of benchmarks is on my to-do list. Once I have that, I'll start doing more rigorous benchmarking. But I don't think I quite need it yet because bottlenecks are showing up quickly and consistently in my profiling, and in most cases they seem to have reasonably straight forward solutions. All the above profiles against my library were done against revision 49658b3475e2 in my bitbucket repository, so you can at least access that code.
Sphere: Related Content
6 comments:
Erik,
This is awesome work. I hope it makes it into the standard library.
thanks for doing this. Very good job. It has to make it into the STDlib.
/Jonas
I wonder if this can help with the ExceptionBlob.
Longjumps Considered Inexpensive
Using a pre-alocated exception (or a clone of) makes the exception throwing just like a simple "goto".
Gabriel,
I tried preallocated exceptions and it didn't help. The kicker is that the throw and the catch need to be optimized into the same compilation unit via inlining, which doesn't happen with larger pieces of code.
Yep, like the small letter in contracts, "if the thrower and catcher are together in the same compilation unit" comes to bite you back :)
BTW, I'm looking forward to use your refactored actors library!
hello friends I really liked this information, a few days ago I read something similar, I would like to receive updates on this issue, as it is very interesting, thanks!
Post a Comment