Friday, February 06, 2009

Scala Actors versus the ExceptionBlob

I've begun some very preliminary benchmarking of my state-based actors library against the current standard actors library and seen some rather interesting results.  The test I'm running spawns four pairs of actors and sends about 40 million messages between each pair.  The only logic in the actors is to count how many messages they have received, ensure they are received in order, and occasionally send back a "More" message.  This is so the mailbox doesn't overfill and cause out-of-memory errors.

The benchmark has three different incarnations, each testing a different way of processing messages until some termination condition occurs.  The versions are:

  • loop
  • recursive react
  • reactWhile (only available with the State-based actors, but relatively easy to port to "classic" actors)

The code for the loop version is below:

class LoopPingSender(val totalPings: Int, val pingReceiver: LoopPingReceiver) extends Actor {
  private var pingsSent = 0
  def act() {
    Actor.loop {
      pingReceiver ! Ping(pingsSent)
      pingsSent += 1
      if (pingsSent == totalPings) {
        println("Sender Done")
        exit("sender completed successfully")
      } else if ((pingsSent % 1000) == 0) react {
        case More => 
      }
    }
  }
}
class LoopPingReceiver(val totalPings: Int) extends Actor {
  private var pingsReceived = 0
  def act() {
    Actor.loop {
      react {
        case Ping(n) if n == pingsReceived => {
          pingsReceived += 1
          if (pingsReceived == totalPings) {
            println("Receiver Done")
            exit("receiver completed successfully")
          } else if ((pingsReceived % 1000) == 0) {
            reply(More)
          }
        }
        case Ping(n) => {
          println("error")
          exit("invalid n value: " + n)
        }
      }
    }
  }
}

The non-scientifically gathered results are as follows, all times in seconds:

Version Standard Actors State-based Actors
loop 298.22 96.84
Recursive react 90.25 113.57
reactWhile N/A 40.63

Standard Actor loop

Let's consider the loop version first.  loop in the standard actor implementation is rather convoluted.  Rather than directly supporting looping, the standard actor implementation provides a package private hook for a function that is executed when the actor is terminating.  loop uses this hook to subvert the termination process by throwing an exception and subsequently restarting execution of the actor.  This makes it so that two exceptions are thrown per message processed rather than the normal one.  Here's a small snippet from the profiling results:

     Compiled + native   Method                        
 40.0%  5148  +     0    ExceptionBlob
 13.9%   658  +  1133    scala.actors.Scheduler$$anon$2.run
 13.1%   391  +  1299    scala.actors.Actor$.loop
  3.4%   441  +     0    scala.actors.FJTaskRunner.run
--- snip ---
 73.7%  6705  +  2771    Total compiled
         Stub + native   Method                        
 20.6%     0  +  2654    java.lang.Thread.isInterrupted

What you can see is that the ExceptionBlob is dominating the runtime, followed by Thread.isInterrupted.  ExceptionBlob is a class in the JVM that observes the unwinding of the stack.  As far as I could discern through so separate microbenchmarking (which I'll discuss another day, after I've spent some time in HotSpot code), the amount of time consumed by ExceptionBlob is proportionate depth of the stack being unwound by exceptions.  In other words, throwing fewer exceptions doesn't do any good if you use them to unwind deeper stacks, and unwinding shallower stacks doesn't do any good if you throw more exceptions.  Caching exceptions doesn't help, either, but I digress.  The bottom line is that loop is pretty inefficient in the standard actors implementation.

State-based Actor loop

In my state-based actors, there are two states to handle looping, both of which are subclasses of Running.  The first is Looping, which is initially entered when the actor starts looping.  The second is ResumeLooping, which is used to resume looping after the actor has been suspended.  Both run the body of the react in a tail-recursive loop until their are no more messages to process, at which time they suspend the actor.  This means that many message can potentially be processed before any exception-based signaling is used.  The following is a snippet from a profile:

     Compiled + native   Method                        
 67.5%  3233  +     0    ExceptionBlob
  6.4%   306  +     0    scala.actors.Actor$class.withLock
  4.9%    44  +   189    actorbench.LoopPingSender$$anonfun$act$1.apply
  4.8%    97  +   135    scala.actors.Actor$$anonfun$send$1.apply
  4.5%   116  +    99    scala.actors.Actor$Running.tr$1
  3.8%    36  +   145    scala.actors.Actor$ResumeLooping.enter
--- snip ---
         Stub + native   Method                        
  0.2%     0  +     9    java.lang.Thread.isInterrupted

Again, ExceptionBlob dominates the execution time, although the total time it consumes is lower.  Unfortunately, despite using significantly fewer exception signals, my state-based actors tend to build up significantly deeper stacks than the actors in the standard library.

Standard Actor recursive react

Recursive react seems to be the best performing looping pattern for the standard actor library.  It's a pretty simple pattern:

class PingSender(val totalPings: Int, val pingReceiver: PingReceiver) extends Actor {
  def act() {
    def recurse(i: Int): Unit = {
      if (i > 0) {
        pingReceiver ! Ping(i)
        if ((i % 1000) == 0) {
          react {
            case More => recurse(i - 1)
          }
        } else {
          recurse(i - 1)
        }
      } else {
        println("Sender Done")
      }
    }
    recurse(totalPings)
  }
}

As you can see, the function containing the react block is simply recursively called as needed.  Note that despite the fact that the recursive call appears to be in a tail position, it is not a tail recursive call and will not be optimized by the compiler.  The PartialFunction defined within the react block is actually a separate object with an apply method, so while recurse and the PartialFunction are mutually recursive, they are not tail recursive.

     Compiled + native   Method                        
 19.1%   456  +     0    ExceptionBlob
 13.0%    89  +   221    scala.actors.Reaction.run
  8.6%    53  +   152    scala.actors.Actor$class.send
  7.9%    61  +   127    scala.actors.Actor$class.react
  5.3%    59  +    67    scala.Iterator$class.toList
--- snip ---
 68.1%   862  +   765    Total compiled

         Stub + native   Method                        
 26.4%     0  +   631    java.lang.Thread.isInterrupted
--- snip ---
 26.9%     0  +   643    Total stub

This time Thread.isInterrupted dominates, with ExceptionBlob not too far behind.

Actor States recursive react

It turns out that the "hybrid trampoline" I implemented to facilitate recursive reacts without giving up the thread provides pretty bad performance.

 64.9%  3714  +     0    ExceptionBlob
 11.4%   568  +    85    scala.actors.Actor$class.withoutLock
  8.4%   478  +     1    scala.actors.Actor$class.withLock
  5.1%    90  +   201    scala.actors.Actor$$anonfun$reactWithin$1.apply
  4.2%   115  +   125    scala.actors.Actor$$anonfun$send$1.apply
--- snip ---
 97.8%  5006  +   595    Total compiled
         Stub + native   Method                        
  0.1%     0  +     6    java.lang.Thread.currentThread
--- snip ---
  0.2%     0  +    12    Total stub

The problem is that when I wrote it, I thought that the cost of throwing exceptions was primarily proportionate to the number of exceptions thrown rather than the amount of stack that is unwound. Consequently, I made the trampoline only bounce once every n invocations in order to reduce the number of exceptions being thrown. As it turns out, what I should do is find a place where the stack is likely to be the shallowest, and throw an exception every time.

Actor States reactWhile

reactWhile addresses the pitfalls presented in the previous versions and provides roughly twice the performance of next best performer, recursive react on the standard actor library.

     Compiled + native   Method                        
 30.0%   171  +   331    scala.actors.Actor$Running.tr$1
 24.4%   263  +   146    actorbench.ReactWhileReceiver.withLock
 12.4%    26  +   181    actorbench.ReactWhileSender.sendPings
  8.5%   142  +     0    scala.actors.Actor$class.withoutLock
--- snip ---
 79.0%   660  +   661    Total compiled

         Stub + native   Method                        
 15.1%     0  +   252    java.lang.Thread.isInterrupted
  0.9%     0  +    15    java.lang.Thread.currentThread
 16.0%     0  +   267    Total stub

ExceptionBlob has completely disappeared from the profile. That is because reactWhile makes minimal use of exceptions for signaling.  Consequently, its performance is dominated by some of the complicated for my (worthless) hybrid trampoline, my completely unoptimized lock implementation, and calls to Thread.isInterrupted that are made when the bound thread is returned to the pool when the actor suspends.  I'll write a blog on how reactWhile works at a later date, after I cleanup some of the other problems that I've listed.

Parting Matter

It really annoys me when people post benchmark results without providing details of their setup along with all of the code required to duplicate them, and that's just what I've done.  I've also benchmarked a very narrow set of behavior, failed for warm up the JVM, and not run multiple trials.  Developing a lightweight framework for benchmarking actor performance, along with a more comprehensive set of benchmarks is on my to-do list.  Once I have that, I'll start doing more rigorous benchmarking.  But I don't think I quite need it yet because bottlenecks are showing up quickly and consistently in my profiling, and in most cases they seem to have reasonably straight forward solutions.  All the above profiles against my library were done against revision 49658b3475e2 in my bitbucket repository, so you can at least access that code.

Sphere: Related Content

6 comments:

Jorge Ortiz said...

Erik,

This is awesome work. I hope it makes it into the standard library.

Anonymous said...

thanks for doing this. Very good job. It has to make it into the STDlib.
/Jonas

Unknown said...

I wonder if this can help with the ExceptionBlob.
Longjumps Considered Inexpensive
Using a pre-alocated exception (or a clone of) makes the exception throwing just like a simple "goto".

Erik Engbrecht said...

Gabriel,
I tried preallocated exceptions and it didn't help. The kicker is that the throw and the catch need to be optimized into the same compilation unit via inlining, which doesn't happen with larger pieces of code.

Unknown said...

Yep, like the small letter in contracts, "if the thrower and catcher are together in the same compilation unit" comes to bite you back :)
BTW, I'm looking forward to use your refactored actors library!

wireless sex toys said...

hello friends I really liked this information, a few days ago I read something similar, I would like to receive updates on this issue, as it is very interesting, thanks!