Tuesday, February 10, 2009

Refactoring Scala Actors: Rethinking the Approach

When I started refactoring Scala's actor library, I really had several goals:

  1. Reduce the coupling within the library so that the implementation can be more easily extended and customized
  2. Create a more transparent and "programmer friendly" actor implementation
  3. Improve performance of various use cases
  4. Make actors that interact better with their environment, particularly non-actor code
  5. Maintain API compatibility with the existing library to the maximum extent practical

Thus far I've done my work by completely overhauling the Actor trait directly in the library.  While I have plans for how this will reduce coupling, the haven't come to fruition yet.  I think my state machine model is considerably more transparent and programmer friendly than the current implementation.  The state an actor in is always clear, transition are reasonably well defined, and performing correct locking and unlocking is now pretty straight forward.  I've substantially improved the performance of event-based actors for the common case where an actor loops while it receives messages until some termination criterion is reached.  I haven't done anything with making them interact better with their environment yet, as I believe Philipp Haller is in the process of incorporating some changes for which I submitted patches several months back that will help considerably (he doesn't appear to be using the patches directly, but the changes I've seen are close enough).

A few days ago David MacIver asked me a couple interesting questions on IRC:

  1. Have you considered using Kilim?
  2. Have you looked at the continuations support that appears to be emerging as a compiler plugin?

Using Kilim would almost certainly disqualify the changes I'm making from incorporation into the standard library because of the external dependency on a prerelease framework that uses bytecode manipulation, and I don't think the fledgling continuation support is mature enough to experiment with yet (or maybe just not well documented enough yet, who knows...).  That being said, both of these would be interesting avenues of development.  Event based actors rely heavily on continuations, and the performance of those continuations has a substantial effect on the performance of the actors.  Ultimately a properly decoupled implementation would allow someone to build an API compatible actor implementation on either of these technologies, or something else entirely.

I also received a recent nudge on the mailing list, when someone pointed out that it would be easier to experiment with my library if it was in it's own package.  I somewhat disagree with that.  It would be easier to experiment with it if you didn't have to do an entire build of the Scala distribution that has my code in it, and unfortunately for the foreseeable future I'm going to be depending on both code that is being committed into the trunk and on the ability to modify said code.  Also, the way I have it setup today, if someone wanted to test their Scala application/library/framework against my state-based actors, they would just have to pull my Mercurial repository, build it, and rebuild against the resulting distribution.  On the downside, it's a pain to do side-by-side comparisons between the two implementations, because you have to switch distributions and rebuild every time.

That being said, decoupling is one of my primary goals, and Martin & Company have already set the precedence of doing major work in a separate package with their efforts on redesigning the Scala collections library.  So as soon as I finish up some of my immediate tasks and have a good build again (I'm in the middle of redesigning the control flow pattern to minimize the stack size when exceptions are thrown), I'm going to push most of the public and protected interface on Actor into a new trait that will be between Actor and AbstractActor as abstract methods, move my code out of scala.actors and into scalax.actors in my own directory.  I definitely keep it within the same repository as the Scala distribution code, and will probably just make another directory for it.  This means I'll have to mess with Sabbus to add my portion to the build, which won't be fun, but shouldn't be too hard.  I'm sure there's going to be a lot more to do to extract it out, so I'll be adding items to my issues list as I think of them.

The end result should be my state based actors and the existing actors being able to live side-by-side and interact with one another in the same program.  Assuming I can at least get some patches into the main library, it will also mean that the future of my work will not be dependent on being incorporated into the standard distribution.  If it is, great, if not, I can distribute it separately.  I would have done this from the start, but initially I was highly dependent on the existing code and infrastructure into order to get something working reasonably quickly, and to be able to smoke it out to see if it indeed worked.  Back in December I was basically breaking partest with every change.  But now things are reasonably stable, I rarely break partest, and I don't think it will take much for my code to be able to stand on its own.

So what does everyone think?  Is this a good direction to head in?

Sphere: Related Content

Friday, February 06, 2009

Profiling the ExceptionBlob

Last week on the Scala IRC channel (#scala) I chatted with several people about the need to substantially reduce the use of exceptions for flow control within the Scala Actors library.  My remarks were met with a certain amount of healthy skepticism, as significant progress has been made on improving the performance of Java exceptions, as stated by John Rose in his blog Longjumps Considered Inexpensive.

I highly suggest you go read the post.  I don't doubt anything stated in it, and in fact noticed a substantial performance improvement when I stopped benchmarking against Java 5 and started doing it against Java 6 (I usually work on a Mac, and Java 5 is the default VM), and expect even more improvements with Java 7.  Once you're done, I'd like to call your attention to a couple fragments of it (additional emphasis mine):

The Hotspot JIT can optimize a throw of a preallocated exception into a simple goto, if the thrower and catcher are together in the same compilation unit (which happens because of inlining).

Here is a concrete example of non-local return, implemented efficiently via a cloned exception. I have observed the Hotspot server compiler emitting a machine-level goto for the throws. With current server VM optimizations of this code, the cost of a non-local return is less than three times the cost of a plain (local) return; with the client VM the ratio is ten. See the code example and benchmark for details: NonLocalExit.java (javadoc).

These statements have a couple implications:

  1. Reducing a throw to a long jump requires a certain amount of HotSpot magic take place to ensure that the throw and catch are in the same compilation unit.  The more complex the code is, the less likely this is to happen.
  2. Long jumps are still substantially slower than normal returns

I think John's post received a lot of attention, and people (including myself) saw it, didn't read it carefully, and assumed that HotSpot now performed miracles with cached exceptions.   What we missed was a post to the MLVM development mailing list about a year later by Charles Nutter, who had noticed that his non-local returns in JRuby where not being compiled down into long jumps, and subsequent response from John Rose.  There's a lot of very good technical discussion in there, but I think the gist is that HotSpot's ability to optimize throws into long jumps is still limited, it often needs help to do so, and very smart people are working at improving it.

Of course that all still leaves the question:  Just how much slower are cached exceptions than a normal return?  I can't really answer that in the general case, but I did put together a couple mircobenchmarks in Java that somewhat simulate the type of stack creation that happens within my state-based actors.  The benchmark is pretty simple.  It recursively descends down, incrementing a member variable, and then on the way back up it decrements another member variable.  In the case of the exception-based return, it does the decrements within a finally block.  This admittedly is rather pathological code, but it is also simple and easy to understand.

Here's the normal return version:

public class NormalReturn {
    public static void main(String[] args) {
        NormalReturn m = new NormalReturn();
        m.doTest(10000000);
    }
    public void doTest(int times) {
        for(int i = 0; i < times; i++) {
            down = 0;
            up = 0;
            deepRecurse(1000);
        }
        System.out.println(down);
        System.out.println(up);
    }
    private int down = 0;
    private int up = 0;
    public void deepRecurse(int i) {
        down++;
        if (down <= i) {
            deepRecurse(i);
            up--;
        }
    }
}

...and here's the version that uses a cached exception:

public class TryFinally {
    private static class Signal extends RuntimeException {
        @Override
        public Signal fillInStackTrace() { return this; }
    }
    private Signal signal = new Signal();
    public static void main(String[] args) {
        TryFinally m = new TryFinally();
        m.doTest(10000000);
    }
    public void doTest(int times) {
        for(int i = 0; i < times; i++) {
            try {
                down = 0;
                up = 0;
                deepRecurse(1000);
            } catch (Signal s) {
                // do nothing
            }
        }
        System.out.println(down);
        System.out.println(up);
    }
    private int down = 0;
    private int up = 0;
    public void deepRecurse(int i) {
        try {
            down++;
            if (down == i) throw signal;
            else deepRecurse(i);
        } finally {
            up--;
        }
    }
}

With profiling turned off on and -server on Java 6 on my Mac, the normal return version takes about 60 seconds to complete, while the TryFinally version takes over 10 minutes.  That's certainly much better than the 30x figure John Rose cited for the client VM, but it's still an order-of-magnitude worse than a normal return.  One interesting thing about the benchmark is that if you decrease the recursion depth and increase the number of times deepRecurse() is executed the total time stays relatively constant.

So with this in mind, I'm going to work hard to refactor the actors library to keep exception usage to a minimum, an where it must be used to minimize the depth of the stack.

Sphere: Related Content

Scala Actors versus the ExceptionBlob

I've begun some very preliminary benchmarking of my state-based actors library against the current standard actors library and seen some rather interesting results.  The test I'm running spawns four pairs of actors and sends about 40 million messages between each pair.  The only logic in the actors is to count how many messages they have received, ensure they are received in order, and occasionally send back a "More" message.  This is so the mailbox doesn't overfill and cause out-of-memory errors.

The benchmark has three different incarnations, each testing a different way of processing messages until some termination condition occurs.  The versions are:

  • loop
  • recursive react
  • reactWhile (only available with the State-based actors, but relatively easy to port to "classic" actors)

The code for the loop version is below:

class LoopPingSender(val totalPings: Int, val pingReceiver: LoopPingReceiver) extends Actor {
  private var pingsSent = 0
  def act() {
    Actor.loop {
      pingReceiver ! Ping(pingsSent)
      pingsSent += 1
      if (pingsSent == totalPings) {
        println("Sender Done")
        exit("sender completed successfully")
      } else if ((pingsSent % 1000) == 0) react {
        case More => 
      }
    }
  }
}
class LoopPingReceiver(val totalPings: Int) extends Actor {
  private var pingsReceived = 0
  def act() {
    Actor.loop {
      react {
        case Ping(n) if n == pingsReceived => {
          pingsReceived += 1
          if (pingsReceived == totalPings) {
            println("Receiver Done")
            exit("receiver completed successfully")
          } else if ((pingsReceived % 1000) == 0) {
            reply(More)
          }
        }
        case Ping(n) => {
          println("error")
          exit("invalid n value: " + n)
        }
      }
    }
  }
}

The non-scientifically gathered results are as follows, all times in seconds:

Version Standard Actors State-based Actors
loop 298.22 96.84
Recursive react 90.25 113.57
reactWhile N/A 40.63

Standard Actor loop

Let's consider the loop version first.  loop in the standard actor implementation is rather convoluted.  Rather than directly supporting looping, the standard actor implementation provides a package private hook for a function that is executed when the actor is terminating.  loop uses this hook to subvert the termination process by throwing an exception and subsequently restarting execution of the actor.  This makes it so that two exceptions are thrown per message processed rather than the normal one.  Here's a small snippet from the profiling results:

     Compiled + native   Method                        
 40.0%  5148  +     0    ExceptionBlob
 13.9%   658  +  1133    scala.actors.Scheduler$$anon$2.run
 13.1%   391  +  1299    scala.actors.Actor$.loop
  3.4%   441  +     0    scala.actors.FJTaskRunner.run
--- snip ---
 73.7%  6705  +  2771    Total compiled
         Stub + native   Method                        
 20.6%     0  +  2654    java.lang.Thread.isInterrupted

What you can see is that the ExceptionBlob is dominating the runtime, followed by Thread.isInterrupted.  ExceptionBlob is a class in the JVM that observes the unwinding of the stack.  As far as I could discern through so separate microbenchmarking (which I'll discuss another day, after I've spent some time in HotSpot code), the amount of time consumed by ExceptionBlob is proportionate depth of the stack being unwound by exceptions.  In other words, throwing fewer exceptions doesn't do any good if you use them to unwind deeper stacks, and unwinding shallower stacks doesn't do any good if you throw more exceptions.  Caching exceptions doesn't help, either, but I digress.  The bottom line is that loop is pretty inefficient in the standard actors implementation.

State-based Actor loop

In my state-based actors, there are two states to handle looping, both of which are subclasses of Running.  The first is Looping, which is initially entered when the actor starts looping.  The second is ResumeLooping, which is used to resume looping after the actor has been suspended.  Both run the body of the react in a tail-recursive loop until their are no more messages to process, at which time they suspend the actor.  This means that many message can potentially be processed before any exception-based signaling is used.  The following is a snippet from a profile:

     Compiled + native   Method                        
 67.5%  3233  +     0    ExceptionBlob
  6.4%   306  +     0    scala.actors.Actor$class.withLock
  4.9%    44  +   189    actorbench.LoopPingSender$$anonfun$act$1.apply
  4.8%    97  +   135    scala.actors.Actor$$anonfun$send$1.apply
  4.5%   116  +    99    scala.actors.Actor$Running.tr$1
  3.8%    36  +   145    scala.actors.Actor$ResumeLooping.enter
--- snip ---
         Stub + native   Method                        
  0.2%     0  +     9    java.lang.Thread.isInterrupted

Again, ExceptionBlob dominates the execution time, although the total time it consumes is lower.  Unfortunately, despite using significantly fewer exception signals, my state-based actors tend to build up significantly deeper stacks than the actors in the standard library.

Standard Actor recursive react

Recursive react seems to be the best performing looping pattern for the standard actor library.  It's a pretty simple pattern:

class PingSender(val totalPings: Int, val pingReceiver: PingReceiver) extends Actor {
  def act() {
    def recurse(i: Int): Unit = {
      if (i > 0) {
        pingReceiver ! Ping(i)
        if ((i % 1000) == 0) {
          react {
            case More => recurse(i - 1)
          }
        } else {
          recurse(i - 1)
        }
      } else {
        println("Sender Done")
      }
    }
    recurse(totalPings)
  }
}

As you can see, the function containing the react block is simply recursively called as needed.  Note that despite the fact that the recursive call appears to be in a tail position, it is not a tail recursive call and will not be optimized by the compiler.  The PartialFunction defined within the react block is actually a separate object with an apply method, so while recurse and the PartialFunction are mutually recursive, they are not tail recursive.

     Compiled + native   Method                        
 19.1%   456  +     0    ExceptionBlob
 13.0%    89  +   221    scala.actors.Reaction.run
  8.6%    53  +   152    scala.actors.Actor$class.send
  7.9%    61  +   127    scala.actors.Actor$class.react
  5.3%    59  +    67    scala.Iterator$class.toList
--- snip ---
 68.1%   862  +   765    Total compiled

         Stub + native   Method                        
 26.4%     0  +   631    java.lang.Thread.isInterrupted
--- snip ---
 26.9%     0  +   643    Total stub

This time Thread.isInterrupted dominates, with ExceptionBlob not too far behind.

Actor States recursive react

It turns out that the "hybrid trampoline" I implemented to facilitate recursive reacts without giving up the thread provides pretty bad performance.

 64.9%  3714  +     0    ExceptionBlob
 11.4%   568  +    85    scala.actors.Actor$class.withoutLock
  8.4%   478  +     1    scala.actors.Actor$class.withLock
  5.1%    90  +   201    scala.actors.Actor$$anonfun$reactWithin$1.apply
  4.2%   115  +   125    scala.actors.Actor$$anonfun$send$1.apply
--- snip ---
 97.8%  5006  +   595    Total compiled
         Stub + native   Method                        
  0.1%     0  +     6    java.lang.Thread.currentThread
--- snip ---
  0.2%     0  +    12    Total stub

The problem is that when I wrote it, I thought that the cost of throwing exceptions was primarily proportionate to the number of exceptions thrown rather than the amount of stack that is unwound. Consequently, I made the trampoline only bounce once every n invocations in order to reduce the number of exceptions being thrown. As it turns out, what I should do is find a place where the stack is likely to be the shallowest, and throw an exception every time.

Actor States reactWhile

reactWhile addresses the pitfalls presented in the previous versions and provides roughly twice the performance of next best performer, recursive react on the standard actor library.

     Compiled + native   Method                        
 30.0%   171  +   331    scala.actors.Actor$Running.tr$1
 24.4%   263  +   146    actorbench.ReactWhileReceiver.withLock
 12.4%    26  +   181    actorbench.ReactWhileSender.sendPings
  8.5%   142  +     0    scala.actors.Actor$class.withoutLock
--- snip ---
 79.0%   660  +   661    Total compiled

         Stub + native   Method                        
 15.1%     0  +   252    java.lang.Thread.isInterrupted
  0.9%     0  +    15    java.lang.Thread.currentThread
 16.0%     0  +   267    Total stub

ExceptionBlob has completely disappeared from the profile. That is because reactWhile makes minimal use of exceptions for signaling.  Consequently, its performance is dominated by some of the complicated for my (worthless) hybrid trampoline, my completely unoptimized lock implementation, and calls to Thread.isInterrupted that are made when the bound thread is returned to the pool when the actor suspends.  I'll write a blog on how reactWhile works at a later date, after I cleanup some of the other problems that I've listed.

Parting Matter

It really annoys me when people post benchmark results without providing details of their setup along with all of the code required to duplicate them, and that's just what I've done.  I've also benchmarked a very narrow set of behavior, failed for warm up the JVM, and not run multiple trials.  Developing a lightweight framework for benchmarking actor performance, along with a more comprehensive set of benchmarks is on my to-do list.  Once I have that, I'll start doing more rigorous benchmarking.  But I don't think I quite need it yet because bottlenecks are showing up quickly and consistently in my profiling, and in most cases they seem to have reasonably straight forward solutions.  All the above profiles against my library were done against revision 49658b3475e2 in my bitbucket repository, so you can at least access that code.

Sphere: Related Content

Saturday, January 31, 2009

Google is Paranoid

Apparently Google has recently become extremely paranoid about what may harm my computer. Here's a screenshot of some search results:

Looks like every result is dangerous...now here's what happens when I click on one of these dangerous links:

Must be Apple's developer website is out to get me...

Note: The behavior seems to have stopped. Quick failed Google experiment?

Sphere: Related Content

Saturday, January 24, 2009

Refactoring Scala Actors

I've been experimenting in my spare time with the Scala actors library for about a year now. Initially it was mostly as part of my participation in the Widefinder project, but for the past several months I've been working on building more complicated abstractions using the actor library. The challenge with Widefinder, if you chose to keep your parallelism implementation generic and reusable, was to spread line-oriented input stream processing across as many threads as possible (ok, some may disagree with that, but that was my simplified take). This was relatively straight forward because the lines didn't have to be processed in any given order. Actors worked well for the task, and I was able to build a fairly generic set of classes that could be used in a very natural way for any similar problem.

I was feeling pretty good about Widefinder so I decided tackle a more complicated problem: parallelizing XML parsing (I'll write more on that later). In the process of exploring various ideas to help parallelize the parsing of XML I started butting my head against limitations of the actors library. For example, I built an "asynchronous stream" class - something like a normal Scala stream but instead of being entirely lazy, it asynchronously evaluates itself ahead of what has already been consumed, effectively parallelizing computation of the stream. This worked, except that if the stream wasn't fully evaluated the application using the stream wouldn't terminate properly because it's actor was still active. That made sense, as in many actor based applications you wouldn't want them to terminate before all the actors were finished, so I tried to make a custom scheduler that used daemon threads and therefore would allow the application to terminate without all of the actor threads terminating. This seemed like a straight forward task, except that it turned out to not be possible given some current limitations (link to enhancement about schedulers). So I set myself to fixing/enhancing the actors library to meet my needs and submit some patches.

Making fixes and enhancements to the actors library turned out to be a much bigger challenge than I expected. The external API for actors, particularly the documented part, is pretty clean and flexible. However internally, at least to a new set of eyes, there is a lot of spaghetti (count methods and variables that are defined as private[actors] so all the classes in the package can muck with them), oddly named variables and methods, and over a dozen mutable variables on the Actor trait that must all be correctly set in the right combination in order for the actor to work - and completely lacking any sort of structured means of manipulating them. I braved this, submitted some patches, and even wrote several enhancements that I did not submit, but I kept running into subtle problems that seemed almost impossible to track down. I would sprinkle assertions throughout the code only to realize that even when things seemed to work much of what I assuming was really holding. In other words I was getting lucky - or unlucky - as challenges like this are pretty common when doing parallel programming.

By this time I had a reasonable, albeit incomplete and with some inaccuracies, idea of what all those magical instance variables within Actor do and how they correspond to meaningful states. Here is my take on it:

  1. waitingFor - the partial function that the actor is using to wait for a matching message
  2. shouldExit - used to signal that the actor should exit instead of doing normal processing
  3. isSuspended - seems to indicate that the actor is waiting while blocked on a thread
  4. isDetached - the actor is waiting and is not attached to a thread
  5. isWaiting - the actor is waiting in some way
  6. mailbox - queue of messages for the actor
  7. sessions - list used as a stack to keep track of the sender of messages while they are being processed
  8. received - the message that was received, used to pass around a received message, could be replaced with local variable within functions
  9. exitReason - the reason why the actor exited, default to 'normal when the actor is initialized and can be modified at any time, so it is there even when the actor has not yet exited.
  10. onTimeout - used for keeping track of a timeout action so that it can be canceled if a message is received within the timeout period
  11. kill - a function invoked by reaction prior to invoking exit on the actor. It is used by loop and loopWhile to bypass the normal exit process of reactions
  12. trapExit - if false (the default) then the actor will exit when an actor that it is linked to exits, if it's true then it will receive an Exit message
  13. rc - a reply channel - unclear why it is actually needed because it is never used other than to be set in freshReplyChannel on every invokation
  14. links - a list of linked actors. Linked actors are notified or shutdown when the actor that they are linked to shuts down
  15. continuation - the function to be executed when the actor resumes

All those were pretty confusing, and it took me weeks to make any sense out of them - and usually the sense I made contained subtle flaws, so I decided to do something drastic. In my opinion the actors library is in desperate need of being refactored into something that encapsulates its state better and makes it much more explicit so you do not need to be an advanced adept to figure out what is happening inside the library. So I decided to do that refactoring and see where it lead me.

I distilled much of the above instance variables into the following states, which I have represented as immutable objects:

  • NotStarted - the initial state of the actor
  • Active States
    • StartScheduled - the actor has been started but has not yet been attached a thread or started executing its body
    • Running States - states where the actor is bound to a thread and is executing its body
      • Running - the actor is bound to a thread and is executing its body
      • Looping - a special case of running where the body is repeated while a specified condition holds
      • Killing - when an actor is Running it cannot be directly stopped by the library, so it is shifted into the Killing state. When the library receives control while the actor is running it checks to see if the state has been changed to Killing. If the state has been changed to Killing, then it takes actions to regain control from user code so that the actor can be cleanly shutdown.
    • Wait States
      • BlockedWait - the type of wait performed within a receive block where the actor is waiting for a message while attached to a thread. There may be a timeout specified after which the actor will wakeup.
      • DetachedWait - the type of wait performed within a react block when no matching message is available.
    • ContinuationScheduled - similar to StartScheduled, only the processing of the message the actor was waiting for when in the DetachedWait state is scheduled instead of the starting of the actor.
  • Terminal States
    • Finished - the actor teriminated normally
    • Killed - the actor was killed before it could complete execution
    • UnhandledThrowable - an exception was thrown, usually from "user code," and not handled
    • InternalError - an error occurred within the actor library

...and now there are the following instance variables:

  1. state - the state of the actor
  2. mailbox queue of messages for the actor
  3. sessions list used as a stack to keep track of the sender of messages while they are being processed
  4. trapExit - if false (the default) then the actor will exit when an actor that it is linked to exits, if it's true then it will receive an Exit message
  5. links - a list of linked actors. Linked actors are notified or shutdown when the actor that they are linked to shuts down .

Reducing the statefulness of the actor, and more importantly making the primary state explicit instead of implicit, makes the code much easier to both debug and extend.  In working with actors, I found that it's relatively common to have an actor just seem to stop.  It could be because it's waiting for a message, it could be because it's stuck in a loop, it could be any number of things.  Without explicit state, it's very difficult to tell exactly what's happening with an actor.  With explicit state, it's right there in your debugger.

I have the code in a Bitbucket repository in the "actor_state_machine" branch.  Most of the changes are in the actor.scala file and I highly encourage you to take a look at the differences.  Right now it the Scala test suite completes with the same results as the trunk as of a couple months ago, and I've added a couple tests to catch problems that it didn't.  Next I'm going to do a little more refactoring and general code cleanup, then I'm going to focus on expanding the test suite.  The current Scala testing tool, partest, uses actors heavily so it gives them a workout, but there aren't any explicit tests for actors.  Finally, I'm going to try to decouple the states from one another and try to extract their code from the Actor trait so that they can be tested as independent units and are more extensible.  I'm not sure what I'm actually going to do with the code.  I'd be happy to contribute it back to the EPFL, but given it is such a substantial departure I'm not sure they would want it.  So I'll probably end up moving it into another namespace and distributing it as a separate library.

Sphere: Related Content