Sunday, June 21, 2009

Pondering Actor Design Trades

There's been a lot of discussion of the Scala actors library lately, much of it critical, and a recent flurry of alternate implementations.  The alternate implementations (except my languishing state-based one ;-) all have one thing in common:  They are several orders of magnitude simpler.  Writing a basic actor implementation is actually pretty trivial, especially given java.util.concurrent classes that provide a decent chunk of the functionality in Scala actors, all for free on JDK5+.  So this begs the question few questions:

  1. Why is the standard Scala actor implementation so complex when others have done it in a such simpler fashion?
  2. Is it better to have one, big actor library that supports a wide variety of use cases, or a bunch of smaller ones targeted at specific niches and programming styles?
  3. If there are to be a bunch, should they just be conceptually similar (e.g. all based on the actor model), or should there be interoperability among them?

I'm not going to answer these questions now.  Instead, I'm going to try to start laying out some of what I believe to be the key characteristics of an actor implementation, and how they detract or enforce one another.  So here it goes:

  1. Guarantees
  2. Expressivity
  3. Extensibility
  4. Performance
  5. Scalability

Guarantees

The purpose of a concurrency framework is to make concurrency easier.  Concurrency is hard largely because it is extremely difficult to reason about, and thus concurrent code tends to be hard to write, laden with bugs, and subject to various odd pitfalls.  By providing various guarantees, a concurrency framework makes it easier to reason about concurrent code.  Actors are intended to free the programmer from worrying about things like locks, semaphores, thread management, etc. by encapsulating all that complexity behind a simple interface, assuming the programmer follows some basic rules like "no shared mutable state among actors."

The problem with guarantees is that in they tend to break down in the presence of limited CPU and memory resources.

Expressivity

Expressivity is difficult to define.  For purposes here, I'm going to define it as the degree to which a concise, natural expression of the programmer's intent is supported, and illustrate it by comparing Scala Actor to Lift Actor.  Scala Actors allow you to execute logic independent of message processing (note: this a violation of the theoretical model for actors) by simply placing it in the act method.  Lift Actors, on the other hand, are only triggered when they receive of message (this is consistent with the theoretical model).  For example, this makes it so that Scala Actors can do things such as perform possibly costly setup operations in their own thread before they start listening for messages.  In order to accomplish this in the Lift model, the programmer must create the actor and then send it some sort of "init" message.  The same effect can be achieved with both implementations, but it is more naturally supported by Scala Actors.  Of course there is a tradeoff here, as deviating from the theoretical model potentially weakens any guarantees that the model may provide.  The Scala Actor way also implies that an Actor has an explicit lifecycle, which as we'll see later has other significant implications.

Another example is what I'll call the "nested react pattern."  It is relatively common to want an actor to take on a different behavior after processing a message, thus altering which messages are ignored and how the received messages are processed.

loop {
 react {
    case 'foo => { 
      // do some stuff...
      react {
        case 'bar => // do some other stuff... 
      } 
    } 
  } 
}

The code above alternates between processing 'foo messages and 'bar messages.  This can be done with Lift Actor as well, but the expression is a little less natural:

class MyActor extends LiftActor {
  private val fooMode: PartialFunction[Any, Unit] = {
    case 'foo => {
      // do some stuff
      mode = barMode
    }
  }
  private val barMode: PartialFunction[Any, Unit] = {
    case 'bar => {
      // do some other stuff...
      mode = fooMode
    }
  }
  private var mode = fooMode
  protected def messageHandler = mode
}

Finally, Lift Actors exclusively use an event-based model and have no support for blocking on a thread while waiting for a message, and thus looses the ability to express patterns such as the following:

loop {
  react {
    case 'converToNumber => {
      val i: Int = receive {
        case 'one => 1
        case 'two => 2
        case 'three => 3
      }
      reply(i)
    }
  }
}

Extensibility

For purposes here, I'm going to use "extensible" to mean that a piece of software is extensible if capabilities can be added without modifying the core or breaking its semantics in a amount of effort proportional to the size of the extension.  This is narrower than the traditional definition of extensibility, which also covers the ability of a system to evolve internally.  A good example of extensibility is the ability of both Scala Actors and Lift Actors to allow the user to specify a custom scheduler.  Other examples could include adding control structures, using a different data structure for a mailbox.

The challenge with extensibility is that in order to enable it, what could otherwise be treated as the internal bits of the library must instead have well defined interfaces for components along with appropriate hooks for inserting them.  For example, a while ago I did some work to make the MessageQueue used for the mailbox overrideable (it has temporarily been overcome-by-events due to other changes).  This is a small example, but it shows how extensibility requires a greater degree of forethought.

Extensibility also benefits substantially from simplicity.  Scala Actors are almost impossible to extend from outside the scala.actors package because of their heavy reliance on package-private methods and state (mostly fixed here, but I broke remote actors in the process so no patch yet).  Lift Actors, on the other hand, are very extensible, at least within the bounds of their design (purely event-based actors with no explicit lifecyle).  Many of the flow control mechanisms could be implemented on top of the baseline approach.

At this point we see that extensibility has an interesting relationship with expressivity.  I previously claimed that Scala Actors were more expressive because the wide variety of control structures they provide (and I didn't even touch on some of the DSL-like functionality that enables all sorts of interesting things).  However, given Lift Actors far simpler and more extensible foundation, there is much more opportunity to create custom control structures as extensions to Lift Actors without modifying the core.  Thus, if you are willing to do some lower-level programming, it could be argued that Lift Actors are in reality more expressive due to their extensibility.

Performance and Scalability

For purposes here, I'm going to treat performance as the rate a which an actor can receive and process messages at a relatively small, fixed number of simultaneous actors.   This means that improving performance in largely a matter of reducing the time it takes from when a message is initially sent to when user-code within the actor begins processing the message, including minimizing any pause between when an actor finishes processing one message and is available to start processing the next.  For moderate numbers of actors, performance is often maximized by having one thread per actor, and having the actor block while waiting for a message.  Given enough actors, the memory requirements of using a thread for each actor will eventually cause more slowdown than cost of scheduling a new reaction for each message.  This is illustrated in Philipp Haller's paper, "Actors that Unify Threads and Events" in the following graph:

Note that the above graph covers a microbenchmark running a simple, non-memory intensive task, and that the thread line is not a measurement of thread-bound actors, but rather of a simple threaded implementation.  However, my own benchmarking has shown that receive-based (ones that block on a thread) compare to event-based actors in almost the same way as threads to event-based actors in the above graph.  Also, remember that given a real application where heap space is needed for things besides the stacks of thousands of threads the point where the JVM throws an OutOfMemoryError will be much farther to the left.  There are also more subtle issues.  One of my first experiences with the Scala Actors library was creating a deadlock.  I created more thread-bound actors than the scheduler wanted to create threads, and thus actors were stuck blocking on threads waiting for messages from an actor that hadn't started yet because there were no available threads.  In other words, blocking can lead to situations such as deadlock, starvation, and simply extreme forms of unfairness with respect to how much CPU time is allocated each actor.  These all go against highly desirable guarantees that a actor library should provide outside of extreme circumstances.

Ultimately event-based actors make the better model.  For one, part of the reason why event-based Scala Actors are so expensive is that they suspend by throwing an exception to return control from user code to the library.  While exceptions have been heavily optimized in the JVM, especially in recent versions, they are still substantially slower than normal return paths.  Scala Actors need to use exceptions to suspend is a consequence of their expressivity.  Basically, because the library as little or no knowledge of what an actor is doing within a reaction, it cannot rely on traditional returns without introducing special control structures (see reactWhile numbers in one of my previous blogs).  Lift Actors, on the other hand, have do not need to use exceptions for control flow because the message processing cycle is essentially fixed - user code cannot intersperse weird (or even not-so-weird) patterns within it, or mix in blocking receives with event-based ones.  Another potential optimization of event-based actors is to have them block if there are plenty of threads available, and then release it if the thread they are on is needed by the scheduler.  To my knowledge this optimization is not implemented anywhere, but I think it would be relatively straight forward.  The only problem is that the actor becomes more tightly bound to its scheduler.

Parting Thoughts

Ultimately, time and community willing, I'd like to evolve what is here, plus solid treatment of a lot of lower-level details, into a Scala Improvement Document (SID).  There are a lot of subtle trades involved, and I think producing a general-purpose actors library is at least an order-of-magnitude more difficult than producing a special-purpose one.  I also believe that if an actor implementation is part of the standard library, then it should provide the necessary extension points for when users need something special-purpose they can create it and still leverage components of the standard library and interoperate with other actors.  In order words, I think it should define both the interface portion of an API along with providing a solid implementation.  I don't think we'll even get their without a clear and common understanding of the various considerations involved.

Sphere: Related Content

Monday, May 25, 2009

I'm on Twitter!

For those of you with ADD or it's internet induced equivalents, I've started posting on Twitter. I long avoided it because I feel like the last thing people need is yet another half-baked information stream, but then people seem to like it so I'm giving it a shot. I'll post links to bugs, patches, and other comments regarding my efforts (and those of others) with Scala actors...along with other less important matters. http://twitter.com/ErikEngbrecht

Sphere: Related Content

Refactoring Scala Actors: Progress Update

It's been a while since I've posted, so I thought I'd give everyone a status update.  This post covers several different semi-disjoint topics at a fairly high level.  I plan on diving into some of the issues later this week and beyond, but for now...

State-machine Based Actors

A while back Philipp Haller, the original author and current maintainer of the Scala actors library, contacted and basically said he found the changes I was making really interesting, but he really needed a smaller, more gradual set of patches.  It's a perfectly reasonable request, as I had pretty much completely ripped apart his library.  I had rethought my approach, anyway, so I went about moving my state-machine based actor implementation into its own package and rewiring some of the pieces so that they could share common base traits, common infrastructure, and interoperate with one another as if they were the same library.  So I shoved my code into scalax.actors, and started hacking insertion points for my code into the main library.

The first thing I thought I needed was a base trait that defines the basic structure and operations of an actor, so created a BaseActor in between AbstractActor and Actor (as well as my own StateActor):

trait BaseActor extends AbstractActor {
  def react(f: PartialFunction[Any, Unit]): Nothing
  def reactWithin(msec: Long)(f: PartialFunction[Any, Unit]): Nothing
  def receive[A](f: PartialFunction[Any, A]): A
  def receiveWithin[R](msec: Long)(f: PartialFunction[Any, R]): R
  /*def loop(body: => Unit): Nothing */
  /*def loopWhile(cond: => Boolean)(body: => Unit): Nothing */
  protected[actors] def mailbox: MessageQueue[Message[Any]]
  private[actors] final def mailboxForChannel: MessageQueue[Message[Any]] = mailbox
  def mailboxSize: Int
  def send(msg: Any, replyTo: OutputChannel[Any]): Unit
  def forward(msg: Any): Unit
  def reply(msg: Any): Unit
  /*protected[actors]*/ def sender: OutputChannel[Any]
  def ? : Any
  def start(): AbstractActor
  def freshReplyChannel: Channel[Any] = new Channel[Any](this)
  def scheduler: IScheduler //TODO: restrict access to scheduler??
}

I don't think BaseActor is going to be a permanent fixture because its contents probably belong in AbstractActor instead, but for now it serves its purpose.  One of the first things you should notice is that way to much stuff in there is public.  Most of it should be protected, or perhaps somewhere else entirely (like an InputChannel encapsulated by the actor).

Reworking MessageQueue

There's also the issue of the mailbox, which is a rather important and a den of mutable data that is passed all around with private[actors] qualifiers.  Basically it separates the Message from the elements within the MessageQueue, so that the MessageQueue can keep its internal structure private, and thus facilitating making it a trait so that an actor can provide its own specialized implementation.  I was about to submit a patch for the change, but a fix for a memory leak in FJTaskRunner came about that relied on clearing mutable fields in the message when a task is done processing.  I have an alternative fix by changing pieces of FJTaskScheduler2, but schedulers in general and FJTaskScheduler2 in specific are in flux right now due to bugs (here and here and probably elsewhere), and I want to tweak the design a bit, so I'm holding off.

Fixing Schedulers

Which brings me to schedulers...  Problems with plugging in custom schedulers (mostly fixed) are what originally caused me to dive into the guts of the actor library.  Closely related to schedulers is ActorGC, which is absolutely essential to actors (almost) transparently abstracting threads, but can also be problematic due to it's fundamentally non-deterministic nature (it relies on the garbage collector for some of its more advanced capabilities).  That being said, now in trunk ActorGC is optional, so environments that don't require an implicit shutdown of the actor worker threads can avoid the added complexity.  I intend to cover the details of ActorGC very soon.  There should also be a default scheduler with daemon semantics coming, which has a number of use cases.

Closing Matter

There's a lot more going on.  Some recent flare-ups on Scala Internals mailing, despite being a tad melodramatic, brought a welcome focus on actors for the next release of Scala.  The issue has also given rise to two minimalistic actor implementations, one in Lift and the other in Scalaz.  They both make interesting data points for design and potential interoperability (remember: one of my primary goals is an actor implementation that lets you plug in what you need).  There's issues around ActorProxy that I think will be a little hairy to sort out, but I'm confident they will be.  And finally, there's the omnipresent issue of ensuring actor's really make the guarantees that they claim (right now I think they do, but I wouldn't place money on it until I have tests to prove it).

That's it for now.  I'm going to try to take the time to blog about many of the above issues and more in depth in the coming weeks, and hopefully gain some insights from out in the cloud.

Sphere: Related Content

Tuesday, April 21, 2009

McKinsey and Cloud Computing

McKinsey has created a tempest-in-a-teapot by denouncing the economics behind both in-the-cloud-cloud such as Amazon E2C and behind-the-firewall clouds for large enterprises.  At a high level I think their analysis is actually pretty good, but the conclusions misleading due to a semantic twist.  They use Amazon E2C as a model, and their conclusions go something like this:

  1. Amazon E2C virtual CPU cycles are more expensive than real, in-house CPU cycles
  2. You waste 90% of those in-house CPU cycles
  3. You'll waste almost as many of those virtual cloud CPU cycles, only they cost more, so they are a bad deal
  4. You stand a decent shot at saving some of those real CPU cycles through virtualization, so you should aggressively virtualize your datacenter
  5. You're too inept to deliver a flexible cloud behind-the-firewall, so don't even try

I'll let you ponder which of the above statements is misleading while I address some related topics.

The goals of cloud computing are as old as computing itself.  They are:

  1. Reduce the time it takes to deploy a new application
  2. Reduce the marginal cost of deploying a new application over "standard" methods
  3. Reduce the marginal increase to recurring costs caused by deploying a new application over "standard" methods

Back in the days of yore, when programmers were real men, the solution to this was time sharing.  Computers were expensive and therefore should be run at as a high of utilization as possible.  While making people stand in line a wait to run their batch jobs was a pleasing ego trip for the data center operators, the machines still wasted CPU time while performing slow I/O operations and waiting in line generally made users unhappy.  Thus time sharing was born, and in a quite real sense the first cloud computing environments, because in many cases a large institution would purchase and host the infrastructure and then lease it out of smaller institutions or individuals.

The problem here is that the marginal cost equations end up looking like a stair-step function.  If you had a new application, and your enterprise / institution had excess mainframe capacity, then the marginal cost of letting you run your application was near zero.  But if there was no spare capacity - meaning the mainframe was being efficiently utilized - then the marginal cost was high because either someone else had to be booted off or you needed an additional mainframe.

Now fast-forward a couple decades to the PC revolution.  Somewhere along the way the cost curves for computers and people crossed, so it became appropriate to let the computer sit idle waiting for input from a user rather than having a user sit idle while waiting for a computer.  Now you could have lots of computers with lots of applications running on each one (although initially it was one application at at time, but still, the computer could run any number of them).  This smoothed out the non-recurring marginal cost curve, but as PCs proliferated it drove up recurring costs through sheer volume.

Unfortunately this had problems.  Many applications didn't work well without centralized backends, and some users still needed more compute power than could be reasonably mustered on the desktop.  So the new PCs were connected to mainframes, minicomputers, and eventually servers.  Thus client-server computing was born, along with increasingly confusing IT economics.  PCs were cheap, and constantly becoming cheaper, but backend hardware remained expensive.  The marginal non-recurring cost becomes completely dependent on the nature of the application, and recurring costs simply begin to climb with no end in sight.

Now fast forward a little more.  Microsoft releases a "server" operating system that runs on suped up PCs an convinces a whole bunch of bean counters that they can solve their remaining marginal non-recurring cost problems with Wintel servers that don't cost much more than PCs.  Now more expensive servers.  No more having to divide the cost of a single piece of hardware across several project.  Now if you want to add an application you can just add an inexpensive new Wintel server.  By this time the recurring cost equation had already become a jumbled mess, and the number of servers was still dwarfed by the PC on every desk, so there no tying back the ever increasing recurring costs.  This problem was then further exacerbated by Linux giving the Unix holdouts access to the same cheap hardware.

Thus began the era of one or more physical servers per application, which is where we are today, with McKinsey's suggestion for addressing: virtualization behind the firewall.  The problem with this suggestion is that, for a large enterprise, it isn't really that different from the cloud-in-the-cloud solution that they denounce as uneconomical.  One way is outsourcing a virtualized infrastructure to Amazon or similar, and the other is outsourcing it to their existing IT provider (ok, not all large enterprises outsource their IT, but a whole lot do).

Virtualization, in the cloud or otherwise, isn't the solution because it doesn't address the root cause of the problem - proliferation of (virtual) servers and the various pieces of infrastructure software that run on them, such as web servers and databases.  Hardware is cheap.  Software is often expensive.  System administrators are always expensive.  Virtualization attacks the most minor portion of the equation.

Virtualization is the right concept applied to the wrong level of the application stack.  Applications need to be protected from one another, but if they are built in anything resembling a reasonable way (that's a big caveat, because many aren't) then they don't need the full protections of running in a separate OS instance.  There's even a long standing commercially viable market for such a thing: shared web hosting.

It may not be very enterprisey, but shared web site/application hosting can easily be had for about $5 per month.  The cost quickly goes up as you add capabilities, but still - companies are making money by charging arbitrary people $5 per month to let them run arbitrary code on servers shared by countless other customers running arbitrary code.  How many enterprise IT organizations can offer a similar service at even an order-of-magnitude greater cost?

Not many, if any.  Yet do we see suggestions pointing out that Apache, IIS, Oracle, SQL Server, and countless other pieces of infrastructure can relatively easily be configured to let several applications share compute resources and expensive software licenses?  Nope.  They suggest you take your current mess, and virtualize it behind the firewall instead of virtualizing it outside the firewall.

Sphere: Related Content

Tuesday, February 10, 2009

Refactoring Scala Actors: Rethinking the Approach

When I started refactoring Scala's actor library, I really had several goals:

  1. Reduce the coupling within the library so that the implementation can be more easily extended and customized
  2. Create a more transparent and "programmer friendly" actor implementation
  3. Improve performance of various use cases
  4. Make actors that interact better with their environment, particularly non-actor code
  5. Maintain API compatibility with the existing library to the maximum extent practical

Thus far I've done my work by completely overhauling the Actor trait directly in the library.  While I have plans for how this will reduce coupling, the haven't come to fruition yet.  I think my state machine model is considerably more transparent and programmer friendly than the current implementation.  The state an actor in is always clear, transition are reasonably well defined, and performing correct locking and unlocking is now pretty straight forward.  I've substantially improved the performance of event-based actors for the common case where an actor loops while it receives messages until some termination criterion is reached.  I haven't done anything with making them interact better with their environment yet, as I believe Philipp Haller is in the process of incorporating some changes for which I submitted patches several months back that will help considerably (he doesn't appear to be using the patches directly, but the changes I've seen are close enough).

A few days ago David MacIver asked me a couple interesting questions on IRC:

  1. Have you considered using Kilim?
  2. Have you looked at the continuations support that appears to be emerging as a compiler plugin?

Using Kilim would almost certainly disqualify the changes I'm making from incorporation into the standard library because of the external dependency on a prerelease framework that uses bytecode manipulation, and I don't think the fledgling continuation support is mature enough to experiment with yet (or maybe just not well documented enough yet, who knows...).  That being said, both of these would be interesting avenues of development.  Event based actors rely heavily on continuations, and the performance of those continuations has a substantial effect on the performance of the actors.  Ultimately a properly decoupled implementation would allow someone to build an API compatible actor implementation on either of these technologies, or something else entirely.

I also received a recent nudge on the mailing list, when someone pointed out that it would be easier to experiment with my library if it was in it's own package.  I somewhat disagree with that.  It would be easier to experiment with it if you didn't have to do an entire build of the Scala distribution that has my code in it, and unfortunately for the foreseeable future I'm going to be depending on both code that is being committed into the trunk and on the ability to modify said code.  Also, the way I have it setup today, if someone wanted to test their Scala application/library/framework against my state-based actors, they would just have to pull my Mercurial repository, build it, and rebuild against the resulting distribution.  On the downside, it's a pain to do side-by-side comparisons between the two implementations, because you have to switch distributions and rebuild every time.

That being said, decoupling is one of my primary goals, and Martin & Company have already set the precedence of doing major work in a separate package with their efforts on redesigning the Scala collections library.  So as soon as I finish up some of my immediate tasks and have a good build again (I'm in the middle of redesigning the control flow pattern to minimize the stack size when exceptions are thrown), I'm going to push most of the public and protected interface on Actor into a new trait that will be between Actor and AbstractActor as abstract methods, move my code out of scala.actors and into scalax.actors in my own directory.  I definitely keep it within the same repository as the Scala distribution code, and will probably just make another directory for it.  This means I'll have to mess with Sabbus to add my portion to the build, which won't be fun, but shouldn't be too hard.  I'm sure there's going to be a lot more to do to extract it out, so I'll be adding items to my issues list as I think of them.

The end result should be my state based actors and the existing actors being able to live side-by-side and interact with one another in the same program.  Assuming I can at least get some patches into the main library, it will also mean that the future of my work will not be dependent on being incorporated into the standard distribution.  If it is, great, if not, I can distribute it separately.  I would have done this from the start, but initially I was highly dependent on the existing code and infrastructure into order to get something working reasonably quickly, and to be able to smoke it out to see if it indeed worked.  Back in December I was basically breaking partest with every change.  But now things are reasonably stable, I rarely break partest, and I don't think it will take much for my code to be able to stand on its own.

So what does everyone think?  Is this a good direction to head in?

Sphere: Related Content

Friday, February 06, 2009

Profiling the ExceptionBlob

Last week on the Scala IRC channel (#scala) I chatted with several people about the need to substantially reduce the use of exceptions for flow control within the Scala Actors library.  My remarks were met with a certain amount of healthy skepticism, as significant progress has been made on improving the performance of Java exceptions, as stated by John Rose in his blog Longjumps Considered Inexpensive.

I highly suggest you go read the post.  I don't doubt anything stated in it, and in fact noticed a substantial performance improvement when I stopped benchmarking against Java 5 and started doing it against Java 6 (I usually work on a Mac, and Java 5 is the default VM), and expect even more improvements with Java 7.  Once you're done, I'd like to call your attention to a couple fragments of it (additional emphasis mine):

The Hotspot JIT can optimize a throw of a preallocated exception into a simple goto, if the thrower and catcher are together in the same compilation unit (which happens because of inlining).

Here is a concrete example of non-local return, implemented efficiently via a cloned exception. I have observed the Hotspot server compiler emitting a machine-level goto for the throws. With current server VM optimizations of this code, the cost of a non-local return is less than three times the cost of a plain (local) return; with the client VM the ratio is ten. See the code example and benchmark for details: NonLocalExit.java (javadoc).

These statements have a couple implications:

  1. Reducing a throw to a long jump requires a certain amount of HotSpot magic take place to ensure that the throw and catch are in the same compilation unit.  The more complex the code is, the less likely this is to happen.
  2. Long jumps are still substantially slower than normal returns

I think John's post received a lot of attention, and people (including myself) saw it, didn't read it carefully, and assumed that HotSpot now performed miracles with cached exceptions.   What we missed was a post to the MLVM development mailing list about a year later by Charles Nutter, who had noticed that his non-local returns in JRuby where not being compiled down into long jumps, and subsequent response from John Rose.  There's a lot of very good technical discussion in there, but I think the gist is that HotSpot's ability to optimize throws into long jumps is still limited, it often needs help to do so, and very smart people are working at improving it.

Of course that all still leaves the question:  Just how much slower are cached exceptions than a normal return?  I can't really answer that in the general case, but I did put together a couple mircobenchmarks in Java that somewhat simulate the type of stack creation that happens within my state-based actors.  The benchmark is pretty simple.  It recursively descends down, incrementing a member variable, and then on the way back up it decrements another member variable.  In the case of the exception-based return, it does the decrements within a finally block.  This admittedly is rather pathological code, but it is also simple and easy to understand.

Here's the normal return version:

public class NormalReturn {
    public static void main(String[] args) {
        NormalReturn m = new NormalReturn();
        m.doTest(10000000);
    }
    public void doTest(int times) {
        for(int i = 0; i < times; i++) {
            down = 0;
            up = 0;
            deepRecurse(1000);
        }
        System.out.println(down);
        System.out.println(up);
    }
    private int down = 0;
    private int up = 0;
    public void deepRecurse(int i) {
        down++;
        if (down <= i) {
            deepRecurse(i);
            up--;
        }
    }
}

...and here's the version that uses a cached exception:

public class TryFinally {
    private static class Signal extends RuntimeException {
        @Override
        public Signal fillInStackTrace() { return this; }
    }
    private Signal signal = new Signal();
    public static void main(String[] args) {
        TryFinally m = new TryFinally();
        m.doTest(10000000);
    }
    public void doTest(int times) {
        for(int i = 0; i < times; i++) {
            try {
                down = 0;
                up = 0;
                deepRecurse(1000);
            } catch (Signal s) {
                // do nothing
            }
        }
        System.out.println(down);
        System.out.println(up);
    }
    private int down = 0;
    private int up = 0;
    public void deepRecurse(int i) {
        try {
            down++;
            if (down == i) throw signal;
            else deepRecurse(i);
        } finally {
            up--;
        }
    }
}

With profiling turned off on and -server on Java 6 on my Mac, the normal return version takes about 60 seconds to complete, while the TryFinally version takes over 10 minutes.  That's certainly much better than the 30x figure John Rose cited for the client VM, but it's still an order-of-magnitude worse than a normal return.  One interesting thing about the benchmark is that if you decrease the recursion depth and increase the number of times deepRecurse() is executed the total time stays relatively constant.

So with this in mind, I'm going to work hard to refactor the actors library to keep exception usage to a minimum, an where it must be used to minimize the depth of the stack.

Sphere: Related Content

Scala Actors versus the ExceptionBlob

I've begun some very preliminary benchmarking of my state-based actors library against the current standard actors library and seen some rather interesting results.  The test I'm running spawns four pairs of actors and sends about 40 million messages between each pair.  The only logic in the actors is to count how many messages they have received, ensure they are received in order, and occasionally send back a "More" message.  This is so the mailbox doesn't overfill and cause out-of-memory errors.

The benchmark has three different incarnations, each testing a different way of processing messages until some termination condition occurs.  The versions are:

  • loop
  • recursive react
  • reactWhile (only available with the State-based actors, but relatively easy to port to "classic" actors)

The code for the loop version is below:

class LoopPingSender(val totalPings: Int, val pingReceiver: LoopPingReceiver) extends Actor {
  private var pingsSent = 0
  def act() {
    Actor.loop {
      pingReceiver ! Ping(pingsSent)
      pingsSent += 1
      if (pingsSent == totalPings) {
        println("Sender Done")
        exit("sender completed successfully")
      } else if ((pingsSent % 1000) == 0) react {
        case More => 
      }
    }
  }
}
class LoopPingReceiver(val totalPings: Int) extends Actor {
  private var pingsReceived = 0
  def act() {
    Actor.loop {
      react {
        case Ping(n) if n == pingsReceived => {
          pingsReceived += 1
          if (pingsReceived == totalPings) {
            println("Receiver Done")
            exit("receiver completed successfully")
          } else if ((pingsReceived % 1000) == 0) {
            reply(More)
          }
        }
        case Ping(n) => {
          println("error")
          exit("invalid n value: " + n)
        }
      }
    }
  }
}

The non-scientifically gathered results are as follows, all times in seconds:

Version Standard Actors State-based Actors
loop 298.22 96.84
Recursive react 90.25 113.57
reactWhile N/A 40.63

Standard Actor loop

Let's consider the loop version first.  loop in the standard actor implementation is rather convoluted.  Rather than directly supporting looping, the standard actor implementation provides a package private hook for a function that is executed when the actor is terminating.  loop uses this hook to subvert the termination process by throwing an exception and subsequently restarting execution of the actor.  This makes it so that two exceptions are thrown per message processed rather than the normal one.  Here's a small snippet from the profiling results:

     Compiled + native   Method                        
 40.0%  5148  +     0    ExceptionBlob
 13.9%   658  +  1133    scala.actors.Scheduler$$anon$2.run
 13.1%   391  +  1299    scala.actors.Actor$.loop
  3.4%   441  +     0    scala.actors.FJTaskRunner.run
--- snip ---
 73.7%  6705  +  2771    Total compiled
         Stub + native   Method                        
 20.6%     0  +  2654    java.lang.Thread.isInterrupted

What you can see is that the ExceptionBlob is dominating the runtime, followed by Thread.isInterrupted.  ExceptionBlob is a class in the JVM that observes the unwinding of the stack.  As far as I could discern through so separate microbenchmarking (which I'll discuss another day, after I've spent some time in HotSpot code), the amount of time consumed by ExceptionBlob is proportionate depth of the stack being unwound by exceptions.  In other words, throwing fewer exceptions doesn't do any good if you use them to unwind deeper stacks, and unwinding shallower stacks doesn't do any good if you throw more exceptions.  Caching exceptions doesn't help, either, but I digress.  The bottom line is that loop is pretty inefficient in the standard actors implementation.

State-based Actor loop

In my state-based actors, there are two states to handle looping, both of which are subclasses of Running.  The first is Looping, which is initially entered when the actor starts looping.  The second is ResumeLooping, which is used to resume looping after the actor has been suspended.  Both run the body of the react in a tail-recursive loop until their are no more messages to process, at which time they suspend the actor.  This means that many message can potentially be processed before any exception-based signaling is used.  The following is a snippet from a profile:

     Compiled + native   Method                        
 67.5%  3233  +     0    ExceptionBlob
  6.4%   306  +     0    scala.actors.Actor$class.withLock
  4.9%    44  +   189    actorbench.LoopPingSender$$anonfun$act$1.apply
  4.8%    97  +   135    scala.actors.Actor$$anonfun$send$1.apply
  4.5%   116  +    99    scala.actors.Actor$Running.tr$1
  3.8%    36  +   145    scala.actors.Actor$ResumeLooping.enter
--- snip ---
         Stub + native   Method                        
  0.2%     0  +     9    java.lang.Thread.isInterrupted

Again, ExceptionBlob dominates the execution time, although the total time it consumes is lower.  Unfortunately, despite using significantly fewer exception signals, my state-based actors tend to build up significantly deeper stacks than the actors in the standard library.

Standard Actor recursive react

Recursive react seems to be the best performing looping pattern for the standard actor library.  It's a pretty simple pattern:

class PingSender(val totalPings: Int, val pingReceiver: PingReceiver) extends Actor {
  def act() {
    def recurse(i: Int): Unit = {
      if (i > 0) {
        pingReceiver ! Ping(i)
        if ((i % 1000) == 0) {
          react {
            case More => recurse(i - 1)
          }
        } else {
          recurse(i - 1)
        }
      } else {
        println("Sender Done")
      }
    }
    recurse(totalPings)
  }
}

As you can see, the function containing the react block is simply recursively called as needed.  Note that despite the fact that the recursive call appears to be in a tail position, it is not a tail recursive call and will not be optimized by the compiler.  The PartialFunction defined within the react block is actually a separate object with an apply method, so while recurse and the PartialFunction are mutually recursive, they are not tail recursive.

     Compiled + native   Method                        
 19.1%   456  +     0    ExceptionBlob
 13.0%    89  +   221    scala.actors.Reaction.run
  8.6%    53  +   152    scala.actors.Actor$class.send
  7.9%    61  +   127    scala.actors.Actor$class.react
  5.3%    59  +    67    scala.Iterator$class.toList
--- snip ---
 68.1%   862  +   765    Total compiled

         Stub + native   Method                        
 26.4%     0  +   631    java.lang.Thread.isInterrupted
--- snip ---
 26.9%     0  +   643    Total stub

This time Thread.isInterrupted dominates, with ExceptionBlob not too far behind.

Actor States recursive react

It turns out that the "hybrid trampoline" I implemented to facilitate recursive reacts without giving up the thread provides pretty bad performance.

 64.9%  3714  +     0    ExceptionBlob
 11.4%   568  +    85    scala.actors.Actor$class.withoutLock
  8.4%   478  +     1    scala.actors.Actor$class.withLock
  5.1%    90  +   201    scala.actors.Actor$$anonfun$reactWithin$1.apply
  4.2%   115  +   125    scala.actors.Actor$$anonfun$send$1.apply
--- snip ---
 97.8%  5006  +   595    Total compiled
         Stub + native   Method                        
  0.1%     0  +     6    java.lang.Thread.currentThread
--- snip ---
  0.2%     0  +    12    Total stub

The problem is that when I wrote it, I thought that the cost of throwing exceptions was primarily proportionate to the number of exceptions thrown rather than the amount of stack that is unwound. Consequently, I made the trampoline only bounce once every n invocations in order to reduce the number of exceptions being thrown. As it turns out, what I should do is find a place where the stack is likely to be the shallowest, and throw an exception every time.

Actor States reactWhile

reactWhile addresses the pitfalls presented in the previous versions and provides roughly twice the performance of next best performer, recursive react on the standard actor library.

     Compiled + native   Method                        
 30.0%   171  +   331    scala.actors.Actor$Running.tr$1
 24.4%   263  +   146    actorbench.ReactWhileReceiver.withLock
 12.4%    26  +   181    actorbench.ReactWhileSender.sendPings
  8.5%   142  +     0    scala.actors.Actor$class.withoutLock
--- snip ---
 79.0%   660  +   661    Total compiled

         Stub + native   Method                        
 15.1%     0  +   252    java.lang.Thread.isInterrupted
  0.9%     0  +    15    java.lang.Thread.currentThread
 16.0%     0  +   267    Total stub

ExceptionBlob has completely disappeared from the profile. That is because reactWhile makes minimal use of exceptions for signaling.  Consequently, its performance is dominated by some of the complicated for my (worthless) hybrid trampoline, my completely unoptimized lock implementation, and calls to Thread.isInterrupted that are made when the bound thread is returned to the pool when the actor suspends.  I'll write a blog on how reactWhile works at a later date, after I cleanup some of the other problems that I've listed.

Parting Matter

It really annoys me when people post benchmark results without providing details of their setup along with all of the code required to duplicate them, and that's just what I've done.  I've also benchmarked a very narrow set of behavior, failed for warm up the JVM, and not run multiple trials.  Developing a lightweight framework for benchmarking actor performance, along with a more comprehensive set of benchmarks is on my to-do list.  Once I have that, I'll start doing more rigorous benchmarking.  But I don't think I quite need it yet because bottlenecks are showing up quickly and consistently in my profiling, and in most cases they seem to have reasonably straight forward solutions.  All the above profiles against my library were done against revision 49658b3475e2 in my bitbucket repository, so you can at least access that code.

Sphere: Related Content

Saturday, January 31, 2009

Google is Paranoid

Apparently Google has recently become extremely paranoid about what may harm my computer. Here's a screenshot of some search results:

Looks like every result is dangerous...now here's what happens when I click on one of these dangerous links:

Must be Apple's developer website is out to get me...

Note: The behavior seems to have stopped. Quick failed Google experiment?

Sphere: Related Content

Saturday, January 24, 2009

Refactoring Scala Actors

I've been experimenting in my spare time with the Scala actors library for about a year now. Initially it was mostly as part of my participation in the Widefinder project, but for the past several months I've been working on building more complicated abstractions using the actor library. The challenge with Widefinder, if you chose to keep your parallelism implementation generic and reusable, was to spread line-oriented input stream processing across as many threads as possible (ok, some may disagree with that, but that was my simplified take). This was relatively straight forward because the lines didn't have to be processed in any given order. Actors worked well for the task, and I was able to build a fairly generic set of classes that could be used in a very natural way for any similar problem.

I was feeling pretty good about Widefinder so I decided tackle a more complicated problem: parallelizing XML parsing (I'll write more on that later). In the process of exploring various ideas to help parallelize the parsing of XML I started butting my head against limitations of the actors library. For example, I built an "asynchronous stream" class - something like a normal Scala stream but instead of being entirely lazy, it asynchronously evaluates itself ahead of what has already been consumed, effectively parallelizing computation of the stream. This worked, except that if the stream wasn't fully evaluated the application using the stream wouldn't terminate properly because it's actor was still active. That made sense, as in many actor based applications you wouldn't want them to terminate before all the actors were finished, so I tried to make a custom scheduler that used daemon threads and therefore would allow the application to terminate without all of the actor threads terminating. This seemed like a straight forward task, except that it turned out to not be possible given some current limitations (link to enhancement about schedulers). So I set myself to fixing/enhancing the actors library to meet my needs and submit some patches.

Making fixes and enhancements to the actors library turned out to be a much bigger challenge than I expected. The external API for actors, particularly the documented part, is pretty clean and flexible. However internally, at least to a new set of eyes, there is a lot of spaghetti (count methods and variables that are defined as private[actors] so all the classes in the package can muck with them), oddly named variables and methods, and over a dozen mutable variables on the Actor trait that must all be correctly set in the right combination in order for the actor to work - and completely lacking any sort of structured means of manipulating them. I braved this, submitted some patches, and even wrote several enhancements that I did not submit, but I kept running into subtle problems that seemed almost impossible to track down. I would sprinkle assertions throughout the code only to realize that even when things seemed to work much of what I assuming was really holding. In other words I was getting lucky - or unlucky - as challenges like this are pretty common when doing parallel programming.

By this time I had a reasonable, albeit incomplete and with some inaccuracies, idea of what all those magical instance variables within Actor do and how they correspond to meaningful states. Here is my take on it:

  1. waitingFor - the partial function that the actor is using to wait for a matching message
  2. shouldExit - used to signal that the actor should exit instead of doing normal processing
  3. isSuspended - seems to indicate that the actor is waiting while blocked on a thread
  4. isDetached - the actor is waiting and is not attached to a thread
  5. isWaiting - the actor is waiting in some way
  6. mailbox - queue of messages for the actor
  7. sessions - list used as a stack to keep track of the sender of messages while they are being processed
  8. received - the message that was received, used to pass around a received message, could be replaced with local variable within functions
  9. exitReason - the reason why the actor exited, default to 'normal when the actor is initialized and can be modified at any time, so it is there even when the actor has not yet exited.
  10. onTimeout - used for keeping track of a timeout action so that it can be canceled if a message is received within the timeout period
  11. kill - a function invoked by reaction prior to invoking exit on the actor. It is used by loop and loopWhile to bypass the normal exit process of reactions
  12. trapExit - if false (the default) then the actor will exit when an actor that it is linked to exits, if it's true then it will receive an Exit message
  13. rc - a reply channel - unclear why it is actually needed because it is never used other than to be set in freshReplyChannel on every invokation
  14. links - a list of linked actors. Linked actors are notified or shutdown when the actor that they are linked to shuts down
  15. continuation - the function to be executed when the actor resumes

All those were pretty confusing, and it took me weeks to make any sense out of them - and usually the sense I made contained subtle flaws, so I decided to do something drastic. In my opinion the actors library is in desperate need of being refactored into something that encapsulates its state better and makes it much more explicit so you do not need to be an advanced adept to figure out what is happening inside the library. So I decided to do that refactoring and see where it lead me.

I distilled much of the above instance variables into the following states, which I have represented as immutable objects:

  • NotStarted - the initial state of the actor
  • Active States
    • StartScheduled - the actor has been started but has not yet been attached a thread or started executing its body
    • Running States - states where the actor is bound to a thread and is executing its body
      • Running - the actor is bound to a thread and is executing its body
      • Looping - a special case of running where the body is repeated while a specified condition holds
      • Killing - when an actor is Running it cannot be directly stopped by the library, so it is shifted into the Killing state. When the library receives control while the actor is running it checks to see if the state has been changed to Killing. If the state has been changed to Killing, then it takes actions to regain control from user code so that the actor can be cleanly shutdown.
    • Wait States
      • BlockedWait - the type of wait performed within a receive block where the actor is waiting for a message while attached to a thread. There may be a timeout specified after which the actor will wakeup.
      • DetachedWait - the type of wait performed within a react block when no matching message is available.
    • ContinuationScheduled - similar to StartScheduled, only the processing of the message the actor was waiting for when in the DetachedWait state is scheduled instead of the starting of the actor.
  • Terminal States
    • Finished - the actor teriminated normally
    • Killed - the actor was killed before it could complete execution
    • UnhandledThrowable - an exception was thrown, usually from "user code," and not handled
    • InternalError - an error occurred within the actor library

...and now there are the following instance variables:

  1. state - the state of the actor
  2. mailbox queue of messages for the actor
  3. sessions list used as a stack to keep track of the sender of messages while they are being processed
  4. trapExit - if false (the default) then the actor will exit when an actor that it is linked to exits, if it's true then it will receive an Exit message
  5. links - a list of linked actors. Linked actors are notified or shutdown when the actor that they are linked to shuts down .

Reducing the statefulness of the actor, and more importantly making the primary state explicit instead of implicit, makes the code much easier to both debug and extend.  In working with actors, I found that it's relatively common to have an actor just seem to stop.  It could be because it's waiting for a message, it could be because it's stuck in a loop, it could be any number of things.  Without explicit state, it's very difficult to tell exactly what's happening with an actor.  With explicit state, it's right there in your debugger.

I have the code in a Bitbucket repository in the "actor_state_machine" branch.  Most of the changes are in the actor.scala file and I highly encourage you to take a look at the differences.  Right now it the Scala test suite completes with the same results as the trunk as of a couple months ago, and I've added a couple tests to catch problems that it didn't.  Next I'm going to do a little more refactoring and general code cleanup, then I'm going to focus on expanding the test suite.  The current Scala testing tool, partest, uses actors heavily so it gives them a workout, but there aren't any explicit tests for actors.  Finally, I'm going to try to decouple the states from one another and try to extract their code from the Actor trait so that they can be tested as independent units and are more extensible.  I'm not sure what I'm actually going to do with the code.  I'd be happy to contribute it back to the EPFL, but given it is such a substantial departure I'm not sure they would want it.  So I'll probably end up moving it into another namespace and distributing it as a separate library.

Sphere: Related Content