Sunday, June 21, 2009

Pondering Actor Design Trades

There's been a lot of discussion of the Scala actors library lately, much of it critical, and a recent flurry of alternate implementations.  The alternate implementations (except my languishing state-based one ;-) all have one thing in common:  They are several orders of magnitude simpler.  Writing a basic actor implementation is actually pretty trivial, especially given java.util.concurrent classes that provide a decent chunk of the functionality in Scala actors, all for free on JDK5+.  So this begs the question few questions:

  1. Why is the standard Scala actor implementation so complex when others have done it in a such simpler fashion?
  2. Is it better to have one, big actor library that supports a wide variety of use cases, or a bunch of smaller ones targeted at specific niches and programming styles?
  3. If there are to be a bunch, should they just be conceptually similar (e.g. all based on the actor model), or should there be interoperability among them?

I'm not going to answer these questions now.  Instead, I'm going to try to start laying out some of what I believe to be the key characteristics of an actor implementation, and how they detract or enforce one another.  So here it goes:

  1. Guarantees
  2. Expressivity
  3. Extensibility
  4. Performance
  5. Scalability

Guarantees

The purpose of a concurrency framework is to make concurrency easier.  Concurrency is hard largely because it is extremely difficult to reason about, and thus concurrent code tends to be hard to write, laden with bugs, and subject to various odd pitfalls.  By providing various guarantees, a concurrency framework makes it easier to reason about concurrent code.  Actors are intended to free the programmer from worrying about things like locks, semaphores, thread management, etc. by encapsulating all that complexity behind a simple interface, assuming the programmer follows some basic rules like "no shared mutable state among actors."

The problem with guarantees is that in they tend to break down in the presence of limited CPU and memory resources.

Expressivity

Expressivity is difficult to define.  For purposes here, I'm going to define it as the degree to which a concise, natural expression of the programmer's intent is supported, and illustrate it by comparing Scala Actor to Lift Actor.  Scala Actors allow you to execute logic independent of message processing (note: this a violation of the theoretical model for actors) by simply placing it in the act method.  Lift Actors, on the other hand, are only triggered when they receive of message (this is consistent with the theoretical model).  For example, this makes it so that Scala Actors can do things such as perform possibly costly setup operations in their own thread before they start listening for messages.  In order to accomplish this in the Lift model, the programmer must create the actor and then send it some sort of "init" message.  The same effect can be achieved with both implementations, but it is more naturally supported by Scala Actors.  Of course there is a tradeoff here, as deviating from the theoretical model potentially weakens any guarantees that the model may provide.  The Scala Actor way also implies that an Actor has an explicit lifecycle, which as we'll see later has other significant implications.

Another example is what I'll call the "nested react pattern."  It is relatively common to want an actor to take on a different behavior after processing a message, thus altering which messages are ignored and how the received messages are processed.

loop {
 react {
    case 'foo => { 
      // do some stuff...
      react {
        case 'bar => // do some other stuff... 
      } 
    } 
  } 
}

The code above alternates between processing 'foo messages and 'bar messages.  This can be done with Lift Actor as well, but the expression is a little less natural:

class MyActor extends LiftActor {
  private val fooMode: PartialFunction[Any, Unit] = {
    case 'foo => {
      // do some stuff
      mode = barMode
    }
  }
  private val barMode: PartialFunction[Any, Unit] = {
    case 'bar => {
      // do some other stuff...
      mode = fooMode
    }
  }
  private var mode = fooMode
  protected def messageHandler = mode
}

Finally, Lift Actors exclusively use an event-based model and have no support for blocking on a thread while waiting for a message, and thus looses the ability to express patterns such as the following:

loop {
  react {
    case 'converToNumber => {
      val i: Int = receive {
        case 'one => 1
        case 'two => 2
        case 'three => 3
      }
      reply(i)
    }
  }
}

Extensibility

For purposes here, I'm going to use "extensible" to mean that a piece of software is extensible if capabilities can be added without modifying the core or breaking its semantics in a amount of effort proportional to the size of the extension.  This is narrower than the traditional definition of extensibility, which also covers the ability of a system to evolve internally.  A good example of extensibility is the ability of both Scala Actors and Lift Actors to allow the user to specify a custom scheduler.  Other examples could include adding control structures, using a different data structure for a mailbox.

The challenge with extensibility is that in order to enable it, what could otherwise be treated as the internal bits of the library must instead have well defined interfaces for components along with appropriate hooks for inserting them.  For example, a while ago I did some work to make the MessageQueue used for the mailbox overrideable (it has temporarily been overcome-by-events due to other changes).  This is a small example, but it shows how extensibility requires a greater degree of forethought.

Extensibility also benefits substantially from simplicity.  Scala Actors are almost impossible to extend from outside the scala.actors package because of their heavy reliance on package-private methods and state (mostly fixed here, but I broke remote actors in the process so no patch yet).  Lift Actors, on the other hand, are very extensible, at least within the bounds of their design (purely event-based actors with no explicit lifecyle).  Many of the flow control mechanisms could be implemented on top of the baseline approach.

At this point we see that extensibility has an interesting relationship with expressivity.  I previously claimed that Scala Actors were more expressive because the wide variety of control structures they provide (and I didn't even touch on some of the DSL-like functionality that enables all sorts of interesting things).  However, given Lift Actors far simpler and more extensible foundation, there is much more opportunity to create custom control structures as extensions to Lift Actors without modifying the core.  Thus, if you are willing to do some lower-level programming, it could be argued that Lift Actors are in reality more expressive due to their extensibility.

Performance and Scalability

For purposes here, I'm going to treat performance as the rate a which an actor can receive and process messages at a relatively small, fixed number of simultaneous actors.   This means that improving performance in largely a matter of reducing the time it takes from when a message is initially sent to when user-code within the actor begins processing the message, including minimizing any pause between when an actor finishes processing one message and is available to start processing the next.  For moderate numbers of actors, performance is often maximized by having one thread per actor, and having the actor block while waiting for a message.  Given enough actors, the memory requirements of using a thread for each actor will eventually cause more slowdown than cost of scheduling a new reaction for each message.  This is illustrated in Philipp Haller's paper, "Actors that Unify Threads and Events" in the following graph:

Note that the above graph covers a microbenchmark running a simple, non-memory intensive task, and that the thread line is not a measurement of thread-bound actors, but rather of a simple threaded implementation.  However, my own benchmarking has shown that receive-based (ones that block on a thread) compare to event-based actors in almost the same way as threads to event-based actors in the above graph.  Also, remember that given a real application where heap space is needed for things besides the stacks of thousands of threads the point where the JVM throws an OutOfMemoryError will be much farther to the left.  There are also more subtle issues.  One of my first experiences with the Scala Actors library was creating a deadlock.  I created more thread-bound actors than the scheduler wanted to create threads, and thus actors were stuck blocking on threads waiting for messages from an actor that hadn't started yet because there were no available threads.  In other words, blocking can lead to situations such as deadlock, starvation, and simply extreme forms of unfairness with respect to how much CPU time is allocated each actor.  These all go against highly desirable guarantees that a actor library should provide outside of extreme circumstances.

Ultimately event-based actors make the better model.  For one, part of the reason why event-based Scala Actors are so expensive is that they suspend by throwing an exception to return control from user code to the library.  While exceptions have been heavily optimized in the JVM, especially in recent versions, they are still substantially slower than normal return paths.  Scala Actors need to use exceptions to suspend is a consequence of their expressivity.  Basically, because the library as little or no knowledge of what an actor is doing within a reaction, it cannot rely on traditional returns without introducing special control structures (see reactWhile numbers in one of my previous blogs).  Lift Actors, on the other hand, have do not need to use exceptions for control flow because the message processing cycle is essentially fixed - user code cannot intersperse weird (or even not-so-weird) patterns within it, or mix in blocking receives with event-based ones.  Another potential optimization of event-based actors is to have them block if there are plenty of threads available, and then release it if the thread they are on is needed by the scheduler.  To my knowledge this optimization is not implemented anywhere, but I think it would be relatively straight forward.  The only problem is that the actor becomes more tightly bound to its scheduler.

Parting Thoughts

Ultimately, time and community willing, I'd like to evolve what is here, plus solid treatment of a lot of lower-level details, into a Scala Improvement Document (SID).  There are a lot of subtle trades involved, and I think producing a general-purpose actors library is at least an order-of-magnitude more difficult than producing a special-purpose one.  I also believe that if an actor implementation is part of the standard library, then it should provide the necessary extension points for when users need something special-purpose they can create it and still leverage components of the standard library and interoperate with other actors.  In order words, I think it should define both the interface portion of an API along with providing a solid implementation.  I don't think we'll even get their without a clear and common understanding of the various considerations involved.

Sphere: Related Content

Monday, May 25, 2009

I'm on Twitter!

For those of you with ADD or it's internet induced equivalents, I've started posting on Twitter. I long avoided it because I feel like the last thing people need is yet another half-baked information stream, but then people seem to like it so I'm giving it a shot. I'll post links to bugs, patches, and other comments regarding my efforts (and those of others) with Scala actors...along with other less important matters. http://twitter.com/ErikEngbrecht

Sphere: Related Content

Refactoring Scala Actors: Progress Update

It's been a while since I've posted, so I thought I'd give everyone a status update.  This post covers several different semi-disjoint topics at a fairly high level.  I plan on diving into some of the issues later this week and beyond, but for now...

State-machine Based Actors

A while back Philipp Haller, the original author and current maintainer of the Scala actors library, contacted and basically said he found the changes I was making really interesting, but he really needed a smaller, more gradual set of patches.  It's a perfectly reasonable request, as I had pretty much completely ripped apart his library.  I had rethought my approach, anyway, so I went about moving my state-machine based actor implementation into its own package and rewiring some of the pieces so that they could share common base traits, common infrastructure, and interoperate with one another as if they were the same library.  So I shoved my code into scalax.actors, and started hacking insertion points for my code into the main library.

The first thing I thought I needed was a base trait that defines the basic structure and operations of an actor, so created a BaseActor in between AbstractActor and Actor (as well as my own StateActor):

trait BaseActor extends AbstractActor {
  def react(f: PartialFunction[Any, Unit]): Nothing
  def reactWithin(msec: Long)(f: PartialFunction[Any, Unit]): Nothing
  def receive[A](f: PartialFunction[Any, A]): A
  def receiveWithin[R](msec: Long)(f: PartialFunction[Any, R]): R
  /*def loop(body: => Unit): Nothing */
  /*def loopWhile(cond: => Boolean)(body: => Unit): Nothing */
  protected[actors] def mailbox: MessageQueue[Message[Any]]
  private[actors] final def mailboxForChannel: MessageQueue[Message[Any]] = mailbox
  def mailboxSize: Int
  def send(msg: Any, replyTo: OutputChannel[Any]): Unit
  def forward(msg: Any): Unit
  def reply(msg: Any): Unit
  /*protected[actors]*/ def sender: OutputChannel[Any]
  def ? : Any
  def start(): AbstractActor
  def freshReplyChannel: Channel[Any] = new Channel[Any](this)
  def scheduler: IScheduler //TODO: restrict access to scheduler??
}

I don't think BaseActor is going to be a permanent fixture because its contents probably belong in AbstractActor instead, but for now it serves its purpose.  One of the first things you should notice is that way to much stuff in there is public.  Most of it should be protected, or perhaps somewhere else entirely (like an InputChannel encapsulated by the actor).

Reworking MessageQueue

There's also the issue of the mailbox, which is a rather important and a den of mutable data that is passed all around with private[actors] qualifiers.  Basically it separates the Message from the elements within the MessageQueue, so that the MessageQueue can keep its internal structure private, and thus facilitating making it a trait so that an actor can provide its own specialized implementation.  I was about to submit a patch for the change, but a fix for a memory leak in FJTaskRunner came about that relied on clearing mutable fields in the message when a task is done processing.  I have an alternative fix by changing pieces of FJTaskScheduler2, but schedulers in general and FJTaskScheduler2 in specific are in flux right now due to bugs (here and here and probably elsewhere), and I want to tweak the design a bit, so I'm holding off.

Fixing Schedulers

Which brings me to schedulers...  Problems with plugging in custom schedulers (mostly fixed) are what originally caused me to dive into the guts of the actor library.  Closely related to schedulers is ActorGC, which is absolutely essential to actors (almost) transparently abstracting threads, but can also be problematic due to it's fundamentally non-deterministic nature (it relies on the garbage collector for some of its more advanced capabilities).  That being said, now in trunk ActorGC is optional, so environments that don't require an implicit shutdown of the actor worker threads can avoid the added complexity.  I intend to cover the details of ActorGC very soon.  There should also be a default scheduler with daemon semantics coming, which has a number of use cases.

Closing Matter

There's a lot more going on.  Some recent flare-ups on Scala Internals mailing, despite being a tad melodramatic, brought a welcome focus on actors for the next release of Scala.  The issue has also given rise to two minimalistic actor implementations, one in Lift and the other in Scalaz.  They both make interesting data points for design and potential interoperability (remember: one of my primary goals is an actor implementation that lets you plug in what you need).  There's issues around ActorProxy that I think will be a little hairy to sort out, but I'm confident they will be.  And finally, there's the omnipresent issue of ensuring actor's really make the guarantees that they claim (right now I think they do, but I wouldn't place money on it until I have tests to prove it).

That's it for now.  I'm going to try to take the time to blog about many of the above issues and more in depth in the coming weeks, and hopefully gain some insights from out in the cloud.

Sphere: Related Content

Tuesday, April 21, 2009

McKinsey and Cloud Computing

McKinsey has created a tempest-in-a-teapot by denouncing the economics behind both in-the-cloud-cloud such as Amazon E2C and behind-the-firewall clouds for large enterprises.  At a high level I think their analysis is actually pretty good, but the conclusions misleading due to a semantic twist.  They use Amazon E2C as a model, and their conclusions go something like this:

  1. Amazon E2C virtual CPU cycles are more expensive than real, in-house CPU cycles
  2. You waste 90% of those in-house CPU cycles
  3. You'll waste almost as many of those virtual cloud CPU cycles, only they cost more, so they are a bad deal
  4. You stand a decent shot at saving some of those real CPU cycles through virtualization, so you should aggressively virtualize your datacenter
  5. You're too inept to deliver a flexible cloud behind-the-firewall, so don't even try

I'll let you ponder which of the above statements is misleading while I address some related topics.

The goals of cloud computing are as old as computing itself.  They are:

  1. Reduce the time it takes to deploy a new application
  2. Reduce the marginal cost of deploying a new application over "standard" methods
  3. Reduce the marginal increase to recurring costs caused by deploying a new application over "standard" methods

Back in the days of yore, when programmers were real men, the solution to this was time sharing.  Computers were expensive and therefore should be run at as a high of utilization as possible.  While making people stand in line a wait to run their batch jobs was a pleasing ego trip for the data center operators, the machines still wasted CPU time while performing slow I/O operations and waiting in line generally made users unhappy.  Thus time sharing was born, and in a quite real sense the first cloud computing environments, because in many cases a large institution would purchase and host the infrastructure and then lease it out of smaller institutions or individuals.

The problem here is that the marginal cost equations end up looking like a stair-step function.  If you had a new application, and your enterprise / institution had excess mainframe capacity, then the marginal cost of letting you run your application was near zero.  But if there was no spare capacity - meaning the mainframe was being efficiently utilized - then the marginal cost was high because either someone else had to be booted off or you needed an additional mainframe.

Now fast-forward a couple decades to the PC revolution.  Somewhere along the way the cost curves for computers and people crossed, so it became appropriate to let the computer sit idle waiting for input from a user rather than having a user sit idle while waiting for a computer.  Now you could have lots of computers with lots of applications running on each one (although initially it was one application at at time, but still, the computer could run any number of them).  This smoothed out the non-recurring marginal cost curve, but as PCs proliferated it drove up recurring costs through sheer volume.

Unfortunately this had problems.  Many applications didn't work well without centralized backends, and some users still needed more compute power than could be reasonably mustered on the desktop.  So the new PCs were connected to mainframes, minicomputers, and eventually servers.  Thus client-server computing was born, along with increasingly confusing IT economics.  PCs were cheap, and constantly becoming cheaper, but backend hardware remained expensive.  The marginal non-recurring cost becomes completely dependent on the nature of the application, and recurring costs simply begin to climb with no end in sight.

Now fast forward a little more.  Microsoft releases a "server" operating system that runs on suped up PCs an convinces a whole bunch of bean counters that they can solve their remaining marginal non-recurring cost problems with Wintel servers that don't cost much more than PCs.  Now more expensive servers.  No more having to divide the cost of a single piece of hardware across several project.  Now if you want to add an application you can just add an inexpensive new Wintel server.  By this time the recurring cost equation had already become a jumbled mess, and the number of servers was still dwarfed by the PC on every desk, so there no tying back the ever increasing recurring costs.  This problem was then further exacerbated by Linux giving the Unix holdouts access to the same cheap hardware.

Thus began the era of one or more physical servers per application, which is where we are today, with McKinsey's suggestion for addressing: virtualization behind the firewall.  The problem with this suggestion is that, for a large enterprise, it isn't really that different from the cloud-in-the-cloud solution that they denounce as uneconomical.  One way is outsourcing a virtualized infrastructure to Amazon or similar, and the other is outsourcing it to their existing IT provider (ok, not all large enterprises outsource their IT, but a whole lot do).

Virtualization, in the cloud or otherwise, isn't the solution because it doesn't address the root cause of the problem - proliferation of (virtual) servers and the various pieces of infrastructure software that run on them, such as web servers and databases.  Hardware is cheap.  Software is often expensive.  System administrators are always expensive.  Virtualization attacks the most minor portion of the equation.

Virtualization is the right concept applied to the wrong level of the application stack.  Applications need to be protected from one another, but if they are built in anything resembling a reasonable way (that's a big caveat, because many aren't) then they don't need the full protections of running in a separate OS instance.  There's even a long standing commercially viable market for such a thing: shared web hosting.

It may not be very enterprisey, but shared web site/application hosting can easily be had for about $5 per month.  The cost quickly goes up as you add capabilities, but still - companies are making money by charging arbitrary people $5 per month to let them run arbitrary code on servers shared by countless other customers running arbitrary code.  How many enterprise IT organizations can offer a similar service at even an order-of-magnitude greater cost?

Not many, if any.  Yet do we see suggestions pointing out that Apache, IIS, Oracle, SQL Server, and countless other pieces of infrastructure can relatively easily be configured to let several applications share compute resources and expensive software licenses?  Nope.  They suggest you take your current mess, and virtualize it behind the firewall instead of virtualizing it outside the firewall.

Sphere: Related Content

Tuesday, February 10, 2009

Refactoring Scala Actors: Rethinking the Approach

When I started refactoring Scala's actor library, I really had several goals:

  1. Reduce the coupling within the library so that the implementation can be more easily extended and customized
  2. Create a more transparent and "programmer friendly" actor implementation
  3. Improve performance of various use cases
  4. Make actors that interact better with their environment, particularly non-actor code
  5. Maintain API compatibility with the existing library to the maximum extent practical

Thus far I've done my work by completely overhauling the Actor trait directly in the library.  While I have plans for how this will reduce coupling, the haven't come to fruition yet.  I think my state machine model is considerably more transparent and programmer friendly than the current implementation.  The state an actor in is always clear, transition are reasonably well defined, and performing correct locking and unlocking is now pretty straight forward.  I've substantially improved the performance of event-based actors for the common case where an actor loops while it receives messages until some termination criterion is reached.  I haven't done anything with making them interact better with their environment yet, as I believe Philipp Haller is in the process of incorporating some changes for which I submitted patches several months back that will help considerably (he doesn't appear to be using the patches directly, but the changes I've seen are close enough).

A few days ago David MacIver asked me a couple interesting questions on IRC:

  1. Have you considered using Kilim?
  2. Have you looked at the continuations support that appears to be emerging as a compiler plugin?

Using Kilim would almost certainly disqualify the changes I'm making from incorporation into the standard library because of the external dependency on a prerelease framework that uses bytecode manipulation, and I don't think the fledgling continuation support is mature enough to experiment with yet (or maybe just not well documented enough yet, who knows...).  That being said, both of these would be interesting avenues of development.  Event based actors rely heavily on continuations, and the performance of those continuations has a substantial effect on the performance of the actors.  Ultimately a properly decoupled implementation would allow someone to build an API compatible actor implementation on either of these technologies, or something else entirely.

I also received a recent nudge on the mailing list, when someone pointed out that it would be easier to experiment with my library if it was in it's own package.  I somewhat disagree with that.  It would be easier to experiment with it if you didn't have to do an entire build of the Scala distribution that has my code in it, and unfortunately for the foreseeable future I'm going to be depending on both code that is being committed into the trunk and on the ability to modify said code.  Also, the way I have it setup today, if someone wanted to test their Scala application/library/framework against my state-based actors, they would just have to pull my Mercurial repository, build it, and rebuild against the resulting distribution.  On the downside, it's a pain to do side-by-side comparisons between the two implementations, because you have to switch distributions and rebuild every time.

That being said, decoupling is one of my primary goals, and Martin & Company have already set the precedence of doing major work in a separate package with their efforts on redesigning the Scala collections library.  So as soon as I finish up some of my immediate tasks and have a good build again (I'm in the middle of redesigning the control flow pattern to minimize the stack size when exceptions are thrown), I'm going to push most of the public and protected interface on Actor into a new trait that will be between Actor and AbstractActor as abstract methods, move my code out of scala.actors and into scalax.actors in my own directory.  I definitely keep it within the same repository as the Scala distribution code, and will probably just make another directory for it.  This means I'll have to mess with Sabbus to add my portion to the build, which won't be fun, but shouldn't be too hard.  I'm sure there's going to be a lot more to do to extract it out, so I'll be adding items to my issues list as I think of them.

The end result should be my state based actors and the existing actors being able to live side-by-side and interact with one another in the same program.  Assuming I can at least get some patches into the main library, it will also mean that the future of my work will not be dependent on being incorporated into the standard distribution.  If it is, great, if not, I can distribute it separately.  I would have done this from the start, but initially I was highly dependent on the existing code and infrastructure into order to get something working reasonably quickly, and to be able to smoke it out to see if it indeed worked.  Back in December I was basically breaking partest with every change.  But now things are reasonably stable, I rarely break partest, and I don't think it will take much for my code to be able to stand on its own.

So what does everyone think?  Is this a good direction to head in?

Sphere: Related Content