Sunday, August 22, 2010

Scala Actors: loop, react, and schedulers

One of the unfortunate aspects of many of the "published" (meaning blogged) Scala Actor benchmarks out there is that they rarely pay much attention, if any, to the affects of seemingly idiomatic patterns on performance. Some of the main culprits are:

  1. react versus receive (event-based versus threaded)
  2. loop/react versus recursive react
  3. loop/receive versus receive/while
  4. tweaking (or failing to tweak) the scheduler

I've been working on setting up a "benchmarking framework" in conjunction with experimenting with modifications to the underlying thread pool so that all the possible permutations are automatically tested. What I have right now is a classic "ring" benchmark setup to permute the schedulers and loop/react versus recursive react. The loop/react pattern is more idiomatic (or at least more common), but higher overhead, and it looks something like this:

loop {
  react {
    case Msg(m) => // do stuff
    case Stop => exit()
  }
}

The reason it is high-overhead is that both loop and react raise control flow exceptions that result in the creation of new tasks for the thread pool when they are hit, so for each loop, two exceptions are raised and two tasks are executed. There's overhead in both of the operations, especially raising the exceptions. The recursive react pattern looks like this, so it can avoid the extra exception/task:

def rloop(): Unit = react {  //this would be called by the act() method
  case Msg(m) => {
    // do stuff
    rloop()
  }
  case Stop => // just drop out or call exit()
}

Using loop instead of recursive react effectively doubles the number of tasks that the thread pool has to execute in order to accomplish the same amount of work, which in turn makes it so any overhead in the scheduler is far more pronounced when using loop. Now, I should point out that the overhead really isn't that large, so if the actor is performing significant computations it will be lost in the noise. But it's fairly common to have actors do fairly little with each message. Here's some results from the ring benchmark using 10 rings of 10,000 actors passing a token around them 100 times before exiting. I'm using multiple rings because otherwise there is no parallelism in the benchmark. These are being run on my dual core Macbook.

SchedulerReactMethodTime (sec)
ManagedForkJoinSchedulerLoopReact45.416058
ManagedForkJoinSchedulerRecursiveReact25.509482
ForkJoinSchedulerLoopReact65.268584
ForkJoinSchedulerRecursiveReact45.85605
ResizableThreadPoolSchedulerLoopReact98.084794
ResizableThreadPoolSchedulerRecursiveReact53.379757

The fork/join schedulers are faster than the ResizableThreadPoolScheduler because rather than have all of the worker threads pull tasks off of a single, shared queue; each thread maintains its own local dequeue where it can place tasks directly onto if they are generated while it is running a task. This creates a kind of "fast path" for the tasks that involves much less overhead.

I believe the primary reason ManagedForkJoinScheduler is faster because ForkJoinScheduler does not always leverage the "fast path," even when in theory it could be used. I'm unsure about some of the rationale behind it, but I know some of the time the fast path is bypassed probabilistically in order to reduce the chances of starvation causing deadlock in the presence of long running or blocking tasks. ManagedForkJoinScheduler escapes this particular issue by more actively monitoring the underlying thread pool, and growing it when tasks are being starved. The second reason, and I'm somewhat unsure of the actual degree of the affects, if that ForkJoinScheduler configures the underlying thread pool so that the threads work through the local dequeues in FIFO order, while ManagedForkJoinScheduler configures the pool such that the local dequeues are processed in LIFO order. Processing in LIFO order allows the pool to take advantage of locality with regard to the tasks, basically assuming that the last task generated is the most likely to use data that's currently in cache, and thus reduce cache misses.

The benchmark outputs a lot more information than I captured in the above table. If you'd like to run it, you can obtain the code here. The project uses sbt, so you'll need to have it working on your computer. After you run update in sbt to download all of the dependencies, you can run the ring benchmark as follows:

$ sbt
[info] Building project ManagedForkJoinPool 1.0 against Scala 2.8.0
[info]    using ManagedForkJoinPoolProject with sbt 0.7.4 and Scala 2.7.7
> ringbenchmark
[info] 
[info] == compile ==
[info]   Source analysis: 1 new/modified, 0 indirectly invalidated, 0 removed.
[info] Compiling main sources...
[info] Compilation successful.
[info]   Post-analysis: 79 classes.
[info] == compile ==
[info] 
[info] == copy-resources ==
[info] == copy-resources ==
[info] 
[info] == ringbenchmark ==
[info] RingBenchmark ManagedForkJoinScheduler LoopReact 2 ....output truncated...

You can tweak the benchmarks by modifying the sbt project file. If you do run them, I'm very interested in the results.

Sphere: Related Content

No comments: