Wednesday, October 10, 2007

Why test parallelism on a simple function?

On my last blog anonymous asked:

would a more expensive line-match-function make it more obvious if you are working in parallel?
I would say that one should be able to demonstrate that transparently supporting the potential for parallelism should be near free. If you can use a parallel algorithm to solve a problem that doesn't benefit much from parallelism with roughly the same or better performance characteristics as the serial code then it should be a lot better when you actually give it a more complex problem. Basically, parallelism should be free. A lot of people have commented on Tim Bray's blogs that his test is unsuited for demonstrating the benefits of parallelism because it is IO bound. Tim claims this isn't true, and I suspect there's some truth to that if you have really optimized IO, but I do think the benefits of parallelization for him problem are very limited. That being said, one thing that it he has clearly demonstrated is that parallelism isn't free. His "obvious" newbie solution in Erlang performed horribly and was considerably longer than the Ruby solution. Others have greatly improved the performance with extremely long, complicated chunks of code, but have yet to match Ruby. I find that really sad. So I would like to prove that parallelism can be almost free, meaning:
  1. Leveraging it does not impose a significant additional cognitive load on the programmer.
  2. Problems that are not effectively parallelizable should execute "about as fast" when parallel functionality is used as with serial code.
From an interface perspective I think I have it with the monadic interface to the file. I just need to work out some bugs or change the interface to make them go away. I'll write more on this when I've worked out some of the wrinkles. So that leaves the performance problem. One of the big challenges with parallelization is that spawning new threads or processes is very expensive, and synchronization is somewhat expensive, so it's very easy for the cost of parallelization to overwhelm the cost of the actual solution. The most straight forward way to address this problem is to not parallelize when the function is not complex enough or the input data set isn't large enough to justify it, but that is back to imposing a cognitive load on the programmer because he has to figure that out. Either that or always "start serial" and use runtime profiling tricks to detect if the problem is worth parallelizing, which sounds expensive put probably has merit. Another challenge is knowing how to divide up the problem to avoid excessive synchronization and/or messaging. When processing a file line-by-line, one could send each line out to be processed independently, but that requires a lot of messaging and synchronization if you don't have lock-free messaging. So really you want to break the problem into properly sized chunks and send each chunk as a message rather than simply use the most natural division. Figuring out how big a chunk should be (or how many chunks you should have) is a challenge because it is problem and runtime dependent. Again, this creates the potential to burden the programmer, use complex and potentially expensive runtime profiling, or somehow come up with a magic cheap hueristic. So you can either solve the problems above, or you can have sufficiently cheap parallelism that you don't need good solutions. Right now I'm going after the sufficiently cheap approach. What I have so far is a mapreduce-style function using Scala Actors that breaks a file into chunks of lines and sends them off to be processed by an Actors. I plan on adding a parallel foreach function that could be used for problems like Widefinder using a parallel hash map. Performance wise it's looking promising. Here's some numbers (using my 5+ year old machine): Serial:
Count: 185300 Serial: 11592 real 0m12.107s user 0m11.254s sys 0m0.784s
Parallel:
Count: 185300 Serial: 11722 real 0m12.225s user 0m11.441s sys 0m0.723s
As you can see the parallel code is slightly slower than the serial code. Across runs their times actually overlap a bit, but serial generally times to be a tad faster. One thing I've noticed is that the deltas between the serial and parallel implementations don't really grow - and to some extent shrink - with increasing input sizes. I believe this is because there is a fixed penalty for setting up the thread pool for the actors. This only has to be done once per process invokation, and appears to cost about 200ms on my machine. In other words, parallelization for file processing can be almost free. I actually think it could be better-than-free, even on a single processor box, if IO was more efficient. My current solution is using a BufferedReader to read in the file one line at a time. This means the IO is probably being done in a less-than-optimal way, and that a lot of work is being done in serial for each line (converting from 8bit ASCII to 16-bit Unicode strings, splitting it into lines). I'd like to use nio to read the file in a block at a time, and then let all this work be done in separate threads. I think then there would be a performance increase because one thread would be doing nothing but reading in buffers as fast as the OS and JVM and provide them, and others would be doing all the computation while the IO thread is blocking. But before that I'm going to get the interface cleaned up and solve the memory problem on large files.

Sphere: Related Content

Sunday, October 07, 2007

Dangerous Monads?

I've been trying to write a parallelized version of Tim Bray's Widefinder problem in Scala where the parallization is "hidden" in some library-like code so that the programmer can write something that looks as clean...nay...cleaner than Tim's Ruby code using Scala. I'm also trying to figure out what the heck a monad is, and thinking about how I'll have to use ugly, imperative Java IO classes, so I decided to write some classes to make a file look like a list of strings, with each element representing one line. One critical aspect of this list is that it is lazily generated, so you don't have the wait for the entire file to be loaded to start working with it. Another critical trait is that there are no "back references" from subsequent lines, so that if you no longer hold a reference to previous nodes in the list, those nodes can be garbage collected. The interface supports all the required methods for "for" comprehensions - map, foreach, filter, and flatMap. I think it qualifies as a monad, or at least is something close. I ended up with something where usage looked kind of like this:

def main(args : Array[String]) : Unit = {
  val pat = Pattern.compile("GET /ongoing/When/")
  for (file <- args; line <- Line.open(file) if pat.matcher(line).find) cnt = cnt + 1  } 
Note that I'm just counting lines there, not performing the full Widefinder functionality. But still, that's pretty concise and easy on the eyes. The "for" comprehension is translated into calls to foreach. It works great for a lot of files, but it blows up with a OutOfMemoryError on large files. So why is that? I'll give you a hint, fully expanded it would look kind of like this:
args.foreach((file) => Line.open(file).filter((line) => pat.matcher(line).find).foreach((line) => cnt = cnt + 1))
Can you see the problem? No? Try this:
args.foreach((file) => {
 val firstLine = Line.open(file)
 firstList.filter((line) => pat.matcher(line).find).foreach((line) => cnt = cnt + 1))
})
Can you see it now? The problem is that there's a reference to the first line hanging around. Even if you don't declare the variable it's still there, lurking as the implicit "this" parameter for filter. That reference makes it so the first line is not collectable, and as a result none of the lines are collectable because they are all reachable from the first line. So the whole file is loaded into memory, resulting in an OutOfMemoryError. That seems pretty dangerous to me. So how can we solve this problem? Well, we have to make it so that references to lines disappear as we get done with them. There are a couple ways to do it. The obvious imperative way is the use a var and a while loop, but then you might was well use BufferedReader directly or use an iterator. The functional way is to use tail recursion, so you write:
def scanLines(line: Line, pat: Pattern, cnt: Int): Int = {
  if (line == EndOfFile) cnt
  else {
    val v = if (pat.matcher(line.value).find) 1 else 0
    scanLines(line.next, pat, cnt + v)
  }
}
...and call it for each file like this:
for(file <- args) cnt = cnt + scanLines(Line.open(file), pat, 0)   
Notice that there is no val declaration holding the first line. This is critical, because if there is, it will run out of memory. So what do we do about it? Well, it would be easy enough to refactor methods like foreach that trigger immediate file traversal out of the normal interface and into a tail-recursive method on the companion object. Unfortunately that would break usage in for comprehensions, be inconsistent with other collection-like objects, and in general feel like poor OO. Another way to fix it would be to fix Scala so it supported full tail call optimization. Of course, that would also require adding full tail call support to the JVM. That way the unneeded "this" reference could silently disappear from the stack. This would also allow many methods to be expressed in a much cleaner way. For example:
  final def foreach(f: (String) => Unit): Unit = {
    def fe(line: Line): Unit = {
      if (line != EndOfFile) {
        f(line.value)
        fe(line.next)
      }
    }
    fe(this)
  }
Could be simplified down to:
  final def foreach(f: (String) => Unit): Unit = {
    f(value)
    next.foreach(f)
  }
For those of you who want to take a look at the code I have so far, here it is as an Eclipse project. I've tried to comment it a bit, but it's still a work-in-progress. There is a working (at least I think it works) mapreduce function hiding in there that allows lines to be parallel processed using actors. Unfortunately it is slower (but not substantially slower) than just processing the lines serially. But then I'm running it on an old uniprocessor PC, so maybe with more cores it would do better. If I get some free time at work I'll try it out on a multicore machine and see what happens, but I suspect that unless (or until) I hack together something that uses nio in an optimal way the task will remain IO bound...and even then it may remain IO bound.

Sphere: Related Content