Wednesday, October 10, 2007

Why test parallelism on a simple function?

On my last blog anonymous asked:

would a more expensive line-match-function make it more obvious if you are working in parallel?
I would say that one should be able to demonstrate that transparently supporting the potential for parallelism should be near free. If you can use a parallel algorithm to solve a problem that doesn't benefit much from parallelism with roughly the same or better performance characteristics as the serial code then it should be a lot better when you actually give it a more complex problem. Basically, parallelism should be free. A lot of people have commented on Tim Bray's blogs that his test is unsuited for demonstrating the benefits of parallelism because it is IO bound. Tim claims this isn't true, and I suspect there's some truth to that if you have really optimized IO, but I do think the benefits of parallelization for him problem are very limited. That being said, one thing that it he has clearly demonstrated is that parallelism isn't free. His "obvious" newbie solution in Erlang performed horribly and was considerably longer than the Ruby solution. Others have greatly improved the performance with extremely long, complicated chunks of code, but have yet to match Ruby. I find that really sad. So I would like to prove that parallelism can be almost free, meaning:
  1. Leveraging it does not impose a significant additional cognitive load on the programmer.
  2. Problems that are not effectively parallelizable should execute "about as fast" when parallel functionality is used as with serial code.
From an interface perspective I think I have it with the monadic interface to the file. I just need to work out some bugs or change the interface to make them go away. I'll write more on this when I've worked out some of the wrinkles. So that leaves the performance problem. One of the big challenges with parallelization is that spawning new threads or processes is very expensive, and synchronization is somewhat expensive, so it's very easy for the cost of parallelization to overwhelm the cost of the actual solution. The most straight forward way to address this problem is to not parallelize when the function is not complex enough or the input data set isn't large enough to justify it, but that is back to imposing a cognitive load on the programmer because he has to figure that out. Either that or always "start serial" and use runtime profiling tricks to detect if the problem is worth parallelizing, which sounds expensive put probably has merit. Another challenge is knowing how to divide up the problem to avoid excessive synchronization and/or messaging. When processing a file line-by-line, one could send each line out to be processed independently, but that requires a lot of messaging and synchronization if you don't have lock-free messaging. So really you want to break the problem into properly sized chunks and send each chunk as a message rather than simply use the most natural division. Figuring out how big a chunk should be (or how many chunks you should have) is a challenge because it is problem and runtime dependent. Again, this creates the potential to burden the programmer, use complex and potentially expensive runtime profiling, or somehow come up with a magic cheap hueristic. So you can either solve the problems above, or you can have sufficiently cheap parallelism that you don't need good solutions. Right now I'm going after the sufficiently cheap approach. What I have so far is a mapreduce-style function using Scala Actors that breaks a file into chunks of lines and sends them off to be processed by an Actors. I plan on adding a parallel foreach function that could be used for problems like Widefinder using a parallel hash map. Performance wise it's looking promising. Here's some numbers (using my 5+ year old machine): Serial:
Count: 185300 Serial: 11592 real 0m12.107s user 0m11.254s sys 0m0.784s
Count: 185300 Serial: 11722 real 0m12.225s user 0m11.441s sys 0m0.723s
As you can see the parallel code is slightly slower than the serial code. Across runs their times actually overlap a bit, but serial generally times to be a tad faster. One thing I've noticed is that the deltas between the serial and parallel implementations don't really grow - and to some extent shrink - with increasing input sizes. I believe this is because there is a fixed penalty for setting up the thread pool for the actors. This only has to be done once per process invokation, and appears to cost about 200ms on my machine. In other words, parallelization for file processing can be almost free. I actually think it could be better-than-free, even on a single processor box, if IO was more efficient. My current solution is using a BufferedReader to read in the file one line at a time. This means the IO is probably being done in a less-than-optimal way, and that a lot of work is being done in serial for each line (converting from 8bit ASCII to 16-bit Unicode strings, splitting it into lines). I'd like to use nio to read the file in a block at a time, and then let all this work be done in separate threads. I think then there would be a performance increase because one thread would be doing nothing but reading in buffers as fast as the OS and JVM and provide them, and others would be doing all the computation while the IO thread is blocking. But before that I'm going to get the interface cleaned up and solve the memory problem on large files.

Sphere: Related Content

1 comment:

James Abley said...

val reader = new BufferedReader(new InputStreamReader(new FileInputStream(fileName), "US-ASCII"))

That should speed up the I/O a bit. You can try using multiple threads to read from the file and dispatch to workers as well - each thread skips to a known offset, finds the first complete line and reads a given amount of bytes plus whatever is required to complete the final line.

I think an NIO solution would be interesting though, keep at it.