Actually the only serial part is the reading data ...

2007-12-05T22:58:00.000-05:00

Actually the only serial part is the reading data from disk into a direct buffer. The rest is all parallel. The basic Scala for loop is just syntactic sugar for a call to foreach.

Here's the main behind-the-scenes code:
http://mysite.verizon.net/erik.engbrecht/pio.scala.html

Go to the foreach method in ParallelLineReader. It spawns off a coordinator actor, which then spawns 1 more workers than you have processors. Each worker reads a chunk of the file into a direct buffer and the passes the file channel on to the next worker. The workers form a circular list, so I/O is always done sequentially. Also, memory consumption is limited to the number of workers * the size of the buffers. So if the processing runs behind the input, then it will stop reading data in until it catches up rather than eating up all your memory.

The hard part is the line boundaries, because the beginning and the end of any given read most likely is the middle of a line, not an end.

That's interesting, apparently you have paralleliz...

2007-12-05T15:36:00.000-05:00

That's interesting, apparently you have parallelized the line reading and splitting, but the counting itself is linear, isn't it?

So this is the opposite of my attempt, where splitting is sequential and counting (+ matching) is done in parallel. We should try to combine this :-)

Comments on Erik Engbrecht's Blog: Adventures in Widefinding: Complexity

Actually the only serial part is the reading data ...

That's interesting, apparently you have paralleliz...