Wednesday, June 18, 2008

What does I/O bound really mean?

There's a lot of old folk wisdom out there that justifies the use slow languages and runtimes on the basis that the impact on performance doesn't matter. Here's a sampling of them:

  • Most applications are I/O bound so the performance of the programming language doesn't matter

  • The database does all the heavy lifting, so performance of the application code doesn't matter

  • Computer time is cheap compared to programmer time

Like most folk wisdom, these all have an element of truth. They are often brought up in discussions about the merits of strongly-typed compiled languages versus dynamic languages, and natively compiled languages versus virtual machine based languages versus interpreted languages. I could write about any one of them, and many more, but today I'll address I/O because of the “work” I've been doing on the WideFinder 2 project.

WideFinder 2

The original WideFinder project was, at least on first inspection, quite clearly I/O bound (well, if you had an input file smaller than your main memory...) The WideFinder 2 is a little better because it is a little more computationally complex, but not by much. The benchmark is really simple: process a big (42 GB) web server log file and report some basic statistics. In order to perform the benchmark, the program must parse the file, store information from each line on several maps, and finally extract some statistics out of them.

Tim benchmarked sequential block input on the test system at just shy of 150M/s. This means that the I/O alone required to process the 42GB file should take about 5 minutes, meaning that if the benchmark truly is I/O bound then the program shouldn't take much more than 5 minutes to run. Consequently, if we are to judge based on Tim's reference implementation of the WF2 in Ruby then it quite clearly isn't I/O bound.

I/O Takes CPU, too

If you look closely at the Bonnie benchmark results you'll notice that doing raw block I/O – meaning just reading files into a buffer – consumes almost 80% of the a single core of the CPU. That means that I/O alone comes pretty close to being bound by the CPU as well as being bound by the disk. In fact, experimentation uncovered that placing the system high load reduces maximum I/O throughput. In order to achieve maximum throughput, you actually have to ensure that you keep one core free to manage the I/O.

Application I/O is more than I/O

Another key factor is that what most application programmers think of as I/O involves a lot more than shuttling bits from disk into memory. For example, the Java and other unicode based platforms have to decode the character encoding of the file into the native character encoding of the runtime. In the case of the JVM, this not only requires that every byte be processed, but also frequently doubles the memory consumption of each character and requires the data be copied into a separate buffer. Furthermore, application code usually deals with files line-by-line in the form of some standard (often immutable) string object, thereby requiring yet another pass over the data. So when your code is simply iterating over lines in a file, it isn't really just doing I/O. A fair amount of CPU/memory bound work is required to deliver those nice, neat, unicode strings.

Large datasets change the rules

Memory management isn't free, and even with an industrial strength garbage collector it isn't always simple. A common practice with immutable objects is to have them share some internal state in order to avoid excess data copying. For example, when you take a substring of a Java string, the two string objects share an underlying character array. Often times this saves memory, and it changes generating a substring of a string from a linear time and space operation (with respect to string length) to a constant time and space operation. Unfortunately if the substring is much longer lived that the original string, and especially if it is much smaller, you end up with something that feels awful lot like a memory leak (I think it could be debated whether it is a leak or not).

Even if you're not carrying around a lot of excess baggage in the form of substrings, processing gigabytes of data consumes a lot of memory. In the case of WF2, a little bit of pretty much every line needs to be stored for the calculations at the end. The cast majority of my “debugging” time for WF2 was spent figuring out what JVM parameters will actually allow the program to finish without slowing to a crawl (pre-tuning there were points were it would spend 90% of its time in GC) or consuming every bit of memory available (at least a few WF2 solutions hog tons of memory).

All this means that when dealing with large files other parts of the system need to do more work, which takes away time that could be spent on “real” processing or on I/O (which we've already seen can be a CPU hog). Furthermore, I/O routines and routines commonly used along side heavy I/O (like regular expressions) must be very careful about their memory usage.

So is WF2 really I/O bound?

It depends. Go take a good hard look at the WF2 results page. If you look at Tim Bray's results you would conclude that no, WF2 clearly is not I/O bound. However, if you look at the top you'll see some very impressive numbers that indicate that WF2 indeed is I/O bound (side note: I really need to take a hard look at OCaml). Of course, you could argue that even in the OCaml case it really is CPU bound, because making the task take about as long as the required I/O saturated 8 cores and 32 hardware threads. Scanning down the list would seem to indicate that in most cases I/O does not represent a significant bottleneck, but then it's hard to really tell. The disk may not be the bottleneck, but the I/O routines within the libraries or runtime may be. Consequently, from the application programmer perspective, WF2 is I/O bound.

Redefining “I/O bound” and future impacts

For a long time “I/O bound” primarily referred to hardware or possibly operating system limitations. That was a useful definition, but it is time for it to change. Most software being developed today has a very tall stack of abstractions sitting between it and the hardware. Operating systems schedule I/O and have to split limited resources among many competing processes. Virtual machines sit on top of operating systems, isolating the programmer from the underlying OS and hardware. Libraries and frameworks isolate the programmer from the virtual machine and even other libraries and frameworks. I/O from the programmer's perspective is, or at least should be if he is working on top of good abstractions, that I/O is “everything that happens between when my program requests some data and when it receives it.” Consequently, libraries and runtimes should go to great lengths to ensure that being I/O bound is as close to its original meaning as possible. Prior to multicore systems becoming pervasive that was largely true, but today's I/O libraries fail to take advantage of them, and consequently force I/O into being a bottleneck when it should not be.

That's why in my Widefinder submissions I've worked to separate parallelized I/O into a library-like module. It quite clearly is not done, but it is relatively successful. I reused my the parallel I/O code that I developed for WF1 on WF2 without changing any of the logic. It doesn't provide many features, and it could use a fair amount of optimization and simplification (the line joining logic is an atrocity), but it works.

Sphere: Related Content


Markus Kohler said...

Very good post!

Actually in my experience NIO in Java is still far from being perfect.

As far as I know there's no direct way in Java to use NIO together with Regular Expression.
What really would be needed is a way to directly do regexp matching on a directly buffered stream.

Also the assumption that everything has to be Unicode plus the fact that String in Java is immutable, leads to unnecessary

By the way jruby uses a special regexp library that was ported from ruby to java which does not use Strings but byte arrays.
This could maybe solve the Unicode problem.



Erik Engbrecht said...

Java regular expressions operate on CharSequence, so you can take the CharBuffer (or a subsequence of it) you get after the decode step. There are some challenges with block versus line oriented I/O (or other oriented) but they can be overcome. I'm planning on combining Ray Waldin's regex based approach with my parallel I/O.

That way I can do line splitting and space delimiting all in one pass.

I think proper unicode support is really important. Yes, you can make things go a lot faster using plain old ASCII, and a lot of what's out there is just plain of ASCII, but we live in an international world.


Anonymous said...

Hi, I've been reading a bit about WideFinder 2. I wonder if compressing the log file and then decompressing on the fly would improve I/O at the expense of CPU time. The log file should compress by 10X.

Second, how about preprocessing the log file so each line is at a fixed offset? That way your parallel I/O library doesn't have to do anything fancy. Maybe then you could mmap the file and run a bunch of threads in parallel. With huge files, it's reasonable to expect the data be formatted better.

Viagra Online Without Prescription said...

Greetings Erik Engbrecht and readers.

I respectfully wanted to let you all know my point about this matter. The true fact of this issue is that I didn't had any idea that such a wonderful explanation existed to justify I/O.

Viagra said...

Thanks for the explanation on what this really means!

louisville cheap hotels said...

love this! so great! this is something very interesting. A great concept that reflects the excellent thoughts of the writer. Thank you for providing this information.

muebles en madrid said...

Goodness, there is so much useful information above!