Over the past few weeks in my copiuos free time I've been trying to come up with a fast Widefinder using Scala Actors where the parallelization magic is hidden behind what looks like a nice friendly API. Based on timings on my oh-so-fast Athlon XP 1900+ I seem to have failed on the first account (Ruby is faster) and the second part requires some awareness of parallel programming. But I'm hopeful that when Tim runs it on the T5120 it will be competive with the other solutions, although I doubt that it will be the fastest.
For those of you who don't want to listen to me ramble, here's a jar containing all of the class files needed to run my solution along with the Scala source code. It's so big because I put the Scala libraries in there for Tim so he wouldn't have to worry about CLASSPATH error or installing Scala.
Ok, so why isn't it the fastest? What's the point? Well, first let me define what I think a good widefinder solution should demonstrate.
- The "application code" should be similar in complexity and readability to Tim Bray's example in Ruby. Anything else should be hidden behind a general purpose API, preferably an out-of-the-box one.
- It should be environmentally friendly. This means that it shouldn't hog all the available RAM and make other processes start paging. It should also be well behaved on a machine under load. This is not relative to input file size. It should ALWAYS be environmentally friendly.
- It should maintain the abstractions to which programmers are accustomed. For example, it shouldn't bypass regexes or use non-standard string types. Also, if a language normally deals well with unicode and the potential for multiple encodings, it should deal well with unicode and multiple encodings as well.
- The person running it shouldn't have to tell it about details like how many worker processes to spawn. It should figure it out based on its environment. Ideally the programmer shouldn't have to worry about it either.
- It should be sufficiently performant on a single processor box.
- It should run faster as more processors are made available.
I think my solution is reasonable according to all of these measure. Hopefully we'll see about execution on lots of cores shortly. #1 is a little tricky, because the using a concurrent map was tricky and there is a definite impedance mismatch between Java collections and idiomatic Scala. Depending on your perspective a map-reduce based solution would have been more elegant, but unless map-reduce becomes required learning for all professional programmers then I'm not so sure.
#2 is one area where, in my opinion, many of the current leading solutions fall down. A favorite way of achieving easy parallel IO (well, IO on most disk can't really be done in parallel) is to spawn a bunch of processes (or threads) and have them each mmap a large chunk of the input file and process it. The issue with this is that, as
Fredrik Lundh noticed (BTW - he has an excellent discussion - go read it), this can cause serious paging on some operating systems. Other solutions allocate buffers measured in megabytes for each worker thread/process. I personally don't see this as very environmentally friendly when only kilobytes are really needed. My solution reads the file sequentially using one set of buffers per worker (about 128k total per worker). Each worker reads in 32k and then passes the FileChannel to the next worker.
#3 is the real kicker, and what kicked down the performance of Scala (and, I believe, any other JVM based solution). Java strings are unicode and based on 16 bit characters. The input file is ASCII and based on 8 bit characters. Consequently, a String-oriented solution on the JVM has to transcode the characters. Futhermore, it should be able to transcode the characters from any common file format, not just ASCII or UTF8. This turns out to take a lot of cycles, around 10% depending on the charset and the cost of some other functions. The good news is that each buffer can be transcoded in parallel.
#4 is more about maintaining abstractions. My solution does it automatically, the other solutions don't seem to...although adding it probably wouldn't be a big deal.
#5 is probably true of most of the solutions...or true if you substitute dual-core for uniprocessor, because that's what most of the people used to develop their solutions. In the case of my solution I think there is about a 30% performance penalty on a single processor box (that and transcoding are enough to make it slower than Ruby).
#6 is interesting because this is an I/O bound task. Assuming the data is actually being read from disk (I believe it is cached in all or most of his trials), then this task should only be able to keep a small handful of processors busy. But that's not to say it isn't important.
The reason it's important is because from the normal programmer's point of view, file I/O isn't just shuttling information from the disk to a buffer, it's everything that happens between him making the request and a string representing a line being returned. There's a lot of work that goes into making that string. Here's a list (more or less) for Java and Scala.
- Read the data into a direct ByteBuffer.
- Copy the data into a heap ByteBuffer**
- Transcode the ByteBuffer into a CharBuffer
- Find the characters that represent a line
- Make a String from those characters
Your operating system probably uses something called readahead to asynchronously fetch the next N bytes of a file before you actually request it, on the assumption that you will. The Java or Scala library could asynchrously fill a fixed-sized queue of lines, assuming concurrency is cheap enough.
I learned a lot in this excersise, but my two big takeways are:
- Abstraction has a real cost in terms of CPU cycles, and existing abstractions in mainstream languages (ok, Scala isn't mainstream yet, but Java is) are not optimized for multi-core environments. This will need to change.
- Abstraction has a big cost in terms of programmer cycles if the programmer wants something different. Without the source available from OpenJDK I would have been lost when I was trying to figure out why approaches that seem like they require less CPU work actually require more. The Client and Server VMs make a huge difference in performance that varies widely depending on file size and program structure. With the libraries, JVM, and OS all trying different things to make your code go faster it is extremely difficult to figure out exactly what is happening and why.
**I looked at the OpenJDK source and if you use NIO to read into a non-direct buffer, what actually happens is the Java library reads into an available direct buffer and then copies it into a direct buffer. This is essentially what happens with a regular FileInputStream as well. At first I thought I could speed things up by skipping the copy from the direct buffer to the non-direct buffer, but it doesn't. You see, decoders are optimized to use the array backing the buffer if it is available because direct array accesses are much faster than function calls. Array copies are done using low-level native code. Consequently, copying the direct buffer and the transcoding the in-heap copy is much faster than directly transcoding the direct buffer.
Sphere: Related Content