Friday, February 06, 2009

Profiling the ExceptionBlob

Last week on the Scala IRC channel (#scala) I chatted with several people about the need to substantially reduce the use of exceptions for flow control within the Scala Actors library.  My remarks were met with a certain amount of healthy skepticism, as significant progress has been made on improving the performance of Java exceptions, as stated by John Rose in his blog Longjumps Considered Inexpensive.

I highly suggest you go read the post.  I don't doubt anything stated in it, and in fact noticed a substantial performance improvement when I stopped benchmarking against Java 5 and started doing it against Java 6 (I usually work on a Mac, and Java 5 is the default VM), and expect even more improvements with Java 7.  Once you're done, I'd like to call your attention to a couple fragments of it (additional emphasis mine):

The Hotspot JIT can optimize a throw of a preallocated exception into a simple goto, if the thrower and catcher are together in the same compilation unit (which happens because of inlining).

Here is a concrete example of non-local return, implemented efficiently via a cloned exception. I have observed the Hotspot server compiler emitting a machine-level goto for the throws. With current server VM optimizations of this code, the cost of a non-local return is less than three times the cost of a plain (local) return; with the client VM the ratio is ten. See the code example and benchmark for details: NonLocalExit.java (javadoc).

These statements have a couple implications:

  1. Reducing a throw to a long jump requires a certain amount of HotSpot magic take place to ensure that the throw and catch are in the same compilation unit.  The more complex the code is, the less likely this is to happen.
  2. Long jumps are still substantially slower than normal returns

I think John's post received a lot of attention, and people (including myself) saw it, didn't read it carefully, and assumed that HotSpot now performed miracles with cached exceptions.   What we missed was a post to the MLVM development mailing list about a year later by Charles Nutter, who had noticed that his non-local returns in JRuby where not being compiled down into long jumps, and subsequent response from John Rose.  There's a lot of very good technical discussion in there, but I think the gist is that HotSpot's ability to optimize throws into long jumps is still limited, it often needs help to do so, and very smart people are working at improving it.

Of course that all still leaves the question:  Just how much slower are cached exceptions than a normal return?  I can't really answer that in the general case, but I did put together a couple mircobenchmarks in Java that somewhat simulate the type of stack creation that happens within my state-based actors.  The benchmark is pretty simple.  It recursively descends down, incrementing a member variable, and then on the way back up it decrements another member variable.  In the case of the exception-based return, it does the decrements within a finally block.  This admittedly is rather pathological code, but it is also simple and easy to understand.

Here's the normal return version:

public class NormalReturn {
    public static void main(String[] args) {
        NormalReturn m = new NormalReturn();
        m.doTest(10000000);
    }
    public void doTest(int times) {
        for(int i = 0; i < times; i++) {
            down = 0;
            up = 0;
            deepRecurse(1000);
        }
        System.out.println(down);
        System.out.println(up);
    }
    private int down = 0;
    private int up = 0;
    public void deepRecurse(int i) {
        down++;
        if (down <= i) {
            deepRecurse(i);
            up--;
        }
    }
}

...and here's the version that uses a cached exception:

public class TryFinally {
    private static class Signal extends RuntimeException {
        @Override
        public Signal fillInStackTrace() { return this; }
    }
    private Signal signal = new Signal();
    public static void main(String[] args) {
        TryFinally m = new TryFinally();
        m.doTest(10000000);
    }
    public void doTest(int times) {
        for(int i = 0; i < times; i++) {
            try {
                down = 0;
                up = 0;
                deepRecurse(1000);
            } catch (Signal s) {
                // do nothing
            }
        }
        System.out.println(down);
        System.out.println(up);
    }
    private int down = 0;
    private int up = 0;
    public void deepRecurse(int i) {
        try {
            down++;
            if (down == i) throw signal;
            else deepRecurse(i);
        } finally {
            up--;
        }
    }
}

With profiling turned off on and -server on Java 6 on my Mac, the normal return version takes about 60 seconds to complete, while the TryFinally version takes over 10 minutes.  That's certainly much better than the 30x figure John Rose cited for the client VM, but it's still an order-of-magnitude worse than a normal return.  One interesting thing about the benchmark is that if you decrease the recursion depth and increase the number of times deepRecurse() is executed the total time stays relatively constant.

So with this in mind, I'm going to work hard to refactor the actors library to keep exception usage to a minimum, an where it must be used to minimize the depth of the stack.

Sphere: Related Content

2 comments:

Rich Dougherty said...

Thanks for the post, Erik. This is great work you're doing.

site said...

This can't have effect in actual fact, that's exactly what I suppose.