## Sunday, August 03, 2008

### The Costs of Generality

I've been pondering the results of the Cedric's Code Challenge, and wondering just how much benefit is derived from optimized, purpose-specific solutions as opposed to solutions that rely on a more general libraries or frameworks.  It's fairly common to see debates where one person (or group) insists that general constructs from a standard library or other such solutions represent an unacceptable overhead, and the other side claims that the overhead is meaningless compared to runtime optimizations performed by HotSpot and the cost of programmer time.  These debates can be rather painful to watch, as both sides generally have good points, yet often seem to be arguing right past one another.  Consequently, I think a little exploration of various potential optimizations and what their respective impacts on performance would be beneficial.

For purposes here, I'm going to say that a solution based on a general framework would be one that uses a general purpose library to generate permutations of digits, filters out the ones with a leading zero, converts the permutations to numbers, and then collects the desired statistics.  A purpose specific solution would be one such as Crazy Bob's that is tailor-made for generating numbers based on permuted digits.

### The General Solution

I'm not aware of a combinatronics library for Scala, but it is simple enough to write a generic permutation generating function:

```  def permute[E](s: Set[E], n: Int)(f: List[E] => Unit): Unit = {
def p(s: Set[E], r: List[E]): Unit =
if (r.length == n) f(r) else for(e <- s) p(s - e, e :: r)
p(s, Nil)
}
```

This recursively generates all of the possible permutations.  When it as generated a complete permutation, it passes it to the function specified by the caller.  If `s` is an ordered set, then the permutations will be generated in a predictable order. This can then be used to generate the permutations of digits for the code challenge, as follows:

```  val digits = TreeSet(0L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L)
def main(args: Array[String]): Unit = {
val start = java.lang.System.nanoTime
var last, cnt, jumpLow, jumpMagnitude = 0L
for(d <- 1 to 10) permute(digits, d) { p =>
val xs = p.reverse // digits are generated in the wrong order so must be reversed
if (xs.head != 0L) {  // make sure this is a valid set of digits
val cur = xs.foldLeft(0L)((z, a) => (z * 10L) + a)
val dif = cur - last
if (dif > jumpMagnitude) {
jumpLow = last
jumpMagnitude = dif
}
last = cur
cnt = cnt + 1L
}

}
val end = java.lang.System.nanoTime
println("Count: " + cnt)
println("Jump: " + jumpMagnitude + " (from " + jumpLow + " to " + (jumpLow + jumpMagnitude) + ")")
println("Time: " + ((end - start) / 1000000L) + " ms")
}
```

This solution takes about 13 seconds on my MacBook.

### Generate Only Valid Permutations

The above permutation function can be tweaked as follows to generate only valid permutations (ones without the leading zero), and thereby saving about 10% execution time.

```  def digitPermute(n: Int)(f: List[Long] => Unit): Unit = {
def p(s: Set[Long], r: List[Long]): Unit =
if (r.length == n) f(r) else for(e <- s) p(s - e, e :: r)
for(first <- (digits - 0L)) p(digits - first, first :: Nil)
}
```

The above solution executes in about 12 seconds.

### Accumulating Numbers Instead of Lists

Both of the methods above construct lists of numbers which are later assembled into numbers.  This wastes memory and cycles, because only the resulting numbers are required and they can be accumulated much more efficiently.  Doing so, as shown below, reduces execution time to about 7 seconds.

```  val digits = TreeSet(0L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L)
def digitPermute(n: Int)(f: Long => Unit): Unit = {
def p(s: Set[Long], d: Int, r: Long): Unit =
if (d == n) f(r) else for(e <- s) p(s - e, d + 1, r * 10L + e)
for(first <- (digits - 0L)) p(digits - first, 1, first)
}
```

### Long Set without Boxing

The above implementations all use TreeSet from Scala's standard library, which imposes a few performance penalties. For one, it is "generic." This means that it requires both type-erasure and boxing instead of using primitives.  Second, if you look carefully at the definition of TreeSet, you'll notice that it doesn't require its contents to be Ordered, but rather uses a (potentially implicit) view converting the contained type into an Ordered.  This adds an extra layers of indirection and therefore an extra cost.

```  final class LongSet (val contents: Array[Long]) {
private def indexOf(v: Long, min: Int, max: Int): Int = {
if (min > max) -1
else {
val mid = (min + max) >>> 1
val midVal = contents(mid)
if (midVal < v) indexOf(v, mid + 1, max)
else if (midVal > v) indexOf(v, min, mid - 1)
else mid
}
}
def foreach(f: Long => Unit) {
var i = 0
val max = contents.length
while (i < max) {
f(contents(i))
i = i + 1
}
}
def -(v: Long): LongSet = {
val max = contents.length - 1
if (indexOf(v, 0, max) < 0) this
else {
val a = new Array[Long](max)
var i, j = 0
while (i <= max) {
val cur = contents(i)
if (cur != v) {
a(j) = contents(i)
j = j + 1
}
i = i + 1
}
new LongSet(a)
}
}
}
val digits = new LongSet(Array(0L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L))
def digitPermute(n: Int)(f: Long => Unit): Unit = {
def p(s: LongSet, d: Int, r: Long): Unit =
if (d == n) f(r) else for(e <- s) p(s - e, d + 1, r * 10L + e)
for(first <- (digits - 0L)) p(digits - first, 1, first)
}
```

This implementation brings the execution time down to ~1.7 seconds, representing a substantial savings over TreeSet. The comparison isn't quite fair, as TreeSet uses a Red-Black balanced tree and the code above uses a sorted array, but the difference is still substantial and shows that having a more targeted data structure can improve performance significantly.  At this point you might be thinking "Well no sh*t sherlock!  Of course a data structure tuned for a specific type is faster than one that is written to handle any type!"  That's true...to a point.  Not all languages implement generics using type erasure and require boxing of values within parameterized classes.  For example, C++ was designed to ensure that data structures implemented using templates imposed little or no overhead above more raw ones.

### Special Purpose Set for Permutation Generation

Another approach is to use a more special-purpose data structure in the permutation function without reducing its generality.  The linked set used in Crazy Bob's solution can be generalized to generating permutations of any type.  Unfortunately, this structure is mutable, and mutates on every invocation.  This means that while it would be possible to pass it directly to client code, it would be extremely dangerous because the client code may maintain a reference to the rapidly changing data structure.  Consequently, the structure needs to be copied into a list or similar structure before being passed to client code.  The solution built around the code below completes in ~5 seconds, which is slower than using an structure explicitly coded for dealing with longs and generating longs, but over twice as fast as generating permutations using the standard TreeSet class.

```  private final class Element[E](val value: E, var next: Element[E], var previous: Element[E]) {
/** remove an element from the set*/
def use() {
if (previous ne null) previous.next = next
if (next ne null) next.previous = previous
}
/** put an element back in the set */
def yieldValue() {
if (previous ne null) previous.next = this
if (next ne null) next.previous = this
}
}
private def buildElements[E](s: Set[E]): Element[E] = {
val iter = s.elements
val first = new Element(iter.next, null, null)
var cur = first
while(iter.hasNext) {
cur.next = new Element(iter.next, null, cur)
cur = cur.next
}
first
}
def permute[E](s: Set[E], n: Int)(f: List[E] => Unit): Unit = {
def p(start: Element[E], head: Element[E], r: List[E]): Unit = {
def take(current: Element[E]): Unit = {
if (current ne null) {
val newR = current.value :: r
if (newR.length == n) {
f(newR)
take(current.next)
} else {
current.use()
current.yieldValue()
take(current.next)
}
}
}
take(start)
}
val first = buildElements(s)
p(first, first, Nil)
}
```

### Conclusion

The various implementations here represent a sampling of various ways that Cedric's Code Challenge can be implemented in Scala, and the effects they have on performance.  A relatively direct port of Crazy Bob's solution to Scala completes in ~0.4 seconds, making it by far the fastest solution and about 30 times faster than the solution using standard data structures with a generic permutation generator.  That's not really surprising, so what can we conclude?  The most obvious conclusion is that avoiding the construction of intermediate objects yields a substantial speedup.  This can be seen in two places.  The first is in the switch from constructing a List to represent the permutation to accumulating the Long directly.  The second is in using a special-purpose mutable data structure to generate the permutations, thereby avoiding repeated allocations of Set objects.  Finally, reducing overhead due to things like boxing and the casts associated with type erasure does make a noticeable difference in performance.  On the flip side, Scala's closure based constructs, such as nested functions and for loops, added negligible overhead, if any at all. Using more general constructs instead of more specific ones clearly has a substantial performance cost, but it's also worth mentioning that the cost is trivial compared to the benefit received in the transition from a brute-force solution to an optimal algorithm.

Sphere: Related Content