Saturday, June 28, 2008

Cedric's Code Challenge

Cedric Buest issued a interesting little challenge this morning:

Here is an interesting coding challenge: write a counter function that counts from 1 to max but only returns numbers whose digits don't repeat.

For example, part of the output would be:

  • 8, 9, 10, 12 (11 is not valid)
  • 98, 102, 103 (99, 100 and 101 are not valid)
  • 5432, 5436, 5437 (5433, 5434 and 5435 are not valid)

Also:

  • Display the biggest jump (in the sequences above, it's 4: 98 -> 102)
  • Display the total count of numbers
  • Give these two values for max=10000

He welcomed brute-force solutions, but really the challenge here is in coming up with something more efficient and elegant. There are basically three general approaches:

  1. Run through all the numbers from 0 to n, test each for no repeating digits, and track the above statistics while you do it. This is the brute force method.

  2. Permute the digits in a fashion that generates numbers sequentially and track the statistics. Alternatively you could generate them in any order, sort them, and then calculate the statistics.

  3. Derive a heuristic that proves a given sequence of numbers will all contain repeated digits and can therefore be skipped.

I think #2 is probably the ideal fashion, but I didn't think of it until I was mostly done coding #3.

Finding Repeated Digits

The first step in solving this problem, no matter what the approach, is to come up with an algorithm to detect repeated digits. Commenters on Cedric's blog came up with a number of ways to do this, most of which centered around converting the integer into a string and then finding repeated characters. This is a rather frighteningly inefficient approach. There is certainly no need to convert the number into a string in order to know its digits. A much simpler approach is allocate a ten element array of booleans initialized to false, and start generate the digits from lowest to highest by moding the number by ten. The first time you encounter a digit, you flip it's associated array element to true. The second time, you exit because you have detected a repeat. The array is essentially serving as a thrifty man's map. Here it is in Scala:

  def containsRepeatedDigit(i: Long): Boolean = {
    val digits = new Array[Boolean](10) // elements default to false
    def f(i: Long): Boolean = {
      if (i == 0L) false // all digits have been processed
      else {
        val d = (i % 10L).asInstanceOf[Int]
        if (digits(d)) true
        else {
          digits(d) = true
          f(i / 10L)
        }
      }
    }
    if (i < 11L) false else f(i)
  }

The Heuristic

Consider the number 2201. It has repeating digits. The question is: What's the next number without repeating digits? It is 2301. You could calculate it using brute-force, but you'd end up scanning an extra 99 numbers. Notice that the repetition is in the upper digits. This means that you will cannot get a number with non-repeating digits until the second digit (counting from the left) changes. Now consider the number 2200. In this case changes need to occur in both the lower digits and the upper digits, however addressing the upper digits allows us to skip a much larger section of the search space. Finally, consider 22200. In this case, you still want the second digit. However, you are searching from the right, so algorithms what detect the first repeat won't work. Here's Scala code to find the appropriate digit. Notice that it looks very similar to the repeated digit test above.

  def max(array: Array[Int]): Int = {
    def f(idx: Int, m: Int): Int = {
      if (idx == array.length) m
      else if (array(idx) > m) f(idx + 1, array(idx))
      else f(idx + 1, m)
    }
    f(1, array(0))
  }

  def repeatedDigit(i: Long): Int = {
    val prevIdx = new Array[Int](10)
    val recentIdx = new Array[Int](10)
    def f(i: Long, idx: Int) {
      if (i > 0) {
        val d = (i % 10L).asInstanceOf[Int]
        if (recentIdx(d) > 0) prevIdx(d) = recentIdx(d)
        recentIdx(d) = idx
        f(i / 10L, idx + 1)
      }
    }
    f(i, 1)
    Math.max(max(prevIdx), 0)
  } 

Now that we have an algorithm for finding the highest digit that needs to be changed, we need one that will take that information and turn it into the next possible number containing no repeating digits. This simply requires basic arithmetic.

  def nextPossibleNonRepeating(i: Long): Long = 
        nextPossibleNonRepeating(i, repeatedDigit(i))

  def nextPossibleNonRepeating(i: Long, idx: Int): Long = {
    if (idx == 0) i + 1L
    else {
      val p = Math.pow(10.0, (idx - 1).asInstanceOf[Double]).asInstanceOf[Long]
      val r = i % p
      val d = p - r
      i + d
    }
  }

Given this, it is easy to generate a sequence:

  def nextNonRepeating(i: Long): Long = nextNonRepeating(i, repeatedDigit(i))
  def nextNonRepeating(i: Long, idx: Int): Long = {
    val p = nextPossibleNonRepeating(i, idx)
    val d = repeatedDigit(p)
    if (d == 0) p else nextNonRepeating(p, d)
  }

Solution

Once this is all done, the solution is pretty straight-forward. It takes the general form of the function use to generate the next number with non-repeating digits, only it has to keep track of a bunch of extra information.

  def printNonRepeatingReport(start: Long, stop: Long, last: Long, gapLow: Long,
                              gapHigh: Long, cnt: Long): Unit = {
    if (start > stop) {
      println("max: " + last)
      println("max gap: " + (gapHigh - gapLow) + " between " + 
              gapLow + " and " + gapHigh)
      println("count: " + cnt)
    } else {
      val d = repeatedDigit(start)
      if (d == 0L) {
        //println(start)
        val gap = start - last
        val (gl, gh) = if (gap > (gapHigh - gapLow)) (last, start) 
                       else (gapLow, gapHigh)
        printNonRepeatingReport(start + 1L, stop, start, gl, gh, cnt + 1L)
      } else {
        printNonRepeatingReport(nextPossibleNonRepeating(start, d), stop, last, 
                                gapLow, gapHigh, cnt)
      }
    }
  }

Results

I'm not going to list all the numbers here, just the statistics for 1-10,000:

  • max: 9,876
  • max gap: 105 between 1,098 and 1,203
  • count: 5,274

Of course I haven't checked it against a brute-force solution or other posted solution, so I owe a beer/coffee/tea, depending on your persuasion, to anyone who can point out a bug and provide the solution.

Just for kicks, here's to 10,000,000,000:

  • max: 9,876,543,210
  • max gap: 104,691,357 between 1,098,765,432 and 1,203,456,789
  • count: 8,877,690

Which took 1 minute 23 seconds on my MacBook. Try that with a brute-force approach.

Sphere: Related Content

Monday, June 23, 2008

Google AppEngine

There's a lot of buzz out there about how Google AppEngine is a game changer. My question is: Which game? AppEngine reduces the cost of hosting a web application down to zero. Zero is a pretty profound number. You don't even need to provide a credit card number in case they might want to charge you some day. All you need to a mobile phone number capable of receiving text messages. This means that not only is there no up-front cost, but also that there is no risk that you will suddenly incur huge fees for exceeding transfer quotas when you are lucky enough to be Slashdotted. Your application will just be temporarily unavailable. Hey, if that service level is good enough for Twitter, it's good enough for you, right?

The Open Source Game

Starting an open source project has been free for years. Many project hosting communities exist, providing projects with source code management, issue tracking, wikis, mailing lists, other features, and even a little visibility within their directories. Web hosting is not exactly expensive, especially light-weight scripting languages like PHP, Perl, and Python; but it still costs something and therefore requires a sponsor. Often times for a fledgling project the sponsor is also the founder, who also happens to be the lead programmer, help desk, and promoter. There are some who do an amazing job of doing this, but for the most part I think project founders choose to wait.

Note: If I am wrong about the availability of good, free hosting for open source web applications, please provide links to them in the comments section. I would love to be proven wrong on this.

The Startup Game

If Paul Graham were to comment on AppEngine, it probably be that it eliminates one of the last remaining excuses anyone may have to launching a web startup. If you have an idea, some time, and can hack Python – you can launch a web application with AppEngine. You don't need money, infrastructure, long-term commitment, or any of those other things that may scare a person away from a startup. With AdSense, you even monetize your application without hardly any effort.

Of course there are limitations. For one thing, AppEngine is very limited in what it can do. Pretty much any idea with even moderate bandwidth, storage, or computation requirements that cannot be delegated to another web application (e.g. embedding YouTube functionality) is out. So, unless you plan on building a business based on mashing together existing applications, then AppEngine probably is not your free lunch. That being said, I fully expect that Google will gradually add APIs for all of its applications to AppEngine, thereby providing a very powerful base for new-but-thin ideas.

The R&D Game

Just for a moment, let's say there was a company that required all of its employees to spend 20% of their time working on a project other than their primary job. This is a company that is betting that it cannot predict which idea is going to be the next big thing, so it must try lots and lots of things and see what sticks. As this company grows, providing infrastructure for those projects would become a real challenge. Especially providing infrastructure that allows them to be testing in the wild as opposed to simply internally, and allows the projects to instantly scale if they just happen to turn out to be a success. This company would also want to ensure their employees weren't slowed down by having to deal with muck. Such a company would need an infrastructure that could scale to support many, many applications without burdening infrastructure teams. Such a company would need AppEngine.

Now extend the idea further. Let's say it doesn't really matter whether the next big thing was developed by an employee or not. What matters is that the next big idea is bound to the company, for example by using the company standard authentication mechanism or, more importantly, the company standard monetization mechanism.

Ok, so we all know what company I'm talking about. AppEngine allows Google to outsource some of their R&D at very low cost, and given that most successful AppEngine developers will probably choose AdSense to monetize their creations, Google stands to profit regardless of whether they ever pay any fees or not. In cases where creators do pay hosting fees, the great Google gets paid twice.

The Recruitment Game

Distinguishing among potential recruits is very challenging. The accomplished academic is almost certainly smart, may fall apart when asked to work with a significant team on something important to the company rather than a research interest. The industry veteran may have great corporate experience, but political skills could be masking shallow or outdated technical skills. The bottom line is recruiting is hard because in most cases you never see a direct sampling of an individual's work. At best you can see what a team he was on produced and take at educated guess as to his contribution. Open source projects can provide more information, but for most programmers there is no real motivation to participate in such projects.

AppEngine provides more motivation to the programmer, because he can more readily show his creation to other people without incurring any cost and there is always a chance that he will make some money. There are probably a lot of talented “almost founders” out there who would start a company, but perhaps lack some of the personality traits necessary to do so or found themselves in a situation where they need a steady income stream that simply isn't available for the founder of an early-stage startup. These individuals, and others, will make great pickings for Google.

Conclusion

Long term, in order for Google to grow, it has to attract more people to spend more time on sites displaying AdSense advertisements. Over the past few years Google has come out with countless new online services, most of which on still in beta, and none of which has yielded anything close to the monetization potential of their search business. AppEngine allows them to vicariously service the long tail without continuously expanding their already highly diverse R&D efforts. As a nice added bonus it will provide the opportunity to acquire future high-value startups before they are even real startups by simply hiring the developers and maybe tossing some stock options their way. On the flip side, I don't think AppEngine is going to have much effect on mainstream web application hosting. The API is too constrained and for a company with resources the ties to Google are most likely too tight. So I predict AppEngine will be more of a muted success for Google. The infrastructure built for it will be useful internally, it will help them get more sites up more quickly addressing eclectic needs without burdening employees, and it will provide a proving ground for future hires and possibly very early stage acquisitions. This could all add up to AppEngine being a very significant success, but it also means the success may out of the public eye.

Sphere: Related Content

Wednesday, June 18, 2008

What does I/O bound really mean?

There's a lot of old folk wisdom out there that justifies the use slow languages and runtimes on the basis that the impact on performance doesn't matter. Here's a sampling of them:

  • Most applications are I/O bound so the performance of the programming language doesn't matter

  • The database does all the heavy lifting, so performance of the application code doesn't matter

  • Computer time is cheap compared to programmer time

Like most folk wisdom, these all have an element of truth. They are often brought up in discussions about the merits of strongly-typed compiled languages versus dynamic languages, and natively compiled languages versus virtual machine based languages versus interpreted languages. I could write about any one of them, and many more, but today I'll address I/O because of the “work” I've been doing on the WideFinder 2 project.

WideFinder 2

The original WideFinder project was, at least on first inspection, quite clearly I/O bound (well, if you had an input file smaller than your main memory...) The WideFinder 2 is a little better because it is a little more computationally complex, but not by much. The benchmark is really simple: process a big (42 GB) web server log file and report some basic statistics. In order to perform the benchmark, the program must parse the file, store information from each line on several maps, and finally extract some statistics out of them.

Tim benchmarked sequential block input on the test system at just shy of 150M/s. This means that the I/O alone required to process the 42GB file should take about 5 minutes, meaning that if the benchmark truly is I/O bound then the program shouldn't take much more than 5 minutes to run. Consequently, if we are to judge based on Tim's reference implementation of the WF2 in Ruby then it quite clearly isn't I/O bound.

I/O Takes CPU, too

If you look closely at the Bonnie benchmark results you'll notice that doing raw block I/O – meaning just reading files into a buffer – consumes almost 80% of the a single core of the CPU. That means that I/O alone comes pretty close to being bound by the CPU as well as being bound by the disk. In fact, experimentation uncovered that placing the system high load reduces maximum I/O throughput. In order to achieve maximum throughput, you actually have to ensure that you keep one core free to manage the I/O.

Application I/O is more than I/O

Another key factor is that what most application programmers think of as I/O involves a lot more than shuttling bits from disk into memory. For example, the Java and other unicode based platforms have to decode the character encoding of the file into the native character encoding of the runtime. In the case of the JVM, this not only requires that every byte be processed, but also frequently doubles the memory consumption of each character and requires the data be copied into a separate buffer. Furthermore, application code usually deals with files line-by-line in the form of some standard (often immutable) string object, thereby requiring yet another pass over the data. So when your code is simply iterating over lines in a file, it isn't really just doing I/O. A fair amount of CPU/memory bound work is required to deliver those nice, neat, unicode strings.

Large datasets change the rules

Memory management isn't free, and even with an industrial strength garbage collector it isn't always simple. A common practice with immutable objects is to have them share some internal state in order to avoid excess data copying. For example, when you take a substring of a Java string, the two string objects share an underlying character array. Often times this saves memory, and it changes generating a substring of a string from a linear time and space operation (with respect to string length) to a constant time and space operation. Unfortunately if the substring is much longer lived that the original string, and especially if it is much smaller, you end up with something that feels awful lot like a memory leak (I think it could be debated whether it is a leak or not).

Even if you're not carrying around a lot of excess baggage in the form of substrings, processing gigabytes of data consumes a lot of memory. In the case of WF2, a little bit of pretty much every line needs to be stored for the calculations at the end. The cast majority of my “debugging” time for WF2 was spent figuring out what JVM parameters will actually allow the program to finish without slowing to a crawl (pre-tuning there were points were it would spend 90% of its time in GC) or consuming every bit of memory available (at least a few WF2 solutions hog tons of memory).

All this means that when dealing with large files other parts of the system need to do more work, which takes away time that could be spent on “real” processing or on I/O (which we've already seen can be a CPU hog). Furthermore, I/O routines and routines commonly used along side heavy I/O (like regular expressions) must be very careful about their memory usage.

So is WF2 really I/O bound?

It depends. Go take a good hard look at the WF2 results page. If you look at Tim Bray's results you would conclude that no, WF2 clearly is not I/O bound. However, if you look at the top you'll see some very impressive numbers that indicate that WF2 indeed is I/O bound (side note: I really need to take a hard look at OCaml). Of course, you could argue that even in the OCaml case it really is CPU bound, because making the task take about as long as the required I/O saturated 8 cores and 32 hardware threads. Scanning down the list would seem to indicate that in most cases I/O does not represent a significant bottleneck, but then it's hard to really tell. The disk may not be the bottleneck, but the I/O routines within the libraries or runtime may be. Consequently, from the application programmer perspective, WF2 is I/O bound.

Redefining “I/O bound” and future impacts

For a long time “I/O bound” primarily referred to hardware or possibly operating system limitations. That was a useful definition, but it is time for it to change. Most software being developed today has a very tall stack of abstractions sitting between it and the hardware. Operating systems schedule I/O and have to split limited resources among many competing processes. Virtual machines sit on top of operating systems, isolating the programmer from the underlying OS and hardware. Libraries and frameworks isolate the programmer from the virtual machine and even other libraries and frameworks. I/O from the programmer's perspective is, or at least should be if he is working on top of good abstractions, that I/O is “everything that happens between when my program requests some data and when it receives it.” Consequently, libraries and runtimes should go to great lengths to ensure that being I/O bound is as close to its original meaning as possible. Prior to multicore systems becoming pervasive that was largely true, but today's I/O libraries fail to take advantage of them, and consequently force I/O into being a bottleneck when it should not be.

That's why in my Widefinder submissions I've worked to separate parallelized I/O into a library-like module. It quite clearly is not done, but it is relatively successful. I reused my the parallel I/O code that I developed for WF1 on WF2 without changing any of the logic. It doesn't provide many features, and it could use a fair amount of optimization and simplification (the line joining logic is an atrocity), but it works.

Sphere: Related Content