Monday, December 31, 2007

Double (Im)precision

Most computer users have, at one time or another, received an odd result from a spreadsheet or other program that performs calculations. Most programmers know that this is because of the impedance mismatch between the most common way for computers to "think" about numbers (base 2) and the way most people and businesses think about numbers (base 10). This issue receives a fair amount of attention, probably because we've all (we being programmers) have had to explain to users why a piece of software can't seem to do arithmetic properly. If you're scratching your head right now, or want to know more, I suggest reading this FAQ about decimal arithmetic.

However, as I've long know but rarely contemplated, there if another form of floating point error that can cause significant problems. Standard floating point numbers only store so many digits of precision. The result is that if you add a very large number to a very small number, the very small number simply vanishes. Let me demonstrate:

scala> 10000000000000000.0 + 1.0
res62: Double = 1.0E16
scala> 

The 1.0 simply vanished, because standard double-precision floating point numbers don't have enough digits to store the 1.0 part of such a large number. Doubles are an approximation, and that's fine, because often times all we're approximating things, anyway, and the roughly 15 decimal digits of precisions provided by a double is plenty, right?

Well, actually, no. It depends on what you're doing with them. A month or so ago my father convinced me that instead of spending my free time playing with AI problems and studying fringe programming languages, I should do something useful like predict stock price movements. Of course I can use AI and fringe programming languages to do this...

The first thing I decided to do was to smooth out stock price history and make it continuous by fitting a polynomial. So I wrote a quick polynomial class, grabbed Jama (a Java matrix library), downloaded some data, wrote a function to search for the appropriate degree, and this is what I got (supposedly with about 4% error):

Hmmm...that looks kind of funny. The blue line is the actual price. The red line is the polynomial, which has a degree of 44. That's a rather large degree, but certainly not enough to generate all those squiggles. Those squiggles are an artifact of double precision numbers not being precise enough to be used in calculating a degree-44 polynomial. They don't work that well for the matrix calculations that produce the polynomial, either.

I think that red squiggly line is a very good illustration of what happens when you don't pay enough attention to how your software works under-the-hood. Anyway, here's what the results look like using floating point numbers (courtesy of JScience) with 80 digits of precision (keeping the error at about 4%).

Looks a lot better, doesn't it? One interesting thing is that the resulting polynomial has a much, much lower degree than what was produced by the same algorithm using double precision numbers. With double precision numbers, 4% error was about as good as it could get with this dataset. However, using the higher-precision numbers it can go much lower. Below is a graph at about 2% error:

At this point you might be wondering why anyone would use double precision numbers for this type of thing, or perhaps thinking I'm a moron for trying. I certainly felt a little stupid. But using doubles is relatively common. Matlab and everyone's favorite Excel use doubles. Many of the other toolkits and libraries that I found do as well.

Overall I think I'm either missing something (because the use of doubles is so common), or I find this very frightening. I'm not an expert in this area. I know the doubles use much, much less memory and computational resources. They are also "just there" in pretty much all programming environments, so they are the default. Making arbitrary precision floating points work right wasn't a cakewalk, either. I spent the better part of a day tracing through my code, only to discover a bug in JScience. Also, once you have everything working, you have to figure out what precision to use. The default of 20 digits precision in JScience wasn't noticeably better than regular doubles.

Sphere: Related Content

Tuesday, December 11, 2007

Data Center Meltdown

Denial and the coming “data meltdown” by ZDNet's Michael Krigsman -- Subodh Bapat, Sun Microsystems eco-computing vice president, believes we’ll soon see the first world-class data center meltdown. According to News.com: “You’ll see a massive failure in a year,” Bapat said at a dinner with reporters on Monday. “We are going to see a data center failure of that scale.” “That scale” referred to the problems [...]

Now let's think about that. How can a datacenter fail?

  1. It can lose power for an extended period. It takes a lot of backup generators to keep a 50 megawatt datacenter humming.
  2. A virus or worm can shut it down.
  3. A natural disaster can destroy it or force a shutdown. Often times this is due to a power failure rather than destruction.
  4. It can have its WAN/Internet connection(s) severed. This isn't quite catastrophic unless you're on the other side of the WAN.

Michael has pointed out that Subodh Bapat doesn't point to the expected cause of a major data center meltdown, he just says one is coming. That's because it's not really a matter of one cause. There are so many risks, and you multiply them by the growing number of data centers, and what you come up with is that there's bound to be a major failure soon. We just don't know what the precise cause will be.

Most of the threats involve a geographically localized destruction or disabling of the data center. This means you need off-site recovery, and it probably needs to be fast. That means you probably need more than recovery, you need one or more hot sites that can take over the load of one that fails. This is extremely expensive for a "normal" data center. I can hardly imagine how much it would cost for a 50+ megawatt facility. Basically what we have is too many eggs in one basket, with economies of scale pushing us to keep on putting more eggs in the same basket.

What surprises me is that Subodh Bapat didn't say, oh, well, Sun has the solution. Considering that Jonathan Schwartz put it forward over a year ago. Ok, well, he didn't exactly suggest Blackbox as a means of easily distributing data centers. But think about it. If you're a large corporation you are probably already geographically distributed. If you expand your data center in cargo-container sized units across the nation (or world), you are half..err..maybe one quarter of the way there. You still have to figure out how to make them be hot sites for each other, or at least recovery sites. But at least losses at any given site would be minimized.

Sphere: Related Content

Sunday, December 09, 2007

Why isn't enterprise software sexy?

Robert Scoble asks: Why isn't enterprise software sexy?

A couple of the Enterprise Irregulars and several commentors respond: Because it's supposed to be reliable and effective, not sexy.

I think the reasons are more fundamental than that. Consider:

  1. Most enterprise software customers have mutual non-disclosure agreements with the software vendors.
  2. Most bloggers have day jobs
  3. Most companies do not want their employees violating the terms of contracts.
  4. Most companies do not want their employees airing the company's dirty laundry in a public forum
  5. Many of the most interesting pieces of information surrounding enterprise software involve dirty laundry.

Personally, I have two ground rules for blogging:

  1. Keep it professional. No personal matters, politics, etc.
  2. Keep my employer, coworkers, and my employer's customers and suppliers out of it.

Let's face it. Most IT and software development projects don't go right. The commercial software that you use doesn't perform half as well as the salesman said it would. Consultants have mixed motives, and more importantly aren't miracle workers. The internal team if often over extended and working outside their core area of expertise. Goals are unclear, completely undefined, or changing on a regular basis. Politics are everywhere on all sides.

This is the reality of business, and talking about it in the general case is OK. People read it, nod their heads, and this "Been there, done that, doing it again right now." But specifics are quite different. Specifics can lead to embarrassment, contract violations, and lost sales.

More people don't blog about enterprise software because it strikes too close to home. I don't think it has anything to do with whether enterprise software is sexy or not.

Update:

It just occurred to me that I probably wasn't very clear here on a few points. What is mean by "enterprise software" is enterprise application software, like SAP, Oracle's various applications, PeopleSoft, etc. I don't mean infrastructure software like DBMSes, application servers, operating systems, etc. There is plenty of good information freely available on infrastructure, and people blog about it all the time.

Also, if you look at a lot of the blogs that are out there for, for example, SAP, they are almost all too high level to be particularly useful. It's pretty easy to find platitudes about how you need the right team, buy-in, executive sponsorship, etc and how those (or a lack of them) caused an implementation to succeed or fail. That's all (for the most part) true but everyone already knows it. But there's not a lot of people (that I know of, please post links in the comments if I am wrong) out there posting technical discussions about implementations. Google for "ABAP programming blog" (ABAP is the programming language for SAP), and then do the same for Scala and Lisp. I bet there are more people earning their living off of ABAP than Scala and Lisp combined. Probably an order of magnitude more. So why aren't there more people writing about it? Ok, so Scala and Lisp are both interesting languages. So do the same for Ada and COBOL.

Update: December 10, 2007: Feedback on enterprise software

Enterprise software does receive feedback. Often times significant amounts in forms much better thought out than most blogs. The difference is that the feedback is private between the customer and the software provider. Occasionally it is shared among customers through conferences or other types of "user group" meetings.

If the software supplier will listen (including acting), then this can work fairly well for software that is already deployed. The problem is that there is no information solid information available to support purchasing decisions. There are no readily available sources of "tips and tricks" or "common pitfalls" for deployment or customization efforts. For example someone could write:

We've spent the last 3 months trying to make Foomatic run on top of Oracle. All the sales material touts the Oracle support. But then your deployment tanks, the consultants come in, and they ask you why the heck you aren't using SQL Server. The software is developed against SQL Server and then "ported" to Oracle, and the Oracle port never works right.

Fill in your favorite or least favorite infrastructure products there, the name of an enterprise application, and there you have some useful information. The sales material lies. Once you deviate from their default stack you are screwed. That's the type of information that would be useful - before buying the software or before trying to run it on your "enterprise standard." Not from an over-priced consultant.

Sphere: Related Content

Running Linux apps on a Mac using X

John asks:

Please can you post something about how you set up and use X to run your linux apps from your mac? I've been using X to login and use my linux desktop on my mac laptop for a few things but what I would really love to know how to do is to run individual programs which I cannot get running on my mac or haven't got room for.
Certainly! I'll start from the beginning. The Linux distribution I'm using is Fedora Core 6. The instructions should be similar for any other Unix variant. I used the GUI administration screens to do the configuration (I know - LAME!) so that's what I'm going to use here.

1. Make sure your Linux box has a static IP address.

This step largely depends on your network setup, and technically isn't 100% necessary. In order to access your Linux box, you must either know its IP address or have a domain name for it that will be translated into an IP address. If you are on a corporate network there's a good chance that there's already a DNS entry for your Linux box so you just need to know its name. Here's what I did for my home network.

1.1 Login your router and reserve a range of IP addresses so that they will not be assigned out by DHCP.

Please refer to your router's manual for details on how to do this. Depending on your router you may be able to assign a static IP address explicitly to your Linux box based on its MAC address. I could not.

1.2 Configure your Linux box to use a static IP address.

Launch the network config utility. This will require the root password. Edit your ethernet adapter and configure it for a static IP address instead of DHCP. Choose an IP address within the range that you reserved in your router, or one that your network admin has provided.

2. Enable sshd on your Linux box.

You can do this using the "Services" applet or by editing your /etc/rcN.d (where N is the desired runlevel) scripts. Enable the service and start it. Notice that I am at runlevel 3. This is because I'm now running my Linux box in headless mode, so I just want it to boot up to a console login prompt, not a graphical prompt. If you your Linux box boots up to a graphical prompt, and you intend to keep in that way, then you want runlevel 5.

3. Setup a DNS entry on your Mac.

As root, edit your /etc/hosts file and add an entry for your Linux box. So open a terminal window and type:
  cd /etc
  emacs hosts
The line for "gondolin" is the entry that I added: Save the file by hitting Control-X Control-S, then exit emacs by hitting Control-X Control-C.

4. On your Mac, launch X and ssh as follows to your Linux box.

  ssh -X gondolin
The "-X" flag tells ssh to automatically setup your DISPLAY environment variable. Once you are logged in, simply type the name of the application that you want to run, followed by an ampersand.

5. Be careful about some applications.

Most applications work fine when started from the command line, but starting the full KDE or GNOME desktop sessions does not work properly for me. I'm pretty sure that's because on my Linux box they are configured to run locally. There are some applications, such as Nautilus and Konqueror that are setup (and least on my FC6 box) to plug themselves into the desktop manager session. This results in all sorts of weirdness when there is no KDE or GNOME session running. Nautilus tries to start a GNOME session, which brings up an unusable desktop, and Konqueror just acts weird. However, Nautilus works fine if you give pass it the "--no-desktop" command-line option. The following opens Nautilus in your home directory.
  nautilus --no-desktop ~/
That's it! Feel free to post any questions.

Sphere: Related Content

Thursday, December 06, 2007

Adventures in Widefinding: Performance

I've done some basic performance tests against:

  1. Scala Actor-based parallel Widefinder
  2. Scala serial Widefinder using a BufferedReader
  3. Tim Bray's Ruby Widefinder
The test platform was my 2.2 GHz MacBook with 4GB of RAM using a 6 million line file.  The times were as follows:
Scala Parallel:
  • real 0m14.588s
  • user 0m24.541s
  • sys 0m1.383s
Scala Serial:
  • real 0m20.095s
  • user 0m18.821s
  • sys 0m1.441s
Ruby:
  • real 0m14.301s
  • user 0m12.485s
  • sys 0m1.813s
The good news is that the parallel Scala version is noticeably faster than the serial version.  The bad news is that it is roughly the same speed as the Ruby version, and takes significantly more CPU.  The Scala versions are doing substantially more work, because they have to transcode the 8-bit ASCII contained in the input file into 16-bit Unicode strings.  This requires a full scan of the data.  I believe a fair amount of performance could be gained by combining the transcoding pass over the input with the line-splitting pass.
For those that are curious, the source code for the parallel widefinder is available here: the parallel IO code the actual widefinder code

Sphere: Related Content

Wednesday, December 05, 2007

Adventures in Widefinding: Complexity

For the moment I'm going to ignore the complexity of my widefinder "under the hood," and just focus on the complexity of it at the top level.   The following is Tim Bray's goal.  The red asterisk indicates that the loop should be processed in parallel instead of in serial.

counts = {}
counts.default = 0

ARGF.each_line* do |line|
if line =~ %r{GET /ongoing/When/\d\d\dx/(\d\d\d\d/\d\d/\d\d/[^ .]+) }
counts[$1] += 1
end
end

keys_by_count = counts.keys.sort { |a, b| counts[b] <=> counts[a] }
keys_by_count[0 .. 9].each do |key|
puts "#{counts[key]}: #{key}"
end
So here's the first part of my Scala solution:
   val pat = Pattern.compile("GET /ongoing/When/\\d\\d\\dx/(\\d\\d\\d\\d/\\d\\d/\\d\\d/[^ .]+ )")
   val map = new ConcurrentHashMap[String, AtomicInteger]()
   for (file <- args; line <- LineReader(file)) {
       val mat = pat.matcher(line)
       if (mat.find()) {
         val s = mat.group().substring(23)
         map.putIfAbsent(s, new AtomicInteger(0))
         map.get(s).incrementAndGet()
       }
     }
 

Notice that this is not a map-reduce based solution. This is because I tried to make it look as similar to the original Ruby as possible. There are some minor differences due to the fact that regular expressions are not special in Scala the way they are in Ruby. The big, and in my opinion most intrusive difference, is that a ConcurrentHashMap has to be used to store the results, AtomicIntegers must be used for the counts, and that weird "putIfAbsent" call needs to be used for pulling counts out of the map.

So while the concurrency itself is implicit, and the code is certainly succinct, the programmer still needs to be very much aware of the fact that the block inside the for loop will be executed in a concurrent fashion. This is certainly less elegant than a pure functional approach such as map-reduce, where no synchronization would be necessary, but I also think it is less intrusive to a programmer accustomed to imperative programming.

Here's an ugly part. This is largely due to an impedance mismatch between Java collections and Scala collections. It would look a lot better if Scala had an equivalent to ConcurrentHashMap in its library. There's also probably lots of other ways to make it cleaner. I tried several different approaches and always ran into some problem.

    val array = new Array[(String, Int)](map.size)
    val iter = map.entrySet.iterator
    var i = 0
    while (iter.hasNext) {
      val e = iter.next
      array(i) = (e.getKey, e.getValue.intValue)
      i += 1
    }
And here's the final part, once the ugliness is done:
    quickSort(array)((e) => OrderedTuple(e))
    val limit = if (array.length < 10) array.length - 1 else 9
    for(i <- 0 to limit) {
      print(array(i)._2)
      print(": ")
      println(array(i)._1)
    }

Notice having less that ten results is cleanly handled, unlike in TB's code (unless Ruby magically ignores invalid indices).

Overall I would say I came pretty close to TB's goal without without requiring a fundamental change in programming methodology. I didn't alleviate the need for the programmer to be aware of concurrency, but I don't think that's possible. Sure, people can learn to program in ways that are fundamentally parallelizable, and that probably is what universities should start teaching, but that's going to require a major shift on of part of millions of programmers, or for several generations to retire and be replaced by ones trained in functional programming. Lots of cores are here NOW, so it's not practical to wait until skills catch up. We need to work with what we have.

Sphere: Related Content

Monday, December 03, 2007

Switch

After about two years of running Linux-only (Fedora, to be exact) at home I've switched to a Mac.  I wanted a laptop, needed a new computer desperately (the old one is almost 6 years old), didn't want Windows, and was tired of messing with Linux configuration stuff.

This weekend I bought a Macbook.  So far I love it.  I was hesitant about the keyboard, which is odd, but I actually like it.  I was nervous about using a touchpad instead of a pointer stick, but somehow the touch pad on the Macbook works substantially better than the one on the HP I use at work.  I was worried about quality problems...we'll see about that one.  I just read about some Macbooks and Macbook Pros having defective harddrives, but the model number on mine is different.
So what have I done so far?
  1. Booted the first time.  Wireless worked flawlessly on the first try.  I've spent way too much time making Windows PCs (my wife's, my work laptop, visitors' machines) work with my wireless network.  One of my friends tried to make his Linux laptop work with it and after many hours never succeeded.
  2. Installed Eclipse and the Scala plugin, played around with that.
  3. Installed AquaMacs - IMHO much better than Emacs on Linux.
  4. Surfed the net, poked around, reconfigured some stuff to my liking, etc.
  5. Setup NFS mounts to my old Linux box. This was amazingly painless on both sides.  The only hiccup was that I had to change my UID on my Mac from 501 to 500, which created some oddities before I logged out and back in, but nothing horrible.  The whole process probably took about 15-30 minutes, compared to hours I spent to make Samba work properly with with wife's Windows XP Thinkpad.
  6. Setup sshd on my Linux box and then ran XWindows applications on my Linux box from my Mac.  This whole process took about 10 minutes, and the XWindows apps actually looked good and were very responsive.  Possibly more responsive than running them locally.  Note that this and the above are a tribute to both Fedora and MacOS X.
  7. Upgraded my RAM from the stock 1GB to 4GB.
  8. Ran my Scala widefinder from Eclipse.  It spikes both CPUs once everything is cached.  Puts both at about half if it has to read from disk.
  9. Wrote this blog.
So all that is very positive.  And the UI is so beautiful.  The fonts all look just right...although if I didn't have good eyes I think a lot of them might be too small.
Ok, so what's wrong:
  1. I haven't figured out the keyboard shortcut to switch between windows of an open application.  Command tab switches between applications.
  2. No page down, page up, home, end, etc keys on the laptop keyboard.
  3. The caps lock key is WAY too big.  Keys sizes should be proportionate to usefulness/frequency of use.  Spacebar is huge.  Enter and delete are big.  Caps lock being the size of enter makes not sense.
  4. I had to install ~200MB worth of updates fresh out of the box...over half of that for Leopard.  Why can't a new computer to all up-to-date?  Seems like that "genius" who sold it to me could have given me one that was up-to-date.
  5. After the reboot for the Leopard updates, everything froze.  Just a blue-grey screen with no cursor.  That's a bad first impression.  But I turned it off and back on again and everything was fine.
  6. I had to install codecs for video various video formats.  That's mildly annoying.  Why can't they be preinstalled?  It's not like I had to pay to download and install them.  But what's more annoying is that the first link presented to me for XVid didn't seem to have any appropriate install files for Mac.  I used DivX instead, which worked.
So overall I am very impressed.  I think I'm going to keep my old Linux box around as a server and to run Linux-only stuff over X.  If anything goes wrong I'll be sure to complain about it.

Sphere: Related Content