Monday, December 31, 2007

Double (Im)precision

Most computer users have, at one time or another, received an odd result from a spreadsheet or other program that performs calculations. Most programmers know that this is because of the impedance mismatch between the most common way for computers to "think" about numbers (base 2) and the way most people and businesses think about numbers (base 10). This issue receives a fair amount of attention, probably because we've all (we being programmers) have had to explain to users why a piece of software can't seem to do arithmetic properly. If you're scratching your head right now, or want to know more, I suggest reading this FAQ about decimal arithmetic.

However, as I've long know but rarely contemplated, there if another form of floating point error that can cause significant problems. Standard floating point numbers only store so many digits of precision. The result is that if you add a very large number to a very small number, the very small number simply vanishes. Let me demonstrate:

scala> 10000000000000000.0 + 1.0
res62: Double = 1.0E16
scala> 

The 1.0 simply vanished, because standard double-precision floating point numbers don't have enough digits to store the 1.0 part of such a large number. Doubles are an approximation, and that's fine, because often times all we're approximating things, anyway, and the roughly 15 decimal digits of precisions provided by a double is plenty, right?

Well, actually, no. It depends on what you're doing with them. A month or so ago my father convinced me that instead of spending my free time playing with AI problems and studying fringe programming languages, I should do something useful like predict stock price movements. Of course I can use AI and fringe programming languages to do this...

The first thing I decided to do was to smooth out stock price history and make it continuous by fitting a polynomial. So I wrote a quick polynomial class, grabbed Jama (a Java matrix library), downloaded some data, wrote a function to search for the appropriate degree, and this is what I got (supposedly with about 4% error):

Hmmm...that looks kind of funny. The blue line is the actual price. The red line is the polynomial, which has a degree of 44. That's a rather large degree, but certainly not enough to generate all those squiggles. Those squiggles are an artifact of double precision numbers not being precise enough to be used in calculating a degree-44 polynomial. They don't work that well for the matrix calculations that produce the polynomial, either.

I think that red squiggly line is a very good illustration of what happens when you don't pay enough attention to how your software works under-the-hood. Anyway, here's what the results look like using floating point numbers (courtesy of JScience) with 80 digits of precision (keeping the error at about 4%).

Looks a lot better, doesn't it? One interesting thing is that the resulting polynomial has a much, much lower degree than what was produced by the same algorithm using double precision numbers. With double precision numbers, 4% error was about as good as it could get with this dataset. However, using the higher-precision numbers it can go much lower. Below is a graph at about 2% error:

At this point you might be wondering why anyone would use double precision numbers for this type of thing, or perhaps thinking I'm a moron for trying. I certainly felt a little stupid. But using doubles is relatively common. Matlab and everyone's favorite Excel use doubles. Many of the other toolkits and libraries that I found do as well.

Overall I think I'm either missing something (because the use of doubles is so common), or I find this very frightening. I'm not an expert in this area. I know the doubles use much, much less memory and computational resources. They are also "just there" in pretty much all programming environments, so they are the default. Making arbitrary precision floating points work right wasn't a cakewalk, either. I spent the better part of a day tracing through my code, only to discover a bug in JScience. Also, once you have everything working, you have to figure out what precision to use. The default of 20 digits precision in JScience wasn't noticeably better than regular doubles.

Sphere: Related Content

Tuesday, December 11, 2007

Data Center Meltdown

Denial and the coming “data meltdown” by ZDNet's Michael Krigsman -- Subodh Bapat, Sun Microsystems eco-computing vice president, believes we’ll soon see the first world-class data center meltdown. According to News.com: “You’ll see a massive failure in a year,” Bapat said at a dinner with reporters on Monday. “We are going to see a data center failure of that scale.” “That scale” referred to the problems [...]

Now let's think about that. How can a datacenter fail?

  1. It can lose power for an extended period. It takes a lot of backup generators to keep a 50 megawatt datacenter humming.
  2. A virus or worm can shut it down.
  3. A natural disaster can destroy it or force a shutdown. Often times this is due to a power failure rather than destruction.
  4. It can have its WAN/Internet connection(s) severed. This isn't quite catastrophic unless you're on the other side of the WAN.

Michael has pointed out that Subodh Bapat doesn't point to the expected cause of a major data center meltdown, he just says one is coming. That's because it's not really a matter of one cause. There are so many risks, and you multiply them by the growing number of data centers, and what you come up with is that there's bound to be a major failure soon. We just don't know what the precise cause will be.

Most of the threats involve a geographically localized destruction or disabling of the data center. This means you need off-site recovery, and it probably needs to be fast. That means you probably need more than recovery, you need one or more hot sites that can take over the load of one that fails. This is extremely expensive for a "normal" data center. I can hardly imagine how much it would cost for a 50+ megawatt facility. Basically what we have is too many eggs in one basket, with economies of scale pushing us to keep on putting more eggs in the same basket.

What surprises me is that Subodh Bapat didn't say, oh, well, Sun has the solution. Considering that Jonathan Schwartz put it forward over a year ago. Ok, well, he didn't exactly suggest Blackbox as a means of easily distributing data centers. But think about it. If you're a large corporation you are probably already geographically distributed. If you expand your data center in cargo-container sized units across the nation (or world), you are half..err..maybe one quarter of the way there. You still have to figure out how to make them be hot sites for each other, or at least recovery sites. But at least losses at any given site would be minimized.

Sphere: Related Content

Sunday, December 09, 2007

Why isn't enterprise software sexy?

Robert Scoble asks: Why isn't enterprise software sexy?

A couple of the Enterprise Irregulars and several commentors respond: Because it's supposed to be reliable and effective, not sexy.

I think the reasons are more fundamental than that. Consider:

  1. Most enterprise software customers have mutual non-disclosure agreements with the software vendors.
  2. Most bloggers have day jobs
  3. Most companies do not want their employees violating the terms of contracts.
  4. Most companies do not want their employees airing the company's dirty laundry in a public forum
  5. Many of the most interesting pieces of information surrounding enterprise software involve dirty laundry.

Personally, I have two ground rules for blogging:

  1. Keep it professional. No personal matters, politics, etc.
  2. Keep my employer, coworkers, and my employer's customers and suppliers out of it.

Let's face it. Most IT and software development projects don't go right. The commercial software that you use doesn't perform half as well as the salesman said it would. Consultants have mixed motives, and more importantly aren't miracle workers. The internal team if often over extended and working outside their core area of expertise. Goals are unclear, completely undefined, or changing on a regular basis. Politics are everywhere on all sides.

This is the reality of business, and talking about it in the general case is OK. People read it, nod their heads, and this "Been there, done that, doing it again right now." But specifics are quite different. Specifics can lead to embarrassment, contract violations, and lost sales.

More people don't blog about enterprise software because it strikes too close to home. I don't think it has anything to do with whether enterprise software is sexy or not.

Update:

It just occurred to me that I probably wasn't very clear here on a few points. What is mean by "enterprise software" is enterprise application software, like SAP, Oracle's various applications, PeopleSoft, etc. I don't mean infrastructure software like DBMSes, application servers, operating systems, etc. There is plenty of good information freely available on infrastructure, and people blog about it all the time.

Also, if you look at a lot of the blogs that are out there for, for example, SAP, they are almost all too high level to be particularly useful. It's pretty easy to find platitudes about how you need the right team, buy-in, executive sponsorship, etc and how those (or a lack of them) caused an implementation to succeed or fail. That's all (for the most part) true but everyone already knows it. But there's not a lot of people (that I know of, please post links in the comments if I am wrong) out there posting technical discussions about implementations. Google for "ABAP programming blog" (ABAP is the programming language for SAP), and then do the same for Scala and Lisp. I bet there are more people earning their living off of ABAP than Scala and Lisp combined. Probably an order of magnitude more. So why aren't there more people writing about it? Ok, so Scala and Lisp are both interesting languages. So do the same for Ada and COBOL.

Update: December 10, 2007: Feedback on enterprise software

Enterprise software does receive feedback. Often times significant amounts in forms much better thought out than most blogs. The difference is that the feedback is private between the customer and the software provider. Occasionally it is shared among customers through conferences or other types of "user group" meetings.

If the software supplier will listen (including acting), then this can work fairly well for software that is already deployed. The problem is that there is no information solid information available to support purchasing decisions. There are no readily available sources of "tips and tricks" or "common pitfalls" for deployment or customization efforts. For example someone could write:

We've spent the last 3 months trying to make Foomatic run on top of Oracle. All the sales material touts the Oracle support. But then your deployment tanks, the consultants come in, and they ask you why the heck you aren't using SQL Server. The software is developed against SQL Server and then "ported" to Oracle, and the Oracle port never works right.

Fill in your favorite or least favorite infrastructure products there, the name of an enterprise application, and there you have some useful information. The sales material lies. Once you deviate from their default stack you are screwed. That's the type of information that would be useful - before buying the software or before trying to run it on your "enterprise standard." Not from an over-priced consultant.

Sphere: Related Content

Running Linux apps on a Mac using X

John asks:

Please can you post something about how you set up and use X to run your linux apps from your mac? I've been using X to login and use my linux desktop on my mac laptop for a few things but what I would really love to know how to do is to run individual programs which I cannot get running on my mac or haven't got room for.
Certainly! I'll start from the beginning. The Linux distribution I'm using is Fedora Core 6. The instructions should be similar for any other Unix variant. I used the GUI administration screens to do the configuration (I know - LAME!) so that's what I'm going to use here.

1. Make sure your Linux box has a static IP address.

This step largely depends on your network setup, and technically isn't 100% necessary. In order to access your Linux box, you must either know its IP address or have a domain name for it that will be translated into an IP address. If you are on a corporate network there's a good chance that there's already a DNS entry for your Linux box so you just need to know its name. Here's what I did for my home network.

1.1 Login your router and reserve a range of IP addresses so that they will not be assigned out by DHCP.

Please refer to your router's manual for details on how to do this. Depending on your router you may be able to assign a static IP address explicitly to your Linux box based on its MAC address. I could not.

1.2 Configure your Linux box to use a static IP address.

Launch the network config utility. This will require the root password. Edit your ethernet adapter and configure it for a static IP address instead of DHCP. Choose an IP address within the range that you reserved in your router, or one that your network admin has provided.

2. Enable sshd on your Linux box.

You can do this using the "Services" applet or by editing your /etc/rcN.d (where N is the desired runlevel) scripts. Enable the service and start it. Notice that I am at runlevel 3. This is because I'm now running my Linux box in headless mode, so I just want it to boot up to a console login prompt, not a graphical prompt. If you your Linux box boots up to a graphical prompt, and you intend to keep in that way, then you want runlevel 5.

3. Setup a DNS entry on your Mac.

As root, edit your /etc/hosts file and add an entry for your Linux box. So open a terminal window and type:
  cd /etc
  emacs hosts
The line for "gondolin" is the entry that I added: Save the file by hitting Control-X Control-S, then exit emacs by hitting Control-X Control-C.

4. On your Mac, launch X and ssh as follows to your Linux box.

  ssh -X gondolin
The "-X" flag tells ssh to automatically setup your DISPLAY environment variable. Once you are logged in, simply type the name of the application that you want to run, followed by an ampersand.

5. Be careful about some applications.

Most applications work fine when started from the command line, but starting the full KDE or GNOME desktop sessions does not work properly for me. I'm pretty sure that's because on my Linux box they are configured to run locally. There are some applications, such as Nautilus and Konqueror that are setup (and least on my FC6 box) to plug themselves into the desktop manager session. This results in all sorts of weirdness when there is no KDE or GNOME session running. Nautilus tries to start a GNOME session, which brings up an unusable desktop, and Konqueror just acts weird. However, Nautilus works fine if you give pass it the "--no-desktop" command-line option. The following opens Nautilus in your home directory.
  nautilus --no-desktop ~/
That's it! Feel free to post any questions.

Sphere: Related Content

Thursday, December 06, 2007

Adventures in Widefinding: Performance

I've done some basic performance tests against:

  1. Scala Actor-based parallel Widefinder
  2. Scala serial Widefinder using a BufferedReader
  3. Tim Bray's Ruby Widefinder
The test platform was my 2.2 GHz MacBook with 4GB of RAM using a 6 million line file.  The times were as follows:
Scala Parallel:
  • real 0m14.588s
  • user 0m24.541s
  • sys 0m1.383s
Scala Serial:
  • real 0m20.095s
  • user 0m18.821s
  • sys 0m1.441s
Ruby:
  • real 0m14.301s
  • user 0m12.485s
  • sys 0m1.813s
The good news is that the parallel Scala version is noticeably faster than the serial version.  The bad news is that it is roughly the same speed as the Ruby version, and takes significantly more CPU.  The Scala versions are doing substantially more work, because they have to transcode the 8-bit ASCII contained in the input file into 16-bit Unicode strings.  This requires a full scan of the data.  I believe a fair amount of performance could be gained by combining the transcoding pass over the input with the line-splitting pass.
For those that are curious, the source code for the parallel widefinder is available here: the parallel IO code the actual widefinder code

Sphere: Related Content