Wednesday, April 09, 2008

Multiprocess versus Multithreaded...

...or why Java infects Unix with the Windows mindset.

Recently Paul Murphy, the king of the Sun zealots, blogged about Java bringing the Windows mentality to Windows, all the while slamming Java. In response, John Carrol, a Microsoft employee, rose to the defense of Sun's self-declared crown jewel. Talk about weird.

The funny thing is they are both right, although Murph's arguments are pretty weak.

A little history

Unix and Windows evolved with a very different definition of what the primary unit of isolation should be. On Windows, it is (or was) the node. Each Windows user (and DOS user before him) occupied exactly one node. The worst that could happen is the user destroys his own workspace, so interactive performance reigned supreme over system integrity. You have a node. You have a user. The node does what the user wants as fast as it can. Initially this applied to running a single application at a time, then to allowing several to be open at once but with the one in the foreground receiving primary resources, and finally to allow several applications to run simultaneously. Multithreading reigned king because it was lower overhead and focused on making that foreground process more responsive. Threads were optimized, while processes were neglected.

Unix evolved to be fundamentally multiuser, and its primary unit of isolation is the process. Unix systems were intended to be shared, so it was important that one user could not dominate over another. Furthermore, and slew of processes (daemons) all ran as the same under the same account, while providing services to multiple users, so in order for users to share processes must share. Unlike on Windows, one process crashing the entire system was not acceptable, because that would destroy multiple users' data. As a result, processes were designed to represent a strong level of isolation and heavily optimized to make sure people used it. Threads were largely ignored, or simply treated as processes with a shared heap space, because several cheap processes could simply be chained together to accomplish the same thing in a simpler manner.

The Unix Way

I want you to consider good old-fashioned CGI programs for a moment. Imagine one written in C. First, you may think "Oh my God, running a web application in a non-managed environment. The resource leaks! The memory leaks! The memory consumption of all those processes! Oh the horror!." Of course, you would be wrong. Repeating launching and terminating a Unix process is dirt cheap. Especially a simple program written in C. The OS will cache an image of the executable in memory which can be shared among invocations. The individual process can leak all the resources it wants, because as soon as it terminates all the resources will be automatically freed by the OS, not matter how incompetent the programmer. If the process fails to terminate your friendly neighborhood sysadmin can kill it without hurting any other process.

This method works for producing super-available applications despite incredibly crappy code. I've seen it, both in the for of CGI and in the form of much more sophisticated applications. It works. Users get upset about lost transactions, but the application as a whole almost never goes down.

Enter Java

Java took cheap Unix processes and made them expensive. To compensate, it provided primitives for multithreading. It provided a garbage collector to at least slow memory leaks. It turned all those transient application processes into one big JVM process not only serving all the transactions for a given user, but serving all the transactions for an entire application or even multiple applications. Java made it more difficult to make destructive program errors, but it also made the consequences much more severe. Your friendly neighborhood sysadmin is powerless against a runaway thread or a slow memory leak. All he can do is kill the process, bumping out all of the users, killing all of their sessions.

It's so bad, the process might as well be a node. Unix becomes Windows. The JVM is practically an operating system, but without all of the features of an operating system and a whole lot less mature.

Enter Java Frameworks

This is really what Murph was railing against, although he didn't name it and he conflated it with the core language by labeling "Business Java." Frameworks evolved for a myriad of reasons which are often summarized as "taking care of the plumbing to the developer can focus on the business logic." The "plumbing" is a lot of things, including managing certain resources and generally ensuring the application code executes within a well defined life cycle where it is unlikely to do damage. In other words, instead of giving the user a simple, uniform mechanism like a process to protect the world from his mistakes, he is given dozens of hooks where he can implement little snippets of focused and hopefully bug-free functionality. All this involves a lot of learning above and beyond "the Java you learned in school" (meaning the core language and libraries), putting a cognitive load on the programmer and additional runtime load on the machine.

Multiprocess versus Multithreaded

Most Unixes have evolved efficient threading, and Windows has come a long way in becoming a multiprocess, multiuser environment. Consequently, developers needs to be able to intelligently decide when to use multiple processes, when to use multiple threads, and when to use a hybrid approach. For example, Apache httpd has for quite a while now used a hybrid approach. One one hand on most operating systems threads involve less overhead than processes, so it is more efficient to use multiple threads than multiple processes. On the other hand multiple processes ultimately will give you better reliability because they can be spawned and killed independently from one another, so making a system that can run for months without stopping doesn't require writing a program that will run for months without stopping.

So how do you choose? My rule of thumb is to look at the amount of shared data or messaging required between concurrent execution paths and balance against how long the "process" (not OS process) is expected to live. Execution paths with lots of shared data or that are chatty will benefit from the lower overhead of threading, and threading allows you to avoid the complexities of shared memory or IPC. Of course, multiprocessing allows you to avoid the complexities of threading APIs, and there are libraries to address both, so the complexity issue could be a wash depending on your previous experience.

So why is Murph so wrong? Is JC right?

I think Murph wants to divide the world along nice clean lines. System programmers program in C. They don't need the hand-holding of managed runtimes or languages that treat them like impudent children. They do need lots of flexibility and lots of rope. Application programmers, on the other hand, need high-level abstractions that are close to the business domain that they are addressing. They need to be able to rapidly build software and rapidly change it as requirements evolve. They don't need lots of flexibility and should stay away from low-level details. So, in Murph's eyes, the problem with Java is that it doesn't do either particularly well. The managed runtime and object-orientation get in the system programmer's way, while the general-purpose nature of the language and mish-mash of libraries and frameworks just confuse application developers, or rather distract them from their true purpose. System programmers need C. Application developers need 4GLs.

The fatal flaw in Murph's reasoning is that it ignores the in-between. What happens when the systems programmer or 4GL creator fails to provide the right abstraction for the application developer? He's stuck, that's what happens. Software development is as much about creating abstractions as using them. Consequently, application developers need general-purpose languages.

Sphere: Related Content


Jeff said...

Consequently, application developers need general-purpose languages.

Hence lisp.

Larry Clapp said...

... Java bringing the Windows mentality to Windows

Run that by me again?

Panic said...

The author of this post must calm down a bit and take a long nap...

The post subject seemed to be interesting enough to make me select it for a read but infuriated me enough to quit reading and make this post.

Please take the time to prof read your texts and see if they make sense.

..."about Java bringing the Windows mentality to Windows"...

..."so in order for users to share processes must share."

I have quited reading after that second one... Hope you can do better next time, and take some time to give consideration to the readers.

Erik Engbrecht said...

re: Jeff
Yes, good old Lisp. The only problem with Lisp is it takes the "general" in general-purpose language farther than most can grok.

re: Larry and panic
Ok, apparently a lot more people are reading this than I expected, I admit I didn't proofread and this is basically stream-of-consciouness writing. Revisions will follow.

Anonymous said...

I remember writing servers back in the mid-90s using pthreads and c on the major Unix platforms. Threads are not a Java only thing, they were already well on their way - Java just made it easier.

Greg said...

So the question becomes: How do you make a language that is actually good for both system developers and application developers? Does it exist, really? Java isn't it, obviously.

I guess I can see a low-level language with good support for macros and other ways to transform into a DSL, and a basic syntax that doesn't bind it to being useful only for systems work.

Also, FWIW, I had no trouble interpreting your typos :)

Anonymous said...

"One one hand on most operating systems threads involve less overhead than processes,"

More proof-reading help needed.

Larry Clapp said...

apparently a lot more people are reading this than I expected

You were Reddited, didn't you know?

George said...

I have to agree. By using multithreading within a single process, we've thrown away or reinvented most of an operating system. For example, we no longer have any memory protection. A dependency injection framework is just a poor man's linker. It's sad to have to worry about whether variables should be thread-local.

Erik Engbrecht said...

The point isn't that Java introduced threads to Unix, because it certainly didn't. Likiwise, I believe Windows has always had processes.

But Java contributed to making threads more pervasive in Unix programming, and increased the importance of their performance. For example, look at the addition of NPTL to Linux. Prior to the introduction of NPTL with the 2.6 kernel, Linux threads were more-or-less Linux processes, with all the associated weight. IIRC HP-UX 10.2 was the same way, but 11.0 added "proper" threads (I may be mistaken on this).

A huge chunk of today's server workload is enterprise Java, and achieving high performance with enterprise Java requires really efficient threads.

Nuno said...

Of course, you would be wrong. Repeating launching and terminating a Unix process is dirt cheap. Especially a simple program written in C.

Wrong...! sure, invoking a C CGI a few times a second is not of particular concern, invoking it thousands of times a second is a different proposition. That is why CGI as a fork() model died out eventually. It simply does not scale well.

Your whole proposition is based on a false assumption.

Anonymous said...

Having done some POSIX threads work in C on various Unixes, then Linux from the late 90's, I'd say that using processes is the safest thing but forking lots of child processes loads the system to a point it becomes a PITA. Creating and destroying lots of threads also is problematic and could lead to resources loss. If I had to design something aiming at performance I'd start with a predefined thread pool, then would assign work to threads, putting them in a wait state when done, until the next job.
If thread programming hasn't changed in the last 7-8 years it should still be the way to go.

Anonymous said...

"I want you to consider good old-fashioned CGI programs for a moment. Imagine one written in C. First, you may think "Oh my God, running a web application in a non-managed environment. The resource leaks! The memory leaks! The memory consumption of all those processes! Oh the horror!." Of course, you would be wrong. Repeating launching and terminating a Unix process is dirt cheap. Especially a simple program written in C."


Fork() and exec() are very expensive operations. Wanna bring your webserver to its knees? Try replacing mod_php/perl/whatever with old-fashioned CGI execution. (No, you can't use FastCGI--that's too much like "Windows thinking".)

z0ltan said...

Excellent blog mate. Really the best article I have yet come across which explains the differences between threads and processes so lucidly clear.

Will be catching up on your older blogs mate. Keep up the good work!

Anonymous said...

Dear `panic', `quited' is not a word. If you're going to bitch about grammar... well, sort yourself out first.

Anyway, good read, cheers.

Anonymous said...

manual trackback.

I think one also needs to think about the footprint size of lots of processes, each running their own instance of large, complex libraries. I think there are solutions for that, but they might require an extension of programming techniques or style (see post).

bungle said...

What about processors and cores? Isn't single process always allocated on single core / processor? Threads on the other hand can execute their code on the processor / core that is currently available (and can even change core/processor during context switches)?

Anonymous said...

@bungle: That's what I was referring to in my post - you'll need a single process instance for each concurrently running user, which potentially sucks due to the footprint.

If your requests are short lived, this isn't as much of a problem, as you can simply handle all those requests sequentially in a queue.

But if you have a different scenario, maybe even something like a Comet style application with connections to thousands of users concurrently, processes won't work (and synchronous threads have problems too, you'll need asynchronous processing of events by threads).

Anonymous said...

You need to work on your history, and consider the design constraints. Java was designed to be multi-platform, not just at the source level, but at the bytecode level. This is a huge win, as it lets techies develop on UNIX machines, while managers and users run MSWindows 3.1; the embedded world was another potential target as well, and embedded OSes have even more variety in supported features.

The problem of a gazillion Frameworks in Java is due, I'm convinced, of Java hitting a sweet spot. A competent developer practically has a framework drop out of their codebase when they solve a moderately complicated problem: there's just enough introspection to make it useful, but not so much the programmer gets caught up in the navel-gazing maintenance nightmare of extreme metaprogramming and/or self-modifying code.

Plus, Sun tried to get the big vendors on board, because it's better to have them backing your product than fighting it. What can these big vendors sell, now that Java's so easy? Why, Frameworks, of course. The bigger the better,
as complicated frameworks sell training and support contracts.

Without that backing, the language would likely go nowhere... there's C++ that was fulfilling the OOP mindspace, and an incentive to move away from C++ is a good thing, no matter how stupid such an incentive might seem at first.

So it's best to compare Java with C++, not C.

Alexandre Dulaunoy said...

Usually, you have three options : multi-processes, threading or events. I have a preference for events... A good summary about events versus threads. A lot of rock solid application are using events without the complexity and overhead of threading libraries.

James said...

A couple of commentators have been very excited to say "wrong!" about process creation efficiency in the CGI example. Rather than dismiss the illustration, readers should envision solutions where the process is started in a pool and handles requests until it detects an internal error. Then it exits and restarts. The reliability of such a system turns out to be very high, which is the point the author was making.

Anonymous said...

you have a nice site.thanks for sharing this site. various kinds of ebooks are available here