Showing posts with label Software Engineering. Show all posts
Showing posts with label Software Engineering. Show all posts

Sunday, January 16, 2011

Martin Odersky's Scala Levels

I just saw Martin's post on levels of expertise in Scala, and Tony Morris's response. The concept of levels is something that many in the community have called for repeatedly, in order to help ease people - especially team members who may be a little put off by some of Scala's more advanced features - into the language more gently. I understand the motivation, but it has always struck me as somewhat odd. The reason is that I think one needs to have very compelling reason for such a significant change to be justified, especially a change from a mature, mainstream technology such as Java into one as new an immature as Scala. To being, I'll recount my experience with adding Python as a stable in my programming toolbox.

Learning Python

I dabbled it Python a lot before really using it extensively. Much of my initial usage basically amounted to using Jython as a repl for interactively poking at Java libraries, especially ones that weren't particularly well documented. I thought Python was neat, but at the time most of my programming experience was in C++ and Java. I valued static typing. I valued having a compiler. When I was in college, I had a professor who said something along the line so of "The compiler is your friend. Write your code so that the compiler is more likely to catch your mistakes. Don't try to trick it or bypass it." This was in the context of C++ programming, and given all the ways that one can shoot himself in the foot using C++, it always seemed like sage advice. So giving up have a compiler to catch my mistakes, and an explicit structure clearly present in my code, seemed like an awful lot to ask. Sure, Python is more concise, but in my opinion conciseness is often more a matter of how closely the libraries you're using match the problem that you are trying to solve. About a year ago I had some experimental Python code that I had thrown together that did some regular expression heavy text processing then some XML processing (I was processing output of an application that produced invalid XML, so the first phase was the fix it so the XML parser wouldn't crash and the second was to actually process it). I can't remember why, but I decided to convert it to Java. I think it was because I wanted to integrate it with something that integrated well with Java, but not as well with Python. Anyway, the point is that the Python code and the Java code weren't that different in length. Sure, Java has a lot more boiler plate, but once your code starts being dominated by actual logic instead of getters and setters, it's the libraries that make the difference.

Anyway, I didn't really get into Python until I discovered and thoroughly learned metaclasses and descriptors. In my mind, they took Python from being a toy with a clean syntax to a language that opened up an entire new dimension of abstraction. A dimension of abstraction that had been missing all me life. A dimension that I had glimpsed in C++ template metaprogramming but had never quite managed to enter. All of a sudden I had this tool that enabled me to drastically and repeatedly reduce code sizes all while making the business logic cleaner and more explicit. I had a means to encode my ideas such that they enabled my coworkers to use them without fulling understanding them (an admittedly dangerous proposition). It was beautiful! They remain an important part of my toolbox to this day.

Incidentally, my first uses metaclasses and descriptors was to concisely implement immutable, lazy data structures. This was long before I knew anything about functional programming.

Learning Scala

I also dabbled in Scala a lot before really becoming enamored with it. I was caught by the conciseness over Java, especially the absence of getters and setters, the presence of operator overloading, and multiple inheritance using traits. But I had operator overloading and multiple inheritance in both C++ and Python. I could usually abstract away explicit getters and setters in Python, and by the time I started picking up Scala I had already made a significant shift towards immutability and laziness, so setters weren't as necessary. Scala was neat, and the compatibility with Java was a real benefit, but at first it didn't seem compelling. In fact, when I first started experimenting with Scala it seemed too buggy to really be worth it. Something worth watching, but not yet worth using. Also, by the time I started learning Scala I had already learned Common Lisp and functional programming with full tail-call optimization seemed awkward (and still does).

What eventually sold me on Scala was higher-kinded types, companion objects, and objects-as-modules. You see, but the time I started picking up Scala I had already developed a habit of using a lot of generic programming. This is easy in Python where there's no compiler check a type system to tell you that you are wrong. But a lot of the patterns I commonly used couldn't be translated into Java generics (in fact, in my first foray into Java generics I crashed javac writing code that was compliant with the Java specification). Scala gave me concise programs and type safety! And with the introduction of collections that actually use higher-kinded types in 2.8.0, along with the leaps and bounds made in robustness, I think it finally reached threshold for serious use.

That's not to say these are the only key features. Understanding folds was a real eureka moment for me. I'm not sure how one can do effective functional programming without them, but perhaps there is a way. Implicit definitions are also key, and they are used so extensively in Scala libraries that I think writing Scala without a decent understanding of them is nothing short of voodoo programming. In short, there's a lot, and the features tend combine together in powerful ways.

Summary

I understand the motivation for establishing levels, but I personally think if one avoids many of the features that Martin has flagged as advanced, then the benefits of Scala don't justify the risk of using such a new technology and the need to retrain people. I think without them one is better off finding a better library or framework for their current stack. Also, between the fact that advanced features are omnipresent in the libraries, and teams will always have someone who wants to use the more powerful features, avoiding them works a lot better in theory than it does in practice. You can't not use them, unless you don't use Scala's standard library and avoid almost any library written explicitly for Scala. You can only pretend that you are not using them.

This is probably going to get me flamed, but reward must justify risk. In many circumstances Scala has the features to justify the risk associated with it, but the justification quickly vanishes as the more powerful features are stripped away. So while I think a programmer may be productive at Martin's lower levels, they don't justify Scala over Java.

Sphere: Related Content

Wednesday, January 12, 2011

Types of Project Failure

I saw this question on StackExchange and it made me think about all the projects I've been on that have failed in various ways, and at times been both declared a success and in my opinion been a failure at the same time (this happens to me a lot). Michael Krigsman posted a list of six different types a while back, but they are somewhat IT centric and are more root causes than types (not that identifying root causes isn't a worthwhile exercise - it is). In fact, when if you Google for "types of project failures," the hits in general are IT centric. This may not seem surprising, but I think the topic deserves more general treatment. I'm going to focus on product development and/or deployment type projects, although I suspect much can be generalized to other areas.

Here's my list:

  1. Failure to meet one or more readily measurable objectives (cost, schedule, testable requirements, etc)
  2. Failure to deliver value consummate with the cost of the project
  3. Unnecessary collateral damage done to individuals or business relationships

I think these are important because in my experience they are the types of failure that are traded "under the table." Most good project managers will watch their measurable objectives very closely, and wield them as a weapon as failure approaches. They can do this because objective measures rarely sufficiently reflect "true value" to the sponsor. They simply can't, because value is very hard to quantify, and may take months, years, or even decades to measure. By focusing on the objective measures, the PM can give the sponsor the opportunity to choose how he wants to fail (or "redefine success," depending on how you see it) without excessive blame falling of the project team.

Subjective Objective Failure

The objective measures of failure are what receive the most attention, because, well, they're tangible and people can actually act upon them. My definition of failure here is probably a bit looser than normal. I believe that every project involves an element of risk. If you allocate the budget and schedule for a project such that is has a 50% chance of being met - which may be a perfectly reasonable thing to do - and the project comes in over budget but within a reasonably expected margin relative to the uncertainty, then I think the project is still a success. The same goes for requirements. There's going to be some churn, because no set of requirements is ever complete. There's going to be some unmet requirements. There are always some requirements that don't make sense, are completely extraneous, or are aggressive beyond what is genuinely expected. The customer may harp on some unmet requirement, and the PM may counter with some scope increase that was delivered, but ultimately one can tell if stakeholders feel a product meets its requirements or not. It's a subjective measure, but it's a relatively direct derivation from objective measures.

Value Failure

Failure to deliver sufficient value is usually what the customer really cares about. One common situation I've seen is where requirements critical to the system's ConOps, especially untestable ones or ones that are identified late in the game, are abandoned in the name of meeting more tangible measures. In a more rational world such projects would either be canceled or have success redefined to include sufficient resources to meet the ConOps. But the world is not rational. Sometimes the resources can't be increased because they are not available, but the sponsor or other team members chooses to be optimistic that they will become available soon enough to preserve the system's value proposition, and thus the team should soldier on. This isn't just budget and schedule - I think one of the more common problems I've seen is allocation of critical personnel. Other times it would be politically inconvenient to terminate a project due to loss of face, necessity of pointing out an inconvenient truth about some en vogue technology or process (e.g. one advocated by a politically influential group or person), or idle workers who cost just as much when they're idle as when they're doing something that might have value, or just a general obsession with sunk costs.

This is where the "value consummate with the cost" comes in. Every project has an opportunity cost in addition to a direct cost. Why won't critical personnel be reallocated even though there's sufficient budget? Because they're working on something more important. A naive definition of value failure would be a negative ROI, or an ROI below the cost of capital, but it's often much more complicated.

It's also important to note here that I'm not talking about promised ROI or other promised operational benefits. Often times projects are sold on the premise that they will yield outlandish gains. Sometimes people believe them. Often times they don't. But regardless, and usually perfectly possible to deliver significant value without meeting these promises. This would be a case of accepting objective failure because the value proposition is still there, it's just not as sweet as it was once believed to be.

Collateral Damage

Collateral damage receives a fair amount of press. We've all heard of, and most of us have one time or another worked, a Death March project. In fact, when I first thought about failure due to collateral damage, I thought any project causing significant collateral damage would also fall under at least one of the other two categories. It would be more a consequence of the others than a type unto itself. But then I thought about it, and realized experienced a couple projects where I feel the damage done to people was unacceptable, despite the fact that in terms of traditional measures and value delivered they were successful. There's a couple ways this can happen. One is that there are some intermediate setbacks from which the project ultimately recovers, but one or more people are made into scapegoats. Another way arises from interactions among team members that are allowed to grow toxic but never reach the point where they boil over. In my experience a common cause is an "extreme" performer, either someone who is really good or really bad, with a very difficult personality, particularly when the two are combined on a project with weak management.

Now, you might be wondering what the necessary causes of collateral damage are. It's actually fairly simple. Consider a large project that involves a large subcontract. The project may honestly discover that the value contributed by the subcontract does not justify its cost, or that the deliverables of the subcontract are entirely unnecessary. In this case the subcontract should be terminated, which in turn is likely to lead to a souring of the business relationship between the contractor and the subcontractor, along with potentially significant layoffs at the subcontractor. Often times a business must take action, and there will be people or other businesses who lose out through no fault of their own.

Back to Trading Among Failure Types

A Project Manager, along with projects sponsors and other key stakeholders, may actively choose which type of failure, or balance among them, is most desirable. Sometimes this "failure" can even really be success, so long as significant value is delivered. Some of the possible trades include:

  1. Failing to meet budget and schedule in order to ensure value.
  2. Sacrificing value in order to meet budget and schedule...
  3. ...potentially to avoid the collateral damage that would be inflicted in the case of an overt failure
  4. Inflicting collateral damage through practices such as continuous overtime in order to ensure value or completing on target
  5. Accepting reduced value or increased budget/schedule in order to avoid the collateral damage of the political fallout for not using a favored technology or process

Ultimately some of these trades are inevitable. Personally, I strongly prefer focusing on value. Do what it takes while the value proposition remains solid and the project doesn't resemble a money pit. Kill it when the value proposition disappears or it clearly becomes an infinite resource sink. But of course I know this is rather naive, and sometimes preventing political fallout, which I tend to willfully ignore in my question for truth, justice, and working technology; has an unacceptable if intangible impact. One of my most distinct memories is having a long conversation with a middle manager about a runaway project, and at the end being told something like "Erik, I agree with you. I know you're right. We're wasting time and money. But the money isn't worth the political mess it would create if I cut it off or forcefully reorganized the project." I appreciated the honesty, but it completely floored me, because it meant not only that politics could trump money, but politics could trump my time, which was being wasted. Now I see the wisdom in it, and simply try to avoid such traps when I see them and escape them as quickly as possible when I trip over them.

Sphere: Related Content

Monday, January 03, 2011

StackOverflow and Scraper Sites

I recently noticed that Google searches were turning up a lot of sites that mirror StackOverflow content as opposed to the originals. It appears that I'm not alone. This morning Jeff Atwood blogged about how they're having increasing problems with these sites receiving higher Google rankings. His post, and especially the comments, are filled with righteous indignation about how it's the end of the Internet as we know it. Of course, I can't remember a time when search results weren't polluted faked-out results, so I don't understand why he's so surprised.

But I do think this situation is different. The difference is that in many cases the presentation is far different from my previous exposure to link farms. Historically, my impression is that pages on such sites generally have the following attributes:

  1. The recycled content is poorly formated and often truncated
  2. Unrelated content from multiple sources is lumped together in a single page (usually there is some common keyword)
  3. Advertisements are extremely dominant and often unrelated to the content
  4. The link-to-content ratio is often very high, and the links are to unrelated content

In fact, I think if one were to develop a set of quantitative metrics that could be automatically measured for pages based on the above criteria, I think StackOverflow would perform worse than some of the sites that mirror its content. It's rather ironic, but if you think about it, if you wanted to defeat an algorithm that was developed to find the "best" source for some common content, you'd do everything you could to make your scraper site look more legitimate than the original. Let's compare a question on StackOverflow with it's cousin on eFreedom.com.

Analysis

The primary content is essentially the same with similar formatting. The two major differences are that eFreedom only directly displays the selected answer, as opposed to all of the answers with the selected answer on the top, and none of the comments are displayed. This may help avoid triggering the "unrelated content" rule, because the selected answer is probably the most cohesive with the question, and comments frequently veer off-topic. But I suspect the affect is minimal.

Now consider the advertisements. eFreedom has a few more, but they are more closely tied to the content (using Google AdWords, which probably helps). The advertisements on StackOverflow are for jobs identified via geolocation (I live in Maryland), and the words in them don't correlate particularly well to the primary content, even though they are arguably more relevant to StackOverflow users that run-of-the-mill AdWords ones.

Now let's consider the related links. StackOverflow has links to ~25 related questions in a column along the side. The only content from the questions in the title, and the majority seem to be related based on matching a single tag from the question. eFreedom, on the other hand has links to 10 related questions (they appear to be matching on both tags), puts them inline with the main content, and includes a brief summary. As a human being I think the StackOverflow links are much more noisy and less useful. If I try to think like an algorithm, what I notice is StackOverflow has a higher link-to-content ratio, and the links are to more weakly related content.

The only other major difference is that eFreedom has a "social network bar" on the page. I'm not sure how this would affect ranking. It probably helps with obtaining links.

If you look at the HTML for each page, both use Google Analytics, both use what I'm assuming is a CDN for images, and StackOverflow appears to have a second analysis service. On casual inspection, neither appear to be laden with external content for tracking purposes, although without deeper inspection there's no way to be sure. But I presume a having a large number of tracking links would make a site look more like spam, and having few of them make it look more legit.

Conclusion

I don't think either site looks like spam, but between the two, StackOverflow has more spam-like characteristics. eFreedom's content and links are considerably less noisy than StackOverflow's. Is eFreedom being a leach? Yes, certainly, but it, and I believe some of the other sites replicating StackOverflow's content, don't look like traditional link-spam sites. In fact, for a person who is just looking for content, as opposed to looking to participate in a community, then eFreedom is at least as good, if not slightly better. There may be a moral argument that the original source should be given priority over replications, but from a pure content quality perspective StackOverflow and its copiers are essentially identical, and computer algorithms aren't generally in the business of making moral judgements. Also, there are many forums and mailing list archives out there that have atrocious user interfaces where the casual searcher is likely better off being directed to a content aggregator than to the original source, so I don't think a general rule giving preference to original sources would be productive. Ultimately, I think open community sites like StackOverflow are going to have to compete with better SEO and perhaps better search and browsing UI's for non-contributors, rather than relying up search engines to perform some miracle, because the truth is that from a content consumption perspective the replicated sites are just as good.

Sphere: Related Content

Sunday, August 08, 2010

Improving Schedulers for High-level, Fine-grained Concurrency Frameworks

Quite a while ago, Greg Meredith made a comment that really made me stop and think, and that has been lingering in the back of my mind ever since:

Erik, Alex, John, et al, i'm loathe to hijack this thread -- which is a good one -- but the experience with the lock up described below is really just the tip of the iceberg. Unless specific scheduling contracts and semantics can be worked out, a pluggable scheduling architecture is just asking for disaster. Worse, it means that a whole host of compiler support that could be provided to apps that use actor-based concurrency is null and void because those solutions (be they model-checking or types-based) will critically depend on specific ranges of scheduling semantics. Actors and other higher level concurrency constructs are a major appeal of languages like scala. If they prove themselves to be "not ready for prime time" the impact on perception might not be contained to just that usage of the language. Best wishes, --greg
That wasn't the first time that Greg made a comment along those lines, and it certainly wasn't the last. At one point he offered to allow a few individuals look at some older code that he believed could help. Greg's a really smart guy. If he's worried, we should all be worried.

Empirical Evidence of the Problem

Contrary to what many may think this is hardly an academic problem. A significant portion of the issues people new to Scala Actors have trace to the fact that they expect the scheduler to be making guarantees that it simply does not make. Here's a sampling:

  1. Actors not creating enough threads
  2. Scala Actors different behavior on JRE 1.5 and 1.6
  3. Actor starvation problem (related to previous)
  4. Actor as new thread
There are others, including cases where people have both rightly and wrongly blamed the default scheduler for serious problems with their programs. Most of the above can be traced back to the fact that the default scheduler for Scala Actors uses a slightly modified version of the JSR166y ForkJoinPool, which has issues described below (source):
Like all lightweight-task frameworks, ForkJoin (FJ) does not explicitly cope with blocked IO: If a worker thread blocks on IO, then (1) it is not available to help process other tasks (2) Other worker threads waiting for the task to complete (i.e., to join it) may run out of work and waste CPU cycles. Neither of these issues completely eliminates potential use, but they do require a lot of explicit care. For example, you could place tasks that may experience sustained blockages in their own small ForkJoinPools. (The Fortress folks do something along these lines mapping fortress "fair" threads onto forkjoin.) You can also dynamically increase worker pool sizes (we have methods to do this) when blockages may occur. All in all though, the reason for the restrictions and advice are that we do not have good automated support for these kinds of cases, and don't yet know of the best practices, or whether it is a good idea at all.

The Scope of the Issue

The issue here is neither limited to Scala nor to actor based concurrency. The general consensus in the concurrency community is that locks and threads are not the right abstractions for concurrency in application code (or hardly any code, for that matter), but there isn't any consensus on what the right abstractions are, or if there is consensus it is that the right abstractions are problem specific. There's no one-size-fits-all concurrency abstraction.

The one trait that all high-level or higher-order concurrency abstractions have in common is that under the hood they rely on some sort of managed thread pool and/or scheduler. For purposes here, when I say "scheduler," I mean the layer with which a concurrency framework interacts directly and likely contains framework-specific logic. When I say "thread pool," I mean the layer that directly manages the threads and is concurrency framework agnostic and may even be shared by several schedulers (sharing will be problematic, but I think ultimately necessary). This isn't a hard line, and often times they may be lumped into one. However I think it's a useful dichotomy. I'm also assuming the scheduler and thread pool rely on cooperative thread sharing, and that preemptive multitasking is left to the virtual machine and/or operating system.

The point is, any concurrency abstraction where the user can submit arbitrary code to be executing can ultimately run into serious issues if that user code does something it does not expect. For example, the fork-join scheduler from JSR-166y that the Scala Actors library uses (I think Clojure uses it as well, but it may not) doesn't automatically enlarge its pool in the presence of unmanaged blocks (including normal blocking IO) or simple long-running computations. This results in the majority of the problems I previously cited, because blocking operations are extremely common tasks, and thus a very leaky abstraction.

Key Characteristics of a Good Scheduler

Unfortunately my command of type theory is rudimentary at best, so while Greg could probably describe a well-behaved scheduler using types, I'm going to have to resort to plain English:

  1. The user should not have to think about choosing a scheduler.
  2. The user should not have to think about the size of the thread pool managed by the scheduler.
  3. The user should be able to write code naturally without worrying about any assumptions the scheduler may make
  4. The user should not have to worry about starting or stopping the scheduler
  5. Multiple flavors of concurrency abstraction should be able to share the same thread pool.
  6. The scheduler should impose minimal overhead.
There's probably others, but I think this is a good start.

The Performance Problem

So why hasn't this problem already been solved? Perhaps it has been already and I just don't know about it. If someone has, please let me know. But as far as I can tell it's because there's a perceived nasty performance trade. Here's a quote from Philipp Haller:

The starvation detection mechanism in Scala 2.7 was too expensive to support in the new default scheduler. However, there is a starvation detection mechanism for thread-blocking methods of the `Actor` object (`receive` etc.).
Similar answers have been provided when people have questioned the wisdom of using the FJ framework over the classes included with the JDK. I've tested it a bit and it's true, the FJ framework does yield a significant improvement for certain classes of tasks over JDK classes (although I believe using exceptions for control flow imposes a far larger penalty). Basically it appears that the overhead associated with known solutions to the problem overcomes the introduced benefits of fine-grained concurrency,

ForkJoinPool in a nutshell

What makes ForkJoinPool so fast (other than highly optimized code) is that uses a two-level architecture for its task queues. There's one queue for the pool, and the one dequeue per thread. Most of the data associated with the dequeues is only updated from the owning thread so that tasks can be added and removed without synchronization and minimal volatile and CAS operations. The worker threads also steal work from one another in order to spread out the load. When a worker is deciding the next task to execute, it performs the following checks, using the first one to return a task:

  1. The thread's own local dequeue
  2. The dequeues of other threads in the pool, scanned in a psuedo-random pattern with guiding hints
  3. The common task queue
Threads are only added to the pool if a ManagedBlocker that the pool can see if used to block causing the target concurrency to not the be met. The default target concurrency for the ForkJoinPool is set to the number of available processors. In Scala Actors this is overridden to be twice the number of available processors. The idea behind this is simple: If you're keeping all your processors busy, then adding more threads will just slow you down. As we've seen, problems arise when a higher level of concurrency is needed due to unmanaged blocks, long running computations, or simply the required concurrency being inherently higher than the default concurrency. This happens because as long as a worker is busy or has something in its local dequeue, it won't look elsewhere for tasks. If all the workers are in this state, tasks can simply accumulate, resulting in starvation.

A Crack at Solving the Problem

As you might have guessed by now, I'm trying to solve this problem. Right now I wouldn't use the code for anything beyond testing the code, but despite the fact I'm still wrestling with the internals of ForkJoinPool I've experienced good results so far. My approach is simple: I added in a manager thread that monitors the changes in size of the various queues in the pool, and if they've been flushed since the last check, and grows the pool it tasks appear to be being starved. The overhead imposed on each worker is minimal, as it only has to update a single additional volatile field when it clears out its queue. The overhead of the manager thread is something, but in the benchmarking I've done so far it doesn't seem to add noticeable overhead. Ironically, Scala Actors already maintain a dedicated thread for the scheduler, its just that the heuristic check on the size of the pool were removed in 2.8 (although honestly they never worked that well).

I have two simple tests/micro-benchmarks using Scala Actors with a custom scheduler built on my ManagedForkJoinPool. The SleepyActor benchmark performs extremely poorly if the scheduler doesn't significantly grow the pool, because each actor sleeps on the thread. The PingPong benchmark spawns a bunch of pairs of actors that simply message each other back-and-forth. It is the type of use case where the ForkJoinScheduler shines, at least in terms of throughput. The results on my MacBook are as follows (these are NOT scientific):

BenchmarkDefault SchedulerManagedForkJoinSchedulerResizeableThreadPoolScheduler
SleepyActor92 sec30 secnot tested
PingPong59.49 sec47.17 sec66.02 sec
As you can see, performance actually improved with my scheduler. This is because the default scheduler for Scala Actors does not always use the "fast path" and put new tasks on the dequeue of the thread creating them. It only does it about half the time. So for the PingPong test you can see how the thread local queues impact performance.

Conclusions

It's too early to draw solid conclusions, but based on what I've done so far I think I can develop a solid heuristic for managing the thread pool size, and that the overhead will be negligible. The key is to not impose any more overhead on the worker threads, and to keep the computational overhead of the manager thread a low as possible. This means that the heuristic needs to be simple, and that the wait times between checks should be a long as possible, probably dynamically sized based on how smoothly the pool seems to be running.

What I need right now are more tests and benchmarks to cover "real world" scenarios. I also need to test on machines with more cores. My MacBook doesn't exactly present a powerful system. If anyone has suggestions, please post them!

Sphere: Related Content

Sunday, August 01, 2010

Is the Market for Technical Workers a Market for Lemons?

I saw this article about Amazon Mechanical Turk this morning, and it struck me that the market for technical talent, and especially in software engineering and IT talent, is a market for lemons. For a little more critical (and hopeful!) look at the idea, take a look at this post at the Mises Institute.

What's a Lemon Market?

The basic idea behind a market for lemons is that the seller has significantly more information about the quality of a product than the buyer (e.g. an information asymmetry). The consequence of this is that buyers are unwilling to pay for more than the perceived average value of the goods on the market, because the buyer does not know whether he is going to get a cherry (a high quality good) or a lemon (a low quality good). This then creates disincentives to sell cherries, because buyers will not pay their market value, and incentives to sell lemons, because they can be sold for more than they are worth. This creates a sort of vicious cycle where high quality goods are withdrawn for the market, thus lowering the average quality of the available goods and consequently the prices buyers are willing to pay. The ultimate result is a race to the bottom in terms of prices, with high quality goods vanishing as collateral damage.

There are five criterion for a lemon market (paraphrased and reordered from Wikipedia):

  1. Asymmetry of information such that buyers cannot accurately assess the quality of goods and sellers can.
  2. There is an incentive for sellers to pass off low quality goods has high quality goods.
  3. There is either a wide continuum in the quality of goods being sold, or the average quality is sufficiently low such that buyers are pessimistic about the quality of goods available.
  4. Sellers have no credible means of disclosing the quality of the goods they offer.
  5. There is a deficiency of available public quality assurance.

The Lemon Market for Technical Talent

The market for technical talent is similar. The information available about most job candidates for technical positions is superficial at best, and completely inaccurate at worst. Furthermore, even when information is available, the buyers often do not have the knowledge required to evaluate it. This even extends existing, long term employees and contractors. Finally, the quality of an employee is often highly contextual - environmental and cultural factors can significantly boost or dampen a person, and those same factors may have the opposite effect on another. So let's rundown the different factors:

  1. Information Asymmetry: Resumes are short, shallow, and often deceptive. Technical skills are very difficult to assess during an interview, and cultural fit can be almost impossible to assess.
  2. Incentives for sellers to deceive buyers: Resumes are deceptive for a reason. It's often stated that you need to pad your resume, because the person looking at it is automatically going to assume it is padded. Furthermore, for almost two decades now technology has been a source of high paying jobs that can be obtained with marginal effort (just turn on the radio and listen for advertisements for schools offering technology training).
  3. Continuum in the quality of goods: This exists in all professions.
  4. Sellers have no credible means of disclosing quality: This is largely but not entirely true. Most paid work that technology professionals do cannot be disclosed in detail, and even when it is there is reason for sellers to doubt the credibility. Employment agreements may also place restrictions of the disclosure of closely related work, even if it is done on one's on time with one's own resources.
  5. Deficiency of public quality assurance: Employment laws and potentials for litigation (at least here in the U.S.) make the disclosure of employee or even outsourcer performance very risky. Individual workers cannot effectively provide guarantees, and those provided by outsourcers are generally meaningless.

All this really amounts to is: Sellers have no good way of providing information regarding their high quality products, and buyers have no good way of obtaining reliable information about a seller's products. The problem centers entirely around making information available and ensuring its accuracy.

What Technology Professionals can do

Technology Professionals need to build up public reputations. We need to make samples of our work, or at least proxies of samples, publicly available. We need to build more professional communities with broader participation and more local, face-to-face engagement.

I think broader participation is the key. If you look at other professions (yes, I know, I'm lumping all "technology professionals" together and I frequently rant against such a lumping), members are much more active in various groups and associations. In many it's expected and even explicitly required by employers. Sure, there are tons of groups on the internet. There are numerous, and often times enormous technology conferences. There are countless technology blogs and community websites. But I think the conferences are generally too big to be meaningful because most attendees essentially end up being passive receptacles for sales pitches (that's the purpose of these conferences) and maybe they do a little networking and learning on the side. I think even the most active community sites really represent very small slices of the professionals they are intended to serve. Most professionals are passive participants in them at best, just like with conferences.

But there are millions of engineers, software developers, and IT professionals out there. Millions! How many of them do you find actively participating in professional communities of any kind? Not many when you really think about it. This is a problem because as long as the vast majority technology professionals have no public, professional identity, the vast majority of employers aren't going to see professional reputation as a useful search criterion or even measure when considering candidates.

What Employers can do

Employers can do one of two things:

  1. Expect applicants for new positions to provide meaningful evidence of their skills and take the time to evaluate such evidence
  2. Just outsource everything to the cheapest company possible. Information on quality is largely unavailable and unreliable, but information on price is readily available and relatively accurate.

You can see which one of those strategies is dominating the marketplace. One requires effort and involves going against the tide. The other is easy (at least to start, it's not easy to make work), and involves running along with the herd.

Except in the limited number of cases were employers, and the hiring managers who work for them, genuinely believe they can achieve a competitive advantage by actively seeking out candidates with demonstrated qualifications (as opposed to a padded resume and fast talk during an interview), I don't think employers are going to seriously consider reputation and objective evidence of skills until such information is easily obtainable and fairly reliable.

Is there any hope?

Yes!

We live and work in an information economy where new forms of information become available everyday. There is no reason to believe just because today employers mostly hire on faith and expect luck to carry them through, the in the future there won't be much more information available. I'm sure there are several companies working on aggregating such information right now. The market is huge. While companies will rarely put much effort into obtaining information and aggregating it into a useful form, they will often pay large quantities of money for it. The key is to make sure the information is there for them to use.

Also, if your resume makes it through all the wickets to a real hiring manager, if you provide an easy way for him to find more good information about you, he'll probably use it. Interviewing people is a lot of work. Deciding among candidates can be very hard. Extra information will almost certainly be used. It just has to be easy to obtain. People are lazy.

But what about old fashioned networking?

I think the numbers involved are too large and relying on old-fashioned networks tends to yield too poor of results. Recommendations and referrals are certainly useful and lead to many, many hires. But they tend to be made more based on personal relationships than based on real qualifications and therefore need to be checked. Schmoozing isn't a real technical skill. That being said, it could quite likely get you a job. It's just that in general I don't want to work with such people in a technical capacity, so I'm not going to recommend people go do it in order to obtain technical jobs. I'm selfish that way.

Sphere: Related Content

Tuesday, July 27, 2010

Higher-Level versus Higher-Order Abstraction

Engineering in general, and software engineering in particular, is all about abstraction. The creation and utilization of abstractions is a core part of the daily activities of every good engineer, and many engineers are on a never ending quest to increase their level of abstraction. Abstraction both increases productivity and increases the complexity of problems that can be tackled. Few would argue that increased abstraction is a bad thing.

But people do argue about abstraction, and often condemn abstractions that they view as unfit or too complex. It is common to hear a software developer praise the merits of one means of abstraction and then demean another in the same breath. Some abstractions are too general. Some aren't general enough. Some are simply too complex or, ironically, too abstract. So while engineers almost universally agree that abstraction is good, and more abstraction is better; they often disagree fiercely about what genuinely constitutes "better."

What is an abstraction?

Simply put, an abstraction is something that hides the details involved in realizing a concept, and thus frees the engineer to focus on the problem at hand rather than the details of some other concern. This can take many, many forms. The simplest and most common, so common that the term is often used to describe almost all "good" abstractions, is higher-level abstractions.

Higher-level Abstractions

Higher-level abstractions are fairly simple: they encapsulate details so that they can be used without knowledge or concern about the details. Let's consider a simple example in Scala 2.8 (you can copy and paste these examples directly into the Scala 2.8 REPL):

scala> def sum(data: Array[Double]): Double = {
     |   var i = 0
     |   var total = 0.0 
     |   while(i < data.length) {
     |     total = total + data(i)
     |     i = i + 1
     |   }
     |   total
     | }
scala> def mean(data: Array[Double]): Double = sum(data) / data.length

So we have two functions: one that computes the sum of an array of doubles, and one that computes the mean. Pretty much any programmer (professional or otherwise) would feel confident both writing and using such functions. They are written in a modern multi-paradigm programming language yet I bet if you went back in time and showed them to programmers in some of the very first high-level languages they would recognize exactly what they do. They clearly encapsulate the details of performing certain computations. But what's more interesting about them is what's missing from them:

  1. Polymorphism
  2. Closures or first-class functions
  3. Usage or creation of higher-order functions
  4. Other Scala magic such as type classes and implicit parameters

In other words, they provide a simple, layered, hierarchical abstraction in a very bounded way. If we step away from software for a minute, you can imagine a digital designer placing a symbol for an adder or a register on a schematic without worrying about the arrangement of transistors that will be required to realize them in an actual circuit, or a mechanical engineer selecting a standard screw or clamp. These are parts that can be simply built, tested, and composed into larger devices.

Higher-Order Abstraction Take 1: The Mathematics of Fruit

Imagine I have some apples. But unfortunately I'm here, off in the internet, show I can't show them to you. I need an abstraction to tell you about them. For example, I could say I have two apples. Two is a number, and numbers are abstractions. I can use the same number two to describe the number of McIntoshes in my refrigerator or the number of Apple Macintoshes I own. Now let's say I also want to talk about my strawberries and blueberries. I have 16 strawberries and 100 blueberries. How many pieces of fruit do I have? 118! How did I figure that out? I used arithmetic, which is an abstraction of higher order than numbers. How let's say I want to know how many days it will be before I will have eaten all my strawberries. I can write an equation such as: current_fruit - pieces_per_day * number_of_days = fruit_after_number_of_days. I can do even more with this by solving for different variables. This is algebra, and it is a higher-order abstraction than arithmetic. Now let's say I want to build upon that so that I can study the dynamics of the amount of fruit in my refrigerator. I purchase fruit and put it in my refrigerator. I take fruit out and eat it. I forget about fruit and it gets moldy, the I remove it and throw it away. I can capture all of these as a system of differential equations, and using calculus describe all sorts of things about my fruit at any given point in time. Calculus is a higher-order abstraction that algebra. In fact, abstractions similar to the ones built with calculus are what I mean when I say "higher-order abstraction."

At each step up the chain of mathematics both the generality and number of concepts that can be conveyed by a single abstraction increased. In the case of calculus it becomes essentially infinite, and that's the essence of higher-order abstractions: they deal with the infinite or near-infinite. Also, observe that almost everyone from relatively small children on up understand numbers and arithmetic, most adolescents and adults can stumble through applying algebra, and only a very small portion of the population knows anything about calculus, much less can effectively apply it. Also observe that the functions I defined earlier are just below algebra in terms of their order of abstraction.

Higher-Order Abstraction Take 2: Programming in Scala

Let's see if we can sprinkle some higher-order abstraction into our previously defined functions:

scala> def sum(data: Traversable[Double]) = data.foldLeft(0.0)(_ + _)

This definition of sum exposes one higher-order abstraction (polymorphism - it now can use any Traversable[Double] instead of just an Array[Double]), and it uses higher-order functions in its implementation to perform the summation. This new definition is both much shorter and much more general, but it's still relatively approachable, especially for use. Calling it with an Array[Double] works as before, and now it can be used with any number of collections, so long as the collections contain doubles. But forcing the collections to contain doubles is very limiting, so let's see if we can do better:

scala> def sum[T](data: Traversable[T])(implicit n: Numeric[T]): T = {                                  
     |   import n._
     |   data.foldLeft(zero)(_ + _)
     | }

Ahhh, that's better! Much more general. Not only will this work for any numeric type defined in the Scala standard library, but for any numeric type for which the Numeric type class has been defined! It doesn't even need to be in the same class hierarchy! In order to introduce this new level of generality, we've also introduced the following higher-order abstractions:

  1. Classic polymorphism (Traversable instead of Array)
  2. Parametric polymorphism (the type parameter T for various classes)
  3. Higher-order functions and closures (foldLeft and it's argument that does addition)
  4. Type classes (well, Scala's flavor of type classes, e.g. Numeric)

Now, this is the point where people start screaming that abstraction has gone too far. Many professional programmers would look at it and think "WTF?" They could still guess what it does, but the mechanics are quite elusive for anyone that doesn't know Scala reasonably well. That being said, the code is still far more compact than the original imperative version and is extremely general (to which someone replies "Yes, but it's slow a heck compared to the original!"). At this point I would say the order of abstraction has went from being just below algebra to being on par with integral calculus. Just like with mathematics, we see a significant drop off in the number of people who readily understand it.

Higher-Order Abstraction Take 3: Conclusions

Let's consider a short, incomplete list of higher-order abstractions, means of abstraction, and fields that rely upon higher-order abstractions:

  1. Higher-order functions
  2. Polymorphism and parametric polymorphism
  3. Metaclasses such as in Python and Smalltalk
  4. Rich macros such as in Lisp or C++ Templates
  5. Calculus and all the mathematics built upon it (and various other forms of advanced math)
  6. First-order Logic and its kin, and all the higher-order logics
  7. Any physics or other sciences that are built upon calculus
  8. Any engineering or other applications that are built upon physics

Higher-order abstractions tend to exhibit one or more of the following traits:

  1. They deal in infinities (e.g. integration and differentiation in calculus, universal and existential quantification in first-order logic, polymorphism and type classes in programming)
  2. They deal in transformations (e.g. integration and differentiation in calculus, metaclasses and macros in programming)
  3. They rely upon composition rules (e.g. chain rule and integration by parts in calculus, higher-order functions in programming)
  4. Learning to use any given form of higher-order abstraction requires significant effort and discipline, therefore they tend to be the tools of specialists
  5. They form the foundation of our modern, technological society.

The last two can be converted into conclusions:

  1. Higher-order abstractions are critical for most interesting forms of science and technology
  2. You can't expect a colleague or customer from another field to understand the higher-order abstractions in your field, even if his own field is totally immersed in higher-order abstraction

Sphere: Related Content

Saturday, January 16, 2010

Changing Tastes

When I was a kid, I hated onions, green peppers, and mushrooms. I used to tell people I was allergic to mushrooms so they wouldn't try to make me eat them. I hated any sort of chunky sauce or really textured meat. I think I wanted everything to have either the consistency of a chicken nugget or ketchup. My parents used to tell me that when I was older my tastes would change. That I liked crappy food, disliked good food, and eventually I would realize it. They were right.

So kids like chicken nuggets and ketchup. Wow, huge revelation. What does this have to do with technology? I'm on the steering committee for an internal conference on software engineering that my employer is holding. I'm the most junior person on the committee, and most of the members are managers who have more managers reporting to them. Our technical program committee (separate, more technical people on it, but all very senior) just finished abstract selection and we've been discussing topics and candidates for a panel discussion. During this process I have been absolutely shocked by how my tastes differ from those of my colleagues.

I've noticed that within the presentations selected topics on process, architecture, and management are over-represented. On the other side, many of the abstracts that I thought were very good and deeply technical fell below the line. I can't quite say there was a bias towards "high level" topics, because I think they were over-represented in the submissions. Given the diversity of technology that's almost inevitable. A guy doing embedded signal processing and a guy doing enterprise systems would most likely submit very different technical abstracts, but ones on management or process could be identical. It's almost inevitable that topics that are common across specialties will have more submissions.

There's a similar story with the panel discussion. I wanted a narrower technical topic, preferably one that is a little controversial so panelists and maybe even the audience can engage in debate. My colleagues were more concerned with who is going to be on the panel than what they would talk about, and keeping the topic broad enough to give the panelists freedom.

What's clear is that my colleagues clearly have different tastes in presentation content than me. I think they are genuinely trying to compose the best conference they can, and using their own experiences and preferences as a guide. I think their choices have been reasonable and well intentioned. I just disagree with many of them. If I had been the TPC chair, I would have explicitly biased the selection criteria towards deeper, technical topics. Those are the topics I would attend, even if they are outside my area of expertise. I would use my preferences as a guide. But that leaves me wondering, in another five years or ten years are my tastes going change? My tastes have certainly changed over the past decade, so I have no reason to believe they won't change over the next. Will I prefer process over technology and architecture over implementation? Will I stop thinking "show me the code!" and "show me the scalability benchmarks!" when see a bunch of boxes with lines between them? I don't think so, but only time will tell. When I was a kid, I would have never believed that I would ever willingly eat raw fish, much less enjoy it, but today I do.


Sphere: Related Content

Tuesday, January 05, 2010

Your Company's App

Tim Bray just posted a blog about how Enterprise IT is doing it wrong.  I can't really argue with that. He goes on to explain that Enterprise IT needs to learn from those companies building Web 2.0, because they deliver more functionality in less time and for a whole lot less money. This is where his argument breaks down. The problem is the type of Enterprise Systems he's talking about aren't Twitter, they're your company's app.

I work at a pretty big, stodgy, conservative company. I'm pretty sure, as far as things like ERP and PLM are concerned, my employer is exactly the type of company Tim is talking about, and like I said - he's probably right about the problem. But based on my own experiences and observations at my one lonely company, I think he's wrong in his suggestions. Comparing ERP and Facebook is like comparing apples and Apple Jacks.

The reason I think this is that in terms of deploying Enterprise 2.0 applications I think my employer has done fairly well. We have...

...and probably more stuff that I'm not aware of yet. Some of the above were bought, some were built, some were cobbled together with stuff from a variety of sources. I think most of them were built and deployed, at least initially, for costs about as competitive as possible with Web 2.0 startups and in reasonable timeframes. Of course, this means intranet scale at a Fortune 100 company (or in some cases a subset of it), not successful internet scale.

The problems the above face is much like the problem your typical startup faces: attracting users, keeping investors (sponsors) happy, dealing with the occasional onslaught of visitors after some publicity. But these problems are very different from the problems a traditional Enterprise Application faces. There is SOX compliance. There are no entrenched stakeholders. There are no legacy systems, or if there are they aren't that important. If only 10% of the potential user community actually uses the application, it's a glowing success. Negligible adoption can be accepted for extended periods while the culture adjusts, because the carrying cost of such applications is low.

But enterprise systems needs to deal with SOX. They have more entrenched stakeholders than you can count. These days there's always at least one legacy system, and often several due to disparate business units and acquisitions. If these systems fail, business stops, people don't get paid, or the financials are wrong. If only 90% of your buyers use your ERP systems to process purchase orders, it's an abject failure and possibly endangering the company.

A year or two ago someone "discovered" that a very common, important record in one of our internal systems had 44 (or something like that) required fields, and decided that this was ridiculous. A team was formed to streamline the processes associated with this record by reducing the number of fields. A detailed process audit was conducted. It turned out that every single one of them played a critical role in a downstream process. All the fields remained, and some old timers smiled knowingly.

As many humorless commenters pointed out on Eric Burke's blog, your company's app is your company's app for a reason, and often it's a good reason. These applications aren't complicated because of the technology. Internet applications are complex due to extrinsic forces - the sheer number of users and quantity of data that can deluge the application at any given moment. Enterprise systems tend to be the opposite. Their complexity is intrinsic due to the complexity of the diverse processes the support. The technology complexity (most of which is accidental) is secondary. Tim's suggestions provide a means of addressing technological complexity, and of building green field non-business critical applications in the face of uncertain requirements. They don't provide a means for dealing with critical systems laden with stakeholders, politics, and legacy.

I think the solution, or at least part of the solution, to the problem with enterprise systems lies in removing much of the functionality from them. There are things that must be right all of them time, but most of business (and engineering, etc) exists on a much fuzzier plane. The problem comes from coupling the precise (e.g. general ledger) with the imprecise (e.g. CM on an early stage design), and thus subjecting the imprecise to overly burdensome controls and restrictions. Only after this separation has been carefully implemented can functionality evolve in an agile fashion.

Sphere: Related Content

Monday, June 23, 2008

Google AppEngine

There's a lot of buzz out there about how Google AppEngine is a game changer. My question is: Which game? AppEngine reduces the cost of hosting a web application down to zero. Zero is a pretty profound number. You don't even need to provide a credit card number in case they might want to charge you some day. All you need to a mobile phone number capable of receiving text messages. This means that not only is there no up-front cost, but also that there is no risk that you will suddenly incur huge fees for exceeding transfer quotas when you are lucky enough to be Slashdotted. Your application will just be temporarily unavailable. Hey, if that service level is good enough for Twitter, it's good enough for you, right?

The Open Source Game

Starting an open source project has been free for years. Many project hosting communities exist, providing projects with source code management, issue tracking, wikis, mailing lists, other features, and even a little visibility within their directories. Web hosting is not exactly expensive, especially light-weight scripting languages like PHP, Perl, and Python; but it still costs something and therefore requires a sponsor. Often times for a fledgling project the sponsor is also the founder, who also happens to be the lead programmer, help desk, and promoter. There are some who do an amazing job of doing this, but for the most part I think project founders choose to wait.

Note: If I am wrong about the availability of good, free hosting for open source web applications, please provide links to them in the comments section. I would love to be proven wrong on this.

The Startup Game

If Paul Graham were to comment on AppEngine, it probably be that it eliminates one of the last remaining excuses anyone may have to launching a web startup. If you have an idea, some time, and can hack Python – you can launch a web application with AppEngine. You don't need money, infrastructure, long-term commitment, or any of those other things that may scare a person away from a startup. With AdSense, you even monetize your application without hardly any effort.

Of course there are limitations. For one thing, AppEngine is very limited in what it can do. Pretty much any idea with even moderate bandwidth, storage, or computation requirements that cannot be delegated to another web application (e.g. embedding YouTube functionality) is out. So, unless you plan on building a business based on mashing together existing applications, then AppEngine probably is not your free lunch. That being said, I fully expect that Google will gradually add APIs for all of its applications to AppEngine, thereby providing a very powerful base for new-but-thin ideas.

The R&D Game

Just for a moment, let's say there was a company that required all of its employees to spend 20% of their time working on a project other than their primary job. This is a company that is betting that it cannot predict which idea is going to be the next big thing, so it must try lots and lots of things and see what sticks. As this company grows, providing infrastructure for those projects would become a real challenge. Especially providing infrastructure that allows them to be testing in the wild as opposed to simply internally, and allows the projects to instantly scale if they just happen to turn out to be a success. This company would also want to ensure their employees weren't slowed down by having to deal with muck. Such a company would need an infrastructure that could scale to support many, many applications without burdening infrastructure teams. Such a company would need AppEngine.

Now extend the idea further. Let's say it doesn't really matter whether the next big thing was developed by an employee or not. What matters is that the next big idea is bound to the company, for example by using the company standard authentication mechanism or, more importantly, the company standard monetization mechanism.

Ok, so we all know what company I'm talking about. AppEngine allows Google to outsource some of their R&D at very low cost, and given that most successful AppEngine developers will probably choose AdSense to monetize their creations, Google stands to profit regardless of whether they ever pay any fees or not. In cases where creators do pay hosting fees, the great Google gets paid twice.

The Recruitment Game

Distinguishing among potential recruits is very challenging. The accomplished academic is almost certainly smart, may fall apart when asked to work with a significant team on something important to the company rather than a research interest. The industry veteran may have great corporate experience, but political skills could be masking shallow or outdated technical skills. The bottom line is recruiting is hard because in most cases you never see a direct sampling of an individual's work. At best you can see what a team he was on produced and take at educated guess as to his contribution. Open source projects can provide more information, but for most programmers there is no real motivation to participate in such projects.

AppEngine provides more motivation to the programmer, because he can more readily show his creation to other people without incurring any cost and there is always a chance that he will make some money. There are probably a lot of talented “almost founders” out there who would start a company, but perhaps lack some of the personality traits necessary to do so or found themselves in a situation where they need a steady income stream that simply isn't available for the founder of an early-stage startup. These individuals, and others, will make great pickings for Google.

Conclusion

Long term, in order for Google to grow, it has to attract more people to spend more time on sites displaying AdSense advertisements. Over the past few years Google has come out with countless new online services, most of which on still in beta, and none of which has yielded anything close to the monetization potential of their search business. AppEngine allows them to vicariously service the long tail without continuously expanding their already highly diverse R&D efforts. As a nice added bonus it will provide the opportunity to acquire future high-value startups before they are even real startups by simply hiring the developers and maybe tossing some stock options their way. On the flip side, I don't think AppEngine is going to have much effect on mainstream web application hosting. The API is too constrained and for a company with resources the ties to Google are most likely too tight. So I predict AppEngine will be more of a muted success for Google. The infrastructure built for it will be useful internally, it will help them get more sites up more quickly addressing eclectic needs without burdening employees, and it will provide a proving ground for future hires and possibly very early stage acquisitions. This could all add up to AppEngine being a very significant success, but it also means the success may out of the public eye.

Sphere: Related Content

Wednesday, June 18, 2008

What does I/O bound really mean?

There's a lot of old folk wisdom out there that justifies the use slow languages and runtimes on the basis that the impact on performance doesn't matter. Here's a sampling of them:

  • Most applications are I/O bound so the performance of the programming language doesn't matter

  • The database does all the heavy lifting, so performance of the application code doesn't matter

  • Computer time is cheap compared to programmer time

Like most folk wisdom, these all have an element of truth. They are often brought up in discussions about the merits of strongly-typed compiled languages versus dynamic languages, and natively compiled languages versus virtual machine based languages versus interpreted languages. I could write about any one of them, and many more, but today I'll address I/O because of the “work” I've been doing on the WideFinder 2 project.

WideFinder 2

The original WideFinder project was, at least on first inspection, quite clearly I/O bound (well, if you had an input file smaller than your main memory...) The WideFinder 2 is a little better because it is a little more computationally complex, but not by much. The benchmark is really simple: process a big (42 GB) web server log file and report some basic statistics. In order to perform the benchmark, the program must parse the file, store information from each line on several maps, and finally extract some statistics out of them.

Tim benchmarked sequential block input on the test system at just shy of 150M/s. This means that the I/O alone required to process the 42GB file should take about 5 minutes, meaning that if the benchmark truly is I/O bound then the program shouldn't take much more than 5 minutes to run. Consequently, if we are to judge based on Tim's reference implementation of the WF2 in Ruby then it quite clearly isn't I/O bound.

I/O Takes CPU, too

If you look closely at the Bonnie benchmark results you'll notice that doing raw block I/O – meaning just reading files into a buffer – consumes almost 80% of the a single core of the CPU. That means that I/O alone comes pretty close to being bound by the CPU as well as being bound by the disk. In fact, experimentation uncovered that placing the system high load reduces maximum I/O throughput. In order to achieve maximum throughput, you actually have to ensure that you keep one core free to manage the I/O.

Application I/O is more than I/O

Another key factor is that what most application programmers think of as I/O involves a lot more than shuttling bits from disk into memory. For example, the Java and other unicode based platforms have to decode the character encoding of the file into the native character encoding of the runtime. In the case of the JVM, this not only requires that every byte be processed, but also frequently doubles the memory consumption of each character and requires the data be copied into a separate buffer. Furthermore, application code usually deals with files line-by-line in the form of some standard (often immutable) string object, thereby requiring yet another pass over the data. So when your code is simply iterating over lines in a file, it isn't really just doing I/O. A fair amount of CPU/memory bound work is required to deliver those nice, neat, unicode strings.

Large datasets change the rules

Memory management isn't free, and even with an industrial strength garbage collector it isn't always simple. A common practice with immutable objects is to have them share some internal state in order to avoid excess data copying. For example, when you take a substring of a Java string, the two string objects share an underlying character array. Often times this saves memory, and it changes generating a substring of a string from a linear time and space operation (with respect to string length) to a constant time and space operation. Unfortunately if the substring is much longer lived that the original string, and especially if it is much smaller, you end up with something that feels awful lot like a memory leak (I think it could be debated whether it is a leak or not).

Even if you're not carrying around a lot of excess baggage in the form of substrings, processing gigabytes of data consumes a lot of memory. In the case of WF2, a little bit of pretty much every line needs to be stored for the calculations at the end. The cast majority of my “debugging” time for WF2 was spent figuring out what JVM parameters will actually allow the program to finish without slowing to a crawl (pre-tuning there were points were it would spend 90% of its time in GC) or consuming every bit of memory available (at least a few WF2 solutions hog tons of memory).

All this means that when dealing with large files other parts of the system need to do more work, which takes away time that could be spent on “real” processing or on I/O (which we've already seen can be a CPU hog). Furthermore, I/O routines and routines commonly used along side heavy I/O (like regular expressions) must be very careful about their memory usage.

So is WF2 really I/O bound?

It depends. Go take a good hard look at the WF2 results page. If you look at Tim Bray's results you would conclude that no, WF2 clearly is not I/O bound. However, if you look at the top you'll see some very impressive numbers that indicate that WF2 indeed is I/O bound (side note: I really need to take a hard look at OCaml). Of course, you could argue that even in the OCaml case it really is CPU bound, because making the task take about as long as the required I/O saturated 8 cores and 32 hardware threads. Scanning down the list would seem to indicate that in most cases I/O does not represent a significant bottleneck, but then it's hard to really tell. The disk may not be the bottleneck, but the I/O routines within the libraries or runtime may be. Consequently, from the application programmer perspective, WF2 is I/O bound.

Redefining “I/O bound” and future impacts

For a long time “I/O bound” primarily referred to hardware or possibly operating system limitations. That was a useful definition, but it is time for it to change. Most software being developed today has a very tall stack of abstractions sitting between it and the hardware. Operating systems schedule I/O and have to split limited resources among many competing processes. Virtual machines sit on top of operating systems, isolating the programmer from the underlying OS and hardware. Libraries and frameworks isolate the programmer from the virtual machine and even other libraries and frameworks. I/O from the programmer's perspective is, or at least should be if he is working on top of good abstractions, that I/O is “everything that happens between when my program requests some data and when it receives it.” Consequently, libraries and runtimes should go to great lengths to ensure that being I/O bound is as close to its original meaning as possible. Prior to multicore systems becoming pervasive that was largely true, but today's I/O libraries fail to take advantage of them, and consequently force I/O into being a bottleneck when it should not be.

That's why in my Widefinder submissions I've worked to separate parallelized I/O into a library-like module. It quite clearly is not done, but it is relatively successful. I reused my the parallel I/O code that I developed for WF1 on WF2 without changing any of the logic. It doesn't provide many features, and it could use a fair amount of optimization and simplification (the line joining logic is an atrocity), but it works.

Sphere: Related Content

Monday, March 03, 2008

Floating Points Stike CNN!

Looks like developers at cnn.com need a lesson in floating point numbers:

Either that or the Democratic primary is truly too close to call...

Sphere: Related Content

Monday, January 21, 2008

Programming Language Continuum

Introduction

Ever since I started getting into less-than-mainstream programming languages, I've pondered how to go about classifying them according to their attributes in the hopes that it would yield insight into what an ideal programming language would be. There is the genealogical way that traces roots and influences, but that doesn't break down attributes or make any qualitative judgements. Here I intend to mostly frame the context for debate, rather than draw significant conclusions. I believe bringing some more structure to the discussion is necessary before much more can be gleaned from it.

So what I've done is attempt to break the essence of a programming language down into two dimensions:

  1. Enforced Structure / Dynamism
  2. Engineered Foundations / Mathematical Foundations

Here's a loose attempt at classifying some languages:

Enforced Structure vs Dynamism

Recent debates have focused heavily on the difference between statically typed and dynamically typed languages. I believe that this is a specific case of the more general issue of "how much structure should the programming language force on the programmer?" Today we see Java versus Ruby, but it could just as easily be Ada versus Lisp. On you side of the debate you have people who believe that heavily structured languages are essential for scaling and helping ensure correctness. One of my college professors once said (paraphrased) "the compiler is your friend, help it catch your errors for you." Also, a well defined and compiler checked structure can help scale programming projects up to large teams by ensuring that all team members are working to the same program structure. Finally, they point out the sophisticated tools that available for many statically typed languages, particularly refactoring tools, but also code generators and static validators.

On the other side of the debate you have dynamic language advocates. They claim that statically typed languages are too constraining and require more up-front design than is practical, especially given the vague and changing requirements typical of software development projects. Furthermore, robust code an be achieved through automated testing and by significantly reducing the amount of code required to deliver the required functionality. Finally, they point out the quicker feedback cycles that dynamic languages enable by shortening the change->compile->deploy->test loop to change->test. There was a time when this loop was actually figured into software development cost and schedule estimates, but today outside of very large projects it is simply a cognitive disruption.

Engineered vs Mathematical Foundations

Most of the mainstream programming languages have been "engineered" or "designed" to better enable some group of programmers to better achieve some objective. For example, Ada was engineered to enable the development of very large, complicated and highly reliable systems by huge teams. SmallTalk was designed to enable the construction of modular software in a more natural or "human" manner. COBOL was designed to enable business analysts to program. The list goes on and on, but the common theme is that most of these languages were designed or engineered with very specific design goals, and those goals were largely disconnected from strong mathematical foundations.

On the other side of the spectrum you see languages that are very strongly influenced by computer science theory. Lisp started out as an executable implementation of the untyped lambda calculus and has stayed fairly true to that origin. Haskell combines the typed lambda calculus with category theory and focuses on functional purity. Prolog is based on logic, and SQL on relational theory. What these languages offer, especially strongly typed ones like Haskell, is that they enable computer programs to be much more easily reasoned about by theorem provers (and similar) and therefore can provide a much higher degree of safety. To the initiated, they also provide much more natural and elegant abstractions.

Hybrid Languages

The idea of dynamic languages with option type annotations is often raised as a potential means to bridge the divide between the advocates of static and dynamic languages. Common Lisp provides optional compile-time type checking, but in practice is seems to be used mostly as an optimization in special cases rather than as a means of ensuring correctness. Some touted StrongTalk as an answer, especially when Sun released it as open source, but it seems to have sputtered out. There was much talk about adding optional type annotations to Python, but I believe (please correct me if I am wrong) it is still absent from Python 3000. So while optional static typing is a favorite debate topic, its not popular in the languages that have it and attempts to add it to languages that don't have yet to reach fruition, despite considerable effort.

A recent phenomena, or perhaps simply recently receiving attention, are hybrid languages that attempt to blend concepts from strongly-typed functional languages such as Haskell with more mainstream object-oriented underpinnings. In my opinion, Scala is probably the best of these, as it offers innovative OO features in addition to functional ones, but others such as F# and OCaml certainly deserve mentioning, as do no doubt countless others that I am unfortunately going to neglect.

Conclusion

I think that hybrid static/dynamic languages will never really become popular - or more accurately that the capability will not be extensively used in the languages that offer it. There are a couple reasons for this. Primarily, I think that making static typing optional almost completely eliminates the benefits of static typing. Second, I think the primary advantage of dynamic languages is that they allow partially formed thoughts to be quickly expressed in "working" (meaning running, not correct) code, and that static typing is a major barrier to this. I personally find doing exploratory programming in dynamic languages to be much more pleasant and productive, but once ideas are concrete I prefer the compile-time checks and completeness of static typing.

I personally believe that languages that blend the engineered structure of OO with mathematical formalism represent the future of programming. Scala and F# are here, and both Java and C# are gradually acquiring features from strongly typed functional languages. What's going to be interesting is how it shakes out. If you peruse the Scala mailing list archives, you will notice that there is a marked tension between those from an object-oriented (engineered) perspective who enjoy the extra power that functional programming provides them, versus those from a more pure functional programming background (mathematical). Ultimately, at least from a popularity perspective, I think the more OO approach will win, as historically languages engineered to provide specific advantages have won out over languages with robust mathematical underpinnings.

Sphere: Related Content

Monday, September 17, 2007

The Tar Pit

The editor of TSS has decided to run a series discussing The Mythical Man Month, by Frederick Brooks. Hopefully it will produce some good discussion. There are a lot of Agile advocates that hang out on TSS that really could use to (re)learn some of the lessons of software development. I make a point of rereading it every few years lest I forget the lessons learned by computing's pioneers. The first chapter - The Tar Pit - contains on of my favorite concepts, as illustrated by the graphic below (slightly changed from original): Programs Let me explain. A program is what we all have developed. It's simple piece of software that is useful to the programmer and/to some set of users who are directly involved in defining its requirements. Most bespoke departmental and a substantial portion of enterprise applications fall into this category. They are sufficiently tested and documented to be useful within their originating context, but once that context is left their usefulness breaks down quickly. In addition, they are not solidly designed to be extensible and certainly not to be used as a component in a larger system. Obviously this is a range, and I've really described a fairly well developed program - one almost bordering on a programming product. That script you wrote yesterday to scan the system log for interesting events that has to be run from your home directory using your user account in order to work is also just a program. Programming Products Moving up the graph, we hit programming product. In theory, all commercial applications and mature bespoke applications are programming products. In practice this isn't really the case - but we'll pretend because they are supposed to be and I increased the standard over what Brooks originally described. The big challenge with programming products is that, according to Brooks, they cost three times as much to develop than simple fully-debugged programs yet they contain the same amount of functionality. This is why it's so hard to get sufficient budget and schedule to do a project right. The difference between a solid piece of software and something just cobbled together is very subtle (you can't tell in a demo) yet the cost difference is quite astounding. Consequently, I think most commercial applications are released well before they hit this stage, and bespoke ones require years to mature or a highly disciplined development process to reach this point. Programming Systems Programming systems are programs intended to be reused as parts of larger systems. In modern terms, they are libraries, frameworks, middleware, and other such components that are all the rage in software development. Like programming products, programming systems are thoroughly tested, documented, and most importantly are useful outside of the context in which they were created. And, like programming products, according to Brooks they take three times as long to develop as a regular program. Developing programming systems for bespoke applications or niche use can be a tar pit all its own. For one, many programmers like building libraries and frameworks. The problems are more technically challenging, and there is no strange-minded user to consider. The programmer and his colleagues are the user. Programming systems are relatively common in groups that execute a lot of similar projects and/or that contain programmers who really want to build components. Programming System Products Amazingly, programming systems products are relatively common - even if there really aren't that many of them. As you've probably guessed, a programming system product has all the traits of both a programming product and a programming system. It is useful to a wide range of users and can be effectively extended and/or embedded for the creation of larger systems. It has complete documentation and is extensively tested. Where are these wondrous things? Well, you are using one right now (unless you printed this). Your operating system is one. It both provides useful user-level functions and a huge amount of infrastructure for creating other programs. MS Office is one as well, because it has a pretty extensive API. Most commercially developed enterprise systems should be programming system products, because:

  1. They provide off-the-shelf functionality for regular users
  2. Customers always customize them
  3. They often must be integrated with other products
  4. Simply integrating their own components would go better with a programming system
The problem is that they are not, because of: The Tar Pit Brooks didn't explicitly write this definition of The Tar Pit but I think he would agree. Put yourself in the position of a development manager at a startup or in a larger company about to launch on a new product. On on hand, you want to make the product as good as possible. You know that what you develop today will serve as the base for the company/product line for years to come. It needs to be useful. It needs to be extendable. It needs to be thoroughly tested and documented... It needs to be cheap and delivered yesterday. The differences between a programming system product and a simple programming product are far more subtle than the differences between a program and a programming product. But the programming system product costs a full NINE TIMES as much to develop as the program with essentially the same "outward functionality" - at least if you are a sales guy or a potential customer sitting in a demo. I think this is the struggle of all engineering teams. If the product is long lived, doing it right will pay major dividends down the line. But it can't be long lived if it is never released. It stands a worse chance if it comes out after the market is flooded with similar products (actually, that's debatable...). The ultimate result is a mish-mash of tightly coupled components that, as individuals, fall into one of the lesser categories but as a whole fall down. There is a documented API, but the documentation isn't really that good and the application code bypasses it all the time. The user documentation is out-of-date. Oh, and the application isn't really that general - hence why all the major customers need the API so they can actually make it work. Escaping the Tar Pit Ok, so if you develop a large system you are destined to fall into the tar pit because cheap-and-now (well, overbudget and past schedule) will override right-and-the-next-decade. You need a programming system product, but budget and schedule will never support much more than a program. So how do you escape it? Accept It Products that give the outward impression of being far more than they are often sorely lacking in conceptual integrity. If you are building an application - build it right for the users. Remember you can build it three times for the cost of building a programming system product. Maybe by the third time there will be budget and schedule for it. Or maybe, just maybe, you can evolve your current system. But pretending will just make a mess while wasting significant amounts of time and money. Partition It Some pieces of your system are probably more important than others. There are places where requirements will be volatile or highly diverse amoung customers - those are the places where you need a truly extensible system. You should also be able to reuse strong abstractions that run through your system. The code associated with those abstractions should be top-notch and well documented. Other pieces just need to be great for the users, while a few that remain need to be convenient for yourself to extend or or administrators. Open Source It Find yourself developing yet another web framework because what exists just isn't right? Open source it. This isn't really my idea. It's what David Pollak of CircleShare is doing with the lift web framework for Scala (actually, I'm guessing at David's reasoning, I could be wrong). The infrastructure for your application is essential, but it isn't your application. It is what Jeff Bezos refers to as muck. You have to get it right, but it's distracting you from your core mission. So why not recruit others with similar needs to help you for free? That way you don't have completely give up control but also don't have to do it alone. Theoretically the same could be done for applications. Many large customers of software companies have significant software development expertise - sometimes more than the software companies. I think it would be entirely feasible for a consortium of such companies to develop applications that would better serve them than commercial alternatives. But I have yet to convince someone of that...

Sphere: Related Content