Saturday, November 3, 2007

The Spirit of Saint Louis

Charles Lindbergh doesn’t usually come to mind when one thinks of software architecture, but there’s a interesting anecdote about him that I recall when thinking about distributed programs.

To make a transatlantic crossing, most aviators in the 1920’s embraced multi-engine planes. The prevailing wisdom was that more engines meant more safety. They certainly brought more power and necessarily more weight. However, one by one, they all failed.

In contrast, the Spirit of Saint Louis was a single engine plane. Lindbergh correctly reasoned that multi-engine planes are only safer if you have somewhere to land. In a transatlantic flight, more engines actually increase the odds of failure. This is easy to see if you concede that any transatlantic plane needs all its engines to have enough power to complete the crossing.

If the probability of an engine failing in a given time interval is p, then the chances of a plane with n engines suffering no failures is (1-p)^n. Given that 0 < p < 1, a little algebra verifies that

(1-p) > (1-p)^n

for integral n > 1. In other words, the single engine plane is more likely to finish.

A few jobs back, (it was one of the failed startups I mentioned in my first post) I worked on a piece of code with a small team of about five developers. There was plenty of horsepower in the hardware to support the tasks that the software had to do.

The technical lead divided up the system into a half dozen or so separate Java programs, which communicated via RMI. The motivation was not really to parcel out the work among the developers. Rather, the intent was to improve fault resiliency.

Unfortunately, this didn’t work.

It turns out that the tech lead’s design required all the programs to be running for the system to work. So in this case, trying to achieve fault resiliency by introducing distributed programs didn’t help at all. A monolithic app would have been much simpler and no less reliable.

More recently, I’d been asked to improve the availability of another software system. Management’s direction was to create an identical instance on different hardware, and fail over to it when problems arose. Sometimes, having a warm standby is a good attack on this problem. But in this case, replicating the state from one machine to the other was unusually difficult with the given design. This was an architectural flaw that was too expensive to fix.

It turns out that simply rebooting the existing system would take less time than failing over. By recasting the requirement in terms of MTBF instead of mandating one approach, there would be more latitude to consider a Lindbergh-esque solution. Unfortunately, it’s hard to write customer-oriented requirements because it’s so easy for implementation choices to masquerade as requirements.

No comments: