Natural Software: November 2007

Friday, November 23, 2007

NMI Mine

There's an old story about a computer science professor that tried to give his son a middle name of the empty string. Whether true or not, such efforts would surely fail. How could one fill out the birth certificate? The social security forms? There really is no way to indicate that a middle name exists but has no letters in it.

I used to work at a big company, complete with photo id badges. Full names were printed on them. If someone lacked a middle name, their badge displayed "NMI" for "No Middle Initial." Naturally, I wondered what would happen if we ever hired someone with a middle name of "NMI." Of course, this never happened. But if I were responsible for the id badge software, you can bet that a person with such a name would be a test case.

Programming 101 teaches us to keep data and control separate. But in reality, this lesson is violated all the time. Consider Unix system programming. If you want to open a file descriptor, the open() function returns it, or it returns an error code on failure. Since all the error codes are negative and all file descriptors are positive, this doesn't seem to get us into too much trouble.

But, before you fancy yourself as wise as the designers of Unix, it's worth keeping data and control separate whenever possible. I'm familiar with a software system that schedules tasks on distributed embedded hardware. For reporting, the system writes a CSV file that can be imported into Excel for friendly display.

To schedule a task, a SOAP message is sent into the system. This is interpreted, log messages are written, and an appropriate embedded device is chosen. Then the system sends a message of its own to the device, passing along the scheduling information.

Scheduled tasks can be edited. But if the start time has already arrived, then the start time can't be changed. Only other details can be changed. In such cases, the update message contains a "zero" as the start time, since null start times were not allowed by the XML schema.

This use of a magic value, where data masquerades as control, seemed harmless enough at the time. I let it slip by. There were bigger battles to wage, and I felt that I had been saying "no" too frequently anyway. Certainly the designers on the team felt that I had. Unfortunately, this was a mistake.

You see, the embedded systems ran Linux. So their "zero" time was the 1970 epoch. However, the outside world was using 1900 as zero time, because that's what Excel uses. I'm embarrassed to say that it took us many days to figure this out.

I wish I could blame the delay in fixing this bug on the fact that we have a geographically distributed team. I wish that human language barriers were a suitable excuse. I wish that I could offer something to deflect the cause away from my own bad judgment.

But I cannot. The real root cause of this bug was that I allowed an architectural flaw to creep into the system. Thou shalt not conflate data and control.

Wednesday, November 14, 2007

Aspect Variations

In a prior post, I suggested an analogy between different ways of formulating Physics and the contrast between thinking about objects and aspects. If OO is like vectors and matrices, then aspects are like variational principles. There's a cute little follow-up that I'd like to explore presently.

I'm very fond of Log4J and diagnostic contexts. In a word, diagnostic contexts conveniently allow the code to store arbitrary strings in thread local storage. So suppose I have a jar that offers functionality to persist to a database or a file. The application level code can set a diagnostic context, and then the log messages written by the jar's code will contain that useful information.

In pseudo-code, we might do something like this...


import org.apache.log4j.NDC;
import mypersistpackage.Persister;
//...
  NDC.push("current cseq=" + cseq);
  Persister.save(foo);

Then any Log4J messages written by the Persister.save method will contain the specified cseq number. This is incredibly useful, and we didn't have to modify the Persister source code at all.

However, a subtle problem arises when we're using thread pools. Whenever the application code makes an RPC, there's a chance that the thread will get returned to the pool during the call. This means that when the remote procedure returns, we might pick up where we left off in a different thread. This means that any context held in thread local storage will be stale.

In our example, we'd risk printing out the wrong cseq number after making a remote procedure call. The risks of this would increase when the system is under load, which is exactly when log messages are most important.

In a prior post, it was suggested that an aspect could advise all RPCs, so that timing information could be gathered. This is an example of a global principle that can be applied across the entire application. We can press that aspect into service to make sure our diagnostic contexts are not mangled when coming back from a remote call.

Even though we could do this with our OO helper class described in the prior post, this solution just plain didn't occur to us until we started thinking in aspects. (In fact, we abandoned using diagnostic contexts altogether.) And sure, I could solve the brachistochrone problem numerically using vectors, but I'm sure that a computed solution wouldn't give me the same insight I'd get by using the calculus of variations.

Tuesday, November 13, 2007

OO is to Vector as Aspect is to ...

When beginning to study of Physics, one typically learns about forces, just as Newton framed the subject. Forces are vectors, so you have to learn some trigonometry and linear algebra. Trajectories have a locally computable flavor about them, in that they are determined by summing up the forces imparting accelerations to particles. For example, the deflection of a ray of light through a prism is given by Snell's Law, which concerns itself only with the place where the light ray hits the glass, not original source or final destination of the beam.

And just when you are comfortable with forces being central to everything, the rug gets pulled out from underneath you.

Physicists after Newton reformulated Physics with energy and action as central players. This requires a bit more mathematical sophistication. So as you get farther in the subject, you have to learn variational calculus. Trajectories are now the solutions to a boundary value problems, and have a more global nature to them. For example, the deflection of a ray of light through a prism is governed by Fermat's Principle of Least Time. Out of a family of admissible trajectories, the one nature chooses is determined by a variational principle.

That's a very different way of thinking about natural phenomena. I wonder whether there's an analogy here to object oriented and aspect oriented programming.

Becoming comfortable with object oriented programming is like learning about vectors and matrices. Objects interact with each other by sending messages changing each other's state. This reminds me of forces imparting accelerations to particles. Direct method calls are like fricative forces, and JMS calls to mind magnetism or electrostatics.

The shift in thinking that's required to embrace aspect oriented programming is like learning about the principle of least action. For want of a better phrase, cross cutting concerns have a more, well, globally principled feel to them.

Consider a problem drawn from my real experiences with a system that makes a number of remote procedure calls. Sometimes, things get bogged down, and it's important to know where the time is getting consumed. The original OO approach was to design a little helper class that callers could use to keep track of how long each RPC took.


MyHelper helper = new MyHelper("rpcMethodA");
rpcMethodA();
helper.done();

The constructor took note of the current time and method name, and then the done method wrote an informative log message recording how long the method took.

But, rather than polluting the biz logic with all this bookkeeping, this problem calls out for an aspect oriented solution. A better approach would be to make an aspect that did the timekeeping and logging, and then advise whatever methods you wanted. In my imagination, this feels like imposing a variational principle on the software.

Monday, November 12, 2007

Four A's and a Zed

A couple of years ago, in an attempt to combat myopia, I tried to collect some thoughts on a software system I had a hand in developing. I came up with a few principles that applied to that code, but there might be something more general in them that’s worth capturing. So with apologies to Hugh Grant, here are Four A’s and a Zed.

Availability is Scalability

An important lesson learned is never try to bolt on fault tolerance or fault resiliency at the end. One has to design it in from the beginning. The twin of this idea is that scalability isn’t accidental either. By scalability, I mean the capability of the architecture to improve some figure of merit (throughput, for example) by throwing more hardware at it.

All right, those points are obvious. But what wasn’t obvious (and what might not even be true generally) is the idea that availability and scalability are the same thing.

Architecture By Contract

ABC is a term I made up to denote an amalgam of Design By Contract (DBC) and Interface Oriented Programming (IOP).

Adaptors for Protocols, Plug-ins for Logic

Our product had to integrate into a number of environments. One mistake we made was to mix business logic into the same code that was handling the communication. A better approach would have been to create adaptors that only handled the protocols, and contained no biz logic. Instead, separate plug-ins would contain customizable logic that could be varied independently.

Asynchronicity Considered Harmful

Event-driven systems can allow design decisions to be deferred. Usually, it's better not to defer such decisions, but make the hard choices up front.

Zero Bugs in Zero Slocs

Code that you don't have to write can't have bugs in it. Our prowess at the keyboard should not be measured by how many lines of code we write, but how few.

Friday, November 9, 2007

Translating Orcs

I've been reading Seamus Heaney's translation of Beowulf. It's fantastic. In both senses of the word. I've always liked this kind of stuff, and it's only a coincidence that the movie has just come out. I mean, I would have been reading it anyway.

Something early on in the verses has caught my eye. Mr. Heaney translated "orcs" (actually orcneas) as "evil phantoms." I wonder if he was tempted to translate it as simply "orcs." I bet that a lot of Beowulf readers wouldn't be thrown by the word orcs. And those that were would probably be curious enough to look it up.

The Pragmantic Programmers give a presentation about the Dreyfus Model of communication. Workers (whether in nursing, cooking, or programming) can be divided into five categories ranging from beginner to expert. The less skilled require detailed rules. Bake in the oven at 450 for 30 minutes, then remove the pan using insulated potholders. The more skilled don't require such rules, and embrace intuition. Whip up some fritters.

It's suggested that communication across too many levels is difficult. A novice cook would have trouble following the latter directive. An expert would chafe at having to issue the former one. One of the catch phrases when considering the Dreyfus Model is "legalize intuition." In other words, good organizations tend to defer to experts' intuition.

I have to quibble with that, though.

In our profession, one of the phrases we hear too much is: "I have n years experience, so you have to just trust me." Well, first of all, there's a difference between having twenty years experience and having one year of experience twenty times in a row. Most of us overestimate our expertise.

But even when the person making that argument really is an expert, I feel that it's still a cop out not to articulate the logic behind one's point of view. If you really are such an expert, you should be able to convey why the solution you advocate is best. I just don't buy that the expert can't make himself understood to the novice.

Of course, that doesn't mean that the novice will believe him. Or embrace the direction given. But that's different from being unable to communicate.

I've been blessed with a number of very talented professors over the years. A great many were brilliant. None of them met the stereotype of solipsist genius that couldn't teach worth a darn. In fact, the most gifted were exactly the ones who communicated best.

Nobel laureate Richard Feynman remarked that if a Physics topic could not be explained to freshmen, then physicists really didn't understand the topic.

So, to my fellow architects out there, the next time you are charged with putting a little extra effort into defending your point of view, resist the temptation to take it as a challenge to your role in the group. Instead, welcome the opportunity to reify your intuition into a coherent explanation. And trust your audience to be bright enough, or at least curious enough, to know what orcs are.

Thursday, November 8, 2007

Unscientific Methods

Every once in a while, you need to <rant>

Some years back, I was a novice programmer on an important software effort. IIRC, it was something like a couple or three dozen coders for six calendar months. There were six milestones, one scheduled at the end of each month. For some ironclad non-technical reasons whose details don't matter here, the project's final deadline absolutely positively could not budge.

Well, you know how software development goes, and we completed our first milestone after two months elapsed. The project manager called a meeting. "Don't worry," he reassured us. "We're only a month behind schedule." I sure didn't see it that way, so I had a discreet conversation with my technical lead.

"Our estimate for how long it would take to complete the first milestone was off by 100%," I said." If our other estimates are similarly off, we're not a month behind schedule. We're six months behind!" My tech lead endured my naivete. He reminded me that we had a talented and hardworking group. The first milestone was just a fluke.

"Well, sure," I pressed on. "But, think like a scientist. We've performed one experiment. It tells us that our estimating process is off by a factor of two. I really think we need more people." It was a big company. So throwing more bodies at the problem didn't seem unreasonable to me. But it was not to be.

We worked hard, put in lots of overtime, and completed our second milestone after two more months elapsed. The project manager was replaced. The new manager gave us the Gipper speech, which pretty much alienated everybody who had been slaving away for the past four months. I now had two experiments supporting my view of the schedule.

As I saw it, the team was now effectively tasked with completing eight months work in two months. I voiced this opinion, a little less discreetly this time. But on paper management's unrevised schedule looked like four months work in two months. Somehow that was more palatable. By being the only one who understood the scientific method, I just couldn't make myself believed.

Now, I know what you're thinking. You're thinking that they really believed me, but they were wearing their special emerald colored glasses that get bolted on when they enter the Oz of management. The deadline couldn't move, so they had to keep up the front that we can get the job done in time. After all, everyone died at Khitomer, because the alternative would be unthinkable.

I might buy that, if it weren't for what happened next.

To make up for lost ground, all the programmers were divided into three shifts, and we were each scheduled to come in for our assigned hours. Being young and single, it was little hardship for me to take 3rd shift. We were all working so many hours, there was quite a bit of overlap anyway.

That approach might have made sense if we all didn't have workstations of our own on our desks. It might have made sense if programming wasn't such a human activity, which thrived on good interpersonal communication among team members. It might have made sense if we were geographically distributed across time zones, and worked non-standard hours to maximize overlap. But none of those things were true.

Dividing us into shifts was like a shiny pocket watch that a hypnotist takes out of his pocket to distract someone, in this case, upper management. I'm pretty sure I expressed this simile to my tech lead, but we were both so bleary-eyed at the time I can't really be sure. If the decision makers could be duped by the illogic of such a bizarre work tactic, they probably didn't really understand why someone who thought like an experimental scientist would call for revising the milestone effort estimates.

At least, that's my hypothesis.

</rant>

Wednesday, November 7, 2007

Passivating Thoughts

Before a friend of mine became a brilliant linguist, he was a brilliant Linguistics student. I recall the two of us mulling over some thoughts about active and passive voice then. In English, the subject of an active voice sentence is the actor doing the action. Geoff studies languages. In passive voice, the subject and direct object get swapped. Languages are studied by Geoff. It turns out that most active voice sentences can be rewritten in passive voice and vice versa.

However, there is an interesting class of active voice sentences that cannot be “passivated,” if I may coin that term. He sank the boat to become a hero. If we try to flip that around, it no longer makes any real sense. The boat was sunk to become a hero. It’s no longer correct. Surely the sinker and not the boat itself is the hero.

There are similarly structured sentences that do admit passivation. He sank the boat to collect the insurance. This is nearly identical to the previous example. The boat was sunk to collect the insurance. That works! It’s clear that the boat is not the collector of the insurance. A million other examples flow off the tongue. He drank the Jolt to postpone sleep. We were in college after all. The Jolt was drunk to postpone sleep.

So, what’s so special about the hero example? Why does the passivation transformation sometimes fail?

After kicking this around some, one of us noticed a difference in the sentences. The verb “collect” can be used in passive voice. Insurance was collected. But “become” is special. Without taking poetic license with the language, one cannot passivate become. Hero was become, is not correct. We then conjectured that sentences of the form above could be passivated if and only if their infinitival clause had a valid passive voice form.

Armed with a conjecture, we thought we’d run a few more experiments and see how it bears up. There’s no way to say “was remain”, so we predicted that the following sentence could not be passivated. He shredded the contract to remain a free agent. And sure enough: The contract was shredded to remain a free agent. This sentence would imply that the contract itself was a free agent, but the active voice form does not. The attempt to passivate the verb “to shred” in this example fails because “to remain” admits no passive voice form.

Other experiments also shore up the conjecture. To dance all night, she chose comfortable shoes. That has a clear meaning. She, and not the shoes, is doing the dancing. But attempting to passivate the sentence fails. Comfortable shoes were chosen to dance all night. Even if that might be a grammatically correct sentence, the meaning is warped. The shoes were not selected to dance (among other dancing candidates). It’s the dancer that dances. Our conjecture correctly predicts this because the construction “were danced” doesn’t make sense. (Although I could dance a jig, which is transitive, and a jig could “be danced,” the flavor of dance used above is intransitive, admitting no direct object, and no passive voice form.)

We both celebrated with a Jolt, but here is where the differences in the way Physics students and Linguistics students look at the world came into play. “We’re done,” I exclaimed. “We looked at the data, formed a conjecture, and tested it with more data. Write it up!” That our conjecture was interesting and useful was enough for me.

My friend said something like, “No, we’re not done at all. Now we have to figure out why English obeys the conjecture. What forces could have driven the evolution of the language (or our minds, really) to behave this way? We can‘t just offer the conjecture without justifying it.”

This notion floored me. It would never occur to me to ask why F = GMm/r^2. That’s just a useful law that Newton discovered. That it works is enough. Hypotheses non fingo. So, it seemed incredibly ambitious and speculative to try to explain why the linguistics conjecture worked. I was of no further use.

In summary, most English active voice sentences can be passivated without changing their meaning or rendering the new sentence ungrammatical. Putting on our Physics hat, we might say that they are invariant under the passivation transformation. However, there are a few sentences that are not, namely the ones with infinitival clauses that admit no passive form. This represents an interesting broken symmetry.

When refactoring software, one improves the internal structure without breaking the desired external behavior. In the literature, I perceive a couple of approaches to this. Both are compatible, but have different emphases. Martin Fowler and others emphasize the importance of comprehensive unit tests that pass before and after modifications are made. Bill Opdyke and others champion the idea that source code can be transformed in specific, discrete ways that leave behavior unchanged.

I feel that both approaches are important. The first approach admits the possibility that the unit test suite is not complete enough. A client somewhere might rely on some behavior that’s not checked by a unit test. So the refactoring attempt could fail. The second approach should always work, but it limits the refactoring repertoire to those actions that your tool can do. (Unless you are incredibly meticulous and can confidently edit the code by hand yourself.).

However, I wonder what subtle broken symmetries, analogous to the linguistics example above, might still exist when transforming source code according to our profession's ever growing catalog of refactorings.

Tuesday, November 6, 2007

On Being Imitated

A while back, I noticed that my son (who's nearly two) would hold out his arm and stare intently at the back of his wrist whenever someone asked "what time is it?" No one taught him this, he just picked it up because he observed adults doing it. This struck me with the one-two punch of (1) "Wow he's so observant! That's wonderful!" followed by (2) "Oh-oh, what else is he learning by osmosis? I need to be more careful."

I'm familiar with a piece of Java code that fires callbacks in response to receiving certain raw events. The idea is that the raw events themselves don't have enough information to be useful to the ultimate consumers. So this piece of code transforms the data, and then fires a more meaningful callback.

Here's how it goes. The Transformer class implements a pair of interfaces. One allows clients to register themselves as callbacks, so they can listen to meaningful events. The class maintains a collection of callbacks into which it fires the transformed messages. The other interface allows Transformer to listen to the raw events. Here's some pseudo-code.


public interface Callback
{
    void onEvent(String s);
}

public interface Publisher
{
    void registerCallback(Callback c);
}

public interface RawListener
{
    void onRawEvent(String s);
}

class Transformer implements Publisher, RawListener
{
    private List<callback> allCallbacks;

    public void registerCallback(Callback c)
    {
        this.allCallbacks.add(c);
    }

    public void onRawEvent(String s)
    {
        String transformedString = this.transform(s);
        foreach (Callback c : this.allCallbacks)
        {
            c.onEvent(transformedString);
        }
    }

    private String transform(String s)
    // details omitted

So, clients of the Publisher interface that are interested in receiving events implement the Callback interface, then add themselves into the Transformer's collection. Some other class fires events into the RawListener interface. The idea is that the implementors of Callback don't have to know anything about the RawListener or the format of the raw events.

However, there's a difference between "don't have to know" and "shouldn't know". If the raw event format is subject to change, then we really don't want the implementors of Callback to depend on that in any way. The problem is, that once a client has their mitts on a Publisher, then they can cast it into a RawListener. Like a population of frogs expanding to fill a new niche in the ecosystem, living code will exploit that.


  void cleverAndRisky(Publisher p)
  {
      RawListener listener = (RawListener) p;
      String rawEvent = // details omitted
      listener.onRawEvent(rawEvent);
  }

In Don Box's excellent book, Effective COM, he offers some compelling arguments about the dangers of the QueryInterface method, which is basically a cast. (Incidentally, I think the first chapter of his book is among the finest technical writing I've ever read.) Whenever a class implements interfaces used for different purposes, one runs the risk of a client writing brittle code.

I've grown fond of using Inner Classes to attack this problem. Consider an improved Transformer implementation below. With this new approach, the cleverAndRisky method above won't ever work. This doesn't complicate the Transformer code too much, and it sets a good example of paying careful attention to what gets exposed.


class Transformer implements Publisher
{
    private List<callback> allCallbacks;
   
    public void registerCallback(Callback c)
     {
         allCallbacks.add(c);
     }

    private class RawListenerImpl implements RawListener
    {
        public void onRawEvent(String s)
            {
             String transformedString = transform(s);
             foreach (Callback c : allCallbacks)
                    {
                 c.onEvent(transformedString);
             }
         }
     }

    private String transform(String s)
    // details omitted

Now, I'm not saying that you have to treat users of your code like two-year-olds, who will poke their fingers into every dangerous socket you leave open. Programmers are a pretty smart bunch. But I am suggesting that the public part of an API should be given careful thought. It's a subtle point, but the public API includes what I can cast interfaces into.

Keeping track of this sort of thing sets a good example. Be mindful of what you do when you are in a leadership position. Take pride when the troops imitate you.

Monday, November 5, 2007

Noether's Theorem

In my last semester as an undergraduate Physics concentrator, I took an initially promising class that turned out to be very depressing. In it, we learned that energy is not conserved. Now, everyone knows about the law of conservation of energy. The patent office has even been known to deny patents on the grounds that purported inventions resemble perpetual motion machines, which would violate the principle. And yet, much in the way that Newton’s laws fail in an Einsteinian cosmos, energy is not really conserved.

Here’s how it goes. It’s well accepted that the universe is expanding. This is a colloquial way to express the more precise notion that the distances between everything are getting larger. Space is not expanding into anything larger, in the way that a cake fills up the volume of an oven. We don’t notice this effect because our lives are incredibly short and because forces like gravity and electrostatics hold familiar objects together despite the expansion.

It’s also well known that light propagates in waves, and that the energy in a beam of light depends on the wavelength. More energetic waves have shorter wavelengths. Well, consider a ray of light traveling along in vacuum for a very long time. If left alone long enough, the distances between the crests of its wave will increase because of the expansion of the universe. This increases the wavelength, and robs the beam of energy. Put whimsically, even light gets tired as it ages.

Our little group was shaken by this line of reasoning. The course was an elective, and attended by maybe a dozen curious students. I don’t think we could recall a single problem set endured over the years that didn’t rely on energy conservation somewhere. So we just sat there for a few moments digesting this idea. Finally one of my study partners spoke up. “Professor, you’ve just undone the last four years of our lives,” he managed to get out.

Some years later, as a graduate student (in Mathematics, no less), I came to understand a far deeper principle. Noether’s theorem marries conservation laws to symmetries. Symmetry here has a specialized meaning that’s richer than the layman’s definition. If a deep symmetry can be found in nature, then some observable quantity must be conserved.

Specifically, if an experiment performed today would demonstrate the same behavior if performed tomorrow, then we say that the laws of Physics are invariant under translation in time. Invariance under time translation is an entirely reasonable and rather timid assumption. It’s an example of a symmetry (in the mathematical sense) . Noether’s theorem tells us that this symmetry implies the law of conservation of energy. Other symmetries imply other conservation laws.

Armed with this understanding, conservation laws were displaced from my perspective as fundamental notions, and became natural consequences of mild assumptions about the world. This happy discovery more than made up for the earlier depressing one.

With this more profound perspective, it’s not so upsetting to contemplate that energy might not be conserved over time scales that are non-trivial fractions of the age of the universe. In fact, we might even expect it! We’d have to start thinking up gedankens that violate conservation, like the aging light beam above.

Why has all this come to mind when thinking and blogging about software architecture?

It comes to mind because over time I’ve found that some “best practices” that I’d embraced previously only make sense in limited contexts. This is a depressing discovery, akin to the feelings I had as an undergraduate described above. It’s suggests that we can only hope to architect systems on a sandy intellectual foundation.

However, whether or not a practice is best, or even good, is not happenstance. Rather, it’s a consequence of some deeper principle when certain assumptions are applied. The analogy to Noether’s theorem is too close not to be struck by it. This is a happy discovery, and I plan to expand on this idea and offer a concrete example or two in subsequent posts.

Saturday, November 3, 2007

The Spirit of Saint Louis

Charles Lindbergh doesn’t usually come to mind when one thinks of software architecture, but there’s a interesting anecdote about him that I recall when thinking about distributed programs.

To make a transatlantic crossing, most aviators in the 1920’s embraced multi-engine planes. The prevailing wisdom was that more engines meant more safety. They certainly brought more power and necessarily more weight. However, one by one, they all failed.

In contrast, the Spirit of Saint Louis was a single engine plane. Lindbergh correctly reasoned that multi-engine planes are only safer if you have somewhere to land. In a transatlantic flight, more engines actually increase the odds of failure. This is easy to see if you concede that any transatlantic plane needs all its engines to have enough power to complete the crossing.

If the probability of an engine failing in a given time interval is p, then the chances of a plane with n engines suffering no failures is (1-p)^n. Given that 0 < p < 1, a little algebra verifies that

(1-p) > (1-p)^n

for integral n > 1. In other words, the single engine plane is more likely to finish.

A few jobs back, (it was one of the failed startups I mentioned in my first post) I worked on a piece of code with a small team of about five developers. There was plenty of horsepower in the hardware to support the tasks that the software had to do.

The technical lead divided up the system into a half dozen or so separate Java programs, which communicated via RMI. The motivation was not really to parcel out the work among the developers. Rather, the intent was to improve fault resiliency.

Unfortunately, this didn’t work.

It turns out that the tech lead’s design required all the programs to be running for the system to work. So in this case, trying to achieve fault resiliency by introducing distributed programs didn’t help at all. A monolithic app would have been much simpler and no less reliable.

More recently, I’d been asked to improve the availability of another software system. Management’s direction was to create an identical instance on different hardware, and fail over to it when problems arose. Sometimes, having a warm standby is a good attack on this problem. But in this case, replicating the state from one machine to the other was unusually difficult with the given design. This was an architectural flaw that was too expensive to fix.

It turns out that simply rebooting the existing system would take less time than failing over. By recasting the requirement in terms of MTBF instead of mandating one approach, there would be more latitude to consider a Lindbergh-esque solution. Unfortunately, it’s hard to write customer-oriented requirements because it’s so easy for implementation choices to masquerade as requirements.

Friday, November 2, 2007

Singular Technology

My son is nearly two and has encountered his first homonym. His word for horse is “neigh-neigh,” which makes perfect sense given that R’s are still too difficult for him to pronounce. He calls his Aunt Renee “neigh-neigh”, presumably for the same reason. He seems to have no difficulty grasping that two completely different words are pronounced the same way.

This occurred to me the other day when someone remarked that technology choices are singular. That’s actually a double entendre. It was meant in the sense that one’s technology choices have a distinguished role in the success (or failure) of a software project. Use an ORM if that would simplify development, for example. Consider using CORBA if you have communicating programs written in different languages running on heterogeneous hardware. That sort of thing.

I however misinterpreted the term to be meant in its linear algebra sense. A singular operation is not invertible. For example, after you square a number, there’s no way to know which square root was fed into the squaring operation. Nine is the square of both three and negative three. Given this flavor of singular, the phrase “technology choices are singular” seems even more true to me than the other interpretation.

I think it was Ted Husted who remarked that with so many Java web application frameworks to choose from (Struts, JSF, Shale, etc), it’s more important just to choose one than what the particular choice is. However once such architectural choices are made, it’s extremely expensive to undo them. This is actually one of the working definitions of architecture that I hold in my mind.

Architectural decisions (as opposed to plain old design decisions) are the ones that are too expensive to get wrong.

That’s not a complete definition, obviously, but it’s brief and useful. As an architect, on those occasions where I can’t persuade, nor am I persuaded, I have to consider pulling rank to get my way. Unless the cost of taking the team in the wrong direction is too great, I usually concede technical decisions to be design decisions, and therefore not in my bailiwick.

I once worked on a team that got an important architectural decision wrong. We chose CORBA to allow cooperating programs to communicate. That might have been a good move if we had to worry about

Endianness

Different languages, or

Operating system independence

Unfortunately none of those factors were in play. Moreover, our code ran on embedded hardware, and the CORBA footprint turned out to be uncomfortably large. Years later, they’re still stuck with it. Some of the reasons for the inertia are enumerated below.

Resources are allocated to develop new features instead -- Refactoring is often hard to justify to management

Lack of training in new technology -- An obvious point for developers, but don’t forget sales and support!
Previous decision is exposed to customers --It’s hard enough to change, let alone retire, public APIs

No clear roadmap to migrate to new technology --Typically must keep existing product running, but should avoid big-bang integrations

Emotional ties make change difficult --Thanks for working 60 hour weeks, now we’re scrapping it

Therefore, costs are not only financial. Many factors can conspire to make decisions singular. Pun intended.

That team got other architectural decisions right, and I’m sure I’ve worked on other teams that got many architectural decisions right. But I hardly remember those decisions. In my memory, the only choices of note are the ones where I’ve later been asked to take the square root.

Thursday, November 1, 2007

Why Team Dynamics Matters

A couple of jobs ago I made a terrible mistake. Not of the Fred Brooksian multi-million dollar variety, but one with more modest financial repercussions. I recommended hiring a hotshot who was bright and hardworking, but who was also arrogant, antagonizing, and antisocial. You know the type.

In our field, leaders go through this calculus all the time. Can someone’s domain or technical strengths justify the negative impact their personality -- or lack thereof -- will have on the team? Asked another way, is anyone indispensable? This is the perennial tradeoff question. Experience teaches that the answer is always no. At the time, this was probably obvious to everybody except me.

Even when the answer seems to be yes, it’s really still no. The main reason is Conway’s Law, which holds that organizations produce software systems whose communication pathways reflect the communication pathways of the developers themselves. This notion is a two-sided coin. It predicts that dysfunctional teams will produce bad code. But Conway’s Law can also be used deliberately to make good code.

When it comes to engineering, a couple of developers in front of a whiteboard for five minutes can avert many person-days of barking up the wrong tree. One productive all-hands meeting, even though expensive, can pay for itself by identifying disconnects and misunderstandings early. And many a lunchtime conversation has its eureka moment, when friendly cross-pollination uncovers some useful idea.

Contrast that with my former environment where engineers rarely spoke because it just wasn’t worth the hassle to deal with our sociopathic new hire. Myself, I’m pretty confident and optimistic, but even I dreaded the long walk to the hotshot’s cube, knowing I’d be belittled for asking questions. On projects whose scope exceeds what one genius can deliver, no amount of technical prowess can compensate for the opportunities lost because of such hostility.

Upper management concluded that the answer to the tradeoff question above was “yes” because a working system was delivered, albeit belatedly. However, the product turned out to be a maintenance nightmare. It was brittle, bloated, and non-performant. Maybe in the hotshot’s original vision, it didn’t have to be any of those things, but upon completion it was. The product suffered because of

The hotshot’s inability to articulate the metaphor for the architecture
The unwillingness of the team to iterate towards a solution
Layer upon superfluous layer of indirection in the software itself

As I reflect on that structure now, with wrapper around insulating wrapper, I realize that the software was a perfect reflection of the dysfunctional communication of the team itself. Conway was right. Maintenance costs ran an order of magnitude higher than expected, at least some of which is attributable to the astonishingly high personnel turnover rate that resulted IMHO from the ambience.

More recently, I architected a system with Conway’s Law at the forefront of my mind. I was ridiculed good-naturedly for being an interface chauvinist. I demanded very narrow APIs between the different components, which presumably were to be developed by different people. And there was some chafing, at least at first, about my insistence that the separate components be tested in isolation as a prerequisite for integration. A system was built of cooperating state machines, engineered by cooperating people.

I was a bit heavy handed in enforcing my vision of how the pieces fit together architecturally. But each developer was the tsar of their own component, so there was plenty of room for creativity and meaningful contributions. To the extent to which we all played by the rules, we produced a clean system with fairly high conceptual integrity. Even though geographically distributed, the team members communicated. We produced a top-notch product on a very aggressive schedule, and which beat the performance requirement by an order of magnitude. Conway was right again.

Our humble little team had no indispensable ubermenschen, just decent programmers that enjoyed working together. This time around, that we would be so successful was obvious to nobody. Except me.