Friday, November 23, 2007

NMI Mine

There's an old story about a computer science professor that tried to give his son a middle name of the empty string. Whether true or not, such efforts would surely fail. How could one fill out the birth certificate? The social security forms? There really is no way to indicate that a middle name exists but has no letters in it.

I used to work at a big company, complete with photo id badges. Full names were printed on them. If someone lacked a middle name, their badge displayed "NMI" for "No Middle Initial." Naturally, I wondered what would happen if we ever hired someone with a middle name of "NMI." Of course, this never happened. But if I were responsible for the id badge software, you can bet that a person with such a name would be a test case.

Programming 101 teaches us to keep data and control separate. But in reality, this lesson is violated all the time. Consider Unix system programming. If you want to open a file descriptor, the open() function returns it, or it returns an error code on failure. Since all the error codes are negative and all file descriptors are positive, this doesn't seem to get us into too much trouble.

But, before you fancy yourself as wise as the designers of Unix, it's worth keeping data and control separate whenever possible. I'm familiar with a software system that schedules tasks on distributed embedded hardware. For reporting, the system writes a CSV file that can be imported into Excel for friendly display.

To schedule a task, a SOAP message is sent into the system. This is interpreted, log messages are written, and an appropriate embedded device is chosen. Then the system sends a message of its own to the device, passing along the scheduling information.

Scheduled tasks can be edited. But if the start time has already arrived, then the start time can't be changed. Only other details can be changed. In such cases, the update message contains a "zero" as the start time, since null start times were not allowed by the XML schema.

This use of a magic value, where data masquerades as control, seemed harmless enough at the time. I let it slip by. There were bigger battles to wage, and I felt that I had been saying "no" too frequently anyway. Certainly the designers on the team felt that I had. Unfortunately, this was a mistake.

You see, the embedded systems ran Linux. So their "zero" time was the 1970 epoch. However, the outside world was using 1900 as zero time, because that's what Excel uses. I'm embarrassed to say that it took us many days to figure this out.

I wish I could blame the delay in fixing this bug on the fact that we have a geographically distributed team. I wish that human language barriers were a suitable excuse. I wish that I could offer something to deflect the cause away from my own bad judgment.

But I cannot. The real root cause of this bug was that I allowed an architectural flaw to creep into the system. Thou shalt not conflate data and control.

No comments: