Saturday, January 31, 2015

Learning From Mistakes

Here's a good tagline for this post: Mistakes happen. They are not the end of the world. They can be fixed or compensated for. What matters most is learning from them. And, yeah, in software, proper testing is important!

A quote attributed to Thomas John Watson, Sr. reads, "Recently, I was asked if I was going to fire an employee who made a mistake that cost the company 600,000. No, I replied, I just spent 600,000 training him. Why would I want somebody to hire his experience?"

I appreciate that philosophy. Mistakes, when they truly are mistakes and are not the result of ill intent or a demonstrated inability or refusal to learn, should be regarded as teaching opportunities. I firmly believe that success and failure together are the best way to get a "well-rounded" education in life.

This week I made a minor mistake at the day job. I renamed all the internal groups in our JIRA (issue and task management software) in preparation for synchronizing groups from our directory server. All seemed to go fine until a coworker appeared in my door and said that everyone on his team had just lost their workflow buttons. They had been there a few hours earlier. Hmmm... Turns out, JIRA refers to groups by name in lots of places, and due to the ultra-generic (flexible?) way big Java applications all seem to be designed these days, it was impossible for me to have written a script to find all these corner cases. Now that I've figured out where they were, I have been able to deal with the two or three other cases during the week where someone said they couldn't see something they were supposed to be able to see.

Lessons learned? First, seriously, why was I doing this in production first? Second, I learned something about the tool I would never have learned otherwise. I am better for it, and therefore my company can be better for it as well. Third, most mistakes are are not unrecoverable. And fourth, even if a mistake is unrecoverable, there is always a way forward from it, and things can get back on track.

Ok, I could end this post right here, and you might judge it to be a nice little piece of possible wisdom that most people already share. But, like I said, this was minor mistake. I've been the cause of some major headaches, mostly very early in my career. Read on for some entertainment...

What happens when I delete the C:\DOS directory?

I used a student loan and grant money to buy my first computer in college (a 386SX with 2MB RAM and an 80MB hard drive, I think this was 1992). This way I wouldn't have to spend hours upon hours  in the computer labs to work on assignments, and could be home with my beautiful wife. Having a computer at home gave me plenty of opportunities to tinker and try things. Curious about what would happen, I one day decided to delete the C:\DOS directory.  Imagine my surprise when the computer wouldn't boot after that. I learned a little more about what an operating system actually does (I was in my first year of computer science at this point). After a little time panicking, seeing my wife's realization that I might have just thrown away $2000 only added to the horrible feeling in the pit of my stomach. In desperation, I put in floppy disk #1 and tried rebooting. Relief! It booted and offered to install DOS again. More learning.

Too much whitespace in the code in a 24-line 3270 terminal is a problem

Fast forward a couple of years. I am now working as a student programmer in the university's financial department, assigned to work on the payroll system, among other things. One day, upon arriving at work, the payroll team lead asked me if I had tested my recent changes thoroughly enough. Of course, I thought I had, but... Accounting had called. Somehow, this payroll was double what it should have been.

Here's what the problem was. We accessed the development environment on the mainframe via 3270 terminal emulators from our PCs. When paging up and down, the editor leaves one line from the previous page visible to help provide some continuity between the pages. I knew at one point in the logic I was updating I would need to make the call to the ApplyPayment subroutine (or whatever it was called). I found the place in the code where I thought it should be called and added the appropriate line. I was right about it being the proper place for the call, because the call was already in that place, but it had scrolled off the screen. The line that was carried over from the previous page was a blank line, and I was only barely scanning the code as it was scrolling by. There were two or three extra blank lines after the call, which increased the chances that the line I needed to see would not be the line that carried over from the previous screen.

Luckily the accountants caught it before any real damage was done. Every payroll run gets audited (thankfully) before releasing it to the printers and the direct deposit systems. A quick fix, a re-run of the payroll, and a few hours later, the checks were in the mail.

More learning. Testing is important. Good test data is important. But here's an observation. It's difficult to teach college students how to write good tests when they don't have enough experience to understand all the things that usually go wrong in software. Experience is so important to becoming competent in whatever you're doing in life.

Where did all the buildings go?

Fast forward another couple of years. The payroll mistake didn't cause me to be tossed from the career path, doomed to a life of ___(insert_your_least_favorite_menial_job_here)___. Just before graduating, they offered me full-time employment. (I should add that although I'm only highlighting mistakes in this post, I worked my tail off and tried to learn all I could. That overcomes a multitude of sins.)

I was now in charge of the capital equipment inventory system, which is a fancy accounting term that means watching the value of buildings and desks and things go down, usually for tax reasons. I don't even remember what change I was making to the system that day. I coded it up and tested it in the development database, and upon seeing that it did what I expected it to do, promoted it to production.

The next morning, another more experienced developer, the one whose role I had taken over on this part of the system, asked me whether I had tested the changes I had made. I said, Yes. To which he responded, "Oh, maaaan. Something went really wrong last night." More than half of all the capital equipment (the buildings and everything bolted to the floors in them) was missing from the database. It turns out there was a condition in the data I hadn't accounted for, and that wasn't represented in the dev data we had to test with. Crud. I spent 36 straight hours at the office trying to reconstruct the data, but there wasn't enough there to reliably recreate everything. Month-end was coming, and pressure was high.

What's that? Just restore from the backup? Yeah. After a day and half of trying to avoid doing that, we finally turned to the operations department to request the restored data. After a few hours of closed door conversations, they came sheepishly back and said that their backup job had been failing every night for the last six months. The operators had been ignoring the messages on the console. After so long, they just started telling each other that message always happens so it doesn't mean anything. College students. You could sort of give them a bit of a pass, since most of them didn't really know what those messages meant anyway. But the system admins? Never, in six months, did they bother to review the logs?

At this point, yelling and screaming accomplishes nothing, and the business folks over in the administration building were pretty calm people. The accounting director decided to close the month early, using data manually entered from the last report, two weeks old. I fixed my bug and everything proceeded forward from there. Harm was done. Time was lost. Multiple people contributed to this disaster.

What did I learn? Again, testing is important. And having good test data is critical. And if you have any sysadmin responsibilities, it's a good idea to develop a habit of surveying the logs every day as a sanity check.  And, if you're going to work in a job, be proactive and learn all you can about what you're working with.

These whoppers all happened 20+ years ago. To be sure, I have continued to learn a few things by not doing them right the first time. But nothing has really come close to the levels of disaster that I have caused, or potentially caused, in those early years. I suppose this is what we call EXPERIENCE. Life would have been better in those occasions had I not made the mistakes I did. But knowing what can go wrong, really knowing it, makes me more capable now. The only real mistake in life is not learning from the experience when it doesn't go right.





No comments:

Post a Comment

I value comments as a way to contribute to the topic and I welcome others' opinions. Please keep comments civil and clean.