Wednesday, September 21, 2011

Pay Freeze, slightly melted, a random bug? Maybe.

Every now and then I read about a problem in a software system that makes the news.   I look at the article and read what is described as the problem, and I often wonder, how this supposed flaw got into the system.  In my experience it can be easy to fault the software for an error.  There have certainly been enough cases of odd failures for the general public to believe them, but is it really the software?

This week I heard about the story from Staunton, Virginia.  Apparently the school board had frozen pay for all of its employees for some period of time, and as the article stated, the glitch went uncaught by a number of employees who spot checked this up until a news station requested records for salaries under the freedom of information act.  This is when the discrepancy was apparently noticed.  Now this glitch appears like something of a scandal.  The political black eye alone could be enough to make anyone nervous about the 'quality' of the application in question.

What concerns me though is that this glitch is being described as completely random.  First, do we really understand what it means when something is truly random?  According to Dictionary.com, random has four customary definitions.  The first means 'proceeding, made, or occurring without definite aim, reason, or pattern.  The second is its use in statistics, specifically the concept of a process of selection whereby each item of a set has an equal likelihood of being selected. The third definition applies to physical trades, where a part, parcel, or piece of land or item may appear non uniformly shaped.  The last one is an informal use implying that it was an occurrence that was completely unexpected, or unpredictable.


Let's take a moment and consider the story for a moment.  The first definition implies that there is no rhyme or reason, no discernible pattern to something which may make it random.  Is that the case here?  Reading further I notice the following:
" the pay increase malfunction was random and included three teachers at Bessie Weller Elementary School, four at McSwain Elementary, four at Ware Elementary and a speech teacher and a secondary special education teacher."
Several of these teachers had one thing in common, they attended one of three elementary schools nearby.  Wait does that mean what I think it means, could this be the beginning of an actual pattern emerging, enough to discount the perceived randomness?  It could be, but as testers in this situation, our job is to determine the nature of the fault, not just give our 'best guesses'.  We know a fault happened, therefore we must find a way to duplicate it.  If we continued on this analysis, we'd likely have a couple of test ideas to begin testing, we'd look at the data for all of the affected persons and see just what is it that happened.  Is the over payment of salaries here the problem, or is it a side effect of some other hidden flaw that just became visible due to some quality of the instance that we are examining?   Fortunately, I did a bit more research and found another article on this on MSNBC's site.  Now I will note that MSNBC's article is dated the sixteenth of September, and the other article earlier on the Second day of September, however, as I read I find another nugget that seems to confirm my suspicion.

"All the affected teachers had previously worked at Dixon Elementary School and were reassigned to other schools after Dixon closed two years ago."

So it appears that this bug affected teachers that had all been assigned to a school, that closed two years ago. (No doubt around the time of the glitch actually occurring.)  Would you call this random?  No I see a pattern, so it doesn't hold on the first definition.  The Second definition doesn't hold up to the story at this point either, as given a sampling so large, would you really expect to find just a handful of salaries that are wrong?  I don't buy that either.  The third definition doesn't apply in this context, which leaves us with the remaining informal definition: simply that it was odd or unpredictable 

This fact I do not doubt, no one predicted this to happen.  Now I'm not writing this to criticize the vendor or the county in question where this happened.  That's not the point of this article.  Instead, my hope is to make you think.  As a tester, developer, user, consumer of computing appliances, how often do we encounter behavior that surprises us?  How often do we not only get surprised but feel the event to be unpredictable, with no reason it should be happening?

I imagine this happens more than we might like to admit.  How many times do we sit at our computers, doing something normal.  We're checking our email in our client of choice, we have had no problems with our service and expect to get a no messages found if the service has none waiting for us.  We hit the send/receive button, and wait gleefully hoping to find/not find email.  Then we get a message that it was unable to connect to the server.   That catches us by surprise, maybe we think its an aberration, so we click the button again.

That second click does what?  It allows us to check to see if it was a hiccup, a momentary failure, or perhaps a sign of a long term issue.  I've had this happen from time to time on web pages I may visit frequently.  A forum for a football team may load very fast during the week, but on game day as people are checking up on their team, it slows to a crawl, and a dependency like a style sheet, or images fails to download due to the sudden hit to bandwidth serving the multitude of requests at one time.  It might even take minutes before you get that white page with some structure, and no formatting.  Do we immediately think, wow that's random, this forum is really bugged?  But I know from experience, this isn't a fault of the software itself, at least as far as I can tell, but instead it is a function of a high load on a system that may not be able to keep up with a sudden increase in demand.

As testers, simply finding and reporting bugs is wholly insufficient to communicate to the developer the nature and scope of the fault we've encountered.   In the case of the forum software, a subsequent refresh might fix the page, and it may load fine for several hours thereafter, unable to have the issue reproduced.  Whatever the issue is, we must dig, and see if we can prune down the steps that we followed.  We can try to see if the bug happens if we hit another location, try a different path through the software, or perhaps try a different role or persona.  The point here is it is our job as testers to imagine how this bug could have occurred.  What would your tester instincts tell you to go to prove and find this error so it could be fixed?  Do you have the answer?

Hold that thought, because I am going to revisit this question later in the week.  For now, just remember that just because we can't see the pattern for a bug, doesn't mean there isn't one, and as testers in particular, our use of language should be careful so as to not mislead the public, our developers, our clients, or our managers.

No comments:

Post a Comment