Tuesday, September 27, 2011

If its not random, how to decipher the pattern?

Earlier this week, I wrote about the software fault in the Staunton, Virginia, teacher payroll system.  I talked at length about the concept of 'random', and the importance of distinguishing between something that which is truly random, from something that is better described as unexpected, unpredictable, or just 'having no discernible pattern to me as far as my sense go'.  Using precise language when describing defects in software benefits everyone on the team, including the customer.

Unfortunately, the fault in Staunton, Virgnia's payroll system wasn't found by a tester, instead it was discovered by someone researching the finances of the county's school system.   Now we may not know exactly how this fault was first brought to the attention of this school district.  That doesn't preclude speculation how an investigation of a similar fault on a hypothetically similar system could be conducted.

So imagine a hypothetical payroll system for an organization with multiple locations, accounting for user entities of diverse pay grades and positions similar to the school systems.  The more layers you add to the structure of the system, the greater its complexity.   Now let's suppose the vendor of this software received word about an apparent bug.  This bug affects certain persons within the system who would receive an unexpected, and here to fore unnoticed pay increase.  So if you as a software tester for this software vendor, receive this notice where would you start?

Reporting and analyzing a defect that a tester stumbled upon through his or her own investigation of the software is one thing.  Trying to track down a flaw someone else found and reports is quite another. If we follow the example of the of the system we discussed earlier we can imagine the reports taking the form of output, potentially pay stubs, ledger logs, bank statements, etc.  In short, we possess a log or evidence that the problem occurred, but this evidence may be far enough from the system itself to not be able to produce the same conditions without a bit more digging.

So how can we reproduce these conditions and figure out where the real defect resides?  More information is required, and like a software Sherlock Holmes we must examine the evidence, and piece together the story of what happened.  In the case of the pay roll system it is likely important to know how many individuals were impacted.  Might a search for more information related to the individuals effected, reveal each user to be part of particular entities or organizations with in the system?  Did they work at particular locations, or have their data maintained at a particular data center?  An exhaustive analysis of whatever data can be culled from the system could help establish a definitive relationship between the affected users.

From the headlines, it sounds the School district performed an analysis just like this.  The result seemed to have something to do with individuals who went to a particular school.   Now the age of the defect in the system may not be clear.  If  a lot of time has gone by, it may be possible that the connection is more subtle, and won't track to any particular organization, or be so obvious; however, in this case we strike pay dirt.   One piece of the puzzle is in place.

Given that all those results might track to a particular organization within the software, this may lead to our first hunch.  Were all the people assigned to this organization, also receiving the same bug?  This might be where the first bump in the investigation may be encountered.   Maybe they aren't all affected.  That idea may lead to a belief that our initial hunch was wrong, but it could be that there's a reason why they turned out to be the exception.  

It's at this point in the defect analysis where a history of debugging similar enterprise applications could prove beneficial.  From reviewing some of the articles around the defect, a number of ideas come to mind, all of them based on similar behavior I've encountered in other projects I have worked.  If these employees all worked at a facility that was shuttered, what happens to their accounts when the facility is shut down?  Are they transferred to a new facility?  Are they suspended out right?  Are they removed from the system?

I recall once with a customer relation management system that we encountered a bug when a user account was removed from the system.  All the records linked to it, would cascade and delete, or disappear and not show up in the system when searched.  Could a data integrity issue regarding the integrity of the data for these closed locations be responsible for this behavior?

Another possibility that occurs to me, is that a system that freezes pay for all employees may apply to a group of employees by group.  Might a group that these employees belonged to be used to freeze all of their pay for some time period?  Might failing to belong to a group due to the original group being inactivated cause the issue of applying this freeze to miss these accounts?

It may be difficult to see the cause from just reading the few reports you receive from the user, but a simple logical, and step by step examination of the system could help reveal how the issue happened, and if it was a case of the system being used in a manner that was unplanned by the software vendor, it may indicate a fault in the business rules, or lack of training for the users of the system.  Whatever the case, the team is now on its way to finding where this issue occurred.  Where would you test next?

No comments:

Post a Comment