Martin Pinner Follow I'm one of the co-founders of Application Performance Ltd.

Taking a probabilistic approach to fixing bugs

Does your organisation struggle to fix application errors? Are you able to capture enough information even to understand the problem? Are you able to reproduce the problem? Do you know who should be fixing them? Are you able to convince those people / organisations that they should be fixing them? Even if you use tools, do they help to quickly pinpoint the location of the problems? If you struggle with any of the above then I would like to suggest a different approach to fixing bugs that works with the tools available on the market rather than trying to shoe-horn them into an existing process, which is what typically happens today. Most organisations have an incident service desk of some type. Problems, errors and bugs are logged and then assigned to a first-level support team. But the fact is, you only get to hear about the tiny minority of problems that people are prepared to report. My colleague, Mick McGuinness, has previously written about how few problems are actually ever reported in Why Service Desk calls are just the tip of the iceberg. That article was primarily about problems on the end-user’s device but the same argument applies to application errors. Let’s assume you have an incident. But, unless you do an analysis, you don’t know whether this is an isolated problem or one that many people experience. It would be more worthwhile to fix if many people are suffering from it. Perhaps you could do a keyword search looking for previous examples but that is an error-prone business in its own right. Would the users describe it in exactly the same way? And that is before you know whether you have enough information to reproduce or fix it. It is interesting to see the lengths that some organisations go to, not to fix bugs. Much attention is given to assessing the impact of a particular incident. Is there a business justification for fixing it? Can it be fixed procedurally for now? How long would it take to fix? How much effort would it take? This all takes considerable time. Logs are pored over but they are often inadequate to determine the cause since they usually lack enough detailed information. But by concentrating on a specific incident, are patterns of related incidents being missed? How often does it happen? Is the organisation, in fact, repeatedly fixing “one-offs”? Are there teams of people who are trying to manage the problem rather than actually fixing the underlying cause? Is their time properly accounted for? Having assessed the specific incident, the next stage is to find who is responsible or who would be best placed to fix it. Third-parties developers will want a justification that meets their support agreement. Much time is spent trying to reproduce the problem. Often the results are inconclusive. Tools such as OverOps and AppDynamics can help, but it is unlikely that they would capture detail for every single problem. The potential overhead prevents them doing this. Instead they will capture detail for a sample out of the total set (typically a few percent). But this approach does not necessarily fit well with trying to resolve a specific problem: there is a good chance that the data is not going to be there for the specific case. Therefore, to work with these tools and get the most from them, an alternative approach is to fix the most frequent errors given the data they do provide. It will not always be possibly to say who this will affect, or benefit, and what the impact will be other than, eventually, there will be a positive impact. Not doing this risks the tools failing to reach their potential. Now in some cases it is going to be possible to marry the two approaches and get the best of both worlds: solve a specific issue using the tool because you were lucky enough to have the detailed data to hand. But in general though, a probabilistic approach needs to be taken. In the long run, using a tool to solve problems in a probabilistic manner will outperform solving individual cases. We have been taking this approach with WebTuna. We are following our own mantra of trying to work with the strengths of the tools we use. For example, one of our aims is to continually reduce the number of bad data records that we process, by using OverOps. Hidden with them are a few records that we falsely reject. We intend to find them. Previously we couldn’t see the wood for the trees. If you would like to try OverOps for yourself, and see how it can speed up fixing problems in your own environment, then click below, or go to http://www.applicationperformance.com/overops/ to find out more. If you’d like to talk to one of the team here at AP then please contact us.

25 Oct 2018

Code Quality

« Introducing DBTao: the new way to unlock your hidden ORACLE performance Join industry experts for Dinner (London 8th Nov) »

Application Performance Blog

Taking a probabilistic approach to fixing bugs

Explore →