The TL;DR is that we can bring in a small but very smart and experienced team that will rigorously drill down until the problem is found and resolved.
If you are faced with nasty bugs in your applications that are hard to solve or diagnose we can help solve the problems.
If the root cause of the issue is not known and there is pressure to solve the problem, it is investigated what temporary "quick fix" or workarounds are possible. These will be implemented, tested and rolled out in a controlled manner.
To find the root cause of the problem of course the Scientific Method is used together with a divide and conquer/difference analysis. In this process, the entire team will be extremely critical and go trough an iterative process of:
- formulating hypotheses (and discarding old ones)
- developing testable predictions
- executing tests and gathering test data
In this approach the hardest part in this exercise is usually finding a way to reproduce the error (especially in the case of race conditions). Once the error has been found, solving it is often(*) relatively easy.
Reproducing the error
To reproduce the error it may help to devise a test setup that bombards the software with a huge number of events in parallel in a (pseudo random) manner.
Even after taking special measures, there may still be cases in which the problem reproduces only very sporadically, meaning that the analysis and hypothesis formulation steps become all the more important (because in this case data is more scarse and because typically once you get closer to the root cause with your hypothesis, it will also become easier to reproduce the problem.
Once an error can be reproduced on demand, then it can be solved relatively quickly by using a divide and conquer and difference analysis approach.
Key success factors for succesful debugging
To be quick and succesful in debugging requires the combination of the following approach and skills:
Common error causes
- A creative, associative, inductive way of thinking that allows the team to come with good hypotheses.
- Experience in the applications/technologies covered, to help to more quickly formulate relevant hypotheses.
- A rigorous fact-based approach which sharp deductions that does not allow any assumptions to creep in.
- A thorough understanding of the subject matter. This can not simply be a copy/paste of some Stack Overflow text without really understanding what is going on.
- Being aware of the fact that there may very well be more than one bug: this means that divide and conquer can lead you to chase a red herring.
- Good data logging and visualization tools
- Being able to iterate and test very rapidly: There should be no red tape and quick/responsive communications with all parties involved.
- A strong drive and focus to find the errors and to keep on looking
Likely culprits, that we have seen over and over again (and that are hard to reproduce) are:
- Race conditions
- Memory leaks(**)
- Numerical instability: With floating point numbers == does not make sense.
- And for the low level C and Assembly code out there: of course the trivial uninitialized memory reads, invalid memory accesses ('buffer overflows')
- And for the managed coude out there: Garbage Collection (GC) stalls.
- And practical but common:
- Version incompability on very specific functionality. It seems to work but it is not OK.
- External services sporadically failing.
- Default settings that 'seemed OK' for development, but not for production.
Based on our considerable experience in debugging and software development and the extreme drive, creativity, smartness and our dent for being overly critical we believe that we can help you significantly reduce the time to find the error. Sometimes a mere fresh perspective can speed things up considerably for the existing team.
After solving the error
As a last step we want to be really sure that the error is solved in a robust manner, and that no side effects have crept in.
Beside code review and design review, extensive (stress) testing is performed to ensure that the application is good to go.
(*) But not always unfortunately. For example problems caused by Garbage Collector (stalls) may require significant code rewrites. Also, sometimes parts of applications have become so convoluted by successive authors 'fixing' other authors' mistakes, and adding functionality without fully undetstanding what is going on. In these cases rewrites of larger chunks of code may be required.
(**) Yes these can also occur with managed run times / scripted languages...