I’ve been building domain-specific debuggers for many years now. It’s been a roller-coaster ride of frustrations and achievements, and the ride is just getting wilder by the minute. I’ve been meaning to put some of my experiences into written form for a while.
Bear with me, if you will, this is a first jot.
Developers react very differently when faced with a powerful dissonance between the expected and actual behavior of their software. The gamut of responses ranges from randomly poking at the code in disbelief, to meticulously stepping through every line in a traditional debugger. Some will rewrite chunks of code without root-causing the issue, thinking that it would take less time this way and they couldn’t make it any worse (oh my!).
How many will choose to write their own debugger to attack a specific problem? Few is my guess.
Certainly this shouldn’t be the first response. I can already hear someone snorting “hey Oliver, I’ve got a truncated string over here, think I should write a string debugger?” …no, no you shouldn’t, please don’t. But do consider this article on string processing if it applies to you.
Why Build a Domain-Specific Debugger
Simply, you should consider it if you have a recurring debugging problem that is expensive to you and your team. It comes down to an economic trade-off, because building debuggers is also a very expensive proposition. Whatever you think it might take… multiply by 5 or 10, and attempt it only if you have the skills to do it in your team.
The cost hurdle isn’t insurmountable however, far from it. You can assign a few people to build a tool which hundreds of people will use in their daily work. It doesn’t take a huge improvement in productivity to suddenly make the economics of this tool viable.
You should not build a domain-specific debugger if one already exists for your domain that you can purchase at reasonable cost. In-house tools can rarely compete with productized solutions. If some tools come close, try to see if you can bridge the gap with utility code.
What Resistance You Will Encounter
Sadly, even when it’s glaringly obvious that it’s a net win for the organization, you will face a tremendous amount of resistance from your peers. Domain-specific debugging tools go against some of the most basic intuitions that make up the programmer’s meme pool. Some of these are beliefs that have been reinforced since the very first program each developer wrote, from “hello world” even.
The top source of resistance from developers is disbelief in that…
...they suffer from a debugging problem.
…others suffer from a debugging problem.
…this debugging problem is at least partly avoidable/solvable with tooling.
…your tool conforms to their lofty-but-undefined expectations.
In many cases #4 is a final expression of their internal rejection of your attempt to refactor this most basic activity of programming. In the field, a kind of heartwarming familiarity with printf-debugging is what is keeping it in use in many scenarios where it’s completely inappropriate. Roughly nothing can deal with gigabytes of text files. :^/
How to Design a Domain-Specific Debugger
The very first thing to think about is the shape of your debugging problem. If you were able to store all the relevant data, however large that set may be, what kind of data structure shape would you need? Is it fundamentally a sequence? A tree? A DAG? A general graph?
Then you need equal measures of these three ingredients:
Instrumentation, takes data out of the software and makes it available to the debugging tool; this can either be applied from the outside or require code modifications.
Visualization, presents the data in a way that makes it immediately understandable, where the mark to beat is a plain text file.
Correlation, allows the debugging tool to know how the pieces of data are related. This is the most important part of it all, it is dependent on the instrumentation to make correlation easier and the visualization to make it usable.
I can’t stress enough that the correlation part is essential. As a contrived example, consider memory-leak instrumentation in generalized C++ code. Telling me that there is a leak is only useful to a degree… telling me which line of the source code is responsible for the allocation is much better, and further telling where a similar allocation used to be de-allocated before a change I made to the code would rock my world. Correlation is the essential problem of debugging, this is where humans waste their time, and this is where the tool needs to shine. For most software it is also essential that correlation not require multiple user-visible passes, so that interactivity with the debugging tool isn’t compromised.
At the end of the day the worth of your final solution will depend on the weakest of the three. Also, all three layers need to be designed to the same shape you chose. In my domain space the shape of choice is a tree, and by way of example, we have spent a few man-years on each of the three aspects for a single one of our debugging systems.
Problems to Solve in the Instrumentation Space
- Typefulness. If the software is written in C/C++ then the data in the source code is typed. The encoding is very important to preserve.
- Obstrusiveness. In the case where you explicitly instrument the software to export information, you do not want to obscure the real source code with instrumentation code. This is a very sensitive topic for developers who want to reject the idea at first.
- Performance. Heisenberg’s principle applies to software too. You can’t know what software is doing without altering how fast it’s doing it – your instrumentation will slow it down, be ready with mechanisms to mitigate this effect. There are no Heisenberg compensators here.
- Storage. This data needs to be placed somewhere as it’s accumulated. Be prepared for multi-gigabyte sized data dumps. Make sure you store the data in a way that aligns with the expected access pattern / shape of the problem.
Problems to Solve in the Visualization Space
- Typelessness. Once you’ve saved the typed data you’re likely to want to treat it typelessly in your visualization – you don’t want to implement an int plot, a float plot, a double plot, etc…
- Configuration. There’s no telling what your users want to view in a plot, or in an image, or a histogram… you better make that visualization tool configurable or else you’ll be implementing specializations for the rest of your life.
- Interactivity. The real value of a debugging is when it allows the developer to explore slices on the data that he directs as he builds his mental model of the problem. Your tool needs to be fast enough for users to want to interact with it, and it has to have mechanisms for interaction.
- Queries. With all this flexibilit
y you’ll find your back-end needs to be able to tune only to the relevant data. It needs to be able to handle queries that map to the shape of the user’s data slices – which will hopefully be like the shape of the problem (if you find that’s not the case, you picked the wrong shape).
Problems to Solve in the Correlation Space
- Relationships. This is the single biggest point you need to think about. How do pieces of data relate?
- Distance. If you need to go across great spans of stored data to find related information you’re in a difficult place.
Think about these problems as they apply to your problem domain. If you are thinking of building a domain-specific debugger, you must plan to solve these issues.
** FEATURED COMMENT **
The only criticism I have to offer (constructive or otherwise) is that you should probably consider that for MOST developers, the mark to beat is the MSVS compiler, not printf-style debugging. -John
True, though presumably if you have a domain-specific problem that would warrant a domain-specific debugger then the VS debugger isn’t a very good fit. A standard debugger is only really good at debugging software issues, like a crash, an exception, an invalid argument assertion…
A domain-specific debugger is needed when the software performs its job without spectacular failure, but produces the incorrect output. Then you need to look at most everything it did to decide where it went wrong. Stepping through a debugger for this gets tedious fast if the software if the task is complex.