[1] https://github.com/plasma-umass/ChatDBG (north of 75K downloads to date) [2] https://arxiv.org/abs/2403.16354
We currently are running through LiteLLM, so while undocumented in theory other LLMs could work (in my experience they don't). I’m working on updating and fixing this.
1. a shorter path to relevant information by querying for specific variables or functions rather than longer investigation of source code. LLMs are typically trained/instructed to keep their answers within a range of tokens, so keeping shorter conversations when possible extends the search space the LLM will be "willing" to explore before outputting a final answer.
2. a good starting point in some cases by immediately inspecting suspicious variables or function calls. In my experience this happens a lot in our Python implementation, where the first function calls are typically `info` calls to gather background on the variables and functions in frame.
The second command, `definition`, prints the location and source
code for the definition corresponding to the first occurrence of a symbol on a
given line of code. For example, `definition polymorph.c:118` target prints the
location and source for the declaration of target corresponding to its use on
that line. The definition implementation
leverages the `clangd` language server, which supports source code queries via
JSON-RPC and Microsoft’s Language Server Protocol.
That’s kinda beside the point then, if you want to do Windows debugging. Or am I missing something?
I am actually more interested in improving the debugger interface. For example, AI assistant could help me create breakpoint commands that nicely print function parameters when you only partly know the function signature and do not have symbols. I used Claude/Gemini for such tasks and they were pretty good at it.
As a side note, I recall Kevin Gosse also implemented a WinDbg extension [1][2] which used OpenAI API to interpret the debugger command output.
imo what AI needs to debug is either:
- train with RL to use breakpoints + debugger or to do print debugging, but that'll suck because chains of action are super freaking long and also we know how it goes with AI memory currently, it's not great
- a sort of omniscient debugger always on that can inform the AI of all that the program/services did (sentry-like observability but on steroids). And then the AI would just search within that and find the root cause
none of the two approaches are going to be easy to make happen but imo if we all spend 10+ hours every week debugging that's worth the shot
that's why currently I'm working on approach 2. I made a time travel debugger/observability engine for JS/Python and I'm currently working on plugging it into AI context the most efficiently possible so it debugs even super long sequences of actions in dev & prod hopefully one day
it's super WIP and not self-hostable yet but if you want to check it out: https://ariana.dev/
Like you said, running over a stream of events, states and data for that debugging scenario is probably way more helpful. It would also be great to prime the context with business rules and history for the company. Otherwise LLMs will make the same mistake devs make, not knowing the "why" something is and thinking the "what" is most important.
I'm looking at this as a better way to get the humans pointed in the right direction. Ariana.dev looks interesting!
These kind of demos were cool 2 years ago - then we got function calling in the API, it became super easy to build this stuff - and the reality hit that LLMs were kind of shit and unreliable at using even the most basic tools. Like oh woow you can get a toy example working on it and suddenly it's a "natural language interface to WinDBG".
I am excited about progress in this front in any domain - but FFS show actual progress or something interesting. Show me an article like this [1] where the LLM did anything useful. Or just show what you did that's not "oh I built a wrapper on a CLI" - did you fine tune the model to get better performance ? Did you compare which model performs better by setting up some benchmark and found one to be impressive ?
I am not shitting on OP here because it's fine to share what you're doing and get excited about it - maybe this is step one, but why the f** is this a front page article ?
[1]https://cookieplmonster.github.io/2025/04/23/gta-san-andreas...
real, quality AI breakthrough in software creation & maintenance will require deep rework of many layers in the software stack, low and high level.
Usually what I keep bumping into, are people that never bothered to learn how to use their debuggers beyond the "introduction to debuggers" class, if any.
results["info"] = session.send_command(".lastevent")
results["exception"] = session.send_command("!analyze -v")
results["modules"] = session.send_command("lm")
results["threads"] = session.send_command("~")
You cannot debug a crash dump only with these 4 commands, all the time.Just had a quick look at the code: https://github.com/svnscha/mcp-windbg/blob/main/src/mcp_serv...
I might be wrong, but at first glance I don't think it is only using those 4 commands. It might be using them internally to get context to pass to the AI agent, but it looks like it exposes:
- open_windbg_dump
- run_windbg_cmd
- close_windbg_dump
- list_windbg_dumps
The most interesting one is "run_windbg_cmd" because it might allow the MCP server to send whatever the AI agent wants. E.g: elif name == "run_windbg_cmd":
args = RunWindbgCmdParams(**arguments)
session = get_or_create_session(
args.dump_path, cdb_path, symbols_path, timeout, verbose
)
output = session.send_command(args.command)
return [TextContent(
type="text",
text=f"Command: {args.command}\n\nOutput:\n```\n" + "\n".join(output) + "\n```"
)]
(edit: formatting)After that, all that is required is interpreting the results and connecting it with the source code.
Still impressive at first glance, but I wonder how well it works with a more complex example (like a crash in the Windows kernel due to a broken driver, for example)
I for one enjoy crashdump analysis because it is a technically demanding rare skill. I know I'm an exception but I enjoy actually learning the stuff so I can deterministically produce the desired result! I even apply it to other parts of the job, like learning to currently used programming language and actually reading the documentation libraries/frameworks, instead of copy pasting solutions from the "shortcut du jour" like stack overflow yesterday and LLMs of today!
Analyzing crash dumps is a small part of my job. I know enough to examine exception context records and associated stack traces and 80% of the time, that’s enough. Bruce Dawson’s blog has a lot of great stuff but it’s pretty advanced.
I’m looking for material to help me jump that gap.
There's no magic to getting good at it. Like anything else, it's mostly about practice.
People like Bruce and Raymond Chen had a little bit of a leg up over people outside Microsoft in that if you worked in the Windows division, you got to look at more dumps than you'd have wanted to in your life. That plus being immersed in the knowledge pool and having access to Windows source code helps to speed up learning.
Which is to say, you will eventually "bridge the gap" with them with experience. Just keep plugging at it and eventually you'll understand what to look for and how to find it.
It helps that in a given application domain the nature of crashes will generally be repeated patterns. So after a while you start saying "oh, I bet this is a version of that other thing I've seen devs stumble over all the time".
A bit of a rambling comment to say don't worry. you'll "get really good at it" with experience.
I have a dog eared copy of Advanced Windows Debugging that I've used, but I've also have books around reverse engineering, disassembly and a little bit of curiosity and practice. I have also the .NET version which I haven't used as much. I also enjoyed the Vostokov books, even though there is a lack of editing in them.
Edit to add: It is not as much focus on usage of the tool as it is about understanding what is going on in the dump file, you are ahead in knowledge if you can do stack traces and look at exception records.
* reading the whole source code
* looking up dependency documentation and code, search related blog posts
* getting compilation/linter warnings ands errors
* Running tests
* Running the application and validating output (eg, for a webserver, start the server, send requests, get the response)
The tooling is slowly catching up, and you can enable a bunch of this already with MCP servers, but we are nowhere near the optimum yet.
Expect significant improvements in the near future, even if the models don't get better.
It definitely does allow models to do more.
However, the high-level planning, reflection and executive function still aren't there. LLMs can nowadays navigate very complex tasks using "intuition": just ask them to do the task, give them tools, and they do a good job. But if the task is too long or requires too much information, the context length deteriorates the performance significantly, so you have to switch to a multi-step pipeline with multiple levels of execution.
This is, perhaps unexpectedly, where things start breaking down. Having the LLM write down a plan lossily compresses the "intuition", and LLMs (yes, even Gemini 2.5 Pro) cannot understand what's important to include in such a grand plan, how to predict possible externalities, etc. This is a managerial skill and seems distinct from closed-form coding, which you can always RL towards.
Errors, omissions, and assumptions baked into the plan get multiplied many times over by the subsequent steps that follow the plan. Sometimes, the plan heavily depends on the outcome of some of the execution steps ("investigate if we can..."). Allowing the "execution" LLM to go back and alter the plan results in total chaos, but following the plan rigidly leads to unexpectedly stupid issues, where the execution LLM is trying to follow flawed steps, sometimes even recognizing that they are flawed and trying to self-correct inappropriately.
In short, we're still waiting for an LLM which can keep track of high-level task context and effectively steer and schedule lower-level agents to complete a general task on a larger time horizon.
For a more intuitive example, see how current agentic browser use tools break down when they need to complete a complex, multi-step task. Or just ask Claude Code to do a feature in your existing codebase (that is not simple CRUD) the way you'd tell a junior dev.
However, this usually takes much more effort than just doing the damm thing myself.
A few things that stand out:
The use of MCP to connect CDB with Copilot is genius. Too often, AI tooling is skin-deep—just a chat overlay that guesses at output. You've gone much deeper by wiring actual tool invocations to AI cognition. This feels like the future of all expert tooling.
You nailed the problem framing. It’s not about eliminating expertise—it’s about letting the expert focus on analysis instead of syntax and byte-counting. Having AI interpret crash dumps is like going from raw SQL to a BI dashboard—with the option to drop down if needed.
Releasing it open-source is a huge move. You just laid the groundwork for a whole new ecosystem. I wouldn’t be surprised if this becomes a standard debug layer for large codebases, much like Sentry or Crashlytics became for telemetry.
If Microsoft is smart, they should be building this into VS proper—or at least hiring you to do it.
Curious: have you thought about extending this beyond crash dumps? I could imagine similar integrations for static analysis, exploit triage, or even live kernel debugging with conversational AI support.
Amazing work. Bookmarked, starred, and vibed.
Domain expertise remains crucial though. As complexity increases, you need to provide guidance to the LLM. However, when the model understands specialized tools well - like WinDBG in my experience - it can propose valuable next steps. Even when it slightly misses the mark, course correction is quick.
I've invested quite some time using WinDBG alongside Copilot (specifically Claude in my configuration), analyzing memory dumps, stack frames, variables, and inspect third-party structures in memory. While not solving everything automatically, it substantially enhances productivity.
Consider this as another valuable instrument in your toolkit. I hope tool vendors like Microsoft continue integrating these capabilities directly into IDEs rather than requiring external solutions. This approach to debugging and analysis tools is highly effective, and many already incorporate AI capabilities.
What Copilot currently lacks is the ability to configure custom Agents with specific System Prompts. This would advance these capabilities significantly - though .github/copilot-instructions.md does help somewhat, it's not equivalent to defining custom system prompts or creating a chart participant enabling Agent mode. This functionality will likely arrive eventually.
Other tools already allowing system prompt customization might yield even more interesting results. Reducing how often I need to redirect the LLM could further enhance productivity in this area.
The whole point of this was me chatting with Copilot about a crash dump and I asked him about what the command for some specific task is, because I didn't remember and it suggested me which commands I could further try to investigate something and I was like - wait, what if I let him do this automatically?
That's basically the whole idea behind. Me being too lazy to copy-paste Copilot's suggestions to my WinDBG and while this was just a test at first, becoming a proof of concept and now, almost overnight got quite a lot of attention. I am probably excited the same way as you are.
Knows plenty of arcane commands in addition to the common ones, which is really cool & lets it do amazing things for you, the user.
To the author: most of your audience knows what MCP is, may I suggest adding a tl;dr to help people quickly understand what you've done?
How does it compare to using the Ghidra MCP server?
A disassembler takes compiled binaries and displays the assembly code the machine executes.
A decompiler translates the disassembled code back to pseudocode (e.g. disassembly -> C).
A debugger lets you step through the disassembly. Windbg is a debugger which is pretty powerful, but has the downside of a pretty unintuitive syntax (but I'm biased coming from gdb/llvm debuggers).
Both the MCP servers can probably be used together, but they both do different things. A neat experiment would be to see if they're aware of each other and can utilize each other to "vibe reverse"
If you're debugging from a crash dump you probably have a large, real world program, that actual people have reviewed, deemed correct and released in the wild.
Current LLMs can't produce a sane program over 500 lines, the idea that they can understand a correct looking program several orders of magnitude larger, well enough to diagnose and fix a subtle issue that the people who wrote the it missed, is absurd.