My thesis is that an intermediate layer would eventually end up being equivalent to the docx format, so I've decided not to have any intermediate representation.
We convert docx to html and send it AI. When AI rewrites the HTML and it back, we diff the rewritten HTML against the docx's document.xml and make the modification. This is a simplistic explanation of it. There are a bunch of validations and processing going on.
Regarding the tracked changes/comments, we simply invent new HTML tags for those things e.g. <ins>, <del>, <commentRangeStart> and etc.