The problem that I ran into was that the Dungeon Master AI was not realizing when I told it to do something I could not do. It would just happily narrate as if the act I just took was completely fine. Spending gold I didn't have, Casting spells I didn't know, etc...
As I dove into that problem and started fixing it, a buddy of mine said something about switching rulesets to RIFTS. I said 'I think I can". It was then that I realized that the system I was developing might just work on any type of data, with any LLM.
I shifted my focus, and began developing. IIt's a system that reads any corpus, generates a set of questions and the correct answers straight from that source, then makes the MUT (the LLM or enterprise endpoint being tested) answer them. It runs the same questions through its pipeline too, and a different vendor's model handles the close calls. THen, it provides a complete report of the audit, including every question and answer pair, the answers given, and a question by question analysis of any incorrect answers, determining the reason the MUT was incorrect. It compiles those collectively, looking for patterns, and then provides an extensive report which lists specific training goals for the MUT based upon the audit.
Although it provides a significant Boost in accurate answers, the pipeline is not all knowing. its limits are set by the model it is testing. I built this solo, on a gaming PC and rented GPU's.
I would really like people's opinions of the output, presentation, and anything else about it that you want to critique. Good or bad, I'll welcome it.
I'm happy to answer any questions if you have them.
Thanks HN community - Brian