Matches my experience trying to stabilize long LangGraph workflows. The regex checks are fine for formatting but miss the semantic drift that happens when you're actually injecting context. The rubric-based approach makes sense, but I'm not sure how a bootstrapped team implements this without the human labeling budget. I've tried using a stronger model to grade the outputs, but the latency overhead is brutal.