wizeyone3 hours ago
2 things. The headline math 0.95^20 = 0.358 assumes independent errors. "The body argues the opposite - every subsequent action operates on flawed foundations."
Real long chain failure is worse than the math predicts, not equal to it. The headline undersells the problem the article actually describes.
Also DTCM's eval is narrative-QA across 250 stories, reading comprehension over accumulated context, not an agent tool use.
The production failure modes it discusses (wrong tool selection, brittle API contracts, etc) don't obviously map to that benchmark. The 96% number is encouraging but not directly translatable