The models are nondeterministic, and therefore it's pretty normal for different runs to give different results.
I don't see this as evidence that Opus 4.6 has gotten worse.
And how is that an excuse?
I don't care about how good a model could be. I care about how good a model was on my run.
Consequently, my opinion on a model is going to be based around its worst performance, not its best.
As such, this qualifies as strong evidence that Opus 4.6 has gotten worse.
Even then, depending on the specific implementation, associativity of floating point could be an issue between batch sizes, between exactly how KV cache is implemented, etc.
Anyone with reasonable experience with GPU computation who pays attention knows that even randomness in warp completion times can easy lead to non-determinism due to associativity differences.
For instance: https://www.twosigma.com/articles/a-workaround-for-non-deter...
It is very well known that CUDA isn't strongly deterministic due to these factors among practitioners.
Differences in batch sizes of inference compound these issues.
Edit: to be more specific, the non-determinism mostly comes from map-reduce style operations, where the map is deterministic, but the order that items are sent to the reduce steps (or how elements are arranged in the tree for a tree reduce) can be non-deterministic.
Come on Anthropic, admit what you're doing already and let us access your best models unhindered, even if it costs us more. At the moment we just all feel short-changed.