I just stumbled on this site which monitors the stability of performance of various models like ChatGPT codex 4.3, etc. Some models seem to fluctuate in performance, probably by dynamic reallocations of compute budgets, etc. Fairly interesting stuff, and gives credence to the idea that the same model performs differently on different days, and some models e.g. Chat GPT Codex 5.2 are more consistent than newer models e.g. Chat GPT 5.4