That being said: I feel that there must be some kind of benchmark for this. If no such benchmark exists, use your framework, pair up with a couple of pharmacists, and create one.
Nothing exists at the level of "here's a real multi-component formulation, here's what happened when it was made." Every CPG and pharma company has thousands of these records locked in R&D databases.
I've started building one (FormulaBench) with defined splits and baselines on the public data that does exist, but you're right that the real version needs domain collaborators. If you know pharmacists or formulation scientists who'd be interested in contributing, I'd genuinely welcome the introduction.