There are already examples of this in the wild, language and vision models not just performing scientific experiments, but coming up with new hypothesis on their own, designing experiments from scratch, laying out plans on how to carry out those experiments, and then instructing human helpers to carry those experiments out, gathering data, validating or invalidating hypothesis, etc.
The open question is can we derive a process, come up with data, and train models such that they can 1. detect when some task or question is outside the training distribution, 2. and develop models capable of coming up with a process for exploring the new task or question distribution such that they (eventually) arrive at (if not a good answer), an acceptable one.