Honestly, you could probably get better results with simple prompting of frontier models, as I wasn’t optimizing for results with the smallest model, cheapest GPU and synthetic data. But it was a good and fun learning exercise: how you can actually run RL to improve models without huge investments using the current advancements in AI, tooling and infrastructure.
Overall, it cost me <$2 using Colab's pay as you go plan to train the model, which was surprisingly less than I expected.
Notebook example is on Github, feel free to give it a try in your own free plan!