I was tired of hand-writing JSONL for my Qwen fine-tunes, so I built DataForge. It's a Python framework that generates structured training data from tool schemas — completely deterministic, no API calls needed.
What it does:
You define tool schemas (JSON) + data pools → it generates SFT conversations with tool calls DPO preference pairs from contrastive ranking Anti-template explosion detection (Bloom filter + trigram analysis) Quality gates (configurable thresholds, not vibes) Streaming generation, constant RAM — tested up to 100K examples Output: OpenAI/ShareGPT/ChatML format, ready for trl or axolotl Two working examples included (restaurant assistant, customer support) — ~600 SFT + 60 DPO each, runnable out of the box.
pip install -e . → dataforge generate --config config.yaml → dataset ready.