AfterImage – Generate synthetic multi-turn chat data from documents(github.com)

5 pointsby monatis2 hours ago2 comments

monatis2 hours ago
We kept running into the same exact bottleneck with fine-tuning and evals: You have the source documents, and you have the base model, but you usually don’t have the actual conversations.
If you’re working with internal docs, regulatory text, or technical manuals, there’s plenty of material but zero multi-turn chat logs. And flattening this into standard instruction/response pairs creates models that sound like templates, failing to capture how users actually ask for clarification or push back.
So we open-sourced a small, opinionated library called AfterImage.
It generates synthetic multi-turn conversations grounded in a corpus you provide. The architecture is straightforward: - A simulated user ("Correspondent") with optional persona variation - A simulated assistant ("Respondent") - Both strictly grounded via sampled source material - Outputs directly to JSONL for your SFT (Supervised Fine-Tuning) / eval pipelines
*Why build this?* The narrow bet here is that multi-turn dialogue is its own distinct data problem. There are already great general synthetic data tools (distilabel, synthetic-data-kit). We aren't competing with them. AfterImage prioritizes composable design where generation can be customized with callbacks. For example, you can connect it to various data sources such as local files or Qdrant collections, or you can choose retriever strategies for RAG or aggregation methods for composite evaluation.
*A few honest caveats:* - We don’t have a strong published benchmark yet (semantic similarity only so far). - Quality noticeably degrades/loops as conversations get too long (>5+ turns). Luckily, one-to-three turns is more than enough for most SFT cases.
efecnc2 hours ago
Simulating a user that actually sounds real is definitely the hardest part of this. Curious how you're handling the chunking and retrieval under the hood here.
Does the 'user' agent get fed a specific chunk of text to formulate its questions, and does the 'assistant' agent get that exact same chunk to reply? If they're both looking at the identical text, have you thought about injecting some noise or unrelated distractor chunks into the assistant's context? Might be a solid way to make the resulting SFT data more robust against hallucinations.
- monatis2 hours ago
  Yeah this is one possible way to generate grounded an"responses" in Afterimage. To accomplish context augmentation when generating a response, it allows to use different RAG strategies where retriever may be chosen for the specific use case at hand. This is where composability comes into play.