2 pointsby cerru9055 hours ago1 comment
  • cerru9055 hours ago
    I kept getting annoyed by LLM inference non-reproducibility, and one thing that really surprised me is that changing batch size can change outputs even under “deterministic” settings.

    So I built DetLLM: it measures and proves repeatability using token-level traces + a first-divergence diff, and writes a minimal repro pack for every run (env snapshot, run config, applied controls, traces, report).

    I prototyped this version today in a few hours with Codex. The hardest part was the HLD I did a few days ago, but I was honestly surprised by how well Codex handled the implementation. I didn’t expect it to come together in under a day.

    Would love feedback, and if you find any prompts/models/setups that still make it diverge.