Aggregated by date:
2026 3
2025 7
2024 1
2023 3
2022 1
2021 1
2019 1
1984 1
Given the time needed to gather data, write it up, and go through the peer review process, this is about what I would expect for up-to-date empirical findings combined with some background base understanding.Can you suggest some better and more recent empirical findings?
Which ones are wildly outdated?
- Stray2026 covers a “two-year period” of commits. The paper was submitted in September 2025 and revised in January 2026. “Vibe coding” at that time (so from 2023-2025) was still mostly pasting code from chat windows into your IDE or accepting autocomplete suggestions.
- He2026 is similar in the timeframe, submitted November 2026 and revised in January 2026, focused entirely on Cursor, which, at that time, was very different in its focus (code completion/inline chat prompts vs agentic back-and-forth with extensive tool use and autonomy). Again, very different from what reality looks like currently.
- Becker2025 explicitly evaluated Claude 3.5/3.7 Sonnet, an entire generation removed from the current state-of-the-art.
- Xu2025 and Bakal2025 say they evaluted “GitHub Copilot”, which isn’t an AI model but an AI router. I couldn’t find any data on whether they actually evaluated what models the developers’ requests ended up going to.
The point is that there is no recent empirical data because by the time a rigorous study is ready to publish, the industry and its capabilities have already moved on dramatically. The truth is that, as of right now, anyone claiming to have empirical proof of either slowdowns or efficiency gains is wrong. There is no way to tell.