4 pointsby Facens6 hours ago2 comments
  • potter0984 hours ago
    The weighted-diff idea is more interesting than raw LoC, but the real challenge is separating throughput from rework. A team can look 'more productive' simply because the agent helped them generate more change volume, even if review burden or rollback risk also went up. The metric gets a lot more credible if you pair it with something like review acceptance rate, revert rate, or time-to-merge stability rather than presenting one weighted number alone.
  • Facens6 hours ago
    TLDR: I made a Claude Code plugin to measure coding productivity.

    This helped us measure over +70% productivity in iubenda's dev team. Rationale and details follow.

    For the past year I've been pretty obsessed with using AI for productivity improvement and I've been running initiatives to increase AI adoption within iubenda and team.blue, particularly amongst developers.

    The challenge was to measure the results, but I saw problems with the most commonly used methods: - % of developers using Claude Code: pretty mute, just tells you who is using it, fine for initial rollout but doesn't really give you a sense of what the productivity really is, it's the kind of "tick the box" approach that leaves many companies with a very superficial AI adoption - Number of MRs / PRs: not the worst metric but very unreliable as different teams and developers have different styles in terms of sizes of the contribution (few but large vs many but small), which means that more or fewer MRs / PRs doesn't necessarily mean more or less productive team - Story points: not all teams use story points, plus story point scoring is a qualitative process and subjective. It also requires tracking story points across MRs / PRs / commits, which is very complex as very few teams have really deterministic connection between their git repo and their task management tool, meaning that issues in data coverage makes this method unreliable even on teams that actually use story points - Lines of code changed: I really like the objectivity of this metric, and if we keep as a constant the fact that a specific development team will keep the verbosity of their code and the mix between types of code (tests, translation, updates, comments, refactors, actual new code) indeed constant, then this metric is not bad at all, but in tests this still had huge variability due to large refactors or wide but low value changes skewing the metrics completely

    Several weeks into the rabbit hole, I landed on using lines of code changed, BUT scoring them using Haiku. In essence, the plugin will: - Download all diffs from all repos you select, across all branches, and deduplicate them to avoid double-counting merge commits - Score each file diff with Haiku, giving it a weight that will score e.g. 0 a file change, low a translation change, low or zero a library update, high an actual genuine code change or refactor, etc (this can also act as code verbosity index) - Calculate a sort of "weighted lines of code" metric that you can plot over time to measure productivity improvements

    Scoring is very cheap at around $7 per K commits.

    The plugin also has a # of other features like creating reports, anonymizing developers with local hashing and the possibility to use BigQuery to share the database across a team.

    I'm publishing it so you can grill me on the methodology, cross-check it, find bugs, you name it. All contributions welcome

    • apothegm5 hours ago
      Really? We’re back to using LoC as a metric? Have we learned absolutely nothing in the past 50 years?

      Oh, never mind, we already know the answer to that…