4 pointsby danoandco6 hours ago2 comments
  • danoandco6 hours ago
    OpenAI published an article and demo for scoring how well AI agents can work in a codebase (https://openai.com/index/harness-engineering/, https://www.youtube.com/watch?v=rhsSqr0jdFw). We turned it into a free tool anyone can use.

    Paste any public GitHub repo (or connect a private one) and get a live score across seven dimensions: bootstrap setup, task entry points, test harnesses, lint gates, agent docs, structured documentation, and decision records. It clones the repo, runs static analysis, and scores each dimension 0-3 with evidence pulled from actual files. Takes about 60 seconds.

    Some repos we scored:

    PostHog: https://twill.ai/score/fd033516-628b-4c7c-8db6-d84e3f2737ba

    Supabase: https://twill.ai/score/b2825715-6c3d-4de1-a21b-fc5d9b17103b

    Codex: https://twill.ai/score/d7372d95-0501-4ad3-ae90-8f112ccafee0

    The pattern we keep seeing: most repos lose points on agent-specific docs and decision records. Everything else tends to be decent.

    We built this scorecard as a free tool because agent performance is bounded by repo structure, not just model quality.

    Would love to hear what scores people get. And whether the rubric is missing anything.

  • RoxaneFischer14 hours ago
    not sure about the decision records. seems ideal but no one does that in practice
    • danoandco4 hours ago
      true, i think the key thing is explaining somewhere in the repo "why" something was done. like the rationale for choosing X over Y service for instance.

      maybe this record is just the git log, and the agent just needs to access the git log.

      we'll see how that matures over time