2 pointsby matt_d12 hours ago1 comment
  • xml10 hours ago
    I'd like to add a few failure modes:

    - LLM removes/disables/weakens tests (disallowing manipulation of tests is not really possible in Python since the language is too dynamic, so the entire execution has to be sandboxed, which makes timing more difficult)

    - LLM mutates input, which might throw off some tests (for example, sorting an array where all values have been set to zero is easy), but this can easily be solved by copying the input to somewhere safe or regenerating it from a fixed random seed.

    - LLM writes code that only passes the test cases and nothing else, often with a new special case inserted after every failed test. Randomizing everything seems to be a good defense, although it is not always easy to know beforehand what to randomize. Tensor shapes are obvious, but randomizing data distribution to prevent circumvention via precision downgrades is difficult.

    And regarding "10. Baseline Kernel" from the article; I've had LLMs call __import__ or compile and obfuscate the code in order to circumvent tests. The proposed defense of static analysis is not quite sufficient here.

    I can relate to all points mentioned in the article. They really do happen in practice, and many are also applicable to test-driven development with LLMs. Is there any benchmark to evaluate whether an agent solves a task "in the spirit of the prompt" instead of simply solving it to pass tests?