1 pointby akgitrepos9 hours ago2 comments
  • akgitrepos9 hours ago
    ToolMisuseBench is a deterministic, offline benchmark dataset for evaluating tool-using agents under realistic failure conditions, including schema misuse, execution failures, interface drift, and recovery under budget constraints.

    This dataset is intended for reproducible evaluation of agent tool-use behavior, not for training a general-purpose language model.