OpenAPI spec indexing is a good idea...semantic search is good for general API questions, but often sucks at specific questions about exact requirements for fields, etc. We've built a lot of connectors at my company and have had this problem.. the agent makes up arguments or misses required types because it's doing too much inference instead of running against an actual schema. I think benchmarking correctness for each endpoint (did the agent construct a valid request on the first try) would be the most useful thing to eval.