80% of the benchmark questions are aggregations, 16% are multi-hop, 4% are lookups/subqueries.
Multi-hop is where LLMs struggle the most (hallucinations, partial answers), and aggregations is where you get the most token efficiency, since you skip on pagination which you need with APIs/MCPs that don't provide filters.
The annotation/semantic layer agent creates a new description of the schema on sync, which represents the current state, but that includes stale columns as of today, data is not dropped.
I’ll implement automated schema migrations in the next week or so!
There's 75 questions, divided in 5 use case groups: revenue ops, e-commerce, knowledge bases, devops, support.
I then generated a synthetic dataset with data mimicking APIs ranging from Stripe to Hubspot to Shopify to Zendesk etc..
I expose all the data through Dinobase vs. having one MCP per source e.g. one MCP for Stripe data, one MCP for Hubspot data etc.
I tested this with 11 models, ranging from Kimi 2.5 to Claude Opus 4.6.
Finally there's an LLM-as-a-judge that decides if the answer is correct, and I log latency and tokens.