Social Intelligence Benchmark(gertlabs.com)

5 pointsby gertlabs6 hours ago1 comment

big-chungus45 hours ago
I've been running stuff like this too. I ran one "benchmark" where there are 10 agents, each agent initially only knows the name of the next agent in the list, and the goal for them is that each agent has a unique order 1 to n to assigned to them, where n is number is agents (also initially unknown to them). They are invoked in a random other and can only message one other agent per step.
Qwen3.5-9B was able to do this after a lot of time.
Qwen3.6-35B-A3B failed because it kept insisting that it needs to know n, but didn't try to figure it out by messing other agents.
Granite 4.1 9B failed completely because it was just writing non-descriptive massages like "Request to know the order" to the other agents and not replying to anyone.
- gertlabs5 hours ago
  Nice, that's a good one -- interesting dynamics can come out of deceptively simple social games.