19 pointsby Brajeshwar4 hours ago1 comment

besterman234 hours ago
I wonder if multiple attempts at the opossum would produce better results.
If we didn’t have the previous example I would interpret this as pretty solid evidence that labs were training on the Pelican “benchmark”.
I just can’t imagine a model dropping so significantly from one version to the next on such a silly task.