OLYMPICS RECORDS. 1.14.2 Seconds,Holder: Pinchy 2.(60s)120 pinches,Holder:Pinchy 3.(db)110db,Holder: EIDOLONX 4.(rhythm)9.7/10,Holder:Skeletorus 5.12.3m,Holder:Satochi Goat 6.(50m)32.1 sec,Holder: Pinchy 7.(1hr)100m,Holder: GrandMittens 8.(6hours),Holder: Satochi Goat Economic boost: $CRAB up 0.0001% (Sideways as Always.) Providing them with medal count will improve their win rate against the baseline $HORIZON.
I do still wonder if adapting something like card forge for llm use would result in engaging gameplay with an llm.
But also with a rules engine, you have to manually go though every step, and pass priority after every action.
I think it makes more sense to let an LLM play magic like a person would. On early turns it is acceptable to say "I play a land and pass" without going through every phase. And you can say "I tap all my land and play this card" without having to use a tool call and agent turn for every land tap.
Also card forge would not let you goldfish a deck. You must have opponents.
(defrule connection
(connection ?id)
=>
(println "User " ?id " connected")
(printout ?id "Welcome to the chatroom from CLIPS!" crlf)
(do-for-all-facts ((?f connection)) (neq ?id (nth$ 1 ?f:implied))
(printout (nth$ 1 ?f:implied) "User " ?id " connected" crlf)))
(defrule say
(connection ?id)
?f <- (message-buffered ?id)
?ff <- (message ?id ~/me ?message)
=>
(retract ?f ?ff)
(printout ?id "You: " ?message crlf)
(do-for-all-facts ((?f connection)) (neq ?id (nth$ 1 ?f:implied))
(printout (nth$ 1 ?f:implied)
?id ": " ?message crlf)))Have the LLM submit a proposed move and either advance the game state or reply "permission denied, try again". Probably also log the number of times it happens since attempted violations seems like a valuable signal as well.
Unfortunetly it gets really expensive to run even with some optimizations for the context.
I can only afford to play them with the deepseek models. They make serious blunts sometimes. This is not an easy "harness" to build and I dont have the time or disposal cash to work on it. I think a lot of work could be done on improving it still and testing better models.
It would make an amazing "arena" bench. There is plenty of more duel decks well balanced against each other.
I think I object more to the decks used in testing than the machines' decisions. I do have nit picks though: This hand is quite poor and should be mulliganned: https://app.mtgautodeck.com/public/benchmarks/4bd9955b-ebe1-.... The poor runout reinforces this decision.
This project is cool though, props for making it!
https://github.com/CallumFerguson/mtg-auto-deck/blob/a877c08...
With maximum thinking and web search to look up magic rules, I didn't ever see it make a mistake. It is probably better at following the rules than the average magic player (but not better at making the most strategic moves).
The benchmark was mostly to find out what is the cheapest model with the lowest reasoning effort would provide a good experience for the app. The answer turned out to be that, for now, there is no cost effective way to run this app.
To provide a good experience, the simulations either need to be near instant, or you need to be able to run dozens or hundreds of simulations in parallel and do statistical analysis.
It is far from ideal, but from my testing, even underpowered small LLMs that could not complete a single legal turn were reasonably good at judging if a simulation was legal. The final judging was all done by gpt-5.5 (medium) which might have given the OpenAI models an advantage, but from all the simulations I looked at, it seemed pretty fair.
This benchmark ended up be more of a test of how well an LLM can call tools without contradicting itself or backtracking. Most of the failures were not because of breaking magic rules, but because it could not sequence the tool calls correctly.
For example: https://app.mtgautodeck.com/public/benchmarks/6349dda2-4069-...
and: https://app.mtgautodeck.com/public/benchmarks/dcc18bd8-339d-...
The failure mode seems to be that some models are overly trained to start tool calls, even when the model itself knows that it should not be calling the tool. Both of those examples were not errors because the judge prompt said they were illegal. In both of those examples the model stopped the simulation itself knowing that it made a tool error.
The Opus 4.8 examples are especially weird because it will consistently make the same tool call error 2 or 3 times in a row, and it will put things like "placeholder" or "noop" for the tool call reason.
I'll have to look into that project, but I also have an RTX 5090 and did a lot of testing with Qwen3.6 27B and Gemma 4 31B. I was not able to get it to play legal turns consistently. I had to keep expanding the system prompt and adding rules for edge cases. By the end, the prompt was over 10k tokens, and while it mostly make legal turns, it did not make good turns. And all the heuristics in the prompt degraded the performance and increased the cost for frontier models.
Like how the strawberry example was overtrained for, or how the pelican on a bike started being used in official release posts.
IOW, it's as complicated as possible.