I kept seeing the same pattern in AI agent demos. You hand an LLM a price feed, it gets {"price": 94200, "change_24h": -2.3}, and it burns half its context window figuring out basics. Is this up from last week? What percentile? How does hash rate correlate? The agent does all that work before it starts reasoning about what to do. So I started pre-computing the analysis server-side and returning ~400 token markdown briefings instead of raw JSON.
The experiment: 4-arm RCT. Treatment gets real-time briefings. Control gets price only. A third arm uses web search instead of briefings. Placebo gets the same briefings but time-shifted 5-7 months, presented as current. All arms run Claude, one trading decision per tick.
Latest run, 202 ticks over 6 months. BTC fell 34.7%.
Treatment (briefings): +7.83% | max drawdown 5.95%
Control (price only): -8.14% | max drawdown 15.95%
Web search arm: -1.55% | max drawdown 12.63%
Placebo (stale data): -7.70% | max drawdown 10.17%
BTC buy-and-hold: -34.70%
Treatment beat control by +15.97pp. Beat web search by +9.38pp. All 7 experiments positive, range +4.46pp to +15.97pp across two models (Opus 4.6, Sonnet 4.5).The edge is almost entirely defensive. Treatment's return came from two short campaigns during crashes. In rallies and sideways markets, it matched or underperformed control. Long trades were coin flips.
What didn't work: the earliest run was the worst. Treatment finished last. Rich data with no guardrails caused the agent to flip-flop every tick. BUY, SELL, BUY across three consecutive ticks. $79K traded, zero net position change. A later run was aborted at tick 33 after the agent translated "macro bearish" into "go short" when the right move was cash. 1 of 24 total runs was negative. 5 were inconclusive.
Stale data was worse than no data. Placebo consistently underperformed plain control across runs. Well-structured wrong information is more dangerous than no information.
Things I'm still uncertain about: the edge is untested in a bull market (every window skews bearish), 202 ticks isn't statistically conclusive within a single run (more valued would be years of data/ticks), and the web search arm had contamination risk from future-dated search results.