I benchmarked Claude Code's caveman plugin against "be brief."(www.maxtaylor.me)

14 pointsby max-t-dev3 hours ago2 comments

max-t-dev3 hours ago
Author here. Caveman is a popular Claude Code plugin that compresses Claude's responses via a custom skill with intensity modes. I wanted to know whether it actually beats the simplest possible alternative, prepending "be brief." to prompts. 24 prompts, 5 arms, judged by a separate Claude against per-prompt rubrics covering required facts, required terms, and dangerous wrong claims to avoid. 120 scored responses, 100% key-point coverage across every arm, zero must_avoid triggers. Headline: "be brief." matched caveman on tokens (419 vs 401-449) and quality (0.985 vs 0.970-0.976). Caveman has real value beyond compression. Consistent output structure, intensity modes, the Auto-Clarity safety escape. But the compression itself isn't the differentiator I expected. Harness is open source and strategy-agnostic if anyone wants to add an arm: https://github.com/max-taylor/cc-compression-bench Happy to answer questions about methodology, the per-category variance findings, or the bits I cut from the writeup.
- dataviz10002 minutes ago
  > there was 1 run per prompt per arm
  My understanding is that there was only 1 run per configuration?
  If that is correct, because of the run-to-run variability, it really doesn't say much. It will take several trails per prompt per arm before it will look like it is stabilizing on a plot. It is prohibitively expensive so I've been running same prompt, same model 5 times in order to get a visual understanding of performance.
  Someone did the same with lambda calculus. I wanted to make the point about how much run-to-run variability and difference in cost with the same prompt with the same model running only 5 different trials. I classified each of the thinking steps using Opus 4.6 (costs ~$4 in tokens per run just for that) and plotted them on custom flame graphs.
  When the run-to-run variability is between 8,163 and 17,334 tokens none of these tests mean that much.
  [0] https://adamsohn.com/lambda-variance/
lofaszvanitt15 minutes ago
Caveman is useless for me. We are in the year 2026, computers are here to serve me, and bring me comfort. Caveman is a caveman, speaks like an idiot. I don't want to interact with an idiot. It's irritating, and as the article states, an overhyped turd.
It is the same idiocy that permeates EV cars. You buy an expensive car to go from A to B and at the same time offer you comfort. When I have to think about using the seat heating or not, I'm out of my comfort zone. So no, fuck caveman, and I don't fucking care about the burned tokens.
Be brief. It's easy, no setup needed, not another mindless mumbojumbo extension and its 325 dependencies.
- eulgro2 minutes ago
  I enabled it and I had to read carefully to check if it was really active... turns out I never read the words that caveman omits, so to me it makes zero difference.
- max-t-dev9 minutes ago
  [dead]