RWKV-7 beats Llama 3.2 with 3x fewer training tokens and formally exceeds TC^0(ai.gopubby.com)

2 pointsby Aedelon5 hours ago1 comment

Aedelon5 hours ago
Author here. The core claim: RWKV-7 (2.9B params, RNN) scores 72.8% avg across standard benchmarks vs LLaMA 3.2's 69.7% — trained on 3.1T tokens vs ~9T. Same parameter count, one-third the data.
The more interesting result is architectural: RWKV-7 formally exceeds TC⁰, the complexity class bounding standard Transformers (Merrill & Sabharwal's proof in the paper). It solves state-tracking problems that fixed-depth attention provably cannot.
Inference runs in O(1) memory per token — no KV cache. The hybrid variant (RWKV-X) hits 99.8% passkey retrieval at 64K and 1.37x Flash Attention v3 throughput at 128K.
Paper: https://arxiv.org/abs/2503.14456 (COLM 2025, peer-reviewed)
Weights: https://huggingface.co/collections/RWKV/rwkv-v7-67d43835efa2...
Code: https://github.com/BlinkDL/RWKV-LM (Apache 2.0)
Happy to discuss the delta rule generalization, the TC⁰ proof, or the benchmark methodology — I spent 36 sources digging into the caveats.
- xml5 hours ago
  > Specifically, we collected new data created after January 2025, including: [...] new fiction on Archive of Our Own (Various, 2025),
  Not sure how to feel about this. From a researcher's point of view, reproducibility is important, but the last time someone publicly collected data from AO3, the community was not very fond of that.
  https://huggingface.co/datasets/nyuuzyou/archiveofourown/dis...
  - Aedelon3 hours ago
    Yeah, that HF dataset page is rough. 247+ threads, mostly DMCA reports, archive-locked fics scraped without consent, dataset reuploaded after takedown. The AO3 community had every reason to be furious.
    Not RWKV-specific though. Most large corpora have the same sources in them, they just don't list them explicitly. Whether the transparency makes it better or worse is a real question.