The 16MB constraint is a fascinating forcing function.
Most architectural improvements in recent years have come
from scaling, so it'll be interesting to see whether
depth recurrence or aggressive parameter tying can
meaningfully close the gap.
Curious if anyone has a prior on how much bits-per-byte
can realistically improve over a well-tuned baseline
at this parameter count.