Though I wonder how much of the gains in this case are actually due to 75% extra parameters compared to the baseline, even if the inference FLOPs are matched.
Can't help but see this as a just different twist on parameter use sparsity idea leveraged by MoE models, as those also gain in performance at constant forward pass FLOPs because of extra parameters.