I made three versions, one with 128 experts kept, one with 150 and the biggest (borderline fitting one) with 180 experts out of 256. Experts kept are based around coding / agentic / research workloads.
Goal is to have a higher-precision (NVFP4) option to run the model, the original full ds4 already runs the IQ2XXS version. Custom CUDA kernels are written to try and best align the NVFP4 models to the Spark.
The K180 runs at around 119/122GB ram usage at the full 1M context, tested up to 32k prefill and was stable. For best memory efficiency, you might need DS4_CUDA_MANAGED_MODEL=1 DS4_KV_TURBO=1. More memory/bandwidth optimizations are coming, after that I plan on tackling re-adjusting the MTP heads (which would require re-training them on the new architectures).
Benchmarking hasn't been done yet, as I have mostly been busy with the CUDA. Treat as experimental.
Model links: https://huggingface.co/sleepyeldrazi/DeepSeek-v4-Flash-REAP-... https://huggingface.co/sleepyeldrazi/DeepSeek-v4-Flash-REAP-... https://huggingface.co/sleepyeldrazi/DeepSeek-v4-Flash-REAP-...