Slicing an 80B MoE LLM into 40B domain specialists(github.com)

3 pointsby JThomas-CoE10 hours ago1 comment

JThomas-CoE9 hours ago
I wanted to share a proof-of-concept from the College of Experts project on the separability of machine intelligence.
We hypothesized that domain knowledge in monolithic Mixture-of-Experts models is not holographically entangled across all routing layers, but physically separable. Using histographic activation profiling across 10 coding languages, we surgically extracted the 256 experts responsible for Python from Qwen3-Coder-Next-80B and separately extracted the 256 experts responsible for Web/Frontend logic. We used a bias activation function across the 48 layers which modified the expert ranking and selected experts up to the expert budget of 256 per layer.
The resulting 40B Python Specialist retains a 93% score on HumanEval (compared to the 80B model's 94%), despite losing half its parameters. Conversely, the 40B Web Specialist retains near-perfect UI generation capabilities while completely losing the ability to emit raw Python logic. Note that this was achieved strictly via weight-slicing the unmodified .gguf file, with zero post-surgery fine-tuning.
The repo linked above contains Demo v1.5, which uses a fast ONNX supervisor (DML/CUDA) to hot-swap these massive 40B lobes via Ollama, allowing 80B-class MoE routing on consumer hardware (29GB VRAM footprint).
Below are the relevant links:
Whitepaper (PDF): [https://github.com/JThomas-CoE/College-of-Experts-AI/blob/ma...] The Extracted Models: [https://huggingface.co/JThomas-CoE/CoE-WEB2-40b-A3b-GGUF] We are currently preparing to decompose the new Qwen3.5-35B model into a full 10-domain suite. I would love to hear feedback on the layer-slicing methodology or the architectural implications of hosting the routing supervisor outside of the LLM inference engine.